[XPU] Implemented 32bit optimizers in triton #1710

YangKai0616 · 2025-07-16T11:13:50Z

Depends on #1692.

Implemented 32bit optimizers in triton to use of XPU devices.

The PR includes two implementations:

Pure Torch implementation: utilizing torch.compile
Pure Triton implementation: utilizing triton.jit

For the benchmarking on 4096*4096 shapes, the results are as follows:

Pure Torch implementation:

Torch step (eager): 1.075ms
BNB step: 0.516ms
Torch step (eager): 1.058ms
BNB step: 0.517ms
Torch step (eager): 1.080ms
BNB step: 0.527ms
Torch step (eager): 1.069ms
BNB step: 0.539ms
Torch step (eager): 1.034ms
BNB step: 0.526ms

Pure Triton implementation:

Torch step (eager): 1.034ms
BNB step: 0.524ms
Torch step (eager): 1.054ms
BNB step: 0.488ms
Torch step (eager): 1.031ms
BNB step: 0.526ms
Torch step (eager): 1.047ms
BNB step: 0.538ms
Torch step (eager): 1.045ms
BNB step: 0.489ms

For the benchmarking on 1024*1024 shapes, the results are as follows:
Pure Torch implementation:

Torch step (eager): 0.345ms
BNB step: 0.335ms
Torch step (eager): 0.354ms
BNB step: 0.226ms
Torch step (eager): 0.347ms
BNB step: 0.227ms
Torch step (eager): 0.358ms
BNB step: 0.232ms
Torch step (eager): 0.349ms
BNB step: 0.225ms

Pure Triton implementation:

Torch step (eager): 0.346ms
BNB step: 0.226ms
Torch step (eager): 0.337ms
BNB step: 0.216ms
Torch step (eager): 0.338ms
BNB step: 0.215ms
Torch step (eager): 0.333ms
BNB step: 0.226ms
Torch step (eager): 0.349ms
BNB step: 0.235ms

The test platform is Intel(R) Data Center GPU Max 1550. Test script reference #1692. Torch(eager) is 32bit optimizer from torch, BNB is 32bit optimizer.

Considering that the performance gap between torch.compile and Triton implementations is not significant, but triton's implementation compiles faster, and #1692 was implemented with Triton, this PR adopts the Triton version for submission.

Note:Currently, XPU does not support the allocation of memory buffers using a paging mechanism. Therefore, these tests are skipped in tests/test_optim.py::test_optimizer32bit. This functionality will be developed in the future to support full optimizer capabilities.

bitsandbytes/_ops.py

bitsandbytes/backends/triton/kernels_optim.py

bitsandbytes/functional.py

…ch implementation

YangKai0616 added 2 commits July 16, 2025 09:22

Implemented 32bit optimizers in triton

84d40dd

Modify Comments

0350148

YangKai0616 changed the title ~~[XPU] Implemented 32bit optimizers in triton~~ [Draft][XPU] Implemented 32bit optimizers in triton Jul 16, 2025

Optimizing pure torch implementation

1669318

YangKai0616 changed the title ~~[Draft][XPU] Implemented 32bit optimizers in triton~~ [XPU] Implemented 32bit optimizers in triton Jul 17, 2025

YangKai0616 marked this pull request as ready for review July 17, 2025 10:55

jiqing-feng reviewed Jul 18, 2025

View reviewed changes

bitsandbytes/_ops.py Outdated Show resolved Hide resolved

bitsandbytes/backends/triton/kernels_optim.py Outdated Show resolved Hide resolved

bitsandbytes/functional.py Outdated Show resolved Hide resolved

YangKai0616 and others added 2 commits July 18, 2025 05:48

Restore the order of parameters and modify the position of pure pytor…

d0a83d1

…ch implementation

Restore files permissions

7fdb436

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[XPU] Implemented 32bit optimizers in triton #1710

[XPU] Implemented 32bit optimizers in triton #1710

YangKai0616 commented Jul 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[XPU] Implemented 32bit optimizers in triton #1710

Are you sure you want to change the base?

[XPU] Implemented 32bit optimizers in triton #1710

Conversation

YangKai0616 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YangKai0616 commented Jul 16, 2025 •

edited

Loading