Skip to content

[XPU] Implemented 32bit optimizers in triton #1710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

YangKai0616
Copy link

@YangKai0616 YangKai0616 commented Jul 16, 2025

Depends on #1692.

Implemented 32bit optimizers in triton to use of XPU devices.

The PR includes two implementations:

  1. Pure Torch implementation: utilizing torch.compile
  2. Pure Triton implementation: utilizing triton.jit

For the benchmarking on 4096*4096 shapes, the results are as follows:

Pure Torch implementation:

Torch step (eager): 1.075ms
BNB step: 0.516ms
Torch step (eager): 1.058ms
BNB step: 0.517ms
Torch step (eager): 1.080ms
BNB step: 0.527ms
Torch step (eager): 1.069ms
BNB step: 0.539ms
Torch step (eager): 1.034ms
BNB step: 0.526ms

Pure Triton implementation:

Torch step (eager): 1.034ms
BNB step: 0.524ms
Torch step (eager): 1.054ms
BNB step: 0.488ms
Torch step (eager): 1.031ms
BNB step: 0.526ms
Torch step (eager): 1.047ms
BNB step: 0.538ms
Torch step (eager): 1.045ms
BNB step: 0.489ms

For the benchmarking on 1024*1024 shapes, the results are as follows:
Pure Torch implementation:

Torch step (eager): 0.345ms
BNB step: 0.335ms
Torch step (eager): 0.354ms
BNB step: 0.226ms
Torch step (eager): 0.347ms
BNB step: 0.227ms
Torch step (eager): 0.358ms
BNB step: 0.232ms
Torch step (eager): 0.349ms
BNB step: 0.225ms

Pure Triton implementation:

Torch step (eager): 0.346ms
BNB step: 0.226ms
Torch step (eager): 0.337ms
BNB step: 0.216ms
Torch step (eager): 0.338ms
BNB step: 0.215ms
Torch step (eager): 0.333ms
BNB step: 0.226ms
Torch step (eager): 0.349ms
BNB step: 0.235ms

The test platform is Intel(R) Data Center GPU Max 1550. Test script reference #1692. Torch(eager) is 32bit optimizer from torch, BNB is 32bit optimizer.

Considering that the performance gap between torch.compile and Triton implementations is not significant, but triton's implementation compiles faster, and #1692 was implemented with Triton, this PR adopts the Triton version for submission.

Note:Currently, XPU does not support the allocation of memory buffers using a paging mechanism. Therefore, these tests are skipped in tests/test_optim.py::test_optimizer32bit. This functionality will be developed in the future to support full optimizer capabilities.

@YangKai0616 YangKai0616 changed the title [XPU] Implemented 32bit optimizers in triton [Draft][XPU] Implemented 32bit optimizers in triton Jul 16, 2025
@YangKai0616 YangKai0616 changed the title [Draft][XPU] Implemented 32bit optimizers in triton [XPU] Implemented 32bit optimizers in triton Jul 17, 2025
@YangKai0616 YangKai0616 marked this pull request as ready for review July 17, 2025 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants