[XPU] Implemented 32bit optimizers in triton #1710
Open
+700
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Depends on #1692.
Implemented 32bit optimizers in triton to use of XPU devices.
The PR includes two implementations:
For the benchmarking on
4096*4096
shapes, the results are as follows:Pure Torch implementation:
Pure Triton implementation:
For the benchmarking on
1024*1024
shapes, the results are as follows:Pure Torch implementation:
Pure Triton implementation:
The test platform is Intel(R) Data Center GPU Max 1550. Test script reference #1692. Torch(eager) is 32bit optimizer from torch, BNB is 32bit optimizer.
Considering that the performance gap between torch.compile and Triton implementations is not significant, but triton's implementation compiles faster, and #1692 was implemented with Triton, this PR adopts the Triton version for submission.
Note:Currently, XPU does not support the allocation of memory buffers using a paging mechanism. Therefore, these tests are skipped in
tests/test_optim.py::test_optimizer32bit
. This functionality will be developed in the future to support full optimizer capabilities.