SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Zhang, Jintao; Wei, Jia; Zhang, Pengle; Xu, Xiaoming; Huang, Haofeng; Wang, Haoxu; Jiang, Kai; Zhu, Jun; Chen, Jianfei

Computer Science > Machine Learning

arXiv:2505.11594 (cs)

[Submitted on 16 May 2025]

Title:SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Authors:Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen

View PDF HTML (experimental)

Abstract:The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at this https URL.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
Cite as:	arXiv:2505.11594 [cs.LG]
	(or arXiv:2505.11594v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2505.11594

Submission history

From: Jintao Zhang [view email]
[v1] Fri, 16 May 2025 18:01:54 UTC (15,296 KB)

Computer Science > Machine Learning

Title:SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators