SlideShare a Scribd company logo
4
Most read
6
Most read
11
Most read
Deep Dive: Accelerating models with
better Attention layers
Companion videos: https://youtu.be/2TT384U4vQg
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
New Attention
layers
Faster
Attention
layers
Framework
Hardware
features
🔥
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Self-attention
• The self-attention mechanism is at the core of Transformer models
• "Attention is All You Need" https://arxiv.org/abs/1706.03762 (06/2017)
• Quadratic compute and memory complexity with respect to the input sequence length
• Inference with long sequences (e.g. RAG applications) becomes very expensive
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Multi-Head Attention (MHA)
• N: sequence length, d: embedding length,
h: number of heads
• Q, K, V and intermediate dot-product
results (aka K,V cache) are stored in High
Bandwidth Memory (HBM)
• Quadratic complexity for HBM accesses
with respect to sequence length
• Memory becomes a bottleneck
Multi-head attention
Each head sees the full input sequence, but
only a subset of embedding dimensions (d/h)
Qi
Ki
Vi
MHA in BERT: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Multi-Query Attention (MQA)
https://arxiv.org/abs/1911.02150 (06/2019)
• Implemented in Falcon 7B
• Much smaller KV cache (10-100x)
• Less pressure on memory
• 12x faster decoding during inference
• Reduced memory usage:
batch size can be increased
• Small accuracy drop
• Models must be trained with MQA
• Tensor Parallelism requires KV
replication
Each self-attention layer
has its own set of
values and keys
Multi-head attention
All self-attention layers
share the same set of
values and keys
Multi-query attention
Ki
Vi Qi K
V Qi
MQA in Falcon: https://github.com/huggingface/transformers/blob/main/src/transformers/models/falcon/modeling_falcon.py
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Group-Query Attention (GQA)
https://arxiv.org/abs/2305.13245v2 (05/2023)
• Implemented in Llama 2 and Mistral
• Attention head groups share the
same set of keys and values
• Good compromise between speed
and accuracy: almost as accurate as
MHA, and almost as fast as MQA
• MHA models can be uptrained to
GQA
• Better fit for tensor parallelism
GQA in Llama: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py
T5 XXL
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Sliding Window Attention (SWA)
Longformer https://arxiv.org/abs/2004.05150 (04/2020), Mistral https://arxiv.org/abs/2310.06825 (10/2023)
• SWA limits attention to a fixed window (4096 tokens)
• A token can only see window_size tokens from the previous layer (32 layers)
• Maximum theoretical context size = window_size * n_layers (131K)
• Attention complexity is reduced from quadratic to linear
SWA in Mistral: https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
New
layers
Faster
Attention
layers
Framework
Hardware
features
🔥
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Flash Attention
https://arxiv.org/abs/2205.14135 (05/2022)
• Avoid reading and writing the attention matrix from and to HBM
• Load Q and K from HBM once
• Multiply Q and K, keep S in SRAM
• Compute P incrementally in SRAM (tiling)
• Write P back to HBM
• Parallelize over batch size and number of heads
• N: sequence length, d: embedding length, M: size of SRAM (d<=M<=Nd).
• Flash Attention requires O(N2d2M-1) HBM accesses
• M=N : O(Nd2) HBM accesses
• Memory complexity is now linear: 2-4x faster, 10-20x memory savings
• Both the forward and backward passes are optimized to accelerate training
Flash Attention is
available in Hugging
Face TGI
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Flash Attention 2
https://arxiv.org/abs/2307.08691 (07/2023)
• Reduce the number of non-matmul operations to maximize GPU throughput
• Optimize operations for Multi-Query Attention and Grouped-Query Attention
• Increase parallelism (across sequence length)
• Optimize both prompt processing (aka prefill) and text generation
• 2x faster than Flash Attention, up to 9x faster than standard Attention
Flash Attention 2 is
available in Hugging
Face TGI
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Paged Attention
https://arxiv.org/abs/2309.06180 (09/2023)
• The KV cache memory grows and shrinks dynamically for each inference request
• GPU memory fragmentation wastes memory and makes it difficult to increase batch size
• Paged Attention divides the KV cache into fixed-size memory-aligned blocks (pages),
similar to virtual memory pages in operating systems
• Allocating pages reduces internal and external memory fragmentation
• Implemented in the vLLM project https://github.com/vllm-project/vllm
Paged Attention is
available in Hugging
Face TGI

More Related Content

Similar to Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers (20)

PDF
deep_dive_multihead_latent_attention.pdf
Julien SIMON
 
PDF
Transformers.pdf
Ali Zoljodi
 
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
PPTX
Survey of Attention mechanism
SwatiNarkhede1
 
PDF
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Universitat Politècnica de Catalunya
 
PPTX
Survey of Attention mechanism & Use in Computer Vision
SwatiNarkhede1
 
PDF
Cost-effective Interactive Attention Learning with Neural Attention Process
MLAI2
 
PDF
L04.pdf
TRNHONGLINHBCHCM
 
PPTX
Rethinking Attention with Performers
Joonhyung Lee
 
PDF
slides on transformers from prof francois fleuret
ssusere9d9791
 
PDF
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
MLconf
 
PPTX
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
PDF
Language Model Basics - Components of a Generic Attention Mechanism
cniclsh1
 
PPTX
Transformer Mods for Document Length Inputs
Sujit Pal
 
PDF
Transformers in 2021
Grigory Sapunov
 
PDF
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
PDF
attention is all you need.pdf attention is all you need.pdfattention is all y...
Amit Ranjan
 
PDF
Transformers in AI: Revolutionizing Natural Language Processing
studyandinnovation
 
PDF
Transformers: Revolutionizing NLP with Self-Attention
studyandinnovation
 
PDF
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Universitat Politècnica de Catalunya
 
deep_dive_multihead_latent_attention.pdf
Julien SIMON
 
Transformers.pdf
Ali Zoljodi
 
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
Survey of Attention mechanism
SwatiNarkhede1
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Universitat Politècnica de Catalunya
 
Survey of Attention mechanism & Use in Computer Vision
SwatiNarkhede1
 
Cost-effective Interactive Attention Learning with Neural Attention Process
MLAI2
 
Rethinking Attention with Performers
Joonhyung Lee
 
slides on transformers from prof francois fleuret
ssusere9d9791
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
MLconf
 
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Language Model Basics - Components of a Generic Attention Mechanism
cniclsh1
 
Transformer Mods for Document Length Inputs
Sujit Pal
 
Transformers in 2021
Grigory Sapunov
 
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
attention is all you need.pdf attention is all you need.pdfattention is all y...
Amit Ranjan
 
Transformers in AI: Revolutionizing Natural Language Processing
studyandinnovation
 
Transformers: Revolutionizing NLP with Self-Attention
studyandinnovation
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Universitat Politècnica de Catalunya
 

More from Julien SIMON (20)

PDF
Deep Dive: Model Distillation with DistillKit
Julien SIMON
 
PDF
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Julien SIMON
 
PDF
Building High-Quality Domain-Specific Models with Mergekit
Julien SIMON
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien SIMON
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
PDF
Building NLP applications with Transformers
Julien SIMON
 
PPTX
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
PDF
Starting your AI/ML project right (May 2020)
Julien SIMON
 
PPTX
Scale Machine Learning from zero to millions of users (April 2020)
Julien SIMON
 
PPTX
An Introduction to Generative Adversarial Networks (April 2020)
Julien SIMON
 
PPTX
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
Julien SIMON
 
PDF
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Julien SIMON
 
PDF
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
Julien SIMON
 
PDF
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 
PDF
Building smart applications with AWS AI services (October 2019)
Julien SIMON
 
PPTX
Build, train and deploy ML models with SageMaker (October 2019)
Julien SIMON
 
PPTX
The Future of AI (September 2019)
Julien SIMON
 
Deep Dive: Model Distillation with DistillKit
Julien SIMON
 
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Julien SIMON
 
Building High-Quality Domain-Specific Models with Mergekit
Julien SIMON
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien SIMON
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
Building NLP applications with Transformers
Julien SIMON
 
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
Starting your AI/ML project right (May 2020)
Julien SIMON
 
Scale Machine Learning from zero to millions of users (April 2020)
Julien SIMON
 
An Introduction to Generative Adversarial Networks (April 2020)
Julien SIMON
 
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
Julien SIMON
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Julien SIMON
 
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
Julien SIMON
 
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 
Building smart applications with AWS AI services (October 2019)
Julien SIMON
 
Build, train and deploy ML models with SageMaker (October 2019)
Julien SIMON
 
The Future of AI (September 2019)
Julien SIMON
 
Ad

Recently uploaded (20)

PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Python basic programing language for automation
DanialHabibi2
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Ad

Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers

  • 1. Deep Dive: Accelerating models with better Attention layers Companion videos: https://youtu.be/2TT384U4vQg Julien Simon https://www.linkedin.com/in/juliensimon https://www.youtube.com/juliensimonfr The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
  • 2. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. New Attention layers Faster Attention layers Framework Hardware features 🔥
  • 3. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Self-attention • The self-attention mechanism is at the core of Transformer models • "Attention is All You Need" https://arxiv.org/abs/1706.03762 (06/2017) • Quadratic compute and memory complexity with respect to the input sequence length • Inference with long sequences (e.g. RAG applications) becomes very expensive
  • 4. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Multi-Head Attention (MHA) • N: sequence length, d: embedding length, h: number of heads • Q, K, V and intermediate dot-product results (aka K,V cache) are stored in High Bandwidth Memory (HBM) • Quadratic complexity for HBM accesses with respect to sequence length • Memory becomes a bottleneck Multi-head attention Each head sees the full input sequence, but only a subset of embedding dimensions (d/h) Qi Ki Vi MHA in BERT: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py
  • 5. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Multi-Query Attention (MQA) https://arxiv.org/abs/1911.02150 (06/2019) • Implemented in Falcon 7B • Much smaller KV cache (10-100x) • Less pressure on memory • 12x faster decoding during inference • Reduced memory usage: batch size can be increased • Small accuracy drop • Models must be trained with MQA • Tensor Parallelism requires KV replication Each self-attention layer has its own set of values and keys Multi-head attention All self-attention layers share the same set of values and keys Multi-query attention Ki Vi Qi K V Qi MQA in Falcon: https://github.com/huggingface/transformers/blob/main/src/transformers/models/falcon/modeling_falcon.py
  • 6. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Group-Query Attention (GQA) https://arxiv.org/abs/2305.13245v2 (05/2023) • Implemented in Llama 2 and Mistral • Attention head groups share the same set of keys and values • Good compromise between speed and accuracy: almost as accurate as MHA, and almost as fast as MQA • MHA models can be uptrained to GQA • Better fit for tensor parallelism GQA in Llama: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py T5 XXL
  • 7. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Sliding Window Attention (SWA) Longformer https://arxiv.org/abs/2004.05150 (04/2020), Mistral https://arxiv.org/abs/2310.06825 (10/2023) • SWA limits attention to a fixed window (4096 tokens) • A token can only see window_size tokens from the previous layer (32 layers) • Maximum theoretical context size = window_size * n_layers (131K) • Attention complexity is reduced from quadratic to linear SWA in Mistral: https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py
  • 8. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. New layers Faster Attention layers Framework Hardware features 🔥
  • 9. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Flash Attention https://arxiv.org/abs/2205.14135 (05/2022) • Avoid reading and writing the attention matrix from and to HBM • Load Q and K from HBM once • Multiply Q and K, keep S in SRAM • Compute P incrementally in SRAM (tiling) • Write P back to HBM • Parallelize over batch size and number of heads • N: sequence length, d: embedding length, M: size of SRAM (d<=M<=Nd). • Flash Attention requires O(N2d2M-1) HBM accesses • M=N : O(Nd2) HBM accesses • Memory complexity is now linear: 2-4x faster, 10-20x memory savings • Both the forward and backward passes are optimized to accelerate training Flash Attention is available in Hugging Face TGI
  • 10. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Flash Attention 2 https://arxiv.org/abs/2307.08691 (07/2023) • Reduce the number of non-matmul operations to maximize GPU throughput • Optimize operations for Multi-Query Attention and Grouped-Query Attention • Increase parallelism (across sequence length) • Optimize both prompt processing (aka prefill) and text generation • 2x faster than Flash Attention, up to 9x faster than standard Attention Flash Attention 2 is available in Hugging Face TGI
  • 11. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Paged Attention https://arxiv.org/abs/2309.06180 (09/2023) • The KV cache memory grows and shrinks dynamically for each inference request • GPU memory fragmentation wastes memory and makes it difficult to increase batch size • Paged Attention divides the KV cache into fixed-size memory-aligned blocks (pages), similar to virtual memory pages in operating systems • Allocating pages reduces internal and external memory fragmentation • Implemented in the vLLM project https://github.com/vllm-project/vllm Paged Attention is available in Hugging Face TGI