Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers

Deep Dive: Accelerating models with
better Attention layers
Companion videos: https://youtu.be/2TT384U4vQg
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.

New Attention
layers
Faster
Attention
layers
Framework
Hardware
features
🔥

Self-attention
• The self-attention mechanism is at the core of Transformer models
• "Attention is All You Need" https://arxiv.org/abs/1706.03762 (06/2017)
• Quadratic compute and memory complexity with respect to the input sequence length
• Inference with long sequences (e.g. RAG applications) becomes very expensive

Multi-Head Attention (MHA)
• N: sequence length, d: embedding length,
h: number of heads
• Q, K, V and intermediate dot-product
results (aka K,V cache) are stored in High
Bandwidth Memory (HBM)
• Quadratic complexity for HBM accesses
with respect to sequence length
• Memory becomes a bottleneck
Multi-head attention
Each head sees the full input sequence, but
only a subset of embedding dimensions (d/h)
Qi
Ki
Vi
MHA in BERT: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py

Multi-Query Attention (MQA)
https://arxiv.org/abs/1911.02150 (06/2019)
• Implemented in Falcon 7B
• Much smaller KV cache (10-100x)
• Less pressure on memory
• 12x faster decoding during inference
• Reduced memory usage:
batch size can be increased
• Small accuracy drop
• Models must be trained with MQA
• Tensor Parallelism requires KV
replication
Each self-attention layer
has its own set of
values and keys
Multi-head attention
All self-attention layers
share the same set of
values and keys
Multi-query attention
Ki
Vi Qi K
V Qi
MQA in Falcon: https://github.com/huggingface/transformers/blob/main/src/transformers/models/falcon/modeling_falcon.py

Group-Query Attention (GQA)
https://arxiv.org/abs/2305.13245v2 (05/2023)
• Implemented in Llama 2 and Mistral
• Attention head groups share the
same set of keys and values
• Good compromise between speed
and accuracy: almost as accurate as
MHA, and almost as fast as MQA
• MHA models can be uptrained to
GQA
• Better fit for tensor parallelism
GQA in Llama: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py
T5 XXL

Sliding Window Attention (SWA)
Longformer https://arxiv.org/abs/2004.05150 (04/2020), Mistral https://arxiv.org/abs/2310.06825 (10/2023)
• SWA limits attention to a fixed window (4096 tokens)
• A token can only see window_size tokens from the previous layer (32 layers)
• Maximum theoretical context size = window_size * n_layers (131K)
• Attention complexity is reduced from quadratic to linear
SWA in Mistral: https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py

New
layers
Faster
Attention
layers
Framework
Hardware
features
🔥

Flash Attention
• Avoid reading and writing the attention matrix from and to HBM
• Load Q and K from HBM once
• Multiply Q and K, keep S in SRAM
• Compute P incrementally in SRAM (tiling)
• Write P back to HBM
• Parallelize over batch size and number of heads
• N: sequence length, d: embedding length, M: size of SRAM (d<=M<=Nd).
• Flash Attention requires O(N2d2M-1) HBM accesses
• M=N : O(Nd2) HBM accesses
• Memory complexity is now linear: 2-4x faster, 10-20x memory savings
• Both the forward and backward passes are optimized to accelerate training
Flash Attention is
available in Hugging
Face TGI

Flash Attention 2
• Reduce the number of non-matmul operations to maximize GPU throughput
• Optimize operations for Multi-Query Attention and Grouped-Query Attention
• Increase parallelism (across sequence length)
• Optimize both prompt processing (aka prefill) and text generation
• 2x faster than Flash Attention, up to 9x faster than standard Attention
Flash Attention 2 is
Face TGI

Paged Attention
• The KV cache memory grows and shrinks dynamically for each inference request
• GPU memory fragmentation wastes memory and makes it difficult to increase batch size
• Paged Attention divides the KV cache into fixed-size memory-aligned blocks (pages),
similar to virtual memory pages in operating systems
• Allocating pages reduces internal and external memory fragmentation
• Implemented in the vLLM project https://github.com/vllm-project/vllm
Paged Attention is
Face TGI

Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers

More Related Content

Similar to Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers (20)

More from Julien SIMON (20)

Recently uploaded (20)

Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers