Architecture Terms of LLMs
Terms from “The Big LLM Architecture Comparison”
https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison
Terms De
fi
ned
A. KV Cache
B. Rotary Positional Embeddings (RoPE)
C. Ef
fi
ciency-focused Attention Variants (GQA or MLA)
D. SwiGLU Activations
E. Normalizations (RMSNorm, QK-Norm, Norm Placement)
A. What is a KV Cache?
The Core Idea in One Sentence
The KV Cache is a performance optimization technique used in Transformer-based models (like
GPT, LLaMA, etc.) that drastically speeds up the process of generating text sequentially (one
token at a time) by caching the results of intermediate calculations, thus avoiding redundant
computation.
1. The Problem: Why Do We Need a KV Cache?
To understand the solution, we must
fi
rst understand the problem.
Imagine you're using a model like ChatGPT to generate a response. The model generates text
one word (or token) at a time. Let's say it's generating the sentence: "The quick brown fox".
1. Step 1: You input "The". The model runs a full forward pass and predicts the next token:
"quick".
2. Step 2: You now input "The quick". The model runs another full forward pass to predict
"brown".
3. Step 3: You input "The quick brown". The model runs another full forward pass to
predict "fox".
Do you see the inef
fi
ciency? In Step 2, the model is re-processing the token "The" all over again,
even though it was already processed in Step 1. In Step 3, it's re-processing both "The" and
"quick". This redundancy becomes massively expensive as the sequence gets longer, because the
computation for each new token would scale quadratically with sequence length due to the self-
attention mechanism.
This is incredibly slow and wasteful.
2. The Solution: How the KV Cache Works
The key insight is that during text generation, most of the computation for the previous tokens
is repeated and can be reused.
In the Transformer's self-attention mechanism, each token is projected into three vectors:
• Query (Q): Asks "what should I pay attention to?"
• Key (K): Answers "this is what I contain."
• Value (V): Answers "this is the information I will provide if you attend to me."
Attention is calculated as a weighted sum of the Values, where the weights are determined by the
compatibility between the Query of the current token and the Keys of all previous tokens.
The KV Cache saves the computed Key and Value vectors for all previous tokens.
Let's replay our example with a KV Cache:
1. Step 1: Input is "The".
◦ The model computes the K and V vectors for "The".
◦ It uses the Q vector for "The" to predict the next token: "quick".
◦ It stores the K, V for "The" in the cache.
2. Step 2: Input is now only the new token "quick". (We already have "The" in the cache).
◦ The model computes the K and V vectors for "quick".
◦ To calculate attention, it retrieves the cached K, V for "The" and uses the new
K, V for "quick".
◦ It uses the Q for "quick" with all the Keys (from "The" and "quick") to predict
"brown".
◦ It appends the K, V for "quick" to the cache. The cache now holds K, V for
["The", "quick"].
3. Step 3: Input is only the new token "brown".
◦ The model computes the K and V for "brown".
◦ It retrieves the entire cached K, V for ["The", "quick"].
◦ It uses the Q for "brown" with all the cached Keys to predict "fox".
◦ It appends the K, V for "brown" to the cache.
3. The Bene
fi
ts
• Massive Speedup: The model no longer processes the entire sequence from scratch for
every new token. It only computes the Q, K, V for the new token and re-uses the cached
K, V for all previous tokens. This reduces the computational complexity for each
decoding step.
• Reduced Latency: The response time for each new token becomes much faster, leading
to a much more responsive user experience in applications like chatbots.
4. The Trade-offs
• Memory Overhead: The primary cost of the KV Cache is memory. For every new token
generated, you are storing two new vectors (K and V) for each layer and each attention
head in the model.
◦ Memory Usage ≈ Batch Size × Number of Layers × Number of Heads ×
Sequence Length × Vector Dimension
◦ For long sequences (e.g., 32k tokens), this cache can consume several gigabytes
of VRAM. This is a key constraint and a major focus of optimization research
(like Multi-Query Attention or Grouped-Query Attention, which reduce the
size of the K,V cache).
• Cache Management: In complex sampling techniques or when handling multiple
parallel sequences (beams in beam search), managing the cache correctly (e.g., rolling,
shifting, or pruning it) becomes a non-trivial engineering task.
Analogy
Think of it like solving a long multiplication problem on paper.
• Without KV Cache: Every time you add a new digit, you re-calculate the entire
multiplication from the beginning.
• With KV Cache: You write down the intermediate products (this is the "cache"). When
you add a new digit, you only calculate the new interactions and then add them to your
previously saved intermediate products. This is far more ef
fi
cient
Summary
In essence, the KV Cache is what makes modern LLMs like GPT-4 feasible for interactive
applications, turning an intractably slow process into a fast and responsive one, at the cost of
increased memory usage.
B. What are rotary positional embeddings (RoPE)
Of course! Let's break down Rotary Positional Embeddings (RoPE), which is a brilliant and
widely used technique for giving a sense of order or position to tokens in a sequence for
Transformer models.
The Core Problem: "Where is the word?"
Transformers process all tokens in a sequence simultaneously (unlike RNNs which process them
one by one). This is great for speed but means the model has no innate idea of word order. The
sentences:
• "The cat sat on the mat."
• "The mat sat on the cat."
would look identical to a vanilla Transformer without positional information. We need to
explicitly tell the model about the position of each token.
Rotary Positional Embedding (RoPE) is a highly effective way to do this, used in state-of-the-
art models like LLaMA, GPT-NeoX, and PaLM.
Feature Without KV Cache With KV Cache
Comput
ation
Re-processes the entire sequence for
every new token.
Only processes the new token; re-uses cached
K,V for previous tokens.
Speed Very slow for generation. Very fast for generation.
Memory Lower memory overhead per step.
High memory overhead, which grows with
sequence length.
Use
Case
Training, or single forward passes.
Autoregressive text generation (e.g., chatbots,
code completion).
1. The Intuition: "Weave" Position into the Embedding
Instead of adding a positional vector to the word embedding (the classic approach), RoPE's key
idea is to rotate the embedding vector itself.
Think of it like this:
• Each token's embedding is a vector in a high-dimensional space.
• Its position in the sequence determines a speci
fi
c angle of rotation.
• The further along the token is, the more its embedding vector gets "twisted" or rotated.
This rotation is done in a way that the dot product between two rotated vectors inherently
contains information about the relative distance between them. This is crucial because the self-
attention mechanism relies heavily on dot products (between Query and Key vectors) to
determine attention scores.
2. How It Works (The Simple Version)
Let's simplify the math. Imagine we have a 2-dimensional embedding vector for a word: [x,
y].
1. Represent it as a Complex Number: We can think of this 2D vector as a complex
number x + yi, where i is the imaginary unit.
2. Apply a Rotation: To rotate this vector by an angle θ (which corresponds to its position
m), we multiply it by the complex exponential e^(i*m*θ).
◦ The formula for rotating a complex number (x + yi) by angle θ is:
▪ x' = x*cos(mθ) - y*sin(mθ)
▪ y' = x*sin(mθ) + y*cos(mθ)
3. Result: The new vector [x', y'] is the original vector, but rotated. This rotated
vector now represents the word at that speci
fi
c position.
In reality, RoPE does this not just on a single 2D plane, but across multiple, consecutive 2D
slices of the high-dimensional embedding vector. Each pair of elements in the embedding vector
is treated as a 2D plane and rotated by a fraction of the position-based angle.
3. The "Killer Feature": Relative Position Encoding in Attention
This is why RoPE is so powerful. Let's see what happens in the attention mechanism.
• The Query vector q for a token at position m gets rotated by angle mθ.
• The Key vector k for a token at position n gets rotated by angle nθ.
When we calculate the attention score (the dot product q • k), it becomes:
Rotate(q, mθ) • Rotate(k, nθ)
Due to the properties of rotation, this dot product is mathematically identical to the dot product
of the original q and k vectors after being rotated by the relative angle (m - n)θ.
In simple terms: Attention Score = f( q, k, m - n )
The attention score between a Query at position m and a Key at position n depends only on the
original content vectors (q, k) and their relative distance (m - n). The absolute positions m
and n themselves don't matter, only their difference.
4. Key Advantages of RoPE
1. Built-in Relative Encoding: As shown above, it naturally encodes relative position
information directly into the attention calculation, which models
fi
nd very easy to use.
2. Better Extrapolation: Models using RoPE tend to generalize better to sequence lengths
longer than those they were trained on (though it's not perfect, it's better than absolute
positional embeddings). The rotational structure provides a natural way to "extend" the
positions.
3. Decoupling of Dimensions: Unlike some other methods, RoPE doesn't mix all
dimensions together additively. It acts on pairs of dimensions, which is a more structured
and ef
fi
cient approach.
4. Stability: It avoids the potential numerical issues or saturation that can happen with other
positional encoding functions.
Analogy: A Clock Face
Imagine each word's embedding is a hand on a clock.
• The word "cat" might point to 3 o'clock.
• The word "mat" might point to 7 o'clock.
Without positions: "The cat sat on the mat" and "The mat sat on the cat" would have the hands
pointing in the same directions, making the sentences indistinguishable.
With RoPE: The position of the word determines how much we rotate the clock hand.
• The 1st word ("The") is rotated by 0°.
• The 2nd word ("cat") is rotated by 30°.
• The 3rd word ("sat") is rotated by 60°.
• ...and so on.
Now, when the model checks the relationship between "on" (at position 4) and "mat" (at position
6), it doesn't see their absolute rotated positions (120° and 180°). Instead, the dot product
(attention score) intuitively captures that "mat" is 2 positions away from "on", corresponding to a
relative rotation of 60°. This relative distance is what truly matters.
Summary
In essence, RoPE is an elegant and highly effective mathematical solution to the problem of
word order, and its properties make it a superior choice for training large, powerful language
models.
C. What are ef
fi
ciency-focused attention variants (GQA or MLA)?
Of course! Let's dive into ef
fi
ciency-focused attention variants, speci
fi
cally GQA (Grouped-
Query Attention) and MLA (Multi-head Latent Attention), which are crucial for making large
language models faster and more memory-ef
fi
cient.
The Context: The Scalability Problem with Standard Attention
First, recall the standard Multi-Head Attention (MHA) from the original Transformer:
Feature Description
Core
Idea
Encode positional information by rotating the token's embedding vector based on its
position.
Mechani
sm
Applies a rotary transformation (using sine and cosine) to pairs of elements in the
Query and Key vectors.
Key
Bene
fi
t
The attention score between a Query and Key depends only on their content and
relative position (m-n).
Used In LLaMA, GPT-J, GPT-NeoX, Claude, and many other modern LLMs.
• For each token, the model creates multiple sets of Query (Q), Key (K), and Value (V)
vectors (one set per "head").
• This allows the model to attend to information from different representation subspaces
simultaneously.
The Problem: In auto-regressive decoding (generating text token-by-token), we use the KV
Cache to avoid recalculating Keys and Values for previous tokens. In MHA, each head has its
own unique set of Keys and Values that need to be cached.
• Memory Bottleneck: The size of the KV Cache scales linearly with:
[Batch Size] * [Number of Layers] * [Number of Heads] *
[Sequence Length] * [Head Dimension]
• Memory Bandwidth Bottleneck: At every decoding step, the model must load the entire
KV Cache from memory to compute attention. This memory access (memory-bound
operation) becomes the primary bottleneck, not the actual computation.
GQA and MLA are engineered solutions to this speci
fi
c problem.
1. Grouped-Query Attention (GQA)
GQA is a direct and practical evolution between Multi-Head Attention (MHA) and Multi-Query
Attention (MQA).
The Spectrum of Attention Variants
• Multi-Head Attention (MHA):
Num_Heads Queries, Num_Heads Keys, Num_Heads Values
Pros: High quality, each head can specialize.
Cons: Large, slow KV Cache.
• Multi-Query Attention (MQA):
Num_Heads Queries, 1 Key, 1 Value
All attention heads share a single set of Key and Value vectors.
Pros: Drastically reduces KV Cache size, very fast.
Cons: Can lead to a drop in model quality because the shared K/V cannot capture diverse
information.
• Grouped-Query Attention (GQA):
Num_Heads Queries, G Keys, G Values (where G is the number of groups, and G <
Num_Heads)
This is the middle ground. You group the heads into G groups. All heads within a group
share the same Key and Value vectors.
How GQA Works
1. Instead of having a separate K and V projection for each head, you have only G K and V
projections.
2. The Num_Heads Queries are divided into G groups.
3. All the Queries in a single group attend to the same, shared Key and Value set.
Visual Analogy:
• MHA: A classroom where every student (Query) has their own private textbook (K) and
notebook (V).
• MQA: A classroom where every student shares one single textbook and one notebook.
• GQA: A classroom divided into study groups. Each group shares one textbook and one
notebook within the group.
Bene
fi
ts of GQA
• Dramatically Reduced KV Cache: The KV Cache size is reduced by a factor of
Num_Heads / G. For example, going from 32 heads to 8 groups (G=8) reduces the
cache size by 4x.
• Improved Inference Speed: Much less data needs to be loaded from memory per
decoding step, alleviating the memory bandwidth bottleneck.
• Quality Preservation: It performs much closer to MHA than MQA does, because the
sharing is less extreme. Models can still maintain a degree of specialized attention.
Usage: GQA is used in LLaMA 2 & 3 and many other state-of-the-art production models
because it offers an excellent trade-off.
2. Multi-head Latent Attention (MLA)
MLA is a more complex and recent technique that takes a different approach. It was introduced
in models like Qwen2 and focuses on compressing the KV Cache.
The Core Idea of MLA
Instead of caching the full [Sequence Length] sized KV states for all layers and heads,
MLA learns to project (or compress) these states into a much smaller, latent space. It then only
caches this compressed representation.
How MLA Works (Simpli
fi
ed)
1. Cross-Attention into Latent Space: MLA introduces a small set of latent tokens (e.g.,
64 or 128). These are learned parameters that act as a "summary" or "bottleneck."
2. KV Compression: For a given layer, the full sequence of Key (K) and Value (V) states
are not cached directly. Instead, they are used to update the latent state via a cross-
attention mechanism:
◦ The latent tokens act as Queries.
◦ The actual K and V states of the current sequence act as Keys and Values.
◦ This cross-attention "condenses" the information from the long sequence into the
small latent space.
3. Caching the Latent State: The model caches this updated, compact latent state instead
of the full KV Cache.
4. Retrieval for Self-Attention: When the standard self-attention needs to run in the next
step, it uses the compressed latent state to recover an approximation of the full context.
Think of it as a "lossy compression" algorithm for the KV Cache. Instead of storing a giant
high-resolution image (full KV Cache), you store a highly compressed JPEG (latent state) and
decompress it when you need to use it.
Bene
fi
ts of MLA
• Massive KV Cache Reduction: This is the biggest advantage. The cache size becomes
constant with respect to sequence length—it only depends on the
fi
xed size of the latent
space. This enables extremely long context windows without the memory explosion.
• Long Context Capability: Models using MLA can handle contexts of 128K, 1M, or
even more tokens practically.
• Maintains Multi-Head Structure: Unlike GQA, it doesn't reduce the number of K/V
heads; it compresses the information from all of them.
Trade-offs of MLA
• Computational Overhead: The cross-attention compression/decompression steps add
extra computation.
• Potential Quality Loss: Any compression can lose information. The model must learn to
use the latent space effectively, which might not perfectly preserve all the details from the
full context.
• Complexity: The architecture is more complex than GQA.
Comparison Summary
Conclusion
• GQA is currently the industry standard for ef
fi
ciency in general-purpose LLMs. It's a
simple, effective modi
fi
cation that provides massive inference speedups with minimal
quality loss.
• MLA represents the frontier of long-context research. It's a more radical architectural
change aimed at solving the fundamental problem of KV Cache scaling, making
extremely long conversations and document processing feasible.
Both are essential tools in the modern LLM engineer's toolkit, addressing the critical challenge
of making powerful models practical to deploy.
D. What are SwiGLU activations?
Of course! Let's break down SwiGLU, which is a powerful activation function used in many
modern large language models like LLaMA, PaLM, and GPT-Neo.
1. The Building Blocks: From ReLU to GLU
Featu
re
Multi-Head Attention
(MHA)
Grouped-Query
Attention (GQA)
Multi-head Latent Attention
(MLA)
Core
Idea
Every head has unique
K, V.
Groups of heads share K,
V.
Compress the full KV Cache into a
small latent space.
KV
Cach
e Size
Large: O(layers *
heads *
seq_len * dim)
Reduced: O(layers
* groups *
seq_len * dim)
Drastically Reduced: O(layers
* latent_size * dim)
(constant wrt seq_len)
Prim
ary
Bene
fi
t
Highest model quality.
Inference Speed &
Memory. Excellent
quality/speed trade-off.
Long Context Windows. Enables
1M+ token contexts.
Prim
ary
Cost
Slow, memory-heavy
inference.
Slight potential quality
drop vs. MHA.
Added complexity, compression
overhead, potential info loss.
Used
In
Original Transformer,
BERT.
LLaMA 2/3, PaLM,
Gemini.
Qwen2, other long-context models.
To understand SwiGLU, we need to start with its predecessors.
a) ReLU (Recti
fi
ed Linear Unit)
• The classic activation function: ReLU(x) = max(0, x)
• Simple and effective, but can sometimes "die" (output zero for negative inputs).
b) GLU (Gated Linear Unit)
• This is a key innovation. A GLU introduces a gating mechanism, inspired by the success
of gating in LSTMs and GRUs.
• How it works:
1. You take the input x and pass it through two separate linear transformations (like
A*x + b and C*x + d).
2. One of these streams goes through a sigmoid activation function (σ), which
squishes values between 0 and 1. This acts as a gate.
3. The output is the element-wise product of the other stream and this gate.
4. Formula: GLU(x) = (A * x + b) ⊗ σ(C * x + d)
• Intuition: The sigmoid gate learns to control what information from the
fi
rst stream
should be passed through. A value of 1 means "let everything through," a value of 0
means "block everything," and values in between act like a dimmer switch. This allows
the model to learn more complex, non-linear relationships.
2. The Evolution to SwiGLU
The original GLU paper experimented with different activation functions for the
fi
rst stream and
for the gate. They found that simply using a linear activation for the
fi
rst stream and a sigmoid
for the gate worked well. This is sometimes called the "vanilla" GLU.
SwiGLU is a speci
fi
c, high-performing variant of GLU that makes two key changes:
1. It uses the Swish activation (also called SiLU) for the
fi
rst stream instead of a linear
one.
2. It uses the same weight matrix for both transformations (a parameter-sharing trick),
though this can vary in implementation.
Let's look at the components:
a) Swish / SiLU Activation
• Formula: Swish(x) = x * σ(x)
• What it does: It's a smooth, non-monotonic function that looks like ReLU but has a
"bump" or a gentle curve for negative values instead of being exactly zero. It has been
shown to often outperform ReLU in deep networks.
b) The Full SwiGLU Formula
Combining Swish and GLU gives us SwiGLU:
SwiGLU(x) = (Swish(xW + b)) ⊗ (xV + c)
Let's break it down:
• xW + b: This is the
fi
rst linear transformation. Its output is passed through the Swish
function.
• xV + c: This is the second linear transformation. Its output is passed through a sigmoid
(σ) function to create the gate.
• ⊗: The outputs of the two paths are multiplied together element-wise.
In many implementations (like in the LLaMA paper), the weights W and V are the same (W =
V), which reduces parameters.
3. Why is SwiGLU So Effective?
The power of SwiGLU comes from the combination of the gating mechanism and the smooth
Swish activation.
1. Gating Mechanism: The gate provides a form of dynamic feature selection. The model
can learn to suppress irrelevant features and amplify relevant ones for a given context.
This adds a signi
fi
cant amount of representational power and is more expressive than a
simple, static non-linearity like ReLU.
2. Smooth Gradient Flow: The Swish function is smooth and avoids the "dead neuron"
problem of ReLU. Its gradients are well-behaved, which helps with training very deep
networks.
3. Synergy: The combination is empirically superior. Research (e.g., from the original
PaLM paper) has shown that replacing the standard Feed-Forward Network (FFN) with a
SwiGLU-based one leads to better model performance and more stable training for the
same compute budget.
Analogy: A Smart Water Valve
Think of a standard activation like ReLU as a simple on/off water valve:
• If the input pressure is positive, water
fl
ows.
• If it's negative, the valve shuts completely.
A SwiGLU unit is like a smart, automated valve system:
• One part of the system (xW + b) decides how much water could
fl
ow (the Swish
stream).
• The other part (xV + c) is a sensor that decides what percentage of that potential
fl
ow
should actually be allowed through (the sigmoid gate).
• The
fi
nal output is the result of these two systems working together, allowing for much
more precise control.
Summary
In essence, SwiGLU isn't just a minor tweak; it's a fundamental architectural improvement that
has become a standard component in the recipe for building powerful modern language models.
E. What are normalizations (RMSNorm, QK-Norm, norm placement)?
Excellent question! Normalization is a critical but often overlooked part of modern Transformer
architecture. Let's break down RMSNorm, QK-Norm, and the crucial concept of norm
placement.
Feature Description
What it
is
A variant of the Gated Linear Unit (GLU) that uses the Swish activation function.
Core
Idea
A gating mechanism that allows the network to dynamically control the
fl
ow of
information.
Key
Formula
SwiGLU(x) = Swish(xW + b) ⊗ σ(xV + c)
Main
Bene
fi
t
Provides a more powerful and expressive activation function, leading to better
performance in large models.
Main
Cost
It requires ~2/3 more parameters than a standard ReLU-based feed-forward layer
because of the two linear transformations.
Used In LLaMA, PaLM, GPT-Neo, T5, and many other state-of-the-art Transformer models.
1. The Foundation: Why Normalization?
Before diving into speci
fi
cs, remember the core purpose of normalization layers:
• Stabilize Training: Prevent activations from becoming too large or too small (exploding/
vanishing gradients).
• Faster Convergence: Allow for higher learning rates.
• Regularization: Slight regularization effect.
The original Transformer used LayerNorm, which normalizes across the feature dimension for
each token:
LayerNorm(x) = (x - μ) / σ * γ + β
It subtracts the mean (μ) and divides by the standard deviation (σ), then applies a learnable scale
(γ) and bias (β).
2. RMSNorm (Root Mean Square Normalization)
RMSNorm is a popular, simpli
fi
ed successor to LayerNorm.
What it does:
• It removes the mean subtraction.
• It only focuses on re-scaling by the root mean square of the inputs.
Formula:
RMSNorm(x) = (x / RMS(x)) * γ
where RMS(x) = sqrt(mean(x²))
Key Differences from LayerNorm:
1. No Mean Centering: RMSNorm doesn't force the mean to be zero; it only forces the
quadratic mean to be 1. This is based on the observation that the mean-centered
component in LayerNorm contributes little to the stability.
2. No Learnable Bias (β): Only the scale parameter (γ) is learned.
3. Computationally Cheaper: Requires ~15-20% less computation than LayerNorm
because calculating the standard deviation (sqrt(mean((x - μ)²))) is more
expensive than calculating the RMS (sqrt(mean(x²))).
Why it works:
Research showed that the re-scaling invariant property (controlling the variance) is the main
factor for training stability, not the mean-invariance. By keeping it simple, RMSNorm achieves
similar or better performance with higher ef
fi
ciency.
Usage: RMSNorm is used in LLaMA, GPT-J, GPT-NeoX, and many other modern models.
3. QK-Norm (Query-Key Normalization)
This is a more recent and specialized normalization applied speci
fi
cally within the self-attention
mechanism.
The Problem:
In standard attention, the softmax function can become "saturated" or "sharpened." For a single
query token, if one key has a much higher dot product than all others, the softmax will assign
nearly all probability (≈1.0) to that key and almost zero to the rest. This makes the gradient very
small for the non-attended keys, slowing down learning.
The Solution:
QK-Norm applies normalization to the attention scores before the softmax. Speci
fi
cally, it
normalizes the Query and Key vectors individually.
How it works:
1. For each attention head, normalize the Query (Q) and Key (K) vectors along the head
dimension.
2. This is typically done using RMSNorm (without the learnable γ).
3. The normalized Q and K are then used to compute the attention scores.
Formula (for a single head):
Attention = Softmax( (RMSNorm(Q) * RMSNorm(K)ᵀ) / √d_k )
Key Bene
fi
ts:
1. Prevents Softmax Saturation: By controlling the scale of Q and K, it prevents
individual attention scores from becoming extremely large, leading to a "smoother"
softmax distribution.
2. Improved Training Stability: Makes the model less sensitive to learning rate and
initialization.
3. Enables Better Depth Scaling: Crucial for training very deep transformers.
Analogy: Think of it as applying "volume control" to each speaker (Query and Key) before they
start talking, preventing anyone from shouting and drowning out others in the conversation.
Usage: QK-Norm was introduced in the DeepNet paper and is used in models like LLaMA 2 &
3 and Chinchilla.
4. Norm Placement (Pre-Norm vs. Post-Norm)
This is a crucial architectural decision: Where do you put the normalization layer relative to
the sub-layer (Attention or FFN)?
a) Post-Norm (Original Transformer)
• Placement: y = Norm( x + SubLayer(x) )
• The normalization is applied after the residual connection.
• Characteristics:
◦ Often leads to training instability in very deep models.
◦ Requires careful learning rate warm-up.
◦ The gradient has to
fl
ow through the entire network depth before being
normalized.
b) Pre-Norm (Modern Standard)
• Placement: y = x + SubLayer( Norm(x) )
• The normalization is applied to the input of the sub-layer, before the operation.
• Characteristics:
◦ Much more stable training. This is the primary reason for its adoption.
◦ Enables training of very deep transformers (100+ layers).
◦ The gradient
fl
ows directly through the residual connection, bypassing the
normalization and sub-layer, which helps mitigate vanishing gradients.
Why Pre-Norm Works Better: It creates a "path of least resistance" for the gradient via the
residual connection. Each sub-layer effectively operates on a well-behaved, normalized input,
making the learning problem easier for each component.
Visual Comparison:
text
# Post-Norm (Original)
x -> Attention -> Add -> LayerNorm -> Output
# Pre-Norm (Modern)
x -> LayerNorm -> Attention -> Add -> Output
Usage: Pre-Norm is the universal standard in all modern LLMs (GPT, LLaMA, PaLM, etc.).
Summary & Hierarchy in a Modern Transformer Block
Here's how these concepts typically
fi
t together in a single decoder layer of a model like
LLaMA:
1. Input: Hidden states from previous layer
2. Pre-Norm (RMSNorm): Normalize the input
3. Self-Attention:
◦ Calculate Q, K, V from normalized input.
◦ QK-Norm: Apply RMSNorm to Q and K.
◦ Compute attention scores: Scores = (Norm(Q) * Norm(K)ᵀ) /
√d_k
◦ Apply softmax to get weights.
◦ Output = weights * V.
4. First Residual Connection: x = x + Attention_Output
5. Pre-Norm (RMSNorm) again: Normalize the new state.
6. Feed-Forward Network (SwiGLU): Process the normalized state.
7. Second Residual Connection: Output = x + FFN_Output
This combination of Pre-Norm RMSNorm and QK-Norm creates a highly stable and ef
fi
cient
architecture that can be scaled to incredible depths (e.g., 80 layers in LLaMA 2) while
maintaining robust training dynamics.

LLM Architecture Terms Defined from Sebastian Raschka's Paper Big LLM Architecture Comparison

  • 1.
    Architecture Terms ofLLMs Terms from “The Big LLM Architecture Comparison” https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison Terms De fi ned A. KV Cache B. Rotary Positional Embeddings (RoPE) C. Ef fi ciency-focused Attention Variants (GQA or MLA) D. SwiGLU Activations E. Normalizations (RMSNorm, QK-Norm, Norm Placement) A. What is a KV Cache? The Core Idea in One Sentence The KV Cache is a performance optimization technique used in Transformer-based models (like GPT, LLaMA, etc.) that drastically speeds up the process of generating text sequentially (one token at a time) by caching the results of intermediate calculations, thus avoiding redundant computation. 1. The Problem: Why Do We Need a KV Cache? To understand the solution, we must fi rst understand the problem. Imagine you're using a model like ChatGPT to generate a response. The model generates text one word (or token) at a time. Let's say it's generating the sentence: "The quick brown fox". 1. Step 1: You input "The". The model runs a full forward pass and predicts the next token: "quick". 2. Step 2: You now input "The quick". The model runs another full forward pass to predict "brown". 3. Step 3: You input "The quick brown". The model runs another full forward pass to predict "fox". Do you see the inef fi ciency? In Step 2, the model is re-processing the token "The" all over again, even though it was already processed in Step 1. In Step 3, it's re-processing both "The" and "quick". This redundancy becomes massively expensive as the sequence gets longer, because the
  • 2.
    computation for eachnew token would scale quadratically with sequence length due to the self- attention mechanism. This is incredibly slow and wasteful. 2. The Solution: How the KV Cache Works The key insight is that during text generation, most of the computation for the previous tokens is repeated and can be reused. In the Transformer's self-attention mechanism, each token is projected into three vectors: • Query (Q): Asks "what should I pay attention to?" • Key (K): Answers "this is what I contain." • Value (V): Answers "this is the information I will provide if you attend to me." Attention is calculated as a weighted sum of the Values, where the weights are determined by the compatibility between the Query of the current token and the Keys of all previous tokens. The KV Cache saves the computed Key and Value vectors for all previous tokens. Let's replay our example with a KV Cache: 1. Step 1: Input is "The". ◦ The model computes the K and V vectors for "The". ◦ It uses the Q vector for "The" to predict the next token: "quick". ◦ It stores the K, V for "The" in the cache. 2. Step 2: Input is now only the new token "quick". (We already have "The" in the cache). ◦ The model computes the K and V vectors for "quick". ◦ To calculate attention, it retrieves the cached K, V for "The" and uses the new K, V for "quick". ◦ It uses the Q for "quick" with all the Keys (from "The" and "quick") to predict "brown". ◦ It appends the K, V for "quick" to the cache. The cache now holds K, V for ["The", "quick"]. 3. Step 3: Input is only the new token "brown".
  • 3.
    ◦ The modelcomputes the K and V for "brown". ◦ It retrieves the entire cached K, V for ["The", "quick"]. ◦ It uses the Q for "brown" with all the cached Keys to predict "fox". ◦ It appends the K, V for "brown" to the cache. 3. The Bene fi ts • Massive Speedup: The model no longer processes the entire sequence from scratch for every new token. It only computes the Q, K, V for the new token and re-uses the cached K, V for all previous tokens. This reduces the computational complexity for each decoding step. • Reduced Latency: The response time for each new token becomes much faster, leading to a much more responsive user experience in applications like chatbots. 4. The Trade-offs • Memory Overhead: The primary cost of the KV Cache is memory. For every new token generated, you are storing two new vectors (K and V) for each layer and each attention head in the model. ◦ Memory Usage ≈ Batch Size × Number of Layers × Number of Heads × Sequence Length × Vector Dimension ◦ For long sequences (e.g., 32k tokens), this cache can consume several gigabytes of VRAM. This is a key constraint and a major focus of optimization research (like Multi-Query Attention or Grouped-Query Attention, which reduce the size of the K,V cache). • Cache Management: In complex sampling techniques or when handling multiple parallel sequences (beams in beam search), managing the cache correctly (e.g., rolling, shifting, or pruning it) becomes a non-trivial engineering task. Analogy Think of it like solving a long multiplication problem on paper. • Without KV Cache: Every time you add a new digit, you re-calculate the entire multiplication from the beginning. • With KV Cache: You write down the intermediate products (this is the "cache"). When you add a new digit, you only calculate the new interactions and then add them to your previously saved intermediate products. This is far more ef fi cient
  • 4.
    Summary In essence, theKV Cache is what makes modern LLMs like GPT-4 feasible for interactive applications, turning an intractably slow process into a fast and responsive one, at the cost of increased memory usage. B. What are rotary positional embeddings (RoPE) Of course! Let's break down Rotary Positional Embeddings (RoPE), which is a brilliant and widely used technique for giving a sense of order or position to tokens in a sequence for Transformer models. The Core Problem: "Where is the word?" Transformers process all tokens in a sequence simultaneously (unlike RNNs which process them one by one). This is great for speed but means the model has no innate idea of word order. The sentences: • "The cat sat on the mat." • "The mat sat on the cat." would look identical to a vanilla Transformer without positional information. We need to explicitly tell the model about the position of each token. Rotary Positional Embedding (RoPE) is a highly effective way to do this, used in state-of-the- art models like LLaMA, GPT-NeoX, and PaLM. Feature Without KV Cache With KV Cache Comput ation Re-processes the entire sequence for every new token. Only processes the new token; re-uses cached K,V for previous tokens. Speed Very slow for generation. Very fast for generation. Memory Lower memory overhead per step. High memory overhead, which grows with sequence length. Use Case Training, or single forward passes. Autoregressive text generation (e.g., chatbots, code completion).
  • 5.
    1. The Intuition:"Weave" Position into the Embedding Instead of adding a positional vector to the word embedding (the classic approach), RoPE's key idea is to rotate the embedding vector itself. Think of it like this: • Each token's embedding is a vector in a high-dimensional space. • Its position in the sequence determines a speci fi c angle of rotation. • The further along the token is, the more its embedding vector gets "twisted" or rotated. This rotation is done in a way that the dot product between two rotated vectors inherently contains information about the relative distance between them. This is crucial because the self- attention mechanism relies heavily on dot products (between Query and Key vectors) to determine attention scores. 2. How It Works (The Simple Version) Let's simplify the math. Imagine we have a 2-dimensional embedding vector for a word: [x, y]. 1. Represent it as a Complex Number: We can think of this 2D vector as a complex number x + yi, where i is the imaginary unit. 2. Apply a Rotation: To rotate this vector by an angle θ (which corresponds to its position m), we multiply it by the complex exponential e^(i*m*θ). ◦ The formula for rotating a complex number (x + yi) by angle θ is: ▪ x' = x*cos(mθ) - y*sin(mθ) ▪ y' = x*sin(mθ) + y*cos(mθ) 3. Result: The new vector [x', y'] is the original vector, but rotated. This rotated vector now represents the word at that speci fi c position. In reality, RoPE does this not just on a single 2D plane, but across multiple, consecutive 2D slices of the high-dimensional embedding vector. Each pair of elements in the embedding vector is treated as a 2D plane and rotated by a fraction of the position-based angle. 3. The "Killer Feature": Relative Position Encoding in Attention
  • 6.
    This is whyRoPE is so powerful. Let's see what happens in the attention mechanism. • The Query vector q for a token at position m gets rotated by angle mθ. • The Key vector k for a token at position n gets rotated by angle nθ. When we calculate the attention score (the dot product q • k), it becomes: Rotate(q, mθ) • Rotate(k, nθ) Due to the properties of rotation, this dot product is mathematically identical to the dot product of the original q and k vectors after being rotated by the relative angle (m - n)θ. In simple terms: Attention Score = f( q, k, m - n ) The attention score between a Query at position m and a Key at position n depends only on the original content vectors (q, k) and their relative distance (m - n). The absolute positions m and n themselves don't matter, only their difference. 4. Key Advantages of RoPE 1. Built-in Relative Encoding: As shown above, it naturally encodes relative position information directly into the attention calculation, which models fi nd very easy to use. 2. Better Extrapolation: Models using RoPE tend to generalize better to sequence lengths longer than those they were trained on (though it's not perfect, it's better than absolute positional embeddings). The rotational structure provides a natural way to "extend" the positions. 3. Decoupling of Dimensions: Unlike some other methods, RoPE doesn't mix all dimensions together additively. It acts on pairs of dimensions, which is a more structured and ef fi cient approach. 4. Stability: It avoids the potential numerical issues or saturation that can happen with other positional encoding functions. Analogy: A Clock Face Imagine each word's embedding is a hand on a clock. • The word "cat" might point to 3 o'clock. • The word "mat" might point to 7 o'clock. Without positions: "The cat sat on the mat" and "The mat sat on the cat" would have the hands pointing in the same directions, making the sentences indistinguishable.
  • 7.
    With RoPE: Theposition of the word determines how much we rotate the clock hand. • The 1st word ("The") is rotated by 0°. • The 2nd word ("cat") is rotated by 30°. • The 3rd word ("sat") is rotated by 60°. • ...and so on. Now, when the model checks the relationship between "on" (at position 4) and "mat" (at position 6), it doesn't see their absolute rotated positions (120° and 180°). Instead, the dot product (attention score) intuitively captures that "mat" is 2 positions away from "on", corresponding to a relative rotation of 60°. This relative distance is what truly matters. Summary In essence, RoPE is an elegant and highly effective mathematical solution to the problem of word order, and its properties make it a superior choice for training large, powerful language models. C. What are ef fi ciency-focused attention variants (GQA or MLA)? Of course! Let's dive into ef fi ciency-focused attention variants, speci fi cally GQA (Grouped- Query Attention) and MLA (Multi-head Latent Attention), which are crucial for making large language models faster and more memory-ef fi cient. The Context: The Scalability Problem with Standard Attention First, recall the standard Multi-Head Attention (MHA) from the original Transformer: Feature Description Core Idea Encode positional information by rotating the token's embedding vector based on its position. Mechani sm Applies a rotary transformation (using sine and cosine) to pairs of elements in the Query and Key vectors. Key Bene fi t The attention score between a Query and Key depends only on their content and relative position (m-n). Used In LLaMA, GPT-J, GPT-NeoX, Claude, and many other modern LLMs.
  • 8.
    • For eachtoken, the model creates multiple sets of Query (Q), Key (K), and Value (V) vectors (one set per "head"). • This allows the model to attend to information from different representation subspaces simultaneously. The Problem: In auto-regressive decoding (generating text token-by-token), we use the KV Cache to avoid recalculating Keys and Values for previous tokens. In MHA, each head has its own unique set of Keys and Values that need to be cached. • Memory Bottleneck: The size of the KV Cache scales linearly with: [Batch Size] * [Number of Layers] * [Number of Heads] * [Sequence Length] * [Head Dimension] • Memory Bandwidth Bottleneck: At every decoding step, the model must load the entire KV Cache from memory to compute attention. This memory access (memory-bound operation) becomes the primary bottleneck, not the actual computation. GQA and MLA are engineered solutions to this speci fi c problem. 1. Grouped-Query Attention (GQA) GQA is a direct and practical evolution between Multi-Head Attention (MHA) and Multi-Query Attention (MQA). The Spectrum of Attention Variants • Multi-Head Attention (MHA): Num_Heads Queries, Num_Heads Keys, Num_Heads Values Pros: High quality, each head can specialize. Cons: Large, slow KV Cache. • Multi-Query Attention (MQA): Num_Heads Queries, 1 Key, 1 Value All attention heads share a single set of Key and Value vectors. Pros: Drastically reduces KV Cache size, very fast. Cons: Can lead to a drop in model quality because the shared K/V cannot capture diverse information. • Grouped-Query Attention (GQA): Num_Heads Queries, G Keys, G Values (where G is the number of groups, and G < Num_Heads) This is the middle ground. You group the heads into G groups. All heads within a group share the same Key and Value vectors.
  • 9.
    How GQA Works 1.Instead of having a separate K and V projection for each head, you have only G K and V projections. 2. The Num_Heads Queries are divided into G groups. 3. All the Queries in a single group attend to the same, shared Key and Value set. Visual Analogy: • MHA: A classroom where every student (Query) has their own private textbook (K) and notebook (V). • MQA: A classroom where every student shares one single textbook and one notebook. • GQA: A classroom divided into study groups. Each group shares one textbook and one notebook within the group. Bene fi ts of GQA • Dramatically Reduced KV Cache: The KV Cache size is reduced by a factor of Num_Heads / G. For example, going from 32 heads to 8 groups (G=8) reduces the cache size by 4x. • Improved Inference Speed: Much less data needs to be loaded from memory per decoding step, alleviating the memory bandwidth bottleneck. • Quality Preservation: It performs much closer to MHA than MQA does, because the sharing is less extreme. Models can still maintain a degree of specialized attention. Usage: GQA is used in LLaMA 2 & 3 and many other state-of-the-art production models because it offers an excellent trade-off. 2. Multi-head Latent Attention (MLA) MLA is a more complex and recent technique that takes a different approach. It was introduced in models like Qwen2 and focuses on compressing the KV Cache. The Core Idea of MLA Instead of caching the full [Sequence Length] sized KV states for all layers and heads, MLA learns to project (or compress) these states into a much smaller, latent space. It then only caches this compressed representation.
  • 10.
    How MLA Works(Simpli fi ed) 1. Cross-Attention into Latent Space: MLA introduces a small set of latent tokens (e.g., 64 or 128). These are learned parameters that act as a "summary" or "bottleneck." 2. KV Compression: For a given layer, the full sequence of Key (K) and Value (V) states are not cached directly. Instead, they are used to update the latent state via a cross- attention mechanism: ◦ The latent tokens act as Queries. ◦ The actual K and V states of the current sequence act as Keys and Values. ◦ This cross-attention "condenses" the information from the long sequence into the small latent space. 3. Caching the Latent State: The model caches this updated, compact latent state instead of the full KV Cache. 4. Retrieval for Self-Attention: When the standard self-attention needs to run in the next step, it uses the compressed latent state to recover an approximation of the full context. Think of it as a "lossy compression" algorithm for the KV Cache. Instead of storing a giant high-resolution image (full KV Cache), you store a highly compressed JPEG (latent state) and decompress it when you need to use it. Bene fi ts of MLA • Massive KV Cache Reduction: This is the biggest advantage. The cache size becomes constant with respect to sequence length—it only depends on the fi xed size of the latent space. This enables extremely long context windows without the memory explosion. • Long Context Capability: Models using MLA can handle contexts of 128K, 1M, or even more tokens practically. • Maintains Multi-Head Structure: Unlike GQA, it doesn't reduce the number of K/V heads; it compresses the information from all of them. Trade-offs of MLA • Computational Overhead: The cross-attention compression/decompression steps add extra computation. • Potential Quality Loss: Any compression can lose information. The model must learn to use the latent space effectively, which might not perfectly preserve all the details from the full context.
  • 11.
    • Complexity: Thearchitecture is more complex than GQA. Comparison Summary Conclusion • GQA is currently the industry standard for ef fi ciency in general-purpose LLMs. It's a simple, effective modi fi cation that provides massive inference speedups with minimal quality loss. • MLA represents the frontier of long-context research. It's a more radical architectural change aimed at solving the fundamental problem of KV Cache scaling, making extremely long conversations and document processing feasible. Both are essential tools in the modern LLM engineer's toolkit, addressing the critical challenge of making powerful models practical to deploy. D. What are SwiGLU activations? Of course! Let's break down SwiGLU, which is a powerful activation function used in many modern large language models like LLaMA, PaLM, and GPT-Neo. 1. The Building Blocks: From ReLU to GLU Featu re Multi-Head Attention (MHA) Grouped-Query Attention (GQA) Multi-head Latent Attention (MLA) Core Idea Every head has unique K, V. Groups of heads share K, V. Compress the full KV Cache into a small latent space. KV Cach e Size Large: O(layers * heads * seq_len * dim) Reduced: O(layers * groups * seq_len * dim) Drastically Reduced: O(layers * latent_size * dim) (constant wrt seq_len) Prim ary Bene fi t Highest model quality. Inference Speed & Memory. Excellent quality/speed trade-off. Long Context Windows. Enables 1M+ token contexts. Prim ary Cost Slow, memory-heavy inference. Slight potential quality drop vs. MHA. Added complexity, compression overhead, potential info loss. Used In Original Transformer, BERT. LLaMA 2/3, PaLM, Gemini. Qwen2, other long-context models.
  • 12.
    To understand SwiGLU,we need to start with its predecessors. a) ReLU (Recti fi ed Linear Unit) • The classic activation function: ReLU(x) = max(0, x) • Simple and effective, but can sometimes "die" (output zero for negative inputs). b) GLU (Gated Linear Unit) • This is a key innovation. A GLU introduces a gating mechanism, inspired by the success of gating in LSTMs and GRUs. • How it works: 1. You take the input x and pass it through two separate linear transformations (like A*x + b and C*x + d). 2. One of these streams goes through a sigmoid activation function (σ), which squishes values between 0 and 1. This acts as a gate. 3. The output is the element-wise product of the other stream and this gate. 4. Formula: GLU(x) = (A * x + b) ⊗ σ(C * x + d) • Intuition: The sigmoid gate learns to control what information from the fi rst stream should be passed through. A value of 1 means "let everything through," a value of 0 means "block everything," and values in between act like a dimmer switch. This allows the model to learn more complex, non-linear relationships. 2. The Evolution to SwiGLU The original GLU paper experimented with different activation functions for the fi rst stream and for the gate. They found that simply using a linear activation for the fi rst stream and a sigmoid for the gate worked well. This is sometimes called the "vanilla" GLU. SwiGLU is a speci fi c, high-performing variant of GLU that makes two key changes: 1. It uses the Swish activation (also called SiLU) for the fi rst stream instead of a linear one. 2. It uses the same weight matrix for both transformations (a parameter-sharing trick), though this can vary in implementation. Let's look at the components: a) Swish / SiLU Activation
  • 13.
    • Formula: Swish(x)= x * σ(x) • What it does: It's a smooth, non-monotonic function that looks like ReLU but has a "bump" or a gentle curve for negative values instead of being exactly zero. It has been shown to often outperform ReLU in deep networks. b) The Full SwiGLU Formula Combining Swish and GLU gives us SwiGLU: SwiGLU(x) = (Swish(xW + b)) ⊗ (xV + c) Let's break it down: • xW + b: This is the fi rst linear transformation. Its output is passed through the Swish function. • xV + c: This is the second linear transformation. Its output is passed through a sigmoid (σ) function to create the gate. • ⊗: The outputs of the two paths are multiplied together element-wise. In many implementations (like in the LLaMA paper), the weights W and V are the same (W = V), which reduces parameters. 3. Why is SwiGLU So Effective? The power of SwiGLU comes from the combination of the gating mechanism and the smooth Swish activation. 1. Gating Mechanism: The gate provides a form of dynamic feature selection. The model can learn to suppress irrelevant features and amplify relevant ones for a given context. This adds a signi fi cant amount of representational power and is more expressive than a simple, static non-linearity like ReLU. 2. Smooth Gradient Flow: The Swish function is smooth and avoids the "dead neuron" problem of ReLU. Its gradients are well-behaved, which helps with training very deep networks. 3. Synergy: The combination is empirically superior. Research (e.g., from the original PaLM paper) has shown that replacing the standard Feed-Forward Network (FFN) with a SwiGLU-based one leads to better model performance and more stable training for the same compute budget. Analogy: A Smart Water Valve
  • 14.
    Think of astandard activation like ReLU as a simple on/off water valve: • If the input pressure is positive, water fl ows. • If it's negative, the valve shuts completely. A SwiGLU unit is like a smart, automated valve system: • One part of the system (xW + b) decides how much water could fl ow (the Swish stream). • The other part (xV + c) is a sensor that decides what percentage of that potential fl ow should actually be allowed through (the sigmoid gate). • The fi nal output is the result of these two systems working together, allowing for much more precise control. Summary In essence, SwiGLU isn't just a minor tweak; it's a fundamental architectural improvement that has become a standard component in the recipe for building powerful modern language models. E. What are normalizations (RMSNorm, QK-Norm, norm placement)? Excellent question! Normalization is a critical but often overlooked part of modern Transformer architecture. Let's break down RMSNorm, QK-Norm, and the crucial concept of norm placement. Feature Description What it is A variant of the Gated Linear Unit (GLU) that uses the Swish activation function. Core Idea A gating mechanism that allows the network to dynamically control the fl ow of information. Key Formula SwiGLU(x) = Swish(xW + b) ⊗ σ(xV + c) Main Bene fi t Provides a more powerful and expressive activation function, leading to better performance in large models. Main Cost It requires ~2/3 more parameters than a standard ReLU-based feed-forward layer because of the two linear transformations. Used In LLaMA, PaLM, GPT-Neo, T5, and many other state-of-the-art Transformer models.
  • 15.
    1. The Foundation:Why Normalization? Before diving into speci fi cs, remember the core purpose of normalization layers: • Stabilize Training: Prevent activations from becoming too large or too small (exploding/ vanishing gradients). • Faster Convergence: Allow for higher learning rates. • Regularization: Slight regularization effect. The original Transformer used LayerNorm, which normalizes across the feature dimension for each token: LayerNorm(x) = (x - μ) / σ * γ + β It subtracts the mean (μ) and divides by the standard deviation (σ), then applies a learnable scale (γ) and bias (β). 2. RMSNorm (Root Mean Square Normalization) RMSNorm is a popular, simpli fi ed successor to LayerNorm. What it does: • It removes the mean subtraction. • It only focuses on re-scaling by the root mean square of the inputs. Formula: RMSNorm(x) = (x / RMS(x)) * γ where RMS(x) = sqrt(mean(x²)) Key Differences from LayerNorm: 1. No Mean Centering: RMSNorm doesn't force the mean to be zero; it only forces the quadratic mean to be 1. This is based on the observation that the mean-centered component in LayerNorm contributes little to the stability. 2. No Learnable Bias (β): Only the scale parameter (γ) is learned. 3. Computationally Cheaper: Requires ~15-20% less computation than LayerNorm because calculating the standard deviation (sqrt(mean((x - μ)²))) is more expensive than calculating the RMS (sqrt(mean(x²))).
  • 16.
    Why it works: Researchshowed that the re-scaling invariant property (controlling the variance) is the main factor for training stability, not the mean-invariance. By keeping it simple, RMSNorm achieves similar or better performance with higher ef fi ciency. Usage: RMSNorm is used in LLaMA, GPT-J, GPT-NeoX, and many other modern models. 3. QK-Norm (Query-Key Normalization) This is a more recent and specialized normalization applied speci fi cally within the self-attention mechanism. The Problem: In standard attention, the softmax function can become "saturated" or "sharpened." For a single query token, if one key has a much higher dot product than all others, the softmax will assign nearly all probability (≈1.0) to that key and almost zero to the rest. This makes the gradient very small for the non-attended keys, slowing down learning. The Solution: QK-Norm applies normalization to the attention scores before the softmax. Speci fi cally, it normalizes the Query and Key vectors individually. How it works: 1. For each attention head, normalize the Query (Q) and Key (K) vectors along the head dimension. 2. This is typically done using RMSNorm (without the learnable γ). 3. The normalized Q and K are then used to compute the attention scores. Formula (for a single head): Attention = Softmax( (RMSNorm(Q) * RMSNorm(K)ᵀ) / √d_k ) Key Bene fi ts: 1. Prevents Softmax Saturation: By controlling the scale of Q and K, it prevents individual attention scores from becoming extremely large, leading to a "smoother" softmax distribution.
  • 17.
    2. Improved TrainingStability: Makes the model less sensitive to learning rate and initialization. 3. Enables Better Depth Scaling: Crucial for training very deep transformers. Analogy: Think of it as applying "volume control" to each speaker (Query and Key) before they start talking, preventing anyone from shouting and drowning out others in the conversation. Usage: QK-Norm was introduced in the DeepNet paper and is used in models like LLaMA 2 & 3 and Chinchilla. 4. Norm Placement (Pre-Norm vs. Post-Norm) This is a crucial architectural decision: Where do you put the normalization layer relative to the sub-layer (Attention or FFN)? a) Post-Norm (Original Transformer) • Placement: y = Norm( x + SubLayer(x) ) • The normalization is applied after the residual connection. • Characteristics: ◦ Often leads to training instability in very deep models. ◦ Requires careful learning rate warm-up. ◦ The gradient has to fl ow through the entire network depth before being normalized. b) Pre-Norm (Modern Standard) • Placement: y = x + SubLayer( Norm(x) ) • The normalization is applied to the input of the sub-layer, before the operation. • Characteristics: ◦ Much more stable training. This is the primary reason for its adoption. ◦ Enables training of very deep transformers (100+ layers). ◦ The gradient fl ows directly through the residual connection, bypassing the normalization and sub-layer, which helps mitigate vanishing gradients.
  • 18.
    Why Pre-Norm WorksBetter: It creates a "path of least resistance" for the gradient via the residual connection. Each sub-layer effectively operates on a well-behaved, normalized input, making the learning problem easier for each component. Visual Comparison: text # Post-Norm (Original) x -> Attention -> Add -> LayerNorm -> Output # Pre-Norm (Modern) x -> LayerNorm -> Attention -> Add -> Output Usage: Pre-Norm is the universal standard in all modern LLMs (GPT, LLaMA, PaLM, etc.). Summary & Hierarchy in a Modern Transformer Block Here's how these concepts typically fi t together in a single decoder layer of a model like LLaMA: 1. Input: Hidden states from previous layer 2. Pre-Norm (RMSNorm): Normalize the input 3. Self-Attention: ◦ Calculate Q, K, V from normalized input. ◦ QK-Norm: Apply RMSNorm to Q and K. ◦ Compute attention scores: Scores = (Norm(Q) * Norm(K)ᵀ) / √d_k ◦ Apply softmax to get weights. ◦ Output = weights * V. 4. First Residual Connection: x = x + Attention_Output 5. Pre-Norm (RMSNorm) again: Normalize the new state. 6. Feed-Forward Network (SwiGLU): Process the normalized state. 7. Second Residual Connection: Output = x + FFN_Output
  • 19.
    This combination ofPre-Norm RMSNorm and QK-Norm creates a highly stable and ef fi cient architecture that can be scaled to incredible depths (e.g., 80 layers in LLaMA 2) while maintaining robust training dynamics.