Introducing the transformer model
Despite this decisive advance though, several problems remain in machine translation:
- The model fails to capture the meaning of the sentence and is still error-prone
- In addition, we have problems with words that are not in the initial vocabulary
- Errors in pronouns and other grammatical forms
- The model fails to maintain context for long texts
- It is not adaptable if the domain in the training set and test data is different (for example, if it is trained on literary texts and the test set is finance texts)
- RNNs are not parallelizable, and you have to compute sequentially
Considering these points, Google researchers in 2016 came up with the idea of eliminating RNNs altogether rather than improving them. According to the authors of the Attention is All You Need seminal article; all you need is a model that is based on multi-head self-attention. Before going into detail, the transformer consists entirely of stacked layers of multi-head self-attention. In this way, the model learns a hierarchical and increasingly sophisticated representation of the text.
The first step in the process is the transformation of text into numerical vectors (tokenization). After that, we have an embedding step to obtain vectors for each token. A special feature of the transformer is the introduction of a function to record the position of each token in the sequence (self-attention is not position-aware). This process is called positional encoding. The authors in the article use sin and cos alternately with position. This allows the model to know the relative position of each token.
In the first step, the embedding vectors are summed with the result of these functions. This is because self-attention is not aware of word order, but word order in a period is important. Thus, the order is directly encoded in the vectors it awaits. Note, though, that there are no learnable parameters in this function and that for long sequences, it will have to be modified (we will discuss this in the next chapter).

Figure 2.7 – Positional encoding
After that, we have a series of transformer blocks in sequence. The transformer block consists of four elements: multi-head self-attention, feedforward layer, residual connections, and layer normalization.

Figure 2.8 – Flow diagram of the transformer block
The feedforward layer consists of two linear layers. This layer is used to obtain a linear projection of the multi-head self-attention. The weights are identifiable for each position and are separated. It can be seen as two linear transformations with one ReLU activation in between.
This adds a step of non-linearity to self-attention. The FFN layer is chosen because it is an easily parallelized operation.
Residual connections are connections that pass information between two layers without going through the intermediate layer transformation. Initially developed in convolutional networks, they allow a shortcut between layers and help the gradient pass down to the lower layers. In the transformer, blocks are present for both the attention layer and feedforward, where the input is summed with the output. Residual connections also have the advantage of making the loss surface smoother (this helps the model find a better minimum and not get stuck in a local loss). This powerful effect can be seen clearly in Figure 2.9:

Figure 2.9 – Effect of the residual connections on the loss
Note
Figure 2.9 is originally from Visualizing the Loss Landscape of Neural Nets by Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein (https://github.com/tomgoldstein/loss-landscape/tree/master).
The residual connection makes the loss surface smoother, which allows the model to be trained more efficiently and quickly.
Layer normalization is a form of normalization that helps training because it keeps the hidden layer values in a certain range (it is an alternative to batch normalization). Having taken a single vector, it is normalized in a process that takes advantage of the mean and standard deviation. Having calculated the mean and standard deviation, the vector is scaled:
In the final transformation, we exploit two parameters that are learned during training.
There is a lot of variability during the training, and this can hurt the learning of the training. To reduce uninformative variability, we add this normalization step, thus normalizing the gradient as well.
At this point, we can assemble everything into a single block. Consider that after embedding, we have as input X a matrix of dimension n x d (with n being the number of tokens, and d the dimensions of the embedding). This input X goes into a transformer block and comes out with the same dimensions. This process is repeated for all transformer blocks:
Some notes on this process are as follows:
- In some architectures, LayerNorm can be after the FFN block instead of before (whether it is better or not is still debated).
- Modern models have up to 96 transformer blocks in series, but the structure is virtually identical. The idea is that the model learns an increasingly complex representation of the language.
- Starting with the embedding of an input, self-attention allows this representation to be enriched by incorporating an increasingly complex context. In addition, the model also has information about the location of each token.
- Absolute positional encoding has the defect of overrepresenting words at the beginning of the sequence. Today, there are variants that consider the relative position.
Once we have “the bricks,” we can assemble them into a functional structure. In the original description, the model was structured for machine translation and composed of two parts: an encoder (which takes the text to be translated) and a decoder (which will produce the translation).
The original transformer is composed of different blocks of transformer blocks and structures in an encoder and decoder, as you can see in Figure 2.10.

Figure 2.10 – Encoder-decoder structure
The decoder, like the encoder, is composed of an embedding, a positional encoder, and a series of transformer blocks. One note is that in the decoder, instead of self-attention, we have cross-attention. Cross-attention is exactly the same, only we take both elements from the encoder and the decoder (because we want to condition the generation of the decoder based on the encoder input). In this case, the queries come from the encoder and the rest from the decoder. As you can see from Figure 2.11, the decoder sequence can be of different sizes, but the result is the same:

Figure 2.11 – Cross-attention
Input N comes from the encoder, while input M is from the decoder. In the figure, cross-attention is mixing information from the encoder and decoder, allowing the decoder to learn from the encoder.
Another note on the structure: in the decoder, the first self-attention has an additional mask to prevent the model from seeing the future.
This is especially true in the case of QT. In fact, if one wants to predict the next word and the model already knows what it is, we have data leakage. To compensate for this, we add a mask in which the upper-triangular portion is replaced with negative infinity: - ∞.

Figure 2.12 – Masked attention
The first transformer consisted of an encoder and decoder, but today there are also models that are either encoder-only or decoder-only. Today, for generative AI, they are practically all decoder-only. We have our model; now, how can you train a system that seems so complex? In the next section, we will see how to succeed at training.