Exploring masked language modeling
Although the transformer was revolutionary, the popularization of the transformer in the scientific community is also due to the Bidirectional Encoder Representations from Transformers (BERT) model. This is because BERT was a revolutionary variant of the transformer that showed the capabilities of this type of model. BERT was revolutionary because it was already prospectively designed specifically for future applications (such as question answering, summarization, and machine translation). In fact, the original transformer analyzes the left-to-right sequence, so when the model encounters an entity, it cannot relate it to what is on the right of the entity. In these applications, it is important to have context from both directions.

Figure 2.15 – Difference between a causal and bidirectional language model
Bidirectional encoders resolve this limitation by allowing the model to find relationships over the entire...