The Transformer - Xavier Giró - UPC Barcelona 2021

The Transformer
Lecture 20
Xavier Giro-i-Nieto
Associate Professor
Universitat Politecnica de Catalunya
@DocXavi
xavier.giro@upc.edu

3
Acknowledgments
Marta R. Costa-jussà
Associate Professor
Universitat Politècnica de Catalunya
Carlos Escolano
PhD Candidate
Gerard I. Gállego
PhD Student
gerard.ion.gallego@upc.edu
@geiongallego

5
Reminder
Nikhil Sha, “Attention ? An other Perspective!”. 2020.

6
Reminder
Attention is a mechanism to compute a context vector (c) for a query (Q) as a
weighted sum of values (V).
Figure: Nikhil Shah, “Attention? An Other Perspective! [Part 1]” (2020)

7
Reminder

8
Reminder: Seq2Seq with Cross-Attention
Slide concept: Abigail See, Matthew Lamm (Stanford CS224N), 2020
In this case, cross-attention
refers to the attention
between the encoder and
decoder states.

9
What may the term “self” refer to, as a contrast of “cross”-attention ?

Outline
1. Motivation
2. Self-attention
10

11
Self-Attention (or intra-Attention)
Lin, Z., Feng, M., Santos, C. N. D., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. A structured self-attentive sentence embedding.
ICLR 2017.
Figure:
Jay Alammar,
“The Illustrated Transformer”
Self-attention refers to attending to other elements from the SAME sequence.

12
Query (Q)
g(x) = WQ
x
Key (K)
f(x) = WK
x
Value (V)
h(x) = WV
x
WQ
, WK
and WV
are projection
layers shared across all words.

13
Which steps are necessary to compute the contextual representation of a word
embedding e2
in a sequences of four words embeddings (e1
, e2
, e3
, e4
) ?
A (scaled) dot-product is computed between each pair of word embeddings
(eg. e1
and e2
)...
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.

14
embedding e2
, e2
, e3
, e4
) ?
… a softmax layer normalizes the attention scores to obtain the attention
distribution...

15
embedding e2
, e2
, e3
, e4
) ?
...the same word
embeddings are combined to
obtain the contextual
representation e2
’.

16
Figure: Jay Alammar, “The illustrated Transformer” (2018)

17

18

19

20

21

22

23

24
Self-Attention (or intra-Attention) Scaled dot-product
attention

25
Study case: Self-Attention in for image generation
#SAGAN Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. "Self-attention generative adversarial
networks." ICML 2019. [video]
Figure:
Frank Xu
Generator (G): Details can be generated using cues from all feature locations.
Discriminator: Can check consistency betweenn features in distant portions of the image.

26
Study case: Self-Attention in for image generation
#SAGAN Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. "Self-attention generative adversarial
networks." ICML 2019. [video]
Query locations Attention maps for diﬀeret query locations

Outline
1. Motivation
2. Self-attention
3. Multi-head Self-Attention (MHSA)
27

28
Multi-Head Self-Attention (MHSA)
In vanilla self-attention, a single set of projection matrices WQ
, WK
, WV
is used.

29
In multi-head self-attention, multiple sets of projection matrices are used, and can
provide diﬀerent contextual representations for the same input token.

30
The multi-head self-attended E’i
matrixes are concatenated:

31
A fully connected layer on top combines everything in a new E’.

Multi-head Self-Attention: Visualization
32
#BertViz Vig, Jesse. "A multiscale visualization of attention in the transformer model." ACL 2019. [code] [tweet]
Each colour
corresponds
to a head.
Blue: First
head only.
Multi-color:
Multiple
heads.

33
Self-Attention and Convolutional Layers
Cordonnier, J. B., Loukas, A., & Jaggi, M. On the relationship between self-attention and convolutional layers. ICLR 2020.
[tweet] [code]

Outline
1. Motivation
2. Self-attention
3. Multi-head Attention
4. Positional Encoding
34

Positional Encoding
35
Given that the attention mechanism allows accessing all input (and output)
tokens, we no longer need a memory through recurrent layers.

Positional Encoding
36
Where is the relative relation in the sequence encoded ?

Positional Encoding
37
Where is the relative relation in the sequence encoded ?

Positional Encoding
38
Maria Ribalta, Pere-Pau Vàzquez, “Visualization is all you need”. UPC GCED 2020.
Sinusoidal functions are typically used to provide positional encodings.

Positional Encoding
39

Outline
1. Motivation
2. Self-attention
5. The Transformer
40

The Transformer
41
The Transformer removed the recurrency mechanism thanks to self-attention.

The Transformer
42
Positional Encoding over the output
sequence.
Positional Encoding
over the input
sequence.
Auto-regressive (at test).

The Transformer
43
Cross-Attention (or inter-attention)
between input and output tokens
Self-attention for
the input tokens.
Self-attention for the output tokens.

The Transformer: Layers
44
N decoder layers
N encoder
layers

The Transformer: Layers
45
#BertViz Vig, Jesse. "A multiscale visualization of attention in the transformer model." ACL 2019. [code] [tweet]
A birds-eye view of attention across all of the model’s layers and heads

The Transformer: Visualization
46
Maria Ribalta, Pere-Pau Vàzquez, “Visualization is all you need”. UPC GCED 2020.

47
Are Transformers for Language only ? NO !!
#ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa
Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code]

Outline
1. Motivation
2. Self-attention
5. The Transformer
48

49
(extra) PyTorch Lab on Google Colab
DL resources from UPC Telecos:
● Lectures (with Slides & Videos)
● Labs
Gerard Gallego
gerard.ion.gallego@upc.edu
Student PhD
Universitat Politecnica de Catalunya
Technical University of Catalonia

50
Software
● Transformers in HuggingFace.
● GPT-Neo by EleutherAI
○ Similar results to GPT-3, but smaller and open source.
● Andrej Karpathy, minGPT (2020).

51
Learn more
Ashish Vaswani, Stanford CS224N 2019.

52
Learn more
● Tutorials
○ Sebastian Ruder, Deep Learning for NLP Best Practices # Attention (2017).
○ Chris Olah, Shan Carter, “Attention and Augmented Recurrent Neural Networks”. distill.pub 2016.
○ Lilian Weg, The Transformer Family. Lil’Log 2020
● Twitter threads
○ Christian Wolf (INSA Lyon)
● Scientiﬁc publications
○ #Perceiver Jaegle, Andrew, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira.
"Perceiver: General perception with iterative attention." arXiv preprint arXiv:2103.03206 (2021).
○ #Longformer Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv
preprint arXiv:2004.05150 (2020).
○ Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. Transformers are RNS: Fast autoregressive transformers with linear
attention. ICML 2020.
○ Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye
Teh, Tim Harley, Razvan Pascanu, “Multiplicative Interactions and Where to Find Them”. ICLR 2020. [tweet]
○ Self-attention in language
■ Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading. arXiv preprint
arXiv:1601.06733.
○ Self-attention in images
■ Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. ICML
2018.
■ Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. "Non-local neural networks." In CVPR 2018.

The Transformer - Xavier Giró - UPC Barcelona 2021

More Related Content

What's hot (20)

Similar to The Transformer - Xavier Giró - UPC Barcelona 2021 (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

The Transformer - Xavier Giró - UPC Barcelona 2021