240115_Attention Is All You Need (2017 NIPS).pptx

Min-Seo Kim
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: kms39273@naver.com

1
Previous work
RNN(Recurrent Neural Network)
• Utilizes the structure of RNN, which is suitable for processing sequence data or time-series data.
• RNN incorporates past information into current decisions, enabling the understanding of the continuity and
context of data over time.

2
Previous work
LSTM (Long Short-Term Memory)
• LSTM emerged as a solution to the problems of long-term dependencies, where in vanilla RNNs, information
from earlier time steps fails to be sufficiently transmitted to later stages as the sequence lengthens.

3
Previous work
GRU (Gated Recurrent Unit)
• While LSTM requires considerable computing power due to the presence of four neural networks within a
single cell, GRU emerged as an improvement, implementing a similar mechanism with only three neural
networks.

4
Background
• To address the bottleneck issue caused by a single, fixed-size context vector, there has been a shift towards
machine translation approaches that move beyond the RNN-based framework.
Problem with the Encoder-Decoder Model

5
Methodology
Methodology
• Does not use networks that consider sequence order, such as RNN or CNN.
• Utilizes positional encoding to account for the position, and employs self-
attention techniques separately to consider context. encoder
decoder

6
Baseline
Scaled Dot-Product Attention
• Takes Query (Q), Key (K), and Value (V) as
inputs.

7
Baseline
Multi-Head Attention
• More efficient than using a single attention function. It involves mapping queries,
keys, and values through linear projections to intermediate representations. This
process creates multiple attention functions, each with different sets of inputs.

8
Experiments
English-to-German translation task (WMT 2014)
• Measures the performance of translations by comparing how similar machine-translated results are to those
translated by humans.
• It is observed that the Transformer demonstrates higher performance compared to other models, while also
having a lower training cost.

10
Experiments
English Constituency Parsing
• To test the Transformer's effectiveness in other tasks, it has been applied to the English Constituency Parsing
task.
• Constituency Parsing involves classifying words according to their grammatical constituents.
• Despite not being specifically tuned for this task, the Transformer demonstrates good performance.

11
Paper review
• The Transformer replaces the recurrent layers commonly used in encoder-decoder architectures with multi-
headed self-attention.
• For translation tasks, the Transformer can be trained much faster than architectures based on recurrent or
convolutional layers.
• There is great anticipation for the future of attention-based models, and plans are in place to apply them to
other tasks.
• Plans include extending the Transformer to handle input and output modalities beyond text, and exploring
local, restricted attention mechanisms to efficiently process large inputs and outputs such as images, audio,
and video.
• Another research goal is to make the generation process less sequential.
Conclusions

240115_Attention Is All You Need (2017 NIPS).pptx

240115_Attention Is All You Need (2017 NIPS).pptx

More Related Content

Similar to 240115_Attention Is All You Need (2017 NIPS).pptx (20)

More from thanhdowork (20)

Recently uploaded (20)

240115_Attention Is All You Need (2017 NIPS).pptx

Editor's Notes