Min-Seo Kim
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: kms39273@naver.com
1
Previous work
RNN(Recurrent Neural Network)
• Utilizes the structure of RNN, which is suitable for processing sequence data or time-series data.
• RNN incorporates past information into current decisions, enabling the understanding of the continuity and
context of data over time.
2
Previous work
LSTM (Long Short-Term Memory)
• LSTM emerged as a solution to the problems of long-term dependencies, where in vanilla RNNs, information
from earlier time steps fails to be sufficiently transmitted to later stages as the sequence lengthens.
3
Previous work
GRU (Gated Recurrent Unit)
• While LSTM requires considerable computing power due to the presence of four neural networks within a
single cell, GRU emerged as an improvement, implementing a similar mechanism with only three neural
networks.
4
Background
• To address the bottleneck issue caused by a single, fixed-size context vector, there has been a shift towards
machine translation approaches that move beyond the RNN-based framework.
Problem with the Encoder-Decoder Model
5
Methodology
Methodology
• Does not use networks that consider sequence order, such as RNN or CNN.
• Utilizes positional encoding to account for the position, and employs self-
attention techniques separately to consider context. encoder
decoder
6
Baseline
Scaled Dot-Product Attention
• Takes Query (Q), Key (K), and Value (V) as
inputs.
7
Baseline
Multi-Head Attention
• More efficient than using a single attention function. It involves mapping queries,
keys, and values through linear projections to intermediate representations. This
process creates multiple attention functions, each with different sets of inputs.
8
Experiments
English-to-German translation task (WMT 2014)
• Measures the performance of translations by comparing how similar machine-translated results are to those
translated by humans.
• It is observed that the Transformer demonstrates higher performance compared to other models, while also
having a lower training cost.
9
Experiments
Model Variation
10
Experiments
English Constituency Parsing
• To test the Transformer's effectiveness in other tasks, it has been applied to the English Constituency Parsing
task.
• Constituency Parsing involves classifying words according to their grammatical constituents.
• Despite not being specifically tuned for this task, the Transformer demonstrates good performance.
11
Paper review
• The Transformer replaces the recurrent layers commonly used in encoder-decoder architectures with multi-
headed self-attention.
• For translation tasks, the Transformer can be trained much faster than architectures based on recurrent or
convolutional layers.
• There is great anticipation for the future of attention-based models, and plans are in place to apply them to
other tasks.
• Plans include extending the Transformer to handle input and output modalities beyond text, and exploring
local, restricted attention mechanisms to efficiently process large inputs and outputs such as images, audio,
and video.
• Another research goal is to make the generation process less sequential.
Conclusions
240115_Attention Is All You Need (2017 NIPS).pptx

More Related Content

PPTX
Transfer Learning in NLP: A Survey
PDF
Sequence Modelling with Deep Learning
PPTX
Convolutional Neural Network and RNN for OCR problem.
PDF
Sequence Model pytorch at colab with gpu.pdf
 
PPTX
A Generalization of Transformer Networks to Graphs.pptx
PDF
Building a Neural Machine Translation System From Scratch
PPTX
Gnerative AI presidency Module1_L4_LLMs_new.pptx
PPTX
Foundation of ML Project Presentation - 1.pptx
Transfer Learning in NLP: A Survey
Sequence Modelling with Deep Learning
Convolutional Neural Network and RNN for OCR problem.
Sequence Model pytorch at colab with gpu.pdf
 
A Generalization of Transformer Networks to Graphs.pptx
Building a Neural Machine Translation System From Scratch
Gnerative AI presidency Module1_L4_LLMs_new.pptx
Foundation of ML Project Presentation - 1.pptx

Similar to 240115_Attention Is All You Need (2017 NIPS).pptx (20)

PDF
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
PPTX
A Survey of Convolutional Neural Networks
PPTX
Deep Learning for Machine Translation
PPTX
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
PPTX
Story story ppt
DOCX
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
PDF
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PDF
Electi Deep Learning Optimization
PPTX
Scope of parallelism
PPT
presentation.ppt
PPTX
Coding For Cores - C# Way
PDF
Attention Is All You Need
PPTX
Data Parallel and Object Oriented Model
PPT
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
PPT
Chap2 slides
PPT
PMSCS 657_Parallel and Distributed processing
PDF
chap2_slidesforparallelcomputingananthgarama
PPTX
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
PDF
Cloud Computing-Parallel computing-unit-i
PPTX
TensorFlow.pptx
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
A Survey of Convolutional Neural Networks
Deep Learning for Machine Translation
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Story story ppt
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Electi Deep Learning Optimization
Scope of parallelism
presentation.ppt
Coding For Cores - C# Way
Attention Is All You Need
Data Parallel and Object Oriented Model
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
Chap2 slides
PMSCS 657_Parallel and Distributed processing
chap2_slidesforparallelcomputingananthgarama
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Cloud Computing-Parallel computing-unit-i
TensorFlow.pptx
Ad

More from thanhdowork (20)

PPTX
250901_Thien_Labseminar_KET-RAG: A Cost-Efficient Multi-Granular Indexing Fra...
PPTX
[NS][Lab_Seminar_250901]Modeling Uncertainty in Composed Image Retrieval via ...
PPTX
250901_Thuy_Labseminar[GIMLET: A Unified Graph-Text Model for Instruction-Bas...
PPTX
250825_JH_labsemeinar[Collaborative Graph Convolutional Networks for Recommen...
PPTX
250824_Thien_Labseminar_SampleAndDeepGCN.pptx
PPTX
250825_Thuy_Labseminar[Bi-level Contrastive Learning for Knowledge-Enhanced M...
PPTX
[NS][Lab_Seminar_250825]OFFSET: Segmentation-based Focus Shift Revision for C...
PPTX
250818_Thuy_Labseminar[TopoGCL: Topological Graph Contrastive Learning].pptx
PPTX
[NS][Lab_Seminar_250811]Imagine and Seek: Improving Composed Image Retrieval ...
PPTX
250811_HW_LabSeminar[Self-Supervised Graph Information Bottleneck for Multivi...
PPTX
250811_Thien_Labseminar[Cluster-GCN].pptx
PPTX
250811_Thuy_Labseminar[BioBRIDGE: BRIDGING BIOMEDICAL FOUNDATION MODELS VIA K...
PPTX
[NS][Lab_Seminar_250728]On the Trade-off between Over-smoothing and Over-squa...
PPTX
250804_HW_LabSeminar[Discrete Curvature Graph Information Bottleneck].pptx
PPTX
250728_Thuy_Labseminar[Knowledge Enhanced Representation Learning for Drug Di...
PPTX
250728_Thuy_Labseminar[Predictive Chemistry Augmented with Text Retrieval].pptx
PPTX
[NS][Lab_Seminar_250728]NeuralWalker.pptx
PPTX
A Novel Shape-Aware Topological Representation for GPR Data with DNN Integrat...
PPTX
250721_Thien_Labseminar[Variational Graph Auto-Encoders].pptx
PPTX
250721_HW_LabSeminar[RingFormer: A Ring-Enhanced Graph Transformer for Organi...
250901_Thien_Labseminar_KET-RAG: A Cost-Efficient Multi-Granular Indexing Fra...
[NS][Lab_Seminar_250901]Modeling Uncertainty in Composed Image Retrieval via ...
250901_Thuy_Labseminar[GIMLET: A Unified Graph-Text Model for Instruction-Bas...
250825_JH_labsemeinar[Collaborative Graph Convolutional Networks for Recommen...
250824_Thien_Labseminar_SampleAndDeepGCN.pptx
250825_Thuy_Labseminar[Bi-level Contrastive Learning for Knowledge-Enhanced M...
[NS][Lab_Seminar_250825]OFFSET: Segmentation-based Focus Shift Revision for C...
250818_Thuy_Labseminar[TopoGCL: Topological Graph Contrastive Learning].pptx
[NS][Lab_Seminar_250811]Imagine and Seek: Improving Composed Image Retrieval ...
250811_HW_LabSeminar[Self-Supervised Graph Information Bottleneck for Multivi...
250811_Thien_Labseminar[Cluster-GCN].pptx
250811_Thuy_Labseminar[BioBRIDGE: BRIDGING BIOMEDICAL FOUNDATION MODELS VIA K...
[NS][Lab_Seminar_250728]On the Trade-off between Over-smoothing and Over-squa...
250804_HW_LabSeminar[Discrete Curvature Graph Information Bottleneck].pptx
250728_Thuy_Labseminar[Knowledge Enhanced Representation Learning for Drug Di...
250728_Thuy_Labseminar[Predictive Chemistry Augmented with Text Retrieval].pptx
[NS][Lab_Seminar_250728]NeuralWalker.pptx
A Novel Shape-Aware Topological Representation for GPR Data with DNN Integrat...
250721_Thien_Labseminar[Variational Graph Auto-Encoders].pptx
250721_HW_LabSeminar[RingFormer: A Ring-Enhanced Graph Transformer for Organi...
Ad

Recently uploaded (20)

PPTX
IT infrastructure and emerging technologies
PDF
African Communication Research: A review
PPTX
operating_systems_presentations_delhi_nc
PDF
Chevening Scholarship Application and Interview Preparation Guide
DOCX
EDUCATIONAL ASSESSMENT ASSIGNMENT SEMESTER MAY 2025.docx
PPTX
MMW-CHAPTER-1-final.pptx major Elementary Education
PPT
hemostasis and its significance, physiology
PPTX
Neurological complocations of systemic disease
PPTX
Approach to a child with acute kidney injury
PDF
GSA-Past-Papers-2010-2024-2.pdf CSS examination
PPTX
Power Point PR B.Inggris 12 Ed. 2019.pptx
PDF
Disorder of Endocrine system (1).pdfyyhyyyy
PDF
Health aspects of bilberry: A review on its general benefits
PDF
FAMILY PLANNING (preventative and social medicine pdf)
PPTX
growth and developement.pptxweeeeerrgttyyy
DOCX
THEORY AND PRACTICE ASSIGNMENT SEMESTER MAY 2025.docx
PPTX
CHROMIUM & Glucose Tolerance Factor.pptx
PPT
hsl powerpoint resource goyloveh feb 07.ppt
PPTX
Diploma pharmaceutics notes..helps diploma students
PPTX
principlesofmanagementsem1slides-131211060335-phpapp01 (1).ppt
IT infrastructure and emerging technologies
African Communication Research: A review
operating_systems_presentations_delhi_nc
Chevening Scholarship Application and Interview Preparation Guide
EDUCATIONAL ASSESSMENT ASSIGNMENT SEMESTER MAY 2025.docx
MMW-CHAPTER-1-final.pptx major Elementary Education
hemostasis and its significance, physiology
Neurological complocations of systemic disease
Approach to a child with acute kidney injury
GSA-Past-Papers-2010-2024-2.pdf CSS examination
Power Point PR B.Inggris 12 Ed. 2019.pptx
Disorder of Endocrine system (1).pdfyyhyyyy
Health aspects of bilberry: A review on its general benefits
FAMILY PLANNING (preventative and social medicine pdf)
growth and developement.pptxweeeeerrgttyyy
THEORY AND PRACTICE ASSIGNMENT SEMESTER MAY 2025.docx
CHROMIUM & Glucose Tolerance Factor.pptx
hsl powerpoint resource goyloveh feb 07.ppt
Diploma pharmaceutics notes..helps diploma students
principlesofmanagementsem1slides-131211060335-phpapp01 (1).ppt

240115_Attention Is All You Need (2017 NIPS).pptx

  • 1. Min-Seo Kim Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: [email protected]
  • 2. 1 Previous work RNN(Recurrent Neural Network) • Utilizes the structure of RNN, which is suitable for processing sequence data or time-series data. • RNN incorporates past information into current decisions, enabling the understanding of the continuity and context of data over time.
  • 3. 2 Previous work LSTM (Long Short-Term Memory) • LSTM emerged as a solution to the problems of long-term dependencies, where in vanilla RNNs, information from earlier time steps fails to be sufficiently transmitted to later stages as the sequence lengthens.
  • 4. 3 Previous work GRU (Gated Recurrent Unit) • While LSTM requires considerable computing power due to the presence of four neural networks within a single cell, GRU emerged as an improvement, implementing a similar mechanism with only three neural networks.
  • 5. 4 Background • To address the bottleneck issue caused by a single, fixed-size context vector, there has been a shift towards machine translation approaches that move beyond the RNN-based framework. Problem with the Encoder-Decoder Model
  • 6. 5 Methodology Methodology • Does not use networks that consider sequence order, such as RNN or CNN. • Utilizes positional encoding to account for the position, and employs self- attention techniques separately to consider context. encoder decoder
  • 7. 6 Baseline Scaled Dot-Product Attention • Takes Query (Q), Key (K), and Value (V) as inputs.
  • 8. 7 Baseline Multi-Head Attention • More efficient than using a single attention function. It involves mapping queries, keys, and values through linear projections to intermediate representations. This process creates multiple attention functions, each with different sets of inputs.
  • 9. 8 Experiments English-to-German translation task (WMT 2014) • Measures the performance of translations by comparing how similar machine-translated results are to those translated by humans. • It is observed that the Transformer demonstrates higher performance compared to other models, while also having a lower training cost.
  • 11. 10 Experiments English Constituency Parsing • To test the Transformer's effectiveness in other tasks, it has been applied to the English Constituency Parsing task. • Constituency Parsing involves classifying words according to their grammatical constituents. • Despite not being specifically tuned for this task, the Transformer demonstrates good performance.
  • 12. 11 Paper review • The Transformer replaces the recurrent layers commonly used in encoder-decoder architectures with multi- headed self-attention. • For translation tasks, the Transformer can be trained much faster than architectures based on recurrent or convolutional layers. • There is great anticipation for the future of attention-based models, and plans are in place to apply them to other tasks. • Plans include extending the Transformer to handle input and output modalities beyond text, and exploring local, restricted attention mechanisms to efficiently process large inputs and outputs such as images, audio, and video. • Another research goal is to make the generation process less sequential. Conclusions

Editor's Notes

  • #7: RNN based encoder 입력
  • #8: Y t-1 이전단어 S t hidden state C context-vector
  • #9: Y t-1 이전단어 S t hidden state C context-vector
  • #10: RNNencdec-30 attention을 적용하지 않은 baseline Search – 어텐션 적용
  • #11: RNNencdec-30 attention을 적용하지 않은 baseline Search – 어텐션 적용
  • #12: RNNencdec-30 attention을 적용하지 않은 baseline Search – 어텐션 적용