SlideShare a Scribd company logo
11
Most read
12
Most read
17
Most read
[1] Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical
machine translation." arXiv preprint arXiv:1406.1078 (2014).
[2] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural
networks." Advances in neural information processing systems. 2014.
01
Seq2Seq (Encoder-Decoder) Model
Olivia Ni
• Introduction
• Seq2Seq (Encoder-Decoder) Model
• Preliminary: LSTM & GRU
• Notation
• Encoder
• Decoder
• Training Process
• Tips for Seq2Seq (Encoder-Decoder) Model
• Attention
• Scheduled Sampling
• Beam Search
• Appendix (Word Embedding/ Representation)
• Reference
02
Outline
03
Introduction (1/2)
One-to-One Many-to-One One-to-Many Many-to-Many Many-to-Many
Standard NN model w/o
recurrent structure, from
fixed-sized input to fixed-
sized output.
RNN-based model w/
recurrent structure, from
non-fixed-sized sequence
input to fixed-sized output.
RNN-based model w/
recurrent structure, from
fixed-sized input to non-
fixed-sized sequence
output.
RNN-based model w/
recurrent structure, from
non-fixed-sized sequence
input to non-fixed-sized
sequence output.
RNN-based model w/
recurrent structure, from
non-fixed-sized synced
sequence input to non-
fixed-sized sequence
output.
Image classification Sentiment analysis where
a given sentence is
classified as expressing
positive or negative
sentiment.
Image captioning takes an
image and outputs a
sentence of words.
Machine translation: a
RNN reads a sentence in
English and then outputs a
sentence in French.
Video classification where
we wish to label each
frame of the video.
# Each rectangle vector; Each arrow function (matrix multiplication)
# Red Input vectors; Blue Output vectors; Green RNN’s hidden states
04
Introduction (2/2)
編碼器 (Encoder)
Use one RNN to analyze input sequence
• In thesis [1], this encoder RNN = GRU
• In thesis [2], this encoder RNN = LSTM
解碼器 (Decoder)
Use another RNN to generate output sequence
• In thesis [1], this encoder RNN = GRU
• In thesis [2], this encoder RNN = LSTM
上下文向量 (context vector)
A fixed-length vector representations for
the input sentence
05
Seq2Seq Model - Preliminary: LSTM (1/5)
• The cell state runs straight down
the entire chain, with only some
minor linear interactions.
 Easy for information to flow
along it unchanged
• The LSTM does have the ability
to remove or add information to
the cell state, carefully regulated
by structures called gates.
06
Seq2Seq Model - Preliminary: LSTM (2/5)
• Forget gate (sigmoid + pointwise
multiplication operation):
decides what information we’re
going to throw away from the
cell state
• 1: ‘’Complete keep this”
• 0: “Complete get rid of this”
07
Seq2Seq Model - Preliminary: LSTM (3/5)
• Input gate (sigmoid + pointwise
multiplication operation):
decides what new information
we’re going to store in the cell
state
Vanilla RNN
08
Seq2Seq Model - Preliminary: LSTM (4/5)
• Cell state update: forgets the
things we decided to forget
earlier and add the new
candidate values, scaled by how
much we decided to update
• 𝑓𝑡: decide which to forget
• 𝑖 𝑡: decide which to update
⟹ 𝐶𝑡 has been updated at timestamp 𝑡, which change slowly!
09
Seq2Seq Model - Preliminary: LSTM (5/5)
• Output gate (sigmoid + pointwise
multiplication operation):
decides what new information
we’re going to output
⟹ ℎ 𝑡 has been updated at timestamp 𝑡, which change faster!
10
Seq2Seq Model - Preliminary: GRU
Update gate:
Reset gate:
State Candidate:
Current State:
• Gated Recurrent Unit (GRU)
• Idea:
• combine the forget and input gates
into a single “update gate”
• merge the cell state and hidden state
11
Seq2Seq Model - Notation
𝑽(𝒔) The vocabulary size of the input
𝒙𝒊 The one-hot vector of i-th word in the input sentence
ഥ𝒙𝒊 The embedding vector of i-th word in the input sentence
𝑬(𝒔) Embedding matrix of the encoder
𝒉𝒊
𝒔 The i-th hidden vector of the encoder
𝑽(𝒕) The vocabulary size of the output
𝒚𝒋 The one-hot vector of j-th word in the output sentence
ഥ𝒚𝒋 The embedding vector of j-th word in the output sentence
𝑬(𝒕) Embedding matrix of the decoder
𝒉𝒋
𝒕
The j-th hidden vector of the decoder
12
Seq2Seq Model - Encoder
Encoder
Embedding
Layer
The encoder embedding layer converts each word in the input sentence to the embedding vector, when processing the i-th word in the input
sentence, the input and the output of the layer are the following:
The input is 𝒙𝒊: the one-hot vector which represent i-th word
The output is ഥ𝒙𝒊: the embedding vector which represent i-th word
Each embedding vector is calculated by the following process: ഥ𝒙𝒊 = 𝑬(𝒔)
𝒙𝒊 , where 𝑬(𝒔)
∈ ℝ 𝑫× 𝑽 𝒔
is the embedding matrix of the encoder.
Encoder
Recurrent
Layer
The encoder recurrent layer generates the hidden vectors from the embedding vectors, when processing the i-th embedding vector, the input
and the output of the layer are the following:
The input is ഥ𝒙𝒊: the embedding vector which represent i-th word
The output is 𝒉𝒊
𝒔
: the hidden vector of the i-th position
Each hidden vector 𝒉𝒊
𝒔
is calculated by the following process: 𝒉𝒊
𝒔
= 𝝍(𝒔) ഥ𝒙𝒊, 𝒉𝒊−𝟏
𝒔
, where 𝝍(𝒔)
could be an LSTM, GRU model, etc.
13
Seq2Seq Model - Decoder
Decoder
Embedding
Layer
The decoder embedding layer converts each word in the output sentence to the embedding vector. When processing the j-th word in the
output sentence, the input and the output of the layer are the following:
 The input is 𝒚𝒋−𝟏: the one-hot vector which represent (j-1)- th word generated by the decoder output layer
 The output is ഥ𝒚𝒋: the embedding vector which represent (j-1)-th word
Each embedding vector is calculated by the following process: ഥ𝒚𝒋 = 𝑬(𝒕)
𝒚𝒋−𝟏 , where 𝑬(𝒕)
∈ ℝ 𝑫× 𝑽 𝒕
is the embedding matrix of the encoder.
Decoder
Recurrent
Layer
The decoder recurrent layer generates the hidden vectors from the embedding vectors, when processing the j-th embedding vector, the input
and the output of the layer are the following:
 The input is ഥ𝒚𝒋: the embedding vector
 The output is 𝒉𝒋
𝒕
: the hidden vector of the j-th position
Each hidden vector 𝒉𝒋
𝒕
is calculated by the following process: 𝒉𝒋
𝒕
= 𝝍(𝒕) ഥ𝒙𝒋, 𝒉𝒋−𝟏
𝒕
, where 𝝍(𝒕)
could be an LSTM, GRU model, etc. And, we
must use the encoder’s hidden vector of the last position as the decoder’s hidden vector of the first position as following: 𝒉 𝟎
𝒕
= 𝒛 = 𝒉 𝑰
𝒔
Decoder
Output
Layer
The decoder output layer generates the probability of the j-th word of the output sentence from the hidden vector, when processing the j-th
embedding vector, the input and the output of the layer are the following:
 The input is 𝒉𝒋
𝒕
: the hidden vector of the j-th position
 The output is 𝒑𝒋: the probability of generating the one-hot vector 𝒚𝒋 of the j-th word
Each probabilities is calculated by the following process:
𝒑𝒋 = 𝑷 𝜽 𝒚𝒋|𝒀<𝒋 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 𝒐𝒋 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 𝐖(𝟎)
𝒉𝒋
𝒕
+ 𝐛(𝟎)
• MLE (Maximum Likelihood Estimation) for the following conditional
probability:
𝑃 𝜃 𝒀|𝑿 = 𝑃 𝜃 𝑦𝑗 𝑗=0
𝐽+1
|𝑿 = ෑ
𝑗=1
𝐽+1
𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝑿 = ෑ
𝑗=1
𝐽+1
𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝒛
• Minimize the following lost function:
−log 𝑃 𝜃 𝒀|𝑿 = −log𝑃 𝜃 𝑦𝑗 𝑗=0
𝐽+1
|𝑿 = − ෍
𝑗=1
𝐽+1
𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝑿 = − ෍
𝑗=1
𝐽+1
𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝒛
14
Seq2Seq Model – Training Process
15
Tips for Seq2Seq Model - Attention
𝛼1
1
𝛼2
1
𝛼3
1
𝛼4
1 𝛼1
2
𝛼2
2
𝛼3
2
𝛼4
2
𝛼1
3
𝛼2
3
𝛼3
3
𝛼4
3
𝛼1
4
𝛼2
4
𝛼3
4
𝛼4
4
• Good Attention:
Each input component has approximately the same attention weight
(e.g.) Regularization term on attention
The generated sentence will be like
w1 w2(woman)w3 w4(woman)
⟹ no cooking information  ⟹ bad attention!
෍
𝑖
𝜏 − ෍
𝑡
𝛼 𝑡
𝑖
2
For each component Over the generation
• Mismatch between Train and Test (Exposure Bias)
• Training  The inputs are reference.
• Generation  The inputs are the outputs of the last time step.
16
Tips for Seq2Seq Model - Scheduled Sampling
from reference
probability @ training
• Beam search (vs. Greedy Search)
• Core idea: Keep several best path at each step
• Not possible to check all the paths ⟹ Pre-define parameter: Beam size = 2
17
Tips for Seq2Seq Model - Beam Search
A B
A AB B
A B A B A
B
A B
0.4
0.9
0.9
0.6
0.4
0.4
0.6
0.6
Beam searchGreedy Search
• Recall: There are two embedding layers in the seq2seq model.
• How to learn their embedding matrix?  Word embedding/ representation!
• History:
18
Appendix – Word Embedding/ Representation
Issues: (1)newly-invented words, (2)subjective, (3)annotation effort, (4)difficult to compute word similarity
Knowledge-based
representation
Corpus-based
representation
Atomic symbol
(One-hot encoding)
Neighbors
High-dimensional
sparse word vector
(window-based co-occurrence matrix)
Low-dimensional
dense word vector
dimension
reduction
(SVD)
direct
Learning
(word2vec, Glove)
Knowledge-based
representation
Corpus-based
representation
Atomic symbol
(One-hot encoding)
Neighbors
High-dimensional
sparse word vector
(window-based co-occurrence matrix)
Low-dimensional
dense word vector
dimension
reduction
(SVD)
direct
Learning
(word2vec, Glove)
• Recall: There are two embedding layers in the seq2seq model.
• How to learn their embedding matrix?  Word embedding/ representation!
• History:
19
Appendix – Word Embedding/ Representation
Issues: difficult to compute the similarity
Idea: words with similar meanings often have similar neighbors
• Recall: There are two embedding layers in the seq2seq model.
• How to learn their embedding matrix?  Word embedding/ representation!
• History:
20
Appendix – Word Embedding/ Representation
Knowledge-based
representation
Corpus-based
representation
Atomic symbol
(One-hot encoding)
Neighbors
High-dimensional
sparse word vector
(window-based co-occurrence matrix)
Low-dimensional
dense word vector
dimension
reduction
(SVD)
direct
Learning
(word2vec, Glove)
Issues: (1)matrix size increases with vocabulary, (2)high dimensional, (3)sparsity → poor robustness
Idea: low dimensional word vector
• Recall: There are two embedding layers in the seq2seq model.
• How to learn their embedding matrix?  Word embedding/ representation!
• History:
21
Appendix – Word Embedding/ Representation
Knowledge-based
representation
Corpus-based
representation
Atomic symbol
(One-hot encoding)
Neighbors
High-dimensional
sparse word vector
(window-based co-occurrence matrix)
Low-dimensional
dense word vector
dimension
reduction
(SVD)
direct
Learning
(word2vec, Glove)
Issues: (1)computationally expensive: O(mn^2) when n < m for n x m matrix, (2)difficult to add new words
Idea: directly learn low-dimensional word vectors
• Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical
machine translation." arXiv preprint arXiv:1406.1078 (2014).
• Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural
networks." Advances in neural information processing systems. 2014.
• Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual
attention." International conference on machine learning. 2015.
• Bengio, Samy, et al. "Scheduled sampling for sequence prediction with recurrent neural
networks." Advances in Neural Information Processing Systems. 2015.
• Wiseman, Sam, and Alexander M. Rush. "Sequence-to-sequence learning as beam-search
optimization." arXiv preprint arXiv:1606.02960 (2016).
• 台大 李宏毅 教授 [Machine Learning and having it deep and structured (2018,Spring)]
• 台大 陳縕儂 教授 [Applied Deep Learning(2019,Spring)]
22
Reference
23
Thanks for your listening.

More Related Content

What's hot (20)

PDF
Word2Vec
hyunyoung Lee
 
PPTX
Lecture 18: Gaussian Mixture Models and Expectation Maximization
butest
 
PDF
An introduction to the Transformers architecture and BERT
Suman Debnath
 
PPTX
Word embedding
ShivaniChoudhary74
 
PPTX
Introduction For seq2seq(sequence to sequence) and RNN
Hye-min Ahn
 
PDF
Recurrent Neural Networks, LSTM and GRU
ananth
 
PPTX
Tutorial on word2vec
Leiden University
 
PDF
Introduction to Recurrent Neural Network
Knoldus Inc.
 
PPT
Introduction to Natural Language Processing
Pranav Gupta
 
PPTX
Natural Language Processing (NLP) - Introduction
Aritra Mukherjee
 
PPTX
Introduction to Clustering algorithm
hadifar
 
PDF
NLP using transformers
Arvind Devaraj
 
PPTX
Unsupervised learning (clustering)
Pravinkumar Landge
 
PDF
Machine Learning and its Applications
Dr Ganesh Iyer
 
PDF
BERT: Bidirectional Encoder Representations from Transformers
Liangqun Lu
 
PPTX
Text similarity measures
ankit_ppt
 
PDF
Deep learning for NLP and Transformer
Arvind Devaraj
 
PDF
Natural language processing (nlp)
Kuppusamy P
 
PPTX
Notes on attention mechanism
Khang Pham
 
Word2Vec
hyunyoung Lee
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
butest
 
An introduction to the Transformers architecture and BERT
Suman Debnath
 
Word embedding
ShivaniChoudhary74
 
Introduction For seq2seq(sequence to sequence) and RNN
Hye-min Ahn
 
Recurrent Neural Networks, LSTM and GRU
ananth
 
Tutorial on word2vec
Leiden University
 
Introduction to Recurrent Neural Network
Knoldus Inc.
 
Introduction to Natural Language Processing
Pranav Gupta
 
Natural Language Processing (NLP) - Introduction
Aritra Mukherjee
 
Introduction to Clustering algorithm
hadifar
 
NLP using transformers
Arvind Devaraj
 
Unsupervised learning (clustering)
Pravinkumar Landge
 
Machine Learning and its Applications
Dr Ganesh Iyer
 
BERT: Bidirectional Encoder Representations from Transformers
Liangqun Lu
 
Text similarity measures
ankit_ppt
 
Deep learning for NLP and Transformer
Arvind Devaraj
 
Natural language processing (nlp)
Kuppusamy P
 
Notes on attention mechanism
Khang Pham
 

Similar to Seq2Seq (encoder decoder) model (20)

PDF
M5 Topic 1 - Encoder Decoder MODEL-JEC.pdf
KeshavSen4
 
PPTX
Week9_Seq2seq.pptx
KhngNguyn81
 
PDF
encoder-decoder for large language model
ShrideviS7
 
PDF
encoder and decoder for language modelss
ShrideviS7
 
PDF
From_seq2seq_to_BERT
Huali Zhao
 
PDF
Deep Learning to Text
Jian-Kai Wang
 
PDF
Convolutional and Recurrent Neural Networks
Ramesh Ragala
 
PPTX
Natural Language Processing Advancements By Deep Learning: A Survey
Rimzim Thube
 
PDF
RNN and its applications
Sungjoon Choi
 
PDF
Detecting Misleading Headlines in Online News: Hands-on Experiences on Attent...
Kunwoo Park
 
PPTX
introduction to machine learning for students.pptx
sanjioborade1
 
PDF
Sequencing and Attention Models - 2nd Version
ssuserbd372d
 
PDF
05-transformers.pdf
ChaoYang81
 
PDF
Sequence Modelling with Deep Learning
Natasha Latysheva
 
PPTX
RNN JAN 2025 ppt fro scratch looking from basic.pptx
webseriesnit
 
PDF
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
PPTX
Dataworkz odsc london 2018
Olaf de Leeuw
 
PPTX
Chatbot ppt
Manish Mishra
 
PPTX
Long short term memory on tensorflow using python
rahulk2004
 
PPTX
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
ivaderivader
 
M5 Topic 1 - Encoder Decoder MODEL-JEC.pdf
KeshavSen4
 
Week9_Seq2seq.pptx
KhngNguyn81
 
encoder-decoder for large language model
ShrideviS7
 
encoder and decoder for language modelss
ShrideviS7
 
From_seq2seq_to_BERT
Huali Zhao
 
Deep Learning to Text
Jian-Kai Wang
 
Convolutional and Recurrent Neural Networks
Ramesh Ragala
 
Natural Language Processing Advancements By Deep Learning: A Survey
Rimzim Thube
 
RNN and its applications
Sungjoon Choi
 
Detecting Misleading Headlines in Online News: Hands-on Experiences on Attent...
Kunwoo Park
 
introduction to machine learning for students.pptx
sanjioborade1
 
Sequencing and Attention Models - 2nd Version
ssuserbd372d
 
05-transformers.pdf
ChaoYang81
 
Sequence Modelling with Deep Learning
Natasha Latysheva
 
RNN JAN 2025 ppt fro scratch looking from basic.pptx
webseriesnit
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
Dataworkz odsc london 2018
Olaf de Leeuw
 
Chatbot ppt
Manish Mishra
 
Long short term memory on tensorflow using python
rahulk2004
 
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
ivaderivader
 
Ad

Recently uploaded (20)

PDF
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
PDF
The emergence of galactic thin and thick discs across cosmic history
Sérgio Sacani
 
PDF
soil and environmental microbiology.pdf
Divyaprabha67
 
PDF
Plankton and Fisheries Bovas Joel Notes.pdf
J. Bovas Joel BFSc
 
PDF
Integrating Lifestyle Data into Personalized Health Solutions (www.kiu.ac.ug)
publication11
 
PDF
A Man of the Forest: The Contributions of Gifford Pinchot
RowanSales
 
PDF
20250603 Recycling 4.pdf . Rice flour, aluminium, hydrogen, paper, cardboard.
Sharon Liu
 
PDF
BlackBody Radiation experiment report.pdf
Ghadeer Shaabna
 
PDF
Pharmakon of algorithmic alchemy: Marketing in the age of AI
Selcen Ozturkcan
 
PPTX
Congestive Heart failure for pharmacology topic hehe
oceangunnu
 
PDF
Service innovation with AI: Transformation of value proposition and market se...
Selcen Ozturkcan
 
PPTX
Cerebellum_ Parts_Structure_Function.pptx
muralinath2
 
PPTX
Presentation 1 Microbiome Engineering and Synthetic Microbiology.pptx
Prachi Virat
 
PDF
Portable Hyperspectral Imaging (pHI) for the enhanced recording of archaeolog...
crabbn
 
PDF
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
PDF
Asthamudi lake and its fisheries&importance .pdf
J. Bovas Joel BFSc
 
PDF
Annual report 2024 - Inria - English version.pdf
Inria
 
PDF
Webinar: World's Smallest Pacemaker
Scintica Instrumentation
 
PPTX
Systamatic Acquired Resistence (SAR).pptx
giriprasanthmuthuraj
 
PDF
High-speedBouldersandtheDebrisFieldinDARTEjecta
Sérgio Sacani
 
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
The emergence of galactic thin and thick discs across cosmic history
Sérgio Sacani
 
soil and environmental microbiology.pdf
Divyaprabha67
 
Plankton and Fisheries Bovas Joel Notes.pdf
J. Bovas Joel BFSc
 
Integrating Lifestyle Data into Personalized Health Solutions (www.kiu.ac.ug)
publication11
 
A Man of the Forest: The Contributions of Gifford Pinchot
RowanSales
 
20250603 Recycling 4.pdf . Rice flour, aluminium, hydrogen, paper, cardboard.
Sharon Liu
 
BlackBody Radiation experiment report.pdf
Ghadeer Shaabna
 
Pharmakon of algorithmic alchemy: Marketing in the age of AI
Selcen Ozturkcan
 
Congestive Heart failure for pharmacology topic hehe
oceangunnu
 
Service innovation with AI: Transformation of value proposition and market se...
Selcen Ozturkcan
 
Cerebellum_ Parts_Structure_Function.pptx
muralinath2
 
Presentation 1 Microbiome Engineering and Synthetic Microbiology.pptx
Prachi Virat
 
Portable Hyperspectral Imaging (pHI) for the enhanced recording of archaeolog...
crabbn
 
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
Asthamudi lake and its fisheries&importance .pdf
J. Bovas Joel BFSc
 
Annual report 2024 - Inria - English version.pdf
Inria
 
Webinar: World's Smallest Pacemaker
Scintica Instrumentation
 
Systamatic Acquired Resistence (SAR).pptx
giriprasanthmuthuraj
 
High-speedBouldersandtheDebrisFieldinDARTEjecta
Sérgio Sacani
 
Ad

Seq2Seq (encoder decoder) model

  • 1. [1] Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). [2] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014. 01 Seq2Seq (Encoder-Decoder) Model Olivia Ni
  • 2. • Introduction • Seq2Seq (Encoder-Decoder) Model • Preliminary: LSTM & GRU • Notation • Encoder • Decoder • Training Process • Tips for Seq2Seq (Encoder-Decoder) Model • Attention • Scheduled Sampling • Beam Search • Appendix (Word Embedding/ Representation) • Reference 02 Outline
  • 3. 03 Introduction (1/2) One-to-One Many-to-One One-to-Many Many-to-Many Many-to-Many Standard NN model w/o recurrent structure, from fixed-sized input to fixed- sized output. RNN-based model w/ recurrent structure, from non-fixed-sized sequence input to fixed-sized output. RNN-based model w/ recurrent structure, from fixed-sized input to non- fixed-sized sequence output. RNN-based model w/ recurrent structure, from non-fixed-sized sequence input to non-fixed-sized sequence output. RNN-based model w/ recurrent structure, from non-fixed-sized synced sequence input to non- fixed-sized sequence output. Image classification Sentiment analysis where a given sentence is classified as expressing positive or negative sentiment. Image captioning takes an image and outputs a sentence of words. Machine translation: a RNN reads a sentence in English and then outputs a sentence in French. Video classification where we wish to label each frame of the video. # Each rectangle vector; Each arrow function (matrix multiplication) # Red Input vectors; Blue Output vectors; Green RNN’s hidden states
  • 4. 04 Introduction (2/2) 編碼器 (Encoder) Use one RNN to analyze input sequence • In thesis [1], this encoder RNN = GRU • In thesis [2], this encoder RNN = LSTM 解碼器 (Decoder) Use another RNN to generate output sequence • In thesis [1], this encoder RNN = GRU • In thesis [2], this encoder RNN = LSTM 上下文向量 (context vector) A fixed-length vector representations for the input sentence
  • 5. 05 Seq2Seq Model - Preliminary: LSTM (1/5) • The cell state runs straight down the entire chain, with only some minor linear interactions.  Easy for information to flow along it unchanged • The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.
  • 6. 06 Seq2Seq Model - Preliminary: LSTM (2/5) • Forget gate (sigmoid + pointwise multiplication operation): decides what information we’re going to throw away from the cell state • 1: ‘’Complete keep this” • 0: “Complete get rid of this”
  • 7. 07 Seq2Seq Model - Preliminary: LSTM (3/5) • Input gate (sigmoid + pointwise multiplication operation): decides what new information we’re going to store in the cell state Vanilla RNN
  • 8. 08 Seq2Seq Model - Preliminary: LSTM (4/5) • Cell state update: forgets the things we decided to forget earlier and add the new candidate values, scaled by how much we decided to update • 𝑓𝑡: decide which to forget • 𝑖 𝑡: decide which to update ⟹ 𝐶𝑡 has been updated at timestamp 𝑡, which change slowly!
  • 9. 09 Seq2Seq Model - Preliminary: LSTM (5/5) • Output gate (sigmoid + pointwise multiplication operation): decides what new information we’re going to output ⟹ ℎ 𝑡 has been updated at timestamp 𝑡, which change faster!
  • 10. 10 Seq2Seq Model - Preliminary: GRU Update gate: Reset gate: State Candidate: Current State: • Gated Recurrent Unit (GRU) • Idea: • combine the forget and input gates into a single “update gate” • merge the cell state and hidden state
  • 11. 11 Seq2Seq Model - Notation 𝑽(𝒔) The vocabulary size of the input 𝒙𝒊 The one-hot vector of i-th word in the input sentence ഥ𝒙𝒊 The embedding vector of i-th word in the input sentence 𝑬(𝒔) Embedding matrix of the encoder 𝒉𝒊 𝒔 The i-th hidden vector of the encoder 𝑽(𝒕) The vocabulary size of the output 𝒚𝒋 The one-hot vector of j-th word in the output sentence ഥ𝒚𝒋 The embedding vector of j-th word in the output sentence 𝑬(𝒕) Embedding matrix of the decoder 𝒉𝒋 𝒕 The j-th hidden vector of the decoder
  • 12. 12 Seq2Seq Model - Encoder Encoder Embedding Layer The encoder embedding layer converts each word in the input sentence to the embedding vector, when processing the i-th word in the input sentence, the input and the output of the layer are the following: The input is 𝒙𝒊: the one-hot vector which represent i-th word The output is ഥ𝒙𝒊: the embedding vector which represent i-th word Each embedding vector is calculated by the following process: ഥ𝒙𝒊 = 𝑬(𝒔) 𝒙𝒊 , where 𝑬(𝒔) ∈ ℝ 𝑫× 𝑽 𝒔 is the embedding matrix of the encoder. Encoder Recurrent Layer The encoder recurrent layer generates the hidden vectors from the embedding vectors, when processing the i-th embedding vector, the input and the output of the layer are the following: The input is ഥ𝒙𝒊: the embedding vector which represent i-th word The output is 𝒉𝒊 𝒔 : the hidden vector of the i-th position Each hidden vector 𝒉𝒊 𝒔 is calculated by the following process: 𝒉𝒊 𝒔 = 𝝍(𝒔) ഥ𝒙𝒊, 𝒉𝒊−𝟏 𝒔 , where 𝝍(𝒔) could be an LSTM, GRU model, etc.
  • 13. 13 Seq2Seq Model - Decoder Decoder Embedding Layer The decoder embedding layer converts each word in the output sentence to the embedding vector. When processing the j-th word in the output sentence, the input and the output of the layer are the following:  The input is 𝒚𝒋−𝟏: the one-hot vector which represent (j-1)- th word generated by the decoder output layer  The output is ഥ𝒚𝒋: the embedding vector which represent (j-1)-th word Each embedding vector is calculated by the following process: ഥ𝒚𝒋 = 𝑬(𝒕) 𝒚𝒋−𝟏 , where 𝑬(𝒕) ∈ ℝ 𝑫× 𝑽 𝒕 is the embedding matrix of the encoder. Decoder Recurrent Layer The decoder recurrent layer generates the hidden vectors from the embedding vectors, when processing the j-th embedding vector, the input and the output of the layer are the following:  The input is ഥ𝒚𝒋: the embedding vector  The output is 𝒉𝒋 𝒕 : the hidden vector of the j-th position Each hidden vector 𝒉𝒋 𝒕 is calculated by the following process: 𝒉𝒋 𝒕 = 𝝍(𝒕) ഥ𝒙𝒋, 𝒉𝒋−𝟏 𝒕 , where 𝝍(𝒕) could be an LSTM, GRU model, etc. And, we must use the encoder’s hidden vector of the last position as the decoder’s hidden vector of the first position as following: 𝒉 𝟎 𝒕 = 𝒛 = 𝒉 𝑰 𝒔 Decoder Output Layer The decoder output layer generates the probability of the j-th word of the output sentence from the hidden vector, when processing the j-th embedding vector, the input and the output of the layer are the following:  The input is 𝒉𝒋 𝒕 : the hidden vector of the j-th position  The output is 𝒑𝒋: the probability of generating the one-hot vector 𝒚𝒋 of the j-th word Each probabilities is calculated by the following process: 𝒑𝒋 = 𝑷 𝜽 𝒚𝒋|𝒀<𝒋 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 𝒐𝒋 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 𝐖(𝟎) 𝒉𝒋 𝒕 + 𝐛(𝟎)
  • 14. • MLE (Maximum Likelihood Estimation) for the following conditional probability: 𝑃 𝜃 𝒀|𝑿 = 𝑃 𝜃 𝑦𝑗 𝑗=0 𝐽+1 |𝑿 = ෑ 𝑗=1 𝐽+1 𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝑿 = ෑ 𝑗=1 𝐽+1 𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝒛 • Minimize the following lost function: −log 𝑃 𝜃 𝒀|𝑿 = −log𝑃 𝜃 𝑦𝑗 𝑗=0 𝐽+1 |𝑿 = − ෍ 𝑗=1 𝐽+1 𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝑿 = − ෍ 𝑗=1 𝐽+1 𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝒛 14 Seq2Seq Model – Training Process
  • 15. 15 Tips for Seq2Seq Model - Attention 𝛼1 1 𝛼2 1 𝛼3 1 𝛼4 1 𝛼1 2 𝛼2 2 𝛼3 2 𝛼4 2 𝛼1 3 𝛼2 3 𝛼3 3 𝛼4 3 𝛼1 4 𝛼2 4 𝛼3 4 𝛼4 4 • Good Attention: Each input component has approximately the same attention weight (e.g.) Regularization term on attention The generated sentence will be like w1 w2(woman)w3 w4(woman) ⟹ no cooking information  ⟹ bad attention! ෍ 𝑖 𝜏 − ෍ 𝑡 𝛼 𝑡 𝑖 2 For each component Over the generation
  • 16. • Mismatch between Train and Test (Exposure Bias) • Training  The inputs are reference. • Generation  The inputs are the outputs of the last time step. 16 Tips for Seq2Seq Model - Scheduled Sampling from reference probability @ training
  • 17. • Beam search (vs. Greedy Search) • Core idea: Keep several best path at each step • Not possible to check all the paths ⟹ Pre-define parameter: Beam size = 2 17 Tips for Seq2Seq Model - Beam Search A B A AB B A B A B A B A B 0.4 0.9 0.9 0.6 0.4 0.4 0.6 0.6 Beam searchGreedy Search
  • 18. • Recall: There are two embedding layers in the seq2seq model. • How to learn their embedding matrix?  Word embedding/ representation! • History: 18 Appendix – Word Embedding/ Representation Issues: (1)newly-invented words, (2)subjective, (3)annotation effort, (4)difficult to compute word similarity Knowledge-based representation Corpus-based representation Atomic symbol (One-hot encoding) Neighbors High-dimensional sparse word vector (window-based co-occurrence matrix) Low-dimensional dense word vector dimension reduction (SVD) direct Learning (word2vec, Glove)
  • 19. Knowledge-based representation Corpus-based representation Atomic symbol (One-hot encoding) Neighbors High-dimensional sparse word vector (window-based co-occurrence matrix) Low-dimensional dense word vector dimension reduction (SVD) direct Learning (word2vec, Glove) • Recall: There are two embedding layers in the seq2seq model. • How to learn their embedding matrix?  Word embedding/ representation! • History: 19 Appendix – Word Embedding/ Representation Issues: difficult to compute the similarity Idea: words with similar meanings often have similar neighbors
  • 20. • Recall: There are two embedding layers in the seq2seq model. • How to learn their embedding matrix?  Word embedding/ representation! • History: 20 Appendix – Word Embedding/ Representation Knowledge-based representation Corpus-based representation Atomic symbol (One-hot encoding) Neighbors High-dimensional sparse word vector (window-based co-occurrence matrix) Low-dimensional dense word vector dimension reduction (SVD) direct Learning (word2vec, Glove) Issues: (1)matrix size increases with vocabulary, (2)high dimensional, (3)sparsity → poor robustness Idea: low dimensional word vector
  • 21. • Recall: There are two embedding layers in the seq2seq model. • How to learn their embedding matrix?  Word embedding/ representation! • History: 21 Appendix – Word Embedding/ Representation Knowledge-based representation Corpus-based representation Atomic symbol (One-hot encoding) Neighbors High-dimensional sparse word vector (window-based co-occurrence matrix) Low-dimensional dense word vector dimension reduction (SVD) direct Learning (word2vec, Glove) Issues: (1)computationally expensive: O(mn^2) when n < m for n x m matrix, (2)difficult to add new words Idea: directly learn low-dimensional word vectors
  • 22. • Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). • Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014. • Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015. • Bengio, Samy, et al. "Scheduled sampling for sequence prediction with recurrent neural networks." Advances in Neural Information Processing Systems. 2015. • Wiseman, Sam, and Alexander M. Rush. "Sequence-to-sequence learning as beam-search optimization." arXiv preprint arXiv:1606.02960 (2016). • 台大 李宏毅 教授 [Machine Learning and having it deep and structured (2018,Spring)] • 台大 陳縕儂 教授 [Applied Deep Learning(2019,Spring)] 22 Reference
  • 23. 23 Thanks for your listening.