SlideShare a Scribd company logo
Sequence Modeling:
Recurrent and Recursive
Nets (part 2)
M. Sohaib Alam
17 June, 2017
Deep Learning Textbook Study
Meetup Group
Bidirectional RNNs
Motivation:
Sequences where context matters; ideally
have knowledge about the future as well as
past, e.g. speech, hand-writing.
h(t)
: state of sub-RNN moving forward in time
g(t)
: state of sub-RNN moving backward in
time
Extend architecture to go forward/backward
in n dimensions (involving 2n sub-RNNs), e.g.
4 sub-RNNs with input 2d images can
capture long-range lateral interactions
between features, but more expensive to train
than convolutional neural nets.
Encoder-Decoder Sequence-to-Sequence
Architectures
Allow for input and output sequences of different
lengths. Applications include speech recognition,
machine translation or question/answering.
C: vector, or sequence of vectors, summarizing the
input sequence X = (x(1)
, …, x(n)
).
Encoder: input RNN
Decoder: output RNN
Both RNNs trained jointly to maximize average of
log P (y(1)
, …, yn_y
| x(1)
, …, x(n_x)
)
Deep Recurrent Networks
Typically, RNNs can be decomposed into 3
blocks:
- Input-to-hidden
- Hidden-to-hidden
- Hidden-to-ouput
Basic idea here: introduce depth in each of the
above blocks.
Fig (a) : Lower levels transform raw input to
more appropriate transformation
Fig (b) : Add extra layers in the recurrence
relationship
Fig (c) : Mitigate longer distance from t to t+1 by
adding skip connections
Recursive Neural Networks
Generalize computational graph from chain to a
tree.
For sequence of length T, the depth (number of
compositions of non-linear operations) can be
reduced from O(T) to O(log T) (simplest way to
see this is to solve for 2depth
~ T, assuming
branching factor of 2).
Open question: How to best structure the tree. In
practice, depends on the problem at hand.
Ideally, the learner itself infers and implements
the appropriate structure given the input.
Challenge of Long-Term Dependencies
Basic problem:
- Gradients propagated over several time steps tend to either vanish or explode
We can think of recurrence relation
as a simple RNN lacking inputs and non-linear activation function. This can be simplified to
so that if W admits an eigen-decomposition of the form
with Q an orthogonal matrix, further simplifying the recurrence relation to
Thus eigenvalues ei
with |ei
| < 1 will tend to decay to zero, while those with |ei
| > 1 will tend to
explode, eventually causing any component of h(0)
that is not aligned with the largest eigenvector to
be discarded.
Challenge of Long-Term Dependencies
Problem inherent to RNNs. For non-recurrent networks, we can always choose different weights
at different time-steps.
Imagine a scalar weight w getting multiplied by itself several times at each time step.
● The product wt
will either vanish or explode given the magnitude of w.
● On the other hand, if every w(t)
is independent but identically distributed with mean 0 and
variance v, then the state at time is product of all w(t)
’s and the variance of the product is
O(vn
).
For non-recurrent deep feedforward networks, we may achieve some desired variance v*
by
sampling individual weights with variance (v*
)1/n
, and thus avoid the vanishing and exploding
gradient problem.
Open problem: Allow an RNN to learn long-term dependencies without vanishing/exploding
parameters.
Echo State Networks
Hidden-to-hidden and input-to-hidden weights are usually most difficult parameters to learn in an
RNN.
Echo State Networks (ESNs): Set recurrent weights such that hidden units capture history of
past inputs, and learn only the output weights.
Liquid State Machines: Same as above, except uses binary output neurons instead of
continuous-valued hidden units used for ESNs.
This approach is collectively referred to as reservoir computing (hidden units form a reservoir of
temporal features, capturing different aspects of input history).
Echo State Networks
Spectral radius: Largest eigenvalue of the Jacobian at time t,
Suppose J has eigenvector v with eigenvalue lambda. Further suppose we want to
back-propagate a gradient vector g back in time, and compare this to back-propagating the
perturbed vector (g + delta v). The two different executions, after n propagation steps, diverge by
delta |lambda|^n, which if |lambda| > 1, can grow exponentially large, and if |lambda| < 1, can
vanish. (We can similarly replace back-propagation with forward propagation and removing
non-linearity).
Strategy in ESNs is to fix weights to have some bounded spectral radius, such that information is
carried through time but does not explode/vanish.
Strategies for Multiple Time Scales
Design models that operate at multiple time scales, e.g. some parts operating at fine-grained time
scales, others at more coarse-grained scales.
Adding Skip Connections Through Time:
Add direct connections from variables in distant past to variables in present, instead of just from time
t to time t+1.
Leaky Units and a Spectrum of Different Time Scales:
Design units with linear self-connections and weights near 1. As an analogy, consider accumulating
the running average mu(t) of some variable v(t) via
When alpha is close to 1, the running average remembers the past for a long time. Hidden units with
such linear self-connections and weights close to 1 can behave similarly.
Removing connections:
Remove length-one connections and replace them with length-(larger number) connections.
LSTM and Other Gated RNNs
As of now, the most effective sequence models used in practical applications are gated RNNs:
including long short-term memory (LSTM) and networks based on the gated recurrent unit.
Basic idea: Create paths through time with derivatives that don’t explode/vanish, by allowing
connection weights to become functions of time. Leaky units can allow network to accumulate
information over a long period of time, but there should also be a forgetting mechanism when that
info becomes irrelevant. Ideally, we want the network itself to decide when to forget.
LSTM
The state unit si
(t)
has a linear self-loop similar to
the leaky units described in the previous section.
The self-loop weight is controlled by a forget gate
unit
x(t)
: current input vector
h(t)
: current hidden layer vector
bf
: biases
Uf
: input weights
Wf
: recurrent weights
LSTM “cell”
LSTM
The LSTM cell internal state is then updated as
follows
where
b: biases
U: input weights
W: recurrent weights
gi
(t)
: external input gate, similar to forget gate but
with its own parameters
LSTM “cell”
The output of the LSTM cell can also be shut off via
the output gate qi
(t)
One can choose to use the cell state si
(t)
as an extra
input (with its own weight) into the three gates of
the i-th unit, as shown in the figure.
LSTMs have been shown to learn long-term
dependencies more easily than simple recurrent
architectures.
LSTM
LSTM “cell”
Other Gated RNNs
Main difference from LSTM: Single gating unit simultaneously controls the forgetting factor and decision
to update unit.
u: update gate
r: reset gate
Both gates can individually ignore parts of the state vector.
Optimization for Long-Term Dependencies
Basic problem: Vanishing and exploding gradients when optimizing RNNs over many time steps.
Clipping Gradients: Cost function can have sharp cliffs as a function of the weights/biases. Gradient
direction can change dramatically within a short distance. Solution: reduce step-size in direction of
gradient if it gets too large:
where v is the norm threshold, and g is gradient.
Optimization for Long-Term Dependencies
Regularizing to Encourage Information Flow: Previous technique helps with exploding gradients,
but not vanishing gradients. Ideally, we’d like to be as large as
so that it maintains its magnitude as it gets back-propagated. We could therefore use the following
term as a regularizer to achieve this effect:

More Related Content

What's hot (20)

PDF
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
PDF
Domain Transfer and Adaptation Survey
Sangwoo Mo
 
PPTX
Time series predictions using LSTMs
Setu Chokshi
 
PDF
GAN - Theory and Applications
Emanuele Ghelfi
 
PDF
Intro To Convolutional Neural Networks
Mark Scully
 
PDF
Kmp
akruthi k
 
PPTX
Travelling salesman dynamic programming
maharajdey
 
PDF
About Unsupervised Image-to-Image Translation
Mehdi Shibahara
 
PPTX
Loop optimization
Vivek Gandhi
 
PPTX
Generative Adversarial Network (GAN)
Prakhar Rastogi
 
PDF
Optimization for Deep Learning
Sebastian Ruder
 
PPTX
Deep Learning - RNN and CNN
Pradnya Saval
 
PPTX
Convolutional Neural Network (CNN)
Muhammad Haroon
 
PPT
Ds objects and models
Mayank Jain
 
PPT
CONVENTIONAL ENCRYPTION
SHUBHA CHATURVEDI
 
PDF
Generative adversarial text to image synthesis
Universitat Politècnica de Catalunya
 
PPTX
Spiking neural network: an introduction I
Dalin Zhang
 
PPT
Parallel algorithms
guest084d20
 
PPTX
Graph Neural Network - Introduction
Jungwon Kim
 
PDF
Introduction to Generative Adversarial Networks (GANs)
Appsilon Data Science
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
Domain Transfer and Adaptation Survey
Sangwoo Mo
 
Time series predictions using LSTMs
Setu Chokshi
 
GAN - Theory and Applications
Emanuele Ghelfi
 
Intro To Convolutional Neural Networks
Mark Scully
 
Travelling salesman dynamic programming
maharajdey
 
About Unsupervised Image-to-Image Translation
Mehdi Shibahara
 
Loop optimization
Vivek Gandhi
 
Generative Adversarial Network (GAN)
Prakhar Rastogi
 
Optimization for Deep Learning
Sebastian Ruder
 
Deep Learning - RNN and CNN
Pradnya Saval
 
Convolutional Neural Network (CNN)
Muhammad Haroon
 
Ds objects and models
Mayank Jain
 
CONVENTIONAL ENCRYPTION
SHUBHA CHATURVEDI
 
Generative adversarial text to image synthesis
Universitat Politècnica de Catalunya
 
Spiking neural network: an introduction I
Dalin Zhang
 
Parallel algorithms
guest084d20
 
Graph Neural Network - Introduction
Jungwon Kim
 
Introduction to Generative Adversarial Networks (GANs)
Appsilon Data Science
 

Similar to Recurrent and Recursive Nets (part 2) (20)

PPT
14889574 dl ml RNN Deeplearning MMMm.ppt
ManiMaran230751
 
PDF
rnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
khushbu maurya
 
PDF
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Universitat Politècnica de Catalunya
 
PPTX
Introduction to deep learning
Junaid Bhat
 
PDF
Rnn presentation 2
Shubhangi Tandon
 
PPTX
recurrent_neural_networks_april_2020.pptx
SagarTekwani4
 
PDF
rnn_review.10.pdf
FlyingColours13
 
PDF
Recurrent Neural Networks
Sharath TS
 
PPTX
10.0 SequenceModeling-merged-compressed_edited.pptx
ykchia03
 
PDF
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PDF
Introduction to Recurrent Neural Network
Knoldus Inc.
 
PPTX
RNN and LSTM model description and working advantages and disadvantages
AbhijitVenkatesh1
 
PDF
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Edureka!
 
PPTX
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
PPTX
Recurrent Neural Network
Mohammad Sabouri
 
PDF
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Universitat Politècnica de Catalunya
 
PPTX
Rnn & Lstm
Subash Chandra Pakhrin
 
PDF
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Universitat Politècnica de Catalunya
 
14889574 dl ml RNN Deeplearning MMMm.ppt
ManiMaran230751
 
rnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
khushbu maurya
 
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Universitat Politècnica de Catalunya
 
Introduction to deep learning
Junaid Bhat
 
Rnn presentation 2
Shubhangi Tandon
 
recurrent_neural_networks_april_2020.pptx
SagarTekwani4
 
rnn_review.10.pdf
FlyingColours13
 
Recurrent Neural Networks
Sharath TS
 
10.0 SequenceModeling-merged-compressed_edited.pptx
ykchia03
 
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Introduction to Recurrent Neural Network
Knoldus Inc.
 
RNN and LSTM model description and working advantages and disadvantages
AbhijitVenkatesh1
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Edureka!
 
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
Recurrent Neural Network
Mohammad Sabouri
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Universitat Politècnica de Catalunya
 
Ad

Recently uploaded (20)

PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Ad

Recurrent and Recursive Nets (part 2)

  • 1. Sequence Modeling: Recurrent and Recursive Nets (part 2) M. Sohaib Alam 17 June, 2017 Deep Learning Textbook Study Meetup Group
  • 2. Bidirectional RNNs Motivation: Sequences where context matters; ideally have knowledge about the future as well as past, e.g. speech, hand-writing. h(t) : state of sub-RNN moving forward in time g(t) : state of sub-RNN moving backward in time Extend architecture to go forward/backward in n dimensions (involving 2n sub-RNNs), e.g. 4 sub-RNNs with input 2d images can capture long-range lateral interactions between features, but more expensive to train than convolutional neural nets.
  • 3. Encoder-Decoder Sequence-to-Sequence Architectures Allow for input and output sequences of different lengths. Applications include speech recognition, machine translation or question/answering. C: vector, or sequence of vectors, summarizing the input sequence X = (x(1) , …, x(n) ). Encoder: input RNN Decoder: output RNN Both RNNs trained jointly to maximize average of log P (y(1) , …, yn_y | x(1) , …, x(n_x) )
  • 4. Deep Recurrent Networks Typically, RNNs can be decomposed into 3 blocks: - Input-to-hidden - Hidden-to-hidden - Hidden-to-ouput Basic idea here: introduce depth in each of the above blocks. Fig (a) : Lower levels transform raw input to more appropriate transformation Fig (b) : Add extra layers in the recurrence relationship Fig (c) : Mitigate longer distance from t to t+1 by adding skip connections
  • 5. Recursive Neural Networks Generalize computational graph from chain to a tree. For sequence of length T, the depth (number of compositions of non-linear operations) can be reduced from O(T) to O(log T) (simplest way to see this is to solve for 2depth ~ T, assuming branching factor of 2). Open question: How to best structure the tree. In practice, depends on the problem at hand. Ideally, the learner itself infers and implements the appropriate structure given the input.
  • 6. Challenge of Long-Term Dependencies Basic problem: - Gradients propagated over several time steps tend to either vanish or explode We can think of recurrence relation as a simple RNN lacking inputs and non-linear activation function. This can be simplified to so that if W admits an eigen-decomposition of the form with Q an orthogonal matrix, further simplifying the recurrence relation to Thus eigenvalues ei with |ei | < 1 will tend to decay to zero, while those with |ei | > 1 will tend to explode, eventually causing any component of h(0) that is not aligned with the largest eigenvector to be discarded.
  • 7. Challenge of Long-Term Dependencies Problem inherent to RNNs. For non-recurrent networks, we can always choose different weights at different time-steps. Imagine a scalar weight w getting multiplied by itself several times at each time step. ● The product wt will either vanish or explode given the magnitude of w. ● On the other hand, if every w(t) is independent but identically distributed with mean 0 and variance v, then the state at time is product of all w(t) ’s and the variance of the product is O(vn ). For non-recurrent deep feedforward networks, we may achieve some desired variance v* by sampling individual weights with variance (v* )1/n , and thus avoid the vanishing and exploding gradient problem. Open problem: Allow an RNN to learn long-term dependencies without vanishing/exploding parameters.
  • 8. Echo State Networks Hidden-to-hidden and input-to-hidden weights are usually most difficult parameters to learn in an RNN. Echo State Networks (ESNs): Set recurrent weights such that hidden units capture history of past inputs, and learn only the output weights. Liquid State Machines: Same as above, except uses binary output neurons instead of continuous-valued hidden units used for ESNs. This approach is collectively referred to as reservoir computing (hidden units form a reservoir of temporal features, capturing different aspects of input history).
  • 9. Echo State Networks Spectral radius: Largest eigenvalue of the Jacobian at time t, Suppose J has eigenvector v with eigenvalue lambda. Further suppose we want to back-propagate a gradient vector g back in time, and compare this to back-propagating the perturbed vector (g + delta v). The two different executions, after n propagation steps, diverge by delta |lambda|^n, which if |lambda| > 1, can grow exponentially large, and if |lambda| < 1, can vanish. (We can similarly replace back-propagation with forward propagation and removing non-linearity). Strategy in ESNs is to fix weights to have some bounded spectral radius, such that information is carried through time but does not explode/vanish.
  • 10. Strategies for Multiple Time Scales Design models that operate at multiple time scales, e.g. some parts operating at fine-grained time scales, others at more coarse-grained scales. Adding Skip Connections Through Time: Add direct connections from variables in distant past to variables in present, instead of just from time t to time t+1. Leaky Units and a Spectrum of Different Time Scales: Design units with linear self-connections and weights near 1. As an analogy, consider accumulating the running average mu(t) of some variable v(t) via When alpha is close to 1, the running average remembers the past for a long time. Hidden units with such linear self-connections and weights close to 1 can behave similarly. Removing connections: Remove length-one connections and replace them with length-(larger number) connections.
  • 11. LSTM and Other Gated RNNs As of now, the most effective sequence models used in practical applications are gated RNNs: including long short-term memory (LSTM) and networks based on the gated recurrent unit. Basic idea: Create paths through time with derivatives that don’t explode/vanish, by allowing connection weights to become functions of time. Leaky units can allow network to accumulate information over a long period of time, but there should also be a forgetting mechanism when that info becomes irrelevant. Ideally, we want the network itself to decide when to forget.
  • 12. LSTM The state unit si (t) has a linear self-loop similar to the leaky units described in the previous section. The self-loop weight is controlled by a forget gate unit x(t) : current input vector h(t) : current hidden layer vector bf : biases Uf : input weights Wf : recurrent weights LSTM “cell”
  • 13. LSTM The LSTM cell internal state is then updated as follows where b: biases U: input weights W: recurrent weights gi (t) : external input gate, similar to forget gate but with its own parameters LSTM “cell”
  • 14. The output of the LSTM cell can also be shut off via the output gate qi (t) One can choose to use the cell state si (t) as an extra input (with its own weight) into the three gates of the i-th unit, as shown in the figure. LSTMs have been shown to learn long-term dependencies more easily than simple recurrent architectures. LSTM LSTM “cell”
  • 15. Other Gated RNNs Main difference from LSTM: Single gating unit simultaneously controls the forgetting factor and decision to update unit. u: update gate r: reset gate Both gates can individually ignore parts of the state vector.
  • 16. Optimization for Long-Term Dependencies Basic problem: Vanishing and exploding gradients when optimizing RNNs over many time steps. Clipping Gradients: Cost function can have sharp cliffs as a function of the weights/biases. Gradient direction can change dramatically within a short distance. Solution: reduce step-size in direction of gradient if it gets too large: where v is the norm threshold, and g is gradient.
  • 17. Optimization for Long-Term Dependencies Regularizing to Encourage Information Flow: Previous technique helps with exploding gradients, but not vanishing gradients. Ideally, we’d like to be as large as so that it maintains its magnitude as it gets back-propagated. We could therefore use the following term as a regularizer to achieve this effect: