SlideShare a Scribd company logo
[course site]
Day 3 Lecture 1
Backpropagation
Elisa Sayrol
2
Acknowledgements
Kevin McGuinness
kevin.mcguinness@dcu.ie
Research Fellow
Insight Centre for Data Analytics
Dublin City University
…in our last lecture
Multilayer perceptrons
When each node in each layer is a linear
combination of all inputs from the previous
layer then the network is called a multilayer
perceptron (MLP)
Weights can be organized into matrices.
Forward pass computes
𝐚(#)
Training MLPs
With Multiple layers we need to minimize the loss function 𝓛 𝒇 𝜽 𝒙 , 𝒚 with respect to all the parameters
of the model 𝜽(W(k), b(k)):
𝑊
∗
= 𝑎𝑟𝑔𝑚𝑖𝑛4ℒ 𝑓4 𝑥 , 𝑦
Gradient Descent: Move the parameter 𝜽𝒋		in small steps in the direction opposite sign of the derivative of
the loss with respect j:
𝜽𝒋
(𝒏) = 𝜽𝒋
(𝒏<𝟏) − 𝜶(𝒏<𝟏) @ 𝜵 𝜽𝒋
𝓛(𝒚,𝒇 𝒙 )
Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a minibatch
of examples.
For MLP gradients can be found using the chain rule of differentiation.
The calculations reveal that the gradient wrt. the parameters in layer k only depends on the error from the
above layer and the output from the layer below. This means that the gradients for each layer can be
computed iteratively, starting at the last layer and propagating the error back through the network. This is
known as the backpropagation algorithm.
• Computational Graphs
• Examples applying chain of rule in simple graphs
• Backpropagation applied to Multilayer Perceptron
• Issues on Backpropagation and training
Backpropagation algorithm
Computational graphs
z
x y
x
u(1) u(2)
·
+
y^
x w b
σ
U(1) U(2)
matmul
+
H
X W b
relu
u(1)
u(2)
·
y^
x w λ
x
u(3)
sum
sqrt
𝑧 = 𝑥𝑦 𝑦C=𝜎(xFw + b) 𝑯=max 0, 𝑿𝑾 + 𝒃 𝑦C=xFw
𝜆 P 𝑤R
S
R
From Deep Learning Book
Computational graphs
Applying the Chain Rule to Computational Graphs
𝑦 = 𝑔(𝑥)
𝑧 = 𝑓 𝑔 𝑥
𝑑𝑧
𝑑𝑥
=
𝑑𝑧
𝑑𝑦
𝑑𝑦
𝑑𝑥
𝜕𝑧
𝜕𝑥R
= P
𝜕𝑧
𝜕𝑦V
𝜕𝑦V
𝜕𝑥RV
𝛻𝒙 𝑧 =
𝜕𝒚
𝜕𝒙
F
𝛻𝒚 𝑧
𝑑𝑧
𝑑𝑦
𝑑𝑦
𝑑𝑥X
𝑑𝑦
𝑑𝑥S
𝑑𝑧
𝑑𝑥X
𝑑𝑧
𝑑𝑥S
𝑑𝑧
𝑑𝑥X
=
𝑑𝑧
𝑑𝑦
𝑑𝑦
𝑑𝑥X
𝑑𝑧
𝑑𝑥S
=
𝑑𝑧
𝑑𝑦
𝑑𝑦
𝑑𝑥S
For vectors:
𝑥X
𝑥S
𝑦 𝑧
fg
𝑑𝑧
𝑑𝑦
𝑧 = 𝑓(𝑦)
Computational graphs
Numerical Examples
From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017
+
x
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
𝑥
𝑦
𝑧
𝑞
−2
5
−4
-12
3
𝑞 = 𝑥 + 𝑦
𝑓 = 𝑞𝑧
𝜕𝑞
𝜕𝑥
= 1
𝜕𝑞
𝜕𝑦
= 1
𝜕𝑓
𝜕𝑞
= 𝑧
𝜕𝑓
𝜕𝑧
= 𝑞
𝑊𝑒	𝑤𝑎𝑛𝑡	𝑡𝑜	𝑐𝑜𝑚𝑝𝑢𝑡𝑒:	
𝜕𝑓
𝜕𝑥
,
𝜕𝑓
𝜕𝑦
,
𝜕𝑓
𝜕𝑧
𝐸𝑥𝑎𝑚𝑝𝑙𝑒	𝑥 = −2, 𝑦 = 5, 𝑧 = −4
𝜕𝑓
𝜕𝑓
= 1
𝜕𝑓
𝜕𝑓
= 1
1
𝜕𝑓
𝜕𝑞
= 𝑧 = −4
-4
𝜕𝑓
𝜕𝑧
= 𝑞 = 3
𝜕𝑓
𝜕𝑥
=
𝜕𝑓
𝜕𝑞
𝜕𝑞
𝜕𝑥
= −4 · 1 = −4
𝜕𝑓
𝜕𝑦
=
𝜕𝑓
𝜕𝑞
𝜕𝑞
𝜕𝑦
= −4 · 1 = −4
-4
-4
3
Computational graphs
Numerical Examples
x
+𝑓 𝑥, 𝑦, 𝑧 = 𝜎 𝑤k 𝑥k + 𝑤X 𝑥X + 𝑏
𝑤0
𝑥0
𝑏
x
𝑤1
𝑥1
+ σ
𝑑𝜎(𝑥)
𝑥
=
𝑒<m
1 + 𝑒<m 2
=
1 + 𝑒<m
− 1
1 + 𝑒<m
1
1+ 𝑒<m
𝜎 𝑥 =
1
1 + 𝑒<m
2
−1
−3
−2
−3
−2
6
4 1 0.73
1
𝑑𝜎(𝑥)
𝑥
= (1 − 𝜎(𝑥))(𝜎(𝑥))
0,20,2
0,2
-0,2
0,2
0,2
0,4
-0,4
-0,6
From Stanford Course: Convolutional Neural Networks for Visual Recognition
Computational graphs
Gates. Backward Pass
𝜎 𝑥 =
1
1 + 𝑒<m
𝑑𝜎(𝑥)
𝑥
= (1 − 𝜎(𝑥))(𝜎(𝑥))
𝑞 = 𝑥 + 𝑦 𝜕𝑞
𝜕𝑥
= 1
𝜕𝑞
𝜕𝑦
= 1
𝑓 = 𝑞𝑧
𝜕𝑓
𝜕𝑞
= 𝑧
𝜕𝑓
𝜕𝑧
= 𝑞
Sum: Distributes the gradient to both branches
Product: Switches gradient weigth values
Max: Routes the gradient only to the higher input
branche (not sensitive to the lower branche)
𝑥
𝑦
-0,2
0,2
0
max
2
1
2
+
In general: Derivative of a function
Add branches: Branches that split in the forward pass
and merge in the backward pass, add gradients
Computational graphs
Numerical Examples
From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017
x
𝑓 𝑥, 𝑊 = 𝑾 @ 𝒙 2 = P 𝑾 @ 𝒙 R
S = P 𝒒 R
S
p
RqX
p
RqX
𝑾
𝒙
𝒒 0,16
𝜕𝑓
𝜕𝑞R
= 2𝑞R
1
𝛻 𝑾 𝑓 = 2𝒒 @ 𝒙F
L2
0.1 0.5
−0.3 0.8
0.2
0.4
0.22
0.26
0.44
0.52
0.088 0.176
0.104 0.208
𝛻𝒒 𝑓 = 2𝒒
𝛻𝒙 𝑓 = 2𝑾 𝑇 @ 𝒒
−0.112
0.636
Backpropagation applied to Multilayer Perceptron
For a single neuron with its linear and non-linear part
ℎX
w
g(·)
ℎS
x
ℎX
wyX
𝒂xyX
𝒉xyX = 𝑔(𝑾x 𝒉x +𝒃x) = 𝑔(𝒂xyX)
𝜕𝒉x
𝜕𝒂x
= 𝑔|(𝒂x)
𝜕𝒉𝒂xyX
𝜕𝒉x
= 𝑾 𝒌
Probability Class given an input
(softmax)
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Figure Credit: Kevin McGuiness
Forward Pass
Probability Class given an input
(softmax)
Loss function; e.g., negative log-likelihood
(good for classification)
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Regularization term (L2 Norm)
aka as weight decay
Figure Credit: Kevin McGuiness
Forward Pass
Probability Class given an input
(softmax)
Minimize the loss (plus some
regularization term) w.r.t. Parameters
over the whole training set.
Loss function; e.g., negative log-likelihood
(good for classification)
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Regularization term (L2 Norm)
aka as weight decay
Figure Credit: Kevin McGuiness
Forward Pass
1. Find the error in the top layer:
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
L
Figure Credit: Kevin McGuiness
Backward Pass
1. Find the error in the top layer: 2. Compute weight updates
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
L
Figure Credit: Kevin McGuiness
Backward Pass
To simplify we don’t consider the biass
1. Find the error in the top layer: 3. Backpropagate error to layer below2. Compute weight updates
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
L
Figure Credit: Kevin McGuiness
Backward Pass
To simplify we don’t consider the biass
Issues on Backpropagation and Training
Gradient Descent: Move the parameter 𝜃Vin small steps in the direction opposite sign of the
derivative of the loss with respect j.
𝜃(p) = 𝜃(p<X) − 𝛼 p<X @ 𝛻4ℒ 𝑦, 𝑓 𝑥 − 𝜆𝜃 p<X
Weight Decay: Penalizes large weights, distributes values among all the parameters
Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a
minibatch of examples.
Momentum: the movement direction of parameters averages the gradient estimation with
previous ones.
Several strategies have been proposed to update the weights: optimizers
Weight initialization
Need to pick a starting point for gradient
descent: an initial set of weights
Zero is a very bad idea!
Zero is a critical point
Error signal will not propagate
Gradients will be zero: no progress
Constant value also bad idea:
Need to break symmetry
Use small random values:
E.g. zero mean Gaussian noise with constant
variance
Ideally we want inputs to activation functions
(e.g. sigmoid, tanh, ReLU) to be mostly in the
linear area to allow larger gradients to
propagate and converge faster.
0
tanh
Small
gradient
Large
gradient
bad good
In the backward pass you might be in the flat part of the sigmoid (or any other activation
function like tanh) so derivative tends to zero and your training loss will not go down
“Vanishing Gradients”
Note on hyperparameters
So far we have lots of hyperparameters to choose:
1. Learning rate (α)
2. Regularization constant (λ)
3. Number of epochs
4. Number of hidden layers
5. Nodes in each hidden layer
6. Weight initialization strategy
7. Loss function
8. Activation functions
9. …

More Related Content

What's hot (20)

PDF
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PPTX
The world of loss function
홍배 김
 
PPTX
Anomaly detection using deep one class classifier
홍배 김
 
PDF
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PPTX
Detailed Description on Cross Entropy Loss Function
범준 김
 
PDF
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Iclr2016 vaeまとめ
Deep Learning JP
 
PDF
K-means, EM and Mixture models
Vu Pham
 
PDF
Variational Autoencoders For Image Generation
Jason Anderson
 
PDF
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Universitat Politècnica de Catalunya
 
PPTX
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Eun Ji Lee
 
PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
MLconf
 
PDF
VAE-type Deep Generative Models
Kenta Oono
 
PPTX
Deep learning paper review ppt sourece -Direct clr
taeseon ryu
 
PDF
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Altoros
 
PDF
K-means and GMM
Sanghyuk Chun
 
PPTX
Machine learning applications in aerospace domain
홍배 김
 
PPTX
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Seongwon Hwang
 
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
The world of loss function
홍배 김
 
Anomaly detection using deep one class classifier
홍배 김
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Detailed Description on Cross Entropy Loss Function
범준 김
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Iclr2016 vaeまとめ
Deep Learning JP
 
K-means, EM and Mixture models
Vu Pham
 
Variational Autoencoders For Image Generation
Jason Anderson
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Universitat Politècnica de Catalunya
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Eun Ji Lee
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
MLconf
 
VAE-type Deep Generative Models
Kenta Oono
 
Deep learning paper review ppt sourece -Direct clr
taeseon ryu
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Altoros
 
K-means and GMM
Sanghyuk Chun
 
Machine learning applications in aerospace domain
홍배 김
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Seongwon Hwang
 

Similar to Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence) (20)

PPTX
Deep neural networks & computational graphs
Revanth Kumar
 
PDF
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
PDF
lecture24NeuralNetworksNYUcourseEEEE.pdf
fankerui92
 
PPT
Artificial Neural Network
Pratik Aggarwal
 
PDF
Capstone paper
Muhammad Saeed
 
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
PPTX
Introduction to Neural Networks and Deep Learning from Scratch
Ahmed BESBES
 
PPTX
Neural network basic and introduction of Deep learning
Tapas Majumdar
 
PPTX
PRML Chapter 5
Sunwoo Kim
 
PPTX
Learning to Rank with Neural Networks
Bhaskar Mitra
 
PPTX
Comprehensive Guide to Neural Networks in Machine Learning and Deep Learning ...
RajeswariBsr1
 
PPTX
DeepLearningLecture.pptx
ssuserf07225
 
PDF
Week_2_Neural_Networks_Basichhhhhhhs.pdf
Aliker5
 
PPTX
Introduction to Deep learning and H2O for beginner's
Vidyasagar Bhargava
 
PDF
Sparse autoencoder
Devashish Patel
 
PPT
deep learning UNIT-1 Introduction Part-1.ppt
shashikanthsana
 
PPTX
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 
PPTX
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machine...
resming1
 
PPT
Neural networks,Single Layer Feed Forward
RohiniRajaramPandian
 
Deep neural networks & computational graphs
Revanth Kumar
 
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
lecture24NeuralNetworksNYUcourseEEEE.pdf
fankerui92
 
Artificial Neural Network
Pratik Aggarwal
 
Capstone paper
Muhammad Saeed
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Introduction to Neural Networks and Deep Learning from Scratch
Ahmed BESBES
 
Neural network basic and introduction of Deep learning
Tapas Majumdar
 
PRML Chapter 5
Sunwoo Kim
 
Learning to Rank with Neural Networks
Bhaskar Mitra
 
Comprehensive Guide to Neural Networks in Machine Learning and Deep Learning ...
RajeswariBsr1
 
DeepLearningLecture.pptx
ssuserf07225
 
Week_2_Neural_Networks_Basichhhhhhhs.pdf
Aliker5
 
Introduction to Deep learning and H2O for beginner's
Vidyasagar Bhargava
 
Sparse autoencoder
Devashish Patel
 
deep learning UNIT-1 Introduction Part-1.ppt
shashikanthsana
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 
CST413 KTU S7 CSE Machine Learning Neural Networks and Support Vector Machine...
resming1
 
Neural networks,Single Layer Feed Forward
RohiniRajaramPandian
 
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
PDF
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
PDF
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
PDF
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
PDF
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Climate Action.pptx action plan for climate
justfortalabat
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
deep dive data management sharepoint apps.ppt
novaprofk
 

Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)

  • 1. [course site] Day 3 Lecture 1 Backpropagation Elisa Sayrol
  • 3. …in our last lecture
  • 4. Multilayer perceptrons When each node in each layer is a linear combination of all inputs from the previous layer then the network is called a multilayer perceptron (MLP) Weights can be organized into matrices. Forward pass computes 𝐚(#)
  • 5. Training MLPs With Multiple layers we need to minimize the loss function 𝓛 𝒇 𝜽 𝒙 , 𝒚 with respect to all the parameters of the model 𝜽(W(k), b(k)): 𝑊 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛4ℒ 𝑓4 𝑥 , 𝑦 Gradient Descent: Move the parameter 𝜽𝒋 in small steps in the direction opposite sign of the derivative of the loss with respect j: 𝜽𝒋 (𝒏) = 𝜽𝒋 (𝒏<𝟏) − 𝜶(𝒏<𝟏) @ 𝜵 𝜽𝒋 𝓛(𝒚,𝒇 𝒙 ) Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a minibatch of examples. For MLP gradients can be found using the chain rule of differentiation. The calculations reveal that the gradient wrt. the parameters in layer k only depends on the error from the above layer and the output from the layer below. This means that the gradients for each layer can be computed iteratively, starting at the last layer and propagating the error back through the network. This is known as the backpropagation algorithm.
  • 6. • Computational Graphs • Examples applying chain of rule in simple graphs • Backpropagation applied to Multilayer Perceptron • Issues on Backpropagation and training Backpropagation algorithm
  • 7. Computational graphs z x y x u(1) u(2) · + y^ x w b σ U(1) U(2) matmul + H X W b relu u(1) u(2) · y^ x w λ x u(3) sum sqrt 𝑧 = 𝑥𝑦 𝑦C=𝜎(xFw + b) 𝑯=max 0, 𝑿𝑾 + 𝒃 𝑦C=xFw 𝜆 P 𝑤R S R From Deep Learning Book
  • 8. Computational graphs Applying the Chain Rule to Computational Graphs 𝑦 = 𝑔(𝑥) 𝑧 = 𝑓 𝑔 𝑥 𝑑𝑧 𝑑𝑥 = 𝑑𝑧 𝑑𝑦 𝑑𝑦 𝑑𝑥 𝜕𝑧 𝜕𝑥R = P 𝜕𝑧 𝜕𝑦V 𝜕𝑦V 𝜕𝑥RV 𝛻𝒙 𝑧 = 𝜕𝒚 𝜕𝒙 F 𝛻𝒚 𝑧 𝑑𝑧 𝑑𝑦 𝑑𝑦 𝑑𝑥X 𝑑𝑦 𝑑𝑥S 𝑑𝑧 𝑑𝑥X 𝑑𝑧 𝑑𝑥S 𝑑𝑧 𝑑𝑥X = 𝑑𝑧 𝑑𝑦 𝑑𝑦 𝑑𝑥X 𝑑𝑧 𝑑𝑥S = 𝑑𝑧 𝑑𝑦 𝑑𝑦 𝑑𝑥S For vectors: 𝑥X 𝑥S 𝑦 𝑧 fg 𝑑𝑧 𝑑𝑦 𝑧 = 𝑓(𝑦)
  • 9. Computational graphs Numerical Examples From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017 + x 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 𝑥 𝑦 𝑧 𝑞 −2 5 −4 -12 3 𝑞 = 𝑥 + 𝑦 𝑓 = 𝑞𝑧 𝜕𝑞 𝜕𝑥 = 1 𝜕𝑞 𝜕𝑦 = 1 𝜕𝑓 𝜕𝑞 = 𝑧 𝜕𝑓 𝜕𝑧 = 𝑞 𝑊𝑒 𝑤𝑎𝑛𝑡 𝑡𝑜 𝑐𝑜𝑚𝑝𝑢𝑡𝑒: 𝜕𝑓 𝜕𝑥 , 𝜕𝑓 𝜕𝑦 , 𝜕𝑓 𝜕𝑧 𝐸𝑥𝑎𝑚𝑝𝑙𝑒 𝑥 = −2, 𝑦 = 5, 𝑧 = −4 𝜕𝑓 𝜕𝑓 = 1 𝜕𝑓 𝜕𝑓 = 1 1 𝜕𝑓 𝜕𝑞 = 𝑧 = −4 -4 𝜕𝑓 𝜕𝑧 = 𝑞 = 3 𝜕𝑓 𝜕𝑥 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑥 = −4 · 1 = −4 𝜕𝑓 𝜕𝑦 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑦 = −4 · 1 = −4 -4 -4 3
  • 10. Computational graphs Numerical Examples x +𝑓 𝑥, 𝑦, 𝑧 = 𝜎 𝑤k 𝑥k + 𝑤X 𝑥X + 𝑏 𝑤0 𝑥0 𝑏 x 𝑤1 𝑥1 + σ 𝑑𝜎(𝑥) 𝑥 = 𝑒<m 1 + 𝑒<m 2 = 1 + 𝑒<m − 1 1 + 𝑒<m 1 1+ 𝑒<m 𝜎 𝑥 = 1 1 + 𝑒<m 2 −1 −3 −2 −3 −2 6 4 1 0.73 1 𝑑𝜎(𝑥) 𝑥 = (1 − 𝜎(𝑥))(𝜎(𝑥)) 0,20,2 0,2 -0,2 0,2 0,2 0,4 -0,4 -0,6 From Stanford Course: Convolutional Neural Networks for Visual Recognition
  • 11. Computational graphs Gates. Backward Pass 𝜎 𝑥 = 1 1 + 𝑒<m 𝑑𝜎(𝑥) 𝑥 = (1 − 𝜎(𝑥))(𝜎(𝑥)) 𝑞 = 𝑥 + 𝑦 𝜕𝑞 𝜕𝑥 = 1 𝜕𝑞 𝜕𝑦 = 1 𝑓 = 𝑞𝑧 𝜕𝑓 𝜕𝑞 = 𝑧 𝜕𝑓 𝜕𝑧 = 𝑞 Sum: Distributes the gradient to both branches Product: Switches gradient weigth values Max: Routes the gradient only to the higher input branche (not sensitive to the lower branche) 𝑥 𝑦 -0,2 0,2 0 max 2 1 2 + In general: Derivative of a function Add branches: Branches that split in the forward pass and merge in the backward pass, add gradients
  • 12. Computational graphs Numerical Examples From Stanford Course: Convolutional Neural Networks for Visual Recognition 2017 x 𝑓 𝑥, 𝑊 = 𝑾 @ 𝒙 2 = P 𝑾 @ 𝒙 R S = P 𝒒 R S p RqX p RqX 𝑾 𝒙 𝒒 0,16 𝜕𝑓 𝜕𝑞R = 2𝑞R 1 𝛻 𝑾 𝑓 = 2𝒒 @ 𝒙F L2 0.1 0.5 −0.3 0.8 0.2 0.4 0.22 0.26 0.44 0.52 0.088 0.176 0.104 0.208 𝛻𝒒 𝑓 = 2𝒒 𝛻𝒙 𝑓 = 2𝑾 𝑇 @ 𝒒 −0.112 0.636
  • 13. Backpropagation applied to Multilayer Perceptron For a single neuron with its linear and non-linear part ℎX w g(·) ℎS x ℎX wyX 𝒂xyX 𝒉xyX = 𝑔(𝑾x 𝒉x +𝒃x) = 𝑔(𝒂xyX) 𝜕𝒉x 𝜕𝒂x = 𝑔|(𝒂x) 𝜕𝒉𝒂xyX 𝜕𝒉x = 𝑾 𝒌
  • 14. Probability Class given an input (softmax) h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Figure Credit: Kevin McGuiness Forward Pass
  • 15. Probability Class given an input (softmax) Loss function; e.g., negative log-likelihood (good for classification) h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Regularization term (L2 Norm) aka as weight decay Figure Credit: Kevin McGuiness Forward Pass
  • 16. Probability Class given an input (softmax) Minimize the loss (plus some regularization term) w.r.t. Parameters over the whole training set. Loss function; e.g., negative log-likelihood (good for classification) h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 Regularization term (L2 Norm) aka as weight decay Figure Credit: Kevin McGuiness Forward Pass
  • 17. 1. Find the error in the top layer: h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass
  • 18. 1. Find the error in the top layer: 2. Compute weight updates h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass To simplify we don’t consider the biass
  • 19. 1. Find the error in the top layer: 3. Backpropagate error to layer below2. Compute weight updates h2 h3a3 a4 h4 Loss Hidden Hidden Output W2 W3 x a2 Input W1 L Figure Credit: Kevin McGuiness Backward Pass To simplify we don’t consider the biass
  • 20. Issues on Backpropagation and Training Gradient Descent: Move the parameter 𝜃Vin small steps in the direction opposite sign of the derivative of the loss with respect j. 𝜃(p) = 𝜃(p<X) − 𝛼 p<X @ 𝛻4ℒ 𝑦, 𝑓 𝑥 − 𝜆𝜃 p<X Weight Decay: Penalizes large weights, distributes values among all the parameters Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a minibatch of examples. Momentum: the movement direction of parameters averages the gradient estimation with previous ones. Several strategies have been proposed to update the weights: optimizers
  • 21. Weight initialization Need to pick a starting point for gradient descent: an initial set of weights Zero is a very bad idea! Zero is a critical point Error signal will not propagate Gradients will be zero: no progress Constant value also bad idea: Need to break symmetry Use small random values: E.g. zero mean Gaussian noise with constant variance Ideally we want inputs to activation functions (e.g. sigmoid, tanh, ReLU) to be mostly in the linear area to allow larger gradients to propagate and converge faster. 0 tanh Small gradient Large gradient bad good
  • 22. In the backward pass you might be in the flat part of the sigmoid (or any other activation function like tanh) so derivative tends to zero and your training loss will not go down “Vanishing Gradients”
  • 23. Note on hyperparameters So far we have lots of hyperparameters to choose: 1. Learning rate (α) 2. Regularization constant (λ) 3. Number of epochs 4. Number of hidden layers 5. Nodes in each hidden layer 6. Weight initialization strategy 7. Loss function 8. Activation functions 9. …