SlideShare a Scribd company logo
Batch Normalization
Accelerating Deep Network Training by
Reducing Internal Covariate Shift
By : seraj alhamidi
Instructor: Associate Prof. Mohammed Alhanjouri
June .2019
About the paper
Sergey Ioffe
Google Inc.,
sioffe@google.com
Christian Szegedy
Google Inc.,
szegedy@google.com
Authors
The 32nd International Conference on Machine
Learning (2015)
presented
publishers
https://ai.google/research/pubs/pub43442
Journal of Machine Learning Research
http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
Cornell university
https://arxiv.org/abs/1502.03167
paper with over 6000 citations on ICML 2015citations
Outlines
Introduction
Issues with Training Deep Neural Networks
Batch Normalization
Ablation Study
Comparison with the State of the art Approaches
Some notes
My work
Introduction
ILSVRC Competition in 2015
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) , sponsored
by google and Facebook
ImageNet, is a dataset of over 15 millions labelled high-resolution images with
around 22,000 categories, for classification and localization tasks
ILSVRC uses a subset of ImageNet of 1000 categories.
On 6 Feb 2015, Microsoft has proposed PReLU-Net which has 4.94% error
rate which surpasses the human error rate of 5.1%
Five days later, on 11 Feb 2015, Google proposed BN-Inception which has 4.8%
error rate.
Reach best accuracy in 7% of time need to reach same accuracy
Issues with Training Deep Neural Networks
Vanishing Gradient
Saturating nonlinearities (like 𝑡𝑎𝑛ℎ 𝑜𝑟 𝑠𝑖𝑔𝑚𝑜𝑖𝑑) cannot be used for deep
networks
An example, the sigmoid function and it’s derivative. When the inputs of
the sigmoid function becomes larger or smaller , the derivative becomes
close to zero.
the sigmoid function and its derivativebackpropagation algorithm update rule
 𝑤 𝜅 + 1 = 𝑤 𝜅 − 𝛼
𝜕𝐿
𝜕𝑤
 𝐿 = 0.5 𝑡 − 𝑎 2 𝑡 ∶ 𝑡𝑎𝑟𝑔𝑒𝑡 , 𝑎 ∶ 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
 𝑎 𝑙
= 𝜎 𝑥 𝑙
𝜎 ∶ 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑧 ∶ 𝑖𝑛𝑝𝑢𝑡 𝑡𝑜 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
 𝑥 𝑙 = 𝑤𝑖,𝑗 ∗ 𝑎 𝑙−1 + 𝑤𝑖+1 ,𝑗 ∗ 𝑎 𝑙−1 + ⋯

𝜕𝐿
𝜕𝑤
=
𝜕𝐿
𝜕𝑎
.
𝜕𝑎
𝜕𝑥
.
𝜕𝑥
𝜕𝑤
≡
𝜕𝐿
𝜕𝑎
. 𝜎 𝑥 𝑙 .
𝜕𝑥
𝜕𝑤
Issues with Training Deep Neural Networks
Vanishing Gradient
Sigmoid function with restricted inputsRectified linear units 𝑓 𝑥 = 𝑥+
= max(0, 𝑥)
Some ways around this are to use:
 batch normalization layers can also resolve the issue
 Nonlinearities like Rectified linear units (ReLU) which do not saturate.
 Smaller learning rates
 Careful weights initializations
Issues with Training Deep Neural Networks
Internal Covariate shift
 Covariate – The Features of the Input Data
 Covariate Shift - The change in the distribution of inputs layers in the middle of a
deep neural network, is referred to the technical name “internal covariate shift ”.
when the distribution that is fed to the layers of a network should be somewhat:
Zero-centered
Constant through time and data
the distribution of the data being fed to the layers should not vary too much across
the mini-batches fed to the network
Neural networks learn efficiently
Issues with Training Deep Neural Networks
Internal Covariate shift in deep NN
Iteration i
Iteration i+1
Iteration i+2
Every time there’s new
relation (distribution)
specially in deep layers and
at the beginning of training
Issues with Training Deep Neural Networks
Take one layer from internal layers
Assume has a distribution given
below. Also, let us suppose, the
function learned by the layer is
represented by the dashed line
suppose, after the gradient
updating, the distribution of x
gets changed to something like
the loss for this mini-batch is more as
compared to the previous loss
Issues with Training Deep Neural Networks
 Every time we force the layer (l) to design new Perceptron
 Because I give it new disruption every time
 In deep layers , we may have butterfly effect , however the change in 𝒘 at
first layers is small, thus make the network unstable
Batch Normalization
 The Batch Normalization attempts to normalize a batch of inputs before
they are fed to a non-linear activation unit (like ReLU, sigmoid, etc). during
training
 so that the input to the activation function across each training batch has a
mean of 0 and a variance of 1
 applying batch normalization to the activation σ(Wx + b) would result
in σ(BN(Wx + b)) where 𝐵𝑁 is the batch normalizing transform
Batch Normalization
To make each dimension unit gaussian, we apply:
𝑥 𝑘 =
𝑥 𝑘 − 𝐸 𝑥 𝑘
𝑉𝑎𝑟 𝑥 𝑘
where 𝐸 𝑥(𝑘) and 𝑉𝑎𝑟 𝑥(𝑘) are respectively the mean and variance of 𝑘-th
feature over a batch. Then we transform 𝑥(𝑘) as:
𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘
where 𝛾 and 𝛽 are the hyper (learnable) parameters of the so-called batch
normalization layer
𝐾 is the 𝑘-th sample in one feature mini batch
Batch Normalization
Transformation of inputs
Forward Propagation through Batch Normalization layer
We have shown the normalization of multiple sample in just one feature
Input: Values of 𝐱 over 𝐁 = 𝐱𝟏 … 𝐱 𝐦 a batch; Parameters to be learned 𝛄, 𝛃
Output: 𝓨𝐢 = 𝐁𝐍 𝛄,𝛃(𝐱 𝐢)
Flow of computation through Batch Normalization layer
𝝁 𝑩 =
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊
𝝈 𝑩
𝟐
=
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊 − 𝝁 𝑩
𝟐
𝒙𝒊=
𝒙𝒊 − 𝝁 𝑩
𝝈 𝑩
𝟐
+ 𝝐
𝓨𝒊= 𝜸 𝒙 + 𝜷 = 𝑩𝑵 𝜸,𝜷(𝒙𝒊)
𝜖
𝜖 is a small value 1 ∗ 10−8
for not devided by zero
Forward Propagation through Batch Normalization
layer
TWO features
THE MAGIC
Imagine that the network was thought that the optimal that will minimize
the cost is to Cancel the BN effect !
Forward Propagation through Batch Normalization layer
β = 𝐸 𝑥 = μB
𝛾 = 𝑉𝑎𝑟 𝑥 = 𝜎 𝐵
2
+ 𝜖
𝑥𝑖 =
𝑥𝑖 − 𝜇 𝐵
𝜎 𝐵
2
+ 𝜖
𝒴𝑖 = 𝛾 𝑥𝑖 + 𝛽 = 𝜎 𝐵
2
+ 𝜖 ∗
𝑥 𝑖 −𝜇 𝐵
𝜎 𝐵
2+𝜖
+ 𝜇 𝐵 = 𝑥𝑖
Identity transform
𝛾, 𝛽 Adapted by SGD
Backpropagation through Batch Normalization layer
𝒴𝑖 = 𝛾 𝑥 + 𝛽 = 𝐵𝑁𝛾,𝛽(𝑥𝑖)
𝝏𝑳
𝝏𝓨𝒊
𝝏𝑳
𝝏 𝒙𝒊
𝜕L
𝜕γi
𝝏𝑳
𝝏𝜷
𝝏𝑳
𝝏𝝈 𝑩
𝟐𝝏𝑳
𝝏𝝁 𝑩
𝝏𝑳
𝝏𝒙 𝒊
SGD
𝜷 𝒌 + 𝟏 = 𝜷 𝜿 − 𝜶
𝝏𝑳
𝝏𝜷
𝜸 𝒌 + 𝟏 = 𝜸 𝜿 − 𝜶
𝝏𝑳
𝝏𝜸
𝝁 𝑩 =
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊
𝝈 𝑩
𝟐
=
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊 − 𝝁 𝑩
𝟐
𝒙𝒊=
𝒙𝒊 − 𝝁 𝑩
𝝈 𝑩
𝟐
+ 𝝐
𝓨𝒊= 𝜸 𝒙 + 𝜷 = 𝑩𝑵 𝜸,𝜷(𝒙𝒊)
Backpropagation during test time
using the population, rather than mini-batch statistics. Effectively, we process mini-
batches of size 𝑚 and use their statistics to compute:
𝐸 𝑥(𝑘) = 𝐸 𝐵[ 𝜇 𝐵]
𝑉𝑎𝑟 𝑥(𝑘) =
𝑚
𝑚 − 1
𝐸 𝐵[𝜎 𝐵
2 ]
we can use 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑚𝑜𝑣𝑖𝑛𝑔 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 to estimate the mean and variance to be
used during test time, we estimate the running average of mean and variance as:
𝜇 𝑟𝑢𝑛𝑛𝑖𝑛𝑔 = 𝛼. 𝜇 𝑟𝑢𝑛𝑛𝑖𝑛𝑔 + (1- 𝛼). 𝜇 𝐵
‫ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ‬‫ـــ‬
𝜎𝑟𝑢𝑛𝑛𝑖𝑛𝑔
2
= 𝛼. 𝜎𝑟𝑢𝑛𝑛𝑖𝑛𝑔
2
+ (1- 𝛼).𝜎 𝐵
2
where 𝛼 is a constant smoothing factor between 0 and 1 and represents the degree
of dependence on the previous observations .
Ablation Study
MNIST dataset
28×28 binary image as input, 3 fully connected (FC) hidden layer with 100 activations each
, the last hidden layer followed by 10 activations as there are 10 digits. And the loss is cross
entropy loss.
BN network is much more stable
Ablation Study
ImageNet of 1000 categories on GoogleNet/Inception(2014)
weighing 138GB for the training images, 6.3GB for the validation images, and 13GB for
the testing
CNN architectures tested
Comparison with the State of the art Approaches
Some Notes : cross-entropy loss function
We use cross-entropy loss function
neural network (1)
Computed | targets | correct?
------------------------------------------------
0.3 0.3 0.4 | 0 0 1 (democrat) | yes
0.3 0.4 0.3 | 0 1 0 (republican) | yes
0.1 0.2 0.7 | 1 0 0 (other) | no
neural network (2)
Computed | targets | correct?
------------------------------------------------
0.1 0.2 0.7 | 0 0 1 (democrat) | yes
0.1 0.7 0.2 | 0 1 0 (republican) | yes
0.3 0.4 0.3 | 1 0 0 (other) | no
cross-entropy error for the first training
−( (ln(0.3) ∗ 0) + (ln(0.3) ∗ 0) + (ln(0.4) ∗ 1) ) = −ln(0.4)
average cross-entropy error (ACE)
−(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
average cross-entropy error (ACE)
−(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64
mean squared error for the first item
(0.3 − 0)^2 + (0.3 − 0)^2 + (0.4 − 1)^2 = 0.54
the MSE for the first neural network is
(0.54 + 0.54 + 1.34) / 3 = 0.81
The MSE for the second, better, network is
(0.14 + 0.14 + 0.74) / 3 = 0.34
(1.38 − 0.64 = 0.74) > (0.81 − 0.34 = 0.47)
The 𝑙𝑛() function in cross-entropy takes into account the closeness of a prediction
Some Notes : Convolutional Neural Network (CNN)
 It use for Image classification is the task
 It was developed between 1988 and 1993, at Bell Labs
 the first convolutional network that could recognize handwritten digits
Some Notes : Convolutional Neural Network (CNN)
Convolution Layer
(Conv Layer)
Pooling Layer ReLU Layer
Fully Connected
Layer (Flatten)
Some Notes : Convolutional Neural Network (CNN)
Convolution Layer (Conv Layer)
convolution works by sliding a window across the input
Some Notes : Convolutional Neural Network (CNN)
2D filters
Some Notes : Convolutional Neural Network (CNN)
3D filters
Pooling Layer (Sub-sampling or Down-sampling)
 reduce the size of feature maps by using some functions average or the maximum ,
(hence called down-sampling)
 make extracted features more robust by making it more invariant to scale and
orientation changes.
ReLU Layer
Remove all the black elements from it, keeping only those carrying a positive value
(the grey and white colours )
)𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑀𝑎𝑥(𝑧𝑒𝑟𝑜, 𝐼𝑛𝑝𝑢𝑡
to introduce non-linearity in our ConvNet
Fully Connected Layer (Flatten)
http://scs.ryerson.ca/~aharley/vis/conv/flat.html
MY WORK  MNIST on google colab
Inputs = 28*28 = 784
Layer 1&2 = 100 nodes | Layer 3 = 10 nodes
All Activations are sigmoid
Cross-entropy loss function
The train and test set
is already splited in
tensorflow
the distribution over
time of the inputs to
the sigmoid function
of the first five
neurons in the
second layer . Batch
normalization has a
visible and
significant effect of
removing
variance/noise in
these inputs.final acc: 99%
MY WORK  caltech dataset
𝑊𝑖𝑡ℎ 𝐵𝑁
#𝑒𝑝𝑜𝑐ℎ = 150
𝐿𝑅 = 1 ∗ 10−3
𝑊𝑖𝑡ℎ𝑜𝑢𝑡 𝐵𝑁
#𝑒𝑝𝑜𝑐ℎ = 150
𝐿𝑅 = 1 ∗ 10−3
𝑤𝑖𝑡ℎ𝑜𝑢𝑡 BN
#𝑒𝑝𝑜𝑐ℎ = 250
𝐿𝑅 = 1 ∗ 10−3
final acc: 90.54% final acc: 94.44%final acc: 96.04%
We use ten (10)
classes from Caltech
dataset instead of
ImageNet
Dataset because it’s
huge size
classify the
input image
1/3 ,1/3 , 1/3 train
, validation, test
Thanks
4
listening

More Related Content

What's hot (20)

PPTX
Deep neural networks
Si Haem
 
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
PDF
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
PDF
Introduction to Neural Networks
Databricks
 
PPTX
Deep Neural Networks (DNN)
Sir Syed University of Engineering & Technology
 
PPTX
Convolutional neural network
MojammilHusain
 
PPTX
Introduction Of Artificial neural network
Nagarajan
 
PPTX
Gradient descent method
Prof. Neeta Awasthy
 
PPTX
CNN Tutorial
Sungjoon Choi
 
PDF
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 
PPTX
Regularization in deep learning
Kien Le
 
PPTX
Genetic Algorithm in Artificial Intelligence
Sinbad Konick
 
PDF
Convolutional neural network
Yan Xu
 
PDF
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PPTX
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks
 
PPTX
Object Detection using Deep Neural Networks
Usman Qayyum
 
PPTX
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Simplilearn
 
PDF
GAN - Theory and Applications
Emanuele Ghelfi
 
PPTX
Convolutional Neural Network and Its Applications
Kasun Chinthaka Piyarathna
 
PDF
Batch normalization paper review
Minho Heo
 
Deep neural networks
Si Haem
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Introduction to Neural Networks
Databricks
 
Convolutional neural network
MojammilHusain
 
Introduction Of Artificial neural network
Nagarajan
 
Gradient descent method
Prof. Neeta Awasthy
 
CNN Tutorial
Sungjoon Choi
 
Machine Learning: Introduction to Neural Networks
Francesco Collova'
 
Regularization in deep learning
Kien Le
 
Genetic Algorithm in Artificial Intelligence
Sinbad Konick
 
Convolutional neural network
Yan Xu
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks
 
Object Detection using Deep Neural Networks
Usman Qayyum
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Simplilearn
 
GAN - Theory and Applications
Emanuele Ghelfi
 
Convolutional Neural Network and Its Applications
Kasun Chinthaka Piyarathna
 
Batch normalization paper review
Minho Heo
 

Similar to Batch normalization presentation (20)

PDF
N ns 1
Thy Selaroth
 
PPTX
Introduction to deep learning
Junaid Bhat
 
PPT
backpropagation in neural networks
Akash Goel
 
PDF
Web spam classification using supervised artificial neural network algorithms
aciijournal
 
PDF
deep CNN vs conventional ML
Chao Han [email protected]
 
PPTX
UNetEliyaLaialy (2).pptx
NoorUlHaq47
 
PDF
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
aciijournal
 
PDF
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
aciijournal
 
PDF
4 high performance large-scale image recognition without normalization
Donghoon Park
 
PDF
High performance large-scale image recognition without normalization
taeseon ryu
 
PDF
Restricting the Flow: Information Bottlenecks for Attribution
taeseon ryu
 
PPTX
Introduction to Convolutional Neural Networks
ParrotAI
 
PDF
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Ahmed Yousry
 
PPTX
Mnist report ppt
RaghunandanJairam
 
PDF
Mnist report
RaghunandanJairam
 
PPTX
Unit ii supervised ii
Indira Priyadarsini
 
PDF
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
csandit
 
PPTX
Deep learning from a novice perspective
Anirban Santara
 
PPTX
Cnn
rimshailyas1
 
PDF
dfdshofdifhdifhdfhgfoighfgofgfgfgfgdfdfdfdf
nguyenhoangy207
 
N ns 1
Thy Selaroth
 
Introduction to deep learning
Junaid Bhat
 
backpropagation in neural networks
Akash Goel
 
Web spam classification using supervised artificial neural network algorithms
aciijournal
 
deep CNN vs conventional ML
Chao Han [email protected]
 
UNetEliyaLaialy (2).pptx
NoorUlHaq47
 
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
aciijournal
 
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
aciijournal
 
4 high performance large-scale image recognition without normalization
Donghoon Park
 
High performance large-scale image recognition without normalization
taeseon ryu
 
Restricting the Flow: Information Bottlenecks for Attribution
taeseon ryu
 
Introduction to Convolutional Neural Networks
ParrotAI
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Ahmed Yousry
 
Mnist report ppt
RaghunandanJairam
 
Mnist report
RaghunandanJairam
 
Unit ii supervised ii
Indira Priyadarsini
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
csandit
 
Deep learning from a novice perspective
Anirban Santara
 
dfdshofdifhdifhdfhgfoighfgofgfgfgfgdfdfdfdf
nguyenhoangy207
 
Ad

Recently uploaded (20)

PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Ad

Batch normalization presentation

  • 1. Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift By : seraj alhamidi Instructor: Associate Prof. Mohammed Alhanjouri June .2019
  • 2. About the paper Sergey Ioffe Google Inc., [email protected] Christian Szegedy Google Inc., [email protected] Authors The 32nd International Conference on Machine Learning (2015) presented publishers https://ai.google/research/pubs/pub43442 Journal of Machine Learning Research http://jmlr.org/proceedings/papers/v37/ioffe15.pdf Cornell university https://arxiv.org/abs/1502.03167 paper with over 6000 citations on ICML 2015citations
  • 3. Outlines Introduction Issues with Training Deep Neural Networks Batch Normalization Ablation Study Comparison with the State of the art Approaches Some notes My work
  • 4. Introduction ILSVRC Competition in 2015 The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) , sponsored by google and Facebook ImageNet, is a dataset of over 15 millions labelled high-resolution images with around 22,000 categories, for classification and localization tasks ILSVRC uses a subset of ImageNet of 1000 categories. On 6 Feb 2015, Microsoft has proposed PReLU-Net which has 4.94% error rate which surpasses the human error rate of 5.1% Five days later, on 11 Feb 2015, Google proposed BN-Inception which has 4.8% error rate. Reach best accuracy in 7% of time need to reach same accuracy
  • 5. Issues with Training Deep Neural Networks Vanishing Gradient Saturating nonlinearities (like 𝑡𝑎𝑛ℎ 𝑜𝑟 𝑠𝑖𝑔𝑚𝑜𝑖𝑑) cannot be used for deep networks An example, the sigmoid function and it’s derivative. When the inputs of the sigmoid function becomes larger or smaller , the derivative becomes close to zero. the sigmoid function and its derivativebackpropagation algorithm update rule  𝑤 𝜅 + 1 = 𝑤 𝜅 − 𝛼 𝜕𝐿 𝜕𝑤  𝐿 = 0.5 𝑡 − 𝑎 2 𝑡 ∶ 𝑡𝑎𝑟𝑔𝑒𝑡 , 𝑎 ∶ 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛  𝑎 𝑙 = 𝜎 𝑥 𝑙 𝜎 ∶ 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑧 ∶ 𝑖𝑛𝑝𝑢𝑡 𝑡𝑜 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛  𝑥 𝑙 = 𝑤𝑖,𝑗 ∗ 𝑎 𝑙−1 + 𝑤𝑖+1 ,𝑗 ∗ 𝑎 𝑙−1 + ⋯  𝜕𝐿 𝜕𝑤 = 𝜕𝐿 𝜕𝑎 . 𝜕𝑎 𝜕𝑥 . 𝜕𝑥 𝜕𝑤 ≡ 𝜕𝐿 𝜕𝑎 . 𝜎 𝑥 𝑙 . 𝜕𝑥 𝜕𝑤
  • 6. Issues with Training Deep Neural Networks Vanishing Gradient Sigmoid function with restricted inputsRectified linear units 𝑓 𝑥 = 𝑥+ = max(0, 𝑥) Some ways around this are to use:  batch normalization layers can also resolve the issue  Nonlinearities like Rectified linear units (ReLU) which do not saturate.  Smaller learning rates  Careful weights initializations
  • 7. Issues with Training Deep Neural Networks Internal Covariate shift  Covariate – The Features of the Input Data  Covariate Shift - The change in the distribution of inputs layers in the middle of a deep neural network, is referred to the technical name “internal covariate shift ”. when the distribution that is fed to the layers of a network should be somewhat: Zero-centered Constant through time and data the distribution of the data being fed to the layers should not vary too much across the mini-batches fed to the network Neural networks learn efficiently
  • 8. Issues with Training Deep Neural Networks Internal Covariate shift in deep NN Iteration i Iteration i+1 Iteration i+2 Every time there’s new relation (distribution) specially in deep layers and at the beginning of training
  • 9. Issues with Training Deep Neural Networks Take one layer from internal layers Assume has a distribution given below. Also, let us suppose, the function learned by the layer is represented by the dashed line suppose, after the gradient updating, the distribution of x gets changed to something like the loss for this mini-batch is more as compared to the previous loss
  • 10. Issues with Training Deep Neural Networks  Every time we force the layer (l) to design new Perceptron  Because I give it new disruption every time  In deep layers , we may have butterfly effect , however the change in 𝒘 at first layers is small, thus make the network unstable
  • 11. Batch Normalization  The Batch Normalization attempts to normalize a batch of inputs before they are fed to a non-linear activation unit (like ReLU, sigmoid, etc). during training  so that the input to the activation function across each training batch has a mean of 0 and a variance of 1  applying batch normalization to the activation σ(Wx + b) would result in σ(BN(Wx + b)) where 𝐵𝑁 is the batch normalizing transform
  • 12. Batch Normalization To make each dimension unit gaussian, we apply: 𝑥 𝑘 = 𝑥 𝑘 − 𝐸 𝑥 𝑘 𝑉𝑎𝑟 𝑥 𝑘 where 𝐸 𝑥(𝑘) and 𝑉𝑎𝑟 𝑥(𝑘) are respectively the mean and variance of 𝑘-th feature over a batch. Then we transform 𝑥(𝑘) as: 𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘 where 𝛾 and 𝛽 are the hyper (learnable) parameters of the so-called batch normalization layer 𝐾 is the 𝑘-th sample in one feature mini batch
  • 14. Forward Propagation through Batch Normalization layer We have shown the normalization of multiple sample in just one feature Input: Values of 𝐱 over 𝐁 = 𝐱𝟏 … 𝐱 𝐦 a batch; Parameters to be learned 𝛄, 𝛃 Output: 𝓨𝐢 = 𝐁𝐍 𝛄,𝛃(𝐱 𝐢) Flow of computation through Batch Normalization layer 𝝁 𝑩 = 𝟏 𝒎 𝒊=𝟏 𝒎 𝒙𝒊 𝝈 𝑩 𝟐 = 𝟏 𝒎 𝒊=𝟏 𝒎 𝒙𝒊 − 𝝁 𝑩 𝟐 𝒙𝒊= 𝒙𝒊 − 𝝁 𝑩 𝝈 𝑩 𝟐 + 𝝐 𝓨𝒊= 𝜸 𝒙 + 𝜷 = 𝑩𝑵 𝜸,𝜷(𝒙𝒊) 𝜖 𝜖 is a small value 1 ∗ 10−8 for not devided by zero
  • 15. Forward Propagation through Batch Normalization layer TWO features
  • 16. THE MAGIC Imagine that the network was thought that the optimal that will minimize the cost is to Cancel the BN effect ! Forward Propagation through Batch Normalization layer β = 𝐸 𝑥 = μB 𝛾 = 𝑉𝑎𝑟 𝑥 = 𝜎 𝐵 2 + 𝜖 𝑥𝑖 = 𝑥𝑖 − 𝜇 𝐵 𝜎 𝐵 2 + 𝜖 𝒴𝑖 = 𝛾 𝑥𝑖 + 𝛽 = 𝜎 𝐵 2 + 𝜖 ∗ 𝑥 𝑖 −𝜇 𝐵 𝜎 𝐵 2+𝜖 + 𝜇 𝐵 = 𝑥𝑖 Identity transform 𝛾, 𝛽 Adapted by SGD
  • 17. Backpropagation through Batch Normalization layer 𝒴𝑖 = 𝛾 𝑥 + 𝛽 = 𝐵𝑁𝛾,𝛽(𝑥𝑖) 𝝏𝑳 𝝏𝓨𝒊 𝝏𝑳 𝝏 𝒙𝒊 𝜕L 𝜕γi 𝝏𝑳 𝝏𝜷 𝝏𝑳 𝝏𝝈 𝑩 𝟐𝝏𝑳 𝝏𝝁 𝑩 𝝏𝑳 𝝏𝒙 𝒊 SGD 𝜷 𝒌 + 𝟏 = 𝜷 𝜿 − 𝜶 𝝏𝑳 𝝏𝜷 𝜸 𝒌 + 𝟏 = 𝜸 𝜿 − 𝜶 𝝏𝑳 𝝏𝜸 𝝁 𝑩 = 𝟏 𝒎 𝒊=𝟏 𝒎 𝒙𝒊 𝝈 𝑩 𝟐 = 𝟏 𝒎 𝒊=𝟏 𝒎 𝒙𝒊 − 𝝁 𝑩 𝟐 𝒙𝒊= 𝒙𝒊 − 𝝁 𝑩 𝝈 𝑩 𝟐 + 𝝐 𝓨𝒊= 𝜸 𝒙 + 𝜷 = 𝑩𝑵 𝜸,𝜷(𝒙𝒊)
  • 18. Backpropagation during test time using the population, rather than mini-batch statistics. Effectively, we process mini- batches of size 𝑚 and use their statistics to compute: 𝐸 𝑥(𝑘) = 𝐸 𝐵[ 𝜇 𝐵] 𝑉𝑎𝑟 𝑥(𝑘) = 𝑚 𝑚 − 1 𝐸 𝐵[𝜎 𝐵 2 ] we can use 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑚𝑜𝑣𝑖𝑛𝑔 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 to estimate the mean and variance to be used during test time, we estimate the running average of mean and variance as: 𝜇 𝑟𝑢𝑛𝑛𝑖𝑛𝑔 = 𝛼. 𝜇 𝑟𝑢𝑛𝑛𝑖𝑛𝑔 + (1- 𝛼). 𝜇 𝐵 ‫ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ‬‫ـــ‬ 𝜎𝑟𝑢𝑛𝑛𝑖𝑛𝑔 2 = 𝛼. 𝜎𝑟𝑢𝑛𝑛𝑖𝑛𝑔 2 + (1- 𝛼).𝜎 𝐵 2 where 𝛼 is a constant smoothing factor between 0 and 1 and represents the degree of dependence on the previous observations .
  • 19. Ablation Study MNIST dataset 28×28 binary image as input, 3 fully connected (FC) hidden layer with 100 activations each , the last hidden layer followed by 10 activations as there are 10 digits. And the loss is cross entropy loss. BN network is much more stable
  • 20. Ablation Study ImageNet of 1000 categories on GoogleNet/Inception(2014) weighing 138GB for the training images, 6.3GB for the validation images, and 13GB for the testing CNN architectures tested
  • 21. Comparison with the State of the art Approaches
  • 22. Some Notes : cross-entropy loss function We use cross-entropy loss function neural network (1) Computed | targets | correct? ------------------------------------------------ 0.3 0.3 0.4 | 0 0 1 (democrat) | yes 0.3 0.4 0.3 | 0 1 0 (republican) | yes 0.1 0.2 0.7 | 1 0 0 (other) | no neural network (2) Computed | targets | correct? ------------------------------------------------ 0.1 0.2 0.7 | 0 0 1 (democrat) | yes 0.1 0.7 0.2 | 0 1 0 (republican) | yes 0.3 0.4 0.3 | 1 0 0 (other) | no cross-entropy error for the first training −( (ln(0.3) ∗ 0) + (ln(0.3) ∗ 0) + (ln(0.4) ∗ 1) ) = −ln(0.4) average cross-entropy error (ACE) −(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38 average cross-entropy error (ACE) −(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64 mean squared error for the first item (0.3 − 0)^2 + (0.3 − 0)^2 + (0.4 − 1)^2 = 0.54 the MSE for the first neural network is (0.54 + 0.54 + 1.34) / 3 = 0.81 The MSE for the second, better, network is (0.14 + 0.14 + 0.74) / 3 = 0.34 (1.38 − 0.64 = 0.74) > (0.81 − 0.34 = 0.47) The 𝑙𝑛() function in cross-entropy takes into account the closeness of a prediction
  • 23. Some Notes : Convolutional Neural Network (CNN)  It use for Image classification is the task  It was developed between 1988 and 1993, at Bell Labs  the first convolutional network that could recognize handwritten digits
  • 24. Some Notes : Convolutional Neural Network (CNN) Convolution Layer (Conv Layer) Pooling Layer ReLU Layer Fully Connected Layer (Flatten)
  • 25. Some Notes : Convolutional Neural Network (CNN)
  • 26. Convolution Layer (Conv Layer) convolution works by sliding a window across the input
  • 27. Some Notes : Convolutional Neural Network (CNN) 2D filters
  • 28. Some Notes : Convolutional Neural Network (CNN) 3D filters
  • 29. Pooling Layer (Sub-sampling or Down-sampling)  reduce the size of feature maps by using some functions average or the maximum , (hence called down-sampling)  make extracted features more robust by making it more invariant to scale and orientation changes.
  • 30. ReLU Layer Remove all the black elements from it, keeping only those carrying a positive value (the grey and white colours ) )𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑀𝑎𝑥(𝑧𝑒𝑟𝑜, 𝐼𝑛𝑝𝑢𝑡 to introduce non-linearity in our ConvNet
  • 33. MY WORK  MNIST on google colab Inputs = 28*28 = 784 Layer 1&2 = 100 nodes | Layer 3 = 10 nodes All Activations are sigmoid Cross-entropy loss function The train and test set is already splited in tensorflow the distribution over time of the inputs to the sigmoid function of the first five neurons in the second layer . Batch normalization has a visible and significant effect of removing variance/noise in these inputs.final acc: 99%
  • 34. MY WORK  caltech dataset 𝑊𝑖𝑡ℎ 𝐵𝑁 #𝑒𝑝𝑜𝑐ℎ = 150 𝐿𝑅 = 1 ∗ 10−3 𝑊𝑖𝑡ℎ𝑜𝑢𝑡 𝐵𝑁 #𝑒𝑝𝑜𝑐ℎ = 150 𝐿𝑅 = 1 ∗ 10−3 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 BN #𝑒𝑝𝑜𝑐ℎ = 250 𝐿𝑅 = 1 ∗ 10−3 final acc: 90.54% final acc: 94.44%final acc: 96.04% We use ten (10) classes from Caltech dataset instead of ImageNet Dataset because it’s huge size classify the input image 1/3 ,1/3 , 1/3 train , validation, test