Batch normalization presentation

Batch Normalization
Accelerating Deep Network Training by
Reducing Internal Covariate Shift
By : seraj alhamidi
Instructor: Associate Prof. Mohammed Alhanjouri
June .2019

About the paper
Sergey Ioffe
Google Inc.,
sioffe@google.com
Christian Szegedy
Google Inc.,
szegedy@google.com
Authors
The 32nd International Conference on Machine
Learning (2015)
presented
publishers
https://ai.google/research/pubs/pub43442
Journal of Machine Learning Research
http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
Cornell university
https://arxiv.org/abs/1502.03167
paper with over 6000 citations on ICML 2015citations

Outlines
Introduction
Issues with Training Deep Neural Networks
Batch Normalization
Ablation Study
Comparison with the State of the art Approaches
Some notes
My work

Introduction
ILSVRC Competition in 2015
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) , sponsored
by google and Facebook
ImageNet, is a dataset of over 15 millions labelled high-resolution images with
around 22,000 categories, for classification and localization tasks
ILSVRC uses a subset of ImageNet of 1000 categories.
On 6 Feb 2015, Microsoft has proposed PReLU-Net which has 4.94% error
rate which surpasses the human error rate of 5.1%
Five days later, on 11 Feb 2015, Google proposed BN-Inception which has 4.8%
error rate.
Reach best accuracy in 7% of time need to reach same accuracy

Issues with Training Deep Neural Networks
Vanishing Gradient
Saturating nonlinearities (like 𝑡𝑎𝑛ℎ 𝑜𝑟 𝑠𝑖𝑔𝑚𝑜𝑖𝑑) cannot be used for deep
networks
An example, the sigmoid function and it’s derivative. When the inputs of
the sigmoid function becomes larger or smaller , the derivative becomes
close to zero.
the sigmoid function and its derivativebackpropagation algorithm update rule
 𝑤 𝜅 + 1 = 𝑤 𝜅 − 𝛼
𝜕𝐿
𝜕𝑤
 𝐿 = 0.5 𝑡 − 𝑎 2 𝑡 ∶ 𝑡𝑎𝑟𝑔𝑒𝑡 , 𝑎 ∶ 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
 𝑎 𝑙
= 𝜎 𝑥 𝑙
𝜎 ∶ 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑧 ∶ 𝑖𝑛𝑝𝑢𝑡 𝑡𝑜 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
 𝑥 𝑙 = 𝑤𝑖,𝑗 ∗ 𝑎 𝑙−1 + 𝑤𝑖+1 ,𝑗 ∗ 𝑎 𝑙−1 + ⋯

𝜕𝐿
𝜕𝑤
=
𝜕𝐿
𝜕𝑎
.
𝜕𝑎
𝜕𝑥
.
𝜕𝑥
𝜕𝑤
≡
𝜕𝐿
𝜕𝑎
. 𝜎 𝑥 𝑙 .
𝜕𝑥
𝜕𝑤

Vanishing Gradient
Sigmoid function with restricted inputsRectified linear units 𝑓 𝑥 = 𝑥+
= max(0, 𝑥)
Some ways around this are to use:
 batch normalization layers can also resolve the issue
 Nonlinearities like Rectified linear units (ReLU) which do not saturate.
 Smaller learning rates
 Careful weights initializations

Internal Covariate shift
 Covariate – The Features of the Input Data
 Covariate Shift - The change in the distribution of inputs layers in the middle of a
deep neural network, is referred to the technical name “internal covariate shift ”.
when the distribution that is fed to the layers of a network should be somewhat:
Zero-centered
Constant through time and data
the distribution of the data being fed to the layers should not vary too much across
the mini-batches fed to the network
Neural networks learn efficiently

Internal Covariate shift in deep NN
Iteration i
Iteration i+1
Iteration i+2
Every time there’s new
relation (distribution)
specially in deep layers and
at the beginning of training

Take one layer from internal layers
Assume has a distribution given
below. Also, let us suppose, the
function learned by the layer is
represented by the dashed line
suppose, after the gradient
updating, the distribution of x
gets changed to something like
the loss for this mini-batch is more as
compared to the previous loss

 Every time we force the layer (l) to design new Perceptron
 Because I give it new disruption every time
 In deep layers , we may have butterfly effect , however the change in 𝒘 at
first layers is small, thus make the network unstable

Batch Normalization
 The Batch Normalization attempts to normalize a batch of inputs before
they are fed to a non-linear activation unit (like ReLU, sigmoid, etc). during
training
 so that the input to the activation function across each training batch has a
mean of 0 and a variance of 1
 applying batch normalization to the activation σ(Wx + b) would result
in σ(BN(Wx + b)) where 𝐵𝑁 is the batch normalizing transform

Batch Normalization
To make each dimension unit gaussian, we apply:
𝑥 𝑘 =
𝑥 𝑘 − 𝐸 𝑥 𝑘
𝑉𝑎𝑟 𝑥 𝑘
where 𝐸 𝑥(𝑘) and 𝑉𝑎𝑟 𝑥(𝑘) are respectively the mean and variance of 𝑘-th
feature over a batch. Then we transform 𝑥(𝑘) as:
𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘
where 𝛾 and 𝛽 are the hyper (learnable) parameters of the so-called batch
normalization layer
𝐾 is the 𝑘-th sample in one feature mini batch

Batch Normalization
Transformation of inputs

Forward Propagation through Batch Normalization layer
We have shown the normalization of multiple sample in just one feature
Input: Values of 𝐱 over 𝐁 = 𝐱𝟏 … 𝐱 𝐦 a batch; Parameters to be learned 𝛄, 𝛃
Output: 𝓨𝐢 = 𝐁𝐍 𝛄,𝛃(𝐱 𝐢)
Flow of computation through Batch Normalization layer
𝝁 𝑩 =
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊
𝝈 𝑩
𝟐
=
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊 − 𝝁 𝑩
𝟐
𝒙𝒊=
𝝈 𝑩
𝟐
+ 𝝐
𝓨𝒊= 𝜸 𝒙 + 𝜷 = 𝑩𝑵 𝜸,𝜷(𝒙𝒊)
𝜖
𝜖 is a small value 1 ∗ 10−8
for not devided by zero

Forward Propagation through Batch Normalization
layer
TWO features

THE MAGIC
Imagine that the network was thought that the optimal that will minimize
the cost is to Cancel the BN effect !
Forward Propagation through Batch Normalization layer
β = 𝐸 𝑥 = μB
𝛾 = 𝑉𝑎𝑟 𝑥 = 𝜎 𝐵
2
+ 𝜖
𝑥𝑖 =
𝑥𝑖 − 𝜇 𝐵
𝜎 𝐵
2
+ 𝜖
𝒴𝑖 = 𝛾 𝑥𝑖 + 𝛽 = 𝜎 𝐵
2
+ 𝜖 ∗
𝑥 𝑖 −𝜇 𝐵
𝜎 𝐵
2+𝜖
+ 𝜇 𝐵 = 𝑥𝑖
Identity transform
𝛾, 𝛽 Adapted by SGD

Backpropagation through Batch Normalization layer
𝒴𝑖 = 𝛾 𝑥 + 𝛽 = 𝐵𝑁𝛾,𝛽(𝑥𝑖)
𝝏𝑳
𝝏𝓨𝒊
𝝏𝑳
𝝏 𝒙𝒊
𝜕L
𝜕γi
𝝏𝑳
𝝏𝜷
𝝏𝑳
𝝏𝝈 𝑩
𝟐𝝏𝑳
𝝏𝝁 𝑩
𝝏𝑳
𝝏𝒙 𝒊
SGD
𝜷 𝒌 + 𝟏 = 𝜷 𝜿 − 𝜶
𝝏𝑳
𝝏𝜷
𝜸 𝒌 + 𝟏 = 𝜸 𝜿 − 𝜶
𝝏𝑳
𝝏𝜸
𝝁 𝑩 =
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊
𝝈 𝑩
𝟐
=
𝟏
𝒎
𝒊=𝟏
𝒎
𝟐
𝒙𝒊=
𝝈 𝑩
𝟐
+ 𝝐
𝓨𝒊= 𝜸 𝒙 + 𝜷 = 𝑩𝑵 𝜸,𝜷(𝒙𝒊)

Backpropagation during test time
using the population, rather than mini-batch statistics. Effectively, we process mini-
batches of size 𝑚 and use their statistics to compute:
𝐸 𝑥(𝑘) = 𝐸 𝐵[ 𝜇 𝐵]
𝑉𝑎𝑟 𝑥(𝑘) =
𝑚
𝑚 − 1
𝐸 𝐵[𝜎 𝐵
2 ]
we can use 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑚𝑜𝑣𝑖𝑛𝑔 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 to estimate the mean and variance to be
used during test time, we estimate the running average of mean and variance as:
𝜇 𝑟𝑢𝑛𝑛𝑖𝑛𝑔 = 𝛼. 𝜇 𝑟𝑢𝑛𝑛𝑖𝑛𝑔 + (1- 𝛼). 𝜇 𝐵
‫ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ‬‫ـــ‬
𝜎𝑟𝑢𝑛𝑛𝑖𝑛𝑔
2
= 𝛼. 𝜎𝑟𝑢𝑛𝑛𝑖𝑛𝑔
2
+ (1- 𝛼).𝜎 𝐵
2
where 𝛼 is a constant smoothing factor between 0 and 1 and represents the degree
of dependence on the previous observations .

Ablation Study
MNIST dataset
28×28 binary image as input, 3 fully connected (FC) hidden layer with 100 activations each
, the last hidden layer followed by 10 activations as there are 10 digits. And the loss is cross
entropy loss.
BN network is much more stable

Ablation Study
ImageNet of 1000 categories on GoogleNet/Inception(2014)
weighing 138GB for the training images, 6.3GB for the validation images, and 13GB for
the testing
CNN architectures tested

Comparison with the State of the art Approaches

Some Notes : cross-entropy loss function
We use cross-entropy loss function
neural network (1)
Computed | targets | correct?
------------------------------------------------
0.3 0.3 0.4 | 0 0 1 (democrat) | yes
0.3 0.4 0.3 | 0 1 0 (republican) | yes
0.1 0.2 0.7 | 1 0 0 (other) | no
neural network (2)
Computed | targets | correct?
------------------------------------------------
0.1 0.2 0.7 | 0 0 1 (democrat) | yes
0.1 0.7 0.2 | 0 1 0 (republican) | yes
0.3 0.4 0.3 | 1 0 0 (other) | no
cross-entropy error for the first training
−( (ln(0.3) ∗ 0) + (ln(0.3) ∗ 0) + (ln(0.4) ∗ 1) ) = −ln(0.4)
average cross-entropy error (ACE)
−(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
average cross-entropy error (ACE)
−(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64
mean squared error for the first item
(0.3 − 0)^2 + (0.3 − 0)^2 + (0.4 − 1)^2 = 0.54
the MSE for the first neural network is
(0.54 + 0.54 + 1.34) / 3 = 0.81
The MSE for the second, better, network is
(0.14 + 0.14 + 0.74) / 3 = 0.34
(1.38 − 0.64 = 0.74) > (0.81 − 0.34 = 0.47)
The 𝑙𝑛() function in cross-entropy takes into account the closeness of a prediction

Some Notes : Convolutional Neural Network (CNN)
 It use for Image classification is the task
 It was developed between 1988 and 1993, at Bell Labs
 the first convolutional network that could recognize handwritten digits

Convolution Layer
(Conv Layer)
Pooling Layer ReLU Layer
Fully Connected
Layer (Flatten)

Convolution Layer (Conv Layer)
convolution works by sliding a window across the input

2D filters

3D filters

Pooling Layer (Sub-sampling or Down-sampling)
 reduce the size of feature maps by using some functions average or the maximum ,
(hence called down-sampling)
 make extracted features more robust by making it more invariant to scale and
orientation changes.

ReLU Layer
Remove all the black elements from it, keeping only those carrying a positive value
(the grey and white colours )
)𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑀𝑎𝑥(𝑧𝑒𝑟𝑜, 𝐼𝑛𝑝𝑢𝑡
to introduce non-linearity in our ConvNet

Fully Connected Layer (Flatten)

http://scs.ryerson.ca/~aharley/vis/conv/flat.html

MY WORK  MNIST on google colab
Inputs = 28*28 = 784
Layer 1&2 = 100 nodes | Layer 3 = 10 nodes
All Activations are sigmoid
Cross-entropy loss function
The train and test set
is already splited in
tensorflow
the distribution over
time of the inputs to
the sigmoid function
of the first five
neurons in the
second layer . Batch
normalization has a
visible and
significant effect of
removing
variance/noise in
these inputs.final acc: 99%

MY WORK  caltech dataset
𝑊𝑖𝑡ℎ 𝐵𝑁
#𝑒𝑝𝑜𝑐ℎ = 150
𝐿𝑅 = 1 ∗ 10−3
𝑊𝑖𝑡ℎ𝑜𝑢𝑡 𝐵𝑁
𝐿𝑅 = 1 ∗ 10−3
𝑤𝑖𝑡ℎ𝑜𝑢𝑡 BN
𝐿𝑅 = 1 ∗ 10−3
final acc: 90.54% final acc: 94.44%final acc: 96.04%
We use ten (10)
classes from Caltech
dataset instead of
ImageNet
Dataset because it’s
huge size
classify the
input image
1/3 ,1/3 , 1/3 train
, validation, test

Batch normalization presentation

More Related Content

What's hot (20)

Similar to Batch normalization presentation (20)

Recently uploaded (20)

Batch normalization presentation