Accelerating stochastic gradient descent using adaptive mini batch size3

Accelerating Stochastic Gradient
Descent Using Adaptive
Mini-Batch Size

Authors:
● Muayyad Alsadi <alsadi@gmail.com>
● Rawan Ghnemat <r.ghnemat@psut.edu.jo>
● Arafat Awajan <awajan@psut.edu.jo>

What if you could
just fast-forward
through training
process?
8x
This way training becomes
feasible even on commodity
CPUs (without GPUs),
getting high accuracy within
hours.

Artiﬁcial Neural Network (ANN) / Some Types and Applications
● Fully connected multi-layer Deep Neural Networks (DNN)
● Convolutional Neural Network (CNN)
○ Spacial (Image)
○ Context (Text and NLP)
● Recursive Neural Network
○ Sequences (Text letters, Stock events)

Artificial Neural Network (ANN) / Some Types and Applications
● Convolutional Neural Network (CNN)
○ Spacial (Image): classification/regression
○ Context (Text and NLP): classification/regression
● Recursive Neural Network
○ Sequences (Text letters, Stock events)
■ Seq2Seq: Translation, summarization, ...
■ Seq2Label
■ Seq2Value

● Massive number of trainable weights to tune
● Massive number Multiply–Accumulate (MAC) operations
● Vanishing/Exploding Gradients
Deep Learning / some challenges

● Massive number of trainable weights to tune
● Massive number Multiply–Accumulate (MAC) operations
○ Low throughput (ex. images/second)
● Vanishing/Exploding Gradients
○ Slow to converge
Deep Learning / some challenges

Input Output
Deep Neural Network
Millions of Operations Per Item

Sample
Batch Update
Deep Neural Network
Given Labels
Output
A training step: process a batch and update weights

“Stochastic Learning” or “Stochastic Gradient Descent” (SGD) is
done by taking small random samples (mini-batches) instead of the
whole batch of training data “Batch Learning”. Faster to converge
and better in handling the noise and non-linearity. That’s why batch
learning was considered inefficient[1][2]
.
1. Y. LeCun, “Efficient backprop”
2. D. R. Wilson and T. R. Martinez, “The general inefficiency of batch training for gradient descent
learning,”
Batch Learning vs. Stochastic Learning

Sample Update
Deep Neural Network
Given Labels
Output
Factors Affecting Convergence Speed
Sample Size
Design Complexity / Depth / Number of MAC operators
# Classes
Learning Rate
Momentum
Opt. Algo.

● Sample size related
● Learning rate related
● Optimization Algorithm Related
● NN design related
● Transforming Input/Output
Literature Review

● Sample size related
○ Too big batch-size (8192
images per batch)
○ Increasing batch-size
● Learning rate related
○ Per-dimension
○ Fading
○ Momentum
○ Cyclic
○ Warm restart...
● Optimization Algorithm Related
○ AdaGrad, Adam, AdaDelta, ...
Literature Review / see paper
● NN design related
○ SqueezeNet, MobileNet
○ Separable operators
○ Batch-norm
○ Early AUX classiﬁer branches
● Transforming Input/Output
○ Reusing existing model
(ﬁne-tuning)
○ Knowledge transfer

Do very high risk initializations using extremely small
mini-batch size (ex. 4 or 8 samples per batch). Then
“Train-Measure-Adapt-Repeat”. As long as it’s getting better
results keep using such fast-forwarding settings. When stuck
use larger mini-batch size (for example, 32 samples per
batch).
Proposed Method

ff_criteria can be deﬁned
with respect to change in
evaluation accuracy like this
If (acc_new>acc_old) then
mode=ff
else
model=normal

● Specially for cold start (initialization)
● Instead of too big batch-size like 8,192 samples per batch
use extremely small mini-batch size like 4 or 8 samples
per batch! (as long as hardware is fully utilized)
● The network is too cold, it’s already too bad and you have
nothing to lose.
Use extremely small mini-batch size

Assuming that the hardware is fully utilized and have
constant throughput (Images/Seconds), processing a sample
of 8 images is 4 times faster than processing a batch of 32
images. Doing 4 times more updates.
A good guess for batch size is number of cores in your
computer. (scope of paper is training on commodity
hardware).
Why it ticks faster?

By using 4x smaller batch-size, we are doing 4x more higher
risk updates.
Batch size have linear effect on speed but effect on accuracy
is not linear.
Don’t look at accuracy by number of steps but look at
accuracy over time.
It ticks faster but does it converge faster?

Experiments: Fine-tuning
Inception v1 pre-trained on
ImageNet 1K task.

Experiment: The Caltech-UCSD
Birds-200-2011 Dataset

Accuracy over steps: accuracy of batch-size=10 (in cyan) is always below others
Misleading

Accuracy over time: accuracy of batch-size=10 (in cyan) reached 56% in 2 hours,
while others were lagging behind at 40%, 28%, and 10%.

Experiment: The Oxford-IIIT Pet
Dataset (Pets-37)

Eval accuracy over time: using mini-batch size of 8 reached 80% accuracy within
one hour only.

Experiment: Adaptive part on
Birds-200 Dataset

Eval accuracy over time: reaching ~72% accuracy within ~2:20 hours

Summary:
Train-Measure-Adapt-Repeat

Summary: Train-Measure-Adapt-Repeat
● Start with very small mini-batch size and large learning rate
○ BatchSize=4; LearningRate=0.1
● Let mini-batch size be cyclic
○ Switch between two settings (batch size of 8 and 32)
○ Adaptive, non-periodic, based on evaluation accuracy
○ Change the bounds of the settings as you go

Thank you
Follow me on Github
http://muayyad-alsadi.github.io/

Accelerating stochastic gradient descent using adaptive mini batch size3

More Related Content

What's hot (20)

Similar to Accelerating stochastic gradient descent using adaptive mini batch size3 (20)

More from muayyad alsadi (7)

Recently uploaded (20)

Accelerating stochastic gradient descent using adaptive mini batch size3