SlideShare a Scribd company logo
Accelerating Stochastic Gradient
Descent Using Adaptive
Mini-Batch Size
Authors:
● Muayyad Alsadi <alsadi@gmail.com>
● Rawan Ghnemat <r.ghnemat@psut.edu.jo>
● Arafat Awajan <awajan@psut.edu.jo>
What if you could
just fast-forward
through training
process?
8x
This way training becomes
feasible even on commodity
CPUs (without GPUs),
getting high accuracy within
hours.
Background
Artificial Neural Network (ANN) / Some Types and Applications
● Fully connected multi-layer Deep Neural Networks (DNN)
● Convolutional Neural Network (CNN)
○ Spacial (Image)
○ Context (Text and NLP)
● Recursive Neural Network
○ Sequences (Text letters, Stock events)
Artificial Neural Network (ANN) / Some Types and Applications
● Convolutional Neural Network (CNN)
○ Spacial (Image): classification/regression
○ Context (Text and NLP): classification/regression
● Recursive Neural Network
○ Sequences (Text letters, Stock events)
■ Seq2Seq: Translation, summarization, ...
■ Seq2Label
■ Seq2Value
● Massive number of trainable weights to tune
● Massive number Multiply–Accumulate (MAC) operations
● Vanishing/Exploding Gradients
Deep Learning / some challenges
● Massive number of trainable weights to tune
● Massive number Multiply–Accumulate (MAC) operations
○ Low throughput (ex. images/second)
● Vanishing/Exploding Gradients
○ Slow to converge
Deep Learning / some challenges
Input Output
Deep Neural Network
Millions of Operations Per Item
Sample
Batch Update
Deep Neural Network
Given Labels
Output
A training step: process a batch and update weights
“Stochastic Learning” or “Stochastic Gradient Descent” (SGD) is
done by taking small random samples (mini-batches) instead of the
whole batch of training data “Batch Learning”. Faster to converge
and better in handling the noise and non-linearity. That’s why batch
learning was considered inefficient[1][2]
.
1. Y. LeCun, “Efficient backprop”
2. D. R. Wilson and T. R. Martinez, “The general inefficiency of batch training for gradient descent
learning,”
Batch Learning vs. Stochastic Learning
Sample Update
Deep Neural Network
Given Labels
Output
Factors Affecting Convergence Speed
Sample Size
Design Complexity / Depth / Number of MAC operators
# Classes
Learning Rate
Momentum
Opt. Algo.
Literature Review
● Sample size related
● Learning rate related
● Optimization Algorithm Related
● NN design related
● Transforming Input/Output
Literature Review
● Sample size related
○ Too big batch-size (8192
images per batch)
○ Increasing batch-size
● Learning rate related
○ Per-dimension
○ Fading
○ Momentum
○ Cyclic
○ Warm restart...
● Optimization Algorithm Related
○ AdaGrad, Adam, AdaDelta, ...
Literature Review / see paper
● NN design related
○ SqueezeNet, MobileNet
○ Separable operators
○ Batch-norm
○ Early AUX classifier branches
● Transforming Input/Output
○ Reusing existing model
(fine-tuning)
○ Knowledge transfer
Proposed Method
Do very high risk initializations using extremely small
mini-batch size (ex. 4 or 8 samples per batch). Then
“Train-Measure-Adapt-Repeat”. As long as it’s getting better
results keep using such fast-forwarding settings. When stuck
use larger mini-batch size (for example, 32 samples per
batch).
Proposed Method
ff_criteria can be defined
with respect to change in
evaluation accuracy like this
If (acc_new>acc_old) then
mode=ff
else
model=normal
● Specially for cold start (initialization)
● Instead of too big batch-size like 8,192 samples per batch
use extremely small mini-batch size like 4 or 8 samples
per batch! (as long as hardware is fully utilized)
● The network is too cold, it’s already too bad and you have
nothing to lose.
Use extremely small mini-batch size
Assuming that the hardware is fully utilized and have
constant throughput (Images/Seconds), processing a sample
of 8 images is 4 times faster than processing a batch of 32
images. Doing 4 times more updates.
A good guess for batch size is number of cores in your
computer. (scope of paper is training on commodity
hardware).
Why it ticks faster?
By using 4x smaller batch-size, we are doing 4x more higher
risk updates.
Batch size have linear effect on speed but effect on accuracy
is not linear.
Don’t look at accuracy by number of steps but look at
accuracy over time.
It ticks faster but does it converge faster?
Experiments: Fine-tuning
Inception v1 pre-trained on
ImageNet 1K task.
Experiment: The Caltech-UCSD
Birds-200-2011 Dataset
Experiment: Birds 200 Dataset
Accuracy over steps: accuracy of batch-size=10 (in cyan) is always below others
Misleading
Accuracy over time: accuracy of batch-size=10 (in cyan) reached 56% in 2 hours,
while others were lagging behind at 40%, 28%, and 10%.
Experiment: The Oxford-IIIT Pet
Dataset (Pets-37)
Experiment: Pets-37 Dataset
Eval accuracy over time: using mini-batch size of 8 reached 80% accuracy within
one hour only.
Experiment: Adaptive part on
Birds-200 Dataset
Eval accuracy over time: reaching ~72% accuracy within ~2:20 hours
Summary:
Train-Measure-Adapt-Repeat
Summary: Train-Measure-Adapt-Repeat
● Start with very small mini-batch size and large learning rate
○ BatchSize=4; LearningRate=0.1
● Let mini-batch size be cyclic
○ Switch between two settings (batch size of 8 and 32)
○ Adaptive, non-periodic, based on evaluation accuracy
○ Change the bounds of the settings as you go
Q & A
Thank you
Follow me on Github
http://muayyad-alsadi.github.io/

More Related Content

What's hot (20)

PPTX
Scalable Learning in Computer Vision
butest
 
PDF
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
PPTX
Electricity price forecasting with Recurrent Neural Networks
Taegyun Jeon
 
PDF
Google Big Data Expo
BigDataExpo
 
PDF
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 
PDF
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
 
PDF
Image Classification Done Simply using Keras and TensorFlow
Rajiv Shah
 
PDF
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
PDF
Learning where to look: focus and attention in deep vision
Universitat Politècnica de Catalunya
 
PDF
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
Francisco Zamora-Martinez
 
PDF
Mahoney mlconf-nov13
MLconf
 
PDF
Language translation with Deep Learning (RNN) with TensorFlow
S N
 
PPTX
Deep Learning with Apache Spark: an Introduction
Emanuele Bezzi
 
PDF
MLConf 2016 SigOpt Talk by Scott Clark
SigOpt
 
PDF
Josh Patterson MLconf slides
MLconf
 
PPTX
Deep Learning: Chapter 11 Practical Methodology
Jason Tsai
 
PPTX
Regression vs Deep Neural net vs SVM
Ratul Alahy
 
PDF
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 
PDF
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
GeeksLab Odessa
 
PDF
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Big Data Spain
 
Scalable Learning in Computer Vision
butest
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
Electricity price forecasting with Recurrent Neural Networks
Taegyun Jeon
 
Google Big Data Expo
BigDataExpo
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
 
Image Classification Done Simply using Keras and TensorFlow
Rajiv Shah
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Learning where to look: focus and attention in deep vision
Universitat Politècnica de Catalunya
 
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
Francisco Zamora-Martinez
 
Mahoney mlconf-nov13
MLconf
 
Language translation with Deep Learning (RNN) with TensorFlow
S N
 
Deep Learning with Apache Spark: an Introduction
Emanuele Bezzi
 
MLConf 2016 SigOpt Talk by Scott Clark
SigOpt
 
Josh Patterson MLconf slides
MLconf
 
Deep Learning: Chapter 11 Practical Methodology
Jason Tsai
 
Regression vs Deep Neural net vs SVM
Ratul Alahy
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
GeeksLab Odessa
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Big Data Spain
 

Similar to Accelerating stochastic gradient descent using adaptive mini batch size3 (20)

PDF
Bag of tricks for image classification with convolutional neural networks r...
Dongmin Choi
 
PPTX
Techniques in Deep Learning
Sourya Dey
 
PDF
Deep Learning: concepts and use cases (October 2018)
Julien SIMON
 
PDF
Distributed deep learning
Mehdi Shibahara
 
PPTX
Deep Learning for Developers
Julien SIMON
 
PPTX
Deeplearning
Nimrita Koul
 
PPTX
Deep Learning with Apache MXNet (September 2017)
Julien SIMON
 
PDF
Chap 8. Optimization for training deep models
Young-Geun Choi
 
PPTX
08 neural networks
ankit_ppt
 
DOCX
Dnn guidelines
Naitik Shukla
 
PPTX
Neural network basic and introduction of Deep learning
Tapas Majumdar
 
PDF
Paper Review: Training ImageNet in 1hour
Young Seok Kim
 
PDF
Training Neural Networks
Databricks
 
PDF
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
ssuser6a46522
 
PDF
Introduction Machine Learning by MyLittleAdventure
mylittleadventure
 
PDF
An introduction to Machine Learning
Valéry BERNARD
 
PPTX
ML Module 3 Non Linear Learning.pptx
DebabrataPain1
 
PDF
International hpc summer school presentation
Mila, Université de Montréal
 
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 
PDF
Batch normalization paper review
Minho Heo
 
Bag of tricks for image classification with convolutional neural networks r...
Dongmin Choi
 
Techniques in Deep Learning
Sourya Dey
 
Deep Learning: concepts and use cases (October 2018)
Julien SIMON
 
Distributed deep learning
Mehdi Shibahara
 
Deep Learning for Developers
Julien SIMON
 
Deeplearning
Nimrita Koul
 
Deep Learning with Apache MXNet (September 2017)
Julien SIMON
 
Chap 8. Optimization for training deep models
Young-Geun Choi
 
08 neural networks
ankit_ppt
 
Dnn guidelines
Naitik Shukla
 
Neural network basic and introduction of Deep learning
Tapas Majumdar
 
Paper Review: Training ImageNet in 1hour
Young Seok Kim
 
Training Neural Networks
Databricks
 
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
ssuser6a46522
 
Introduction Machine Learning by MyLittleAdventure
mylittleadventure
 
An introduction to Machine Learning
Valéry BERNARD
 
ML Module 3 Non Linear Learning.pptx
DebabrataPain1
 
International hpc summer school presentation
Mila, Université de Montréal
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 
Batch normalization paper review
Minho Heo
 
Ad

More from muayyad alsadi (7)

PDF
Visualizing botnets with t-SNE
muayyad alsadi
 
PDF
Taking your code to production
muayyad alsadi
 
PDF
Introduction to Raft algorithm
muayyad alsadi
 
PDF
Techtalks: taking docker to production
muayyad alsadi
 
PDF
How to think like hardware hacker
muayyad alsadi
 
PDF
الاختيار بين التقنيات
muayyad alsadi
 
PDF
ملتقى الصناع هيا نصنع أردوينو وندخل إلى خفاياه
muayyad alsadi
 
Visualizing botnets with t-SNE
muayyad alsadi
 
Taking your code to production
muayyad alsadi
 
Introduction to Raft algorithm
muayyad alsadi
 
Techtalks: taking docker to production
muayyad alsadi
 
How to think like hardware hacker
muayyad alsadi
 
الاختيار بين التقنيات
muayyad alsadi
 
ملتقى الصناع هيا نصنع أردوينو وندخل إلى خفاياه
muayyad alsadi
 
Ad

Recently uploaded (20)

PDF
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 

Accelerating stochastic gradient descent using adaptive mini batch size3

  • 1. Accelerating Stochastic Gradient Descent Using Adaptive Mini-Batch Size
  • 3. What if you could just fast-forward through training process? 8x This way training becomes feasible even on commodity CPUs (without GPUs), getting high accuracy within hours.
  • 5. Artificial Neural Network (ANN) / Some Types and Applications ● Fully connected multi-layer Deep Neural Networks (DNN) ● Convolutional Neural Network (CNN) ○ Spacial (Image) ○ Context (Text and NLP) ● Recursive Neural Network ○ Sequences (Text letters, Stock events)
  • 6. Artificial Neural Network (ANN) / Some Types and Applications ● Convolutional Neural Network (CNN) ○ Spacial (Image): classification/regression ○ Context (Text and NLP): classification/regression ● Recursive Neural Network ○ Sequences (Text letters, Stock events) ■ Seq2Seq: Translation, summarization, ... ■ Seq2Label ■ Seq2Value
  • 7. ● Massive number of trainable weights to tune ● Massive number Multiply–Accumulate (MAC) operations ● Vanishing/Exploding Gradients Deep Learning / some challenges
  • 8. ● Massive number of trainable weights to tune ● Massive number Multiply–Accumulate (MAC) operations ○ Low throughput (ex. images/second) ● Vanishing/Exploding Gradients ○ Slow to converge Deep Learning / some challenges
  • 9. Input Output Deep Neural Network Millions of Operations Per Item
  • 10. Sample Batch Update Deep Neural Network Given Labels Output A training step: process a batch and update weights
  • 11. “Stochastic Learning” or “Stochastic Gradient Descent” (SGD) is done by taking small random samples (mini-batches) instead of the whole batch of training data “Batch Learning”. Faster to converge and better in handling the noise and non-linearity. That’s why batch learning was considered inefficient[1][2] . 1. Y. LeCun, “Efficient backprop” 2. D. R. Wilson and T. R. Martinez, “The general inefficiency of batch training for gradient descent learning,” Batch Learning vs. Stochastic Learning
  • 12. Sample Update Deep Neural Network Given Labels Output Factors Affecting Convergence Speed Sample Size Design Complexity / Depth / Number of MAC operators # Classes Learning Rate Momentum Opt. Algo.
  • 14. ● Sample size related ● Learning rate related ● Optimization Algorithm Related ● NN design related ● Transforming Input/Output Literature Review
  • 15. ● Sample size related ○ Too big batch-size (8192 images per batch) ○ Increasing batch-size ● Learning rate related ○ Per-dimension ○ Fading ○ Momentum ○ Cyclic ○ Warm restart... ● Optimization Algorithm Related ○ AdaGrad, Adam, AdaDelta, ... Literature Review / see paper ● NN design related ○ SqueezeNet, MobileNet ○ Separable operators ○ Batch-norm ○ Early AUX classifier branches ● Transforming Input/Output ○ Reusing existing model (fine-tuning) ○ Knowledge transfer
  • 17. Do very high risk initializations using extremely small mini-batch size (ex. 4 or 8 samples per batch). Then “Train-Measure-Adapt-Repeat”. As long as it’s getting better results keep using such fast-forwarding settings. When stuck use larger mini-batch size (for example, 32 samples per batch). Proposed Method
  • 18. ff_criteria can be defined with respect to change in evaluation accuracy like this If (acc_new>acc_old) then mode=ff else model=normal
  • 19. ● Specially for cold start (initialization) ● Instead of too big batch-size like 8,192 samples per batch use extremely small mini-batch size like 4 or 8 samples per batch! (as long as hardware is fully utilized) ● The network is too cold, it’s already too bad and you have nothing to lose. Use extremely small mini-batch size
  • 20. Assuming that the hardware is fully utilized and have constant throughput (Images/Seconds), processing a sample of 8 images is 4 times faster than processing a batch of 32 images. Doing 4 times more updates. A good guess for batch size is number of cores in your computer. (scope of paper is training on commodity hardware). Why it ticks faster?
  • 21. By using 4x smaller batch-size, we are doing 4x more higher risk updates. Batch size have linear effect on speed but effect on accuracy is not linear. Don’t look at accuracy by number of steps but look at accuracy over time. It ticks faster but does it converge faster?
  • 22. Experiments: Fine-tuning Inception v1 pre-trained on ImageNet 1K task.
  • 25. Accuracy over steps: accuracy of batch-size=10 (in cyan) is always below others Misleading
  • 26. Accuracy over time: accuracy of batch-size=10 (in cyan) reached 56% in 2 hours, while others were lagging behind at 40%, 28%, and 10%.
  • 27. Experiment: The Oxford-IIIT Pet Dataset (Pets-37)
  • 29. Eval accuracy over time: using mini-batch size of 8 reached 80% accuracy within one hour only.
  • 30. Experiment: Adaptive part on Birds-200 Dataset
  • 31. Eval accuracy over time: reaching ~72% accuracy within ~2:20 hours
  • 33. Summary: Train-Measure-Adapt-Repeat ● Start with very small mini-batch size and large learning rate ○ BatchSize=4; LearningRate=0.1 ● Let mini-batch size be cyclic ○ Switch between two settings (batch size of 8 and 32) ○ Adaptive, non-periodic, based on evaluation accuracy ○ Change the bounds of the settings as you go
  • 34. Q & A
  • 35. Thank you Follow me on Github http://muayyad-alsadi.github.io/