SlideShare a Scribd company logo
Bag of tricks for image classification
with convolutional neural networks
Choi Dongmin
Yonsei University Severance Hospital CCIDS
Contents
• Abstract

• Introduction

• Training Procedures

• Efficient Training

• Model Tweaks

• Training Refinements

• Transfer Learning

• Conclusion
Abstract
• Focus on training refinements, such as data augmentations
and optimization methods

• Tips of how to deal with these refinements

• Raised ResNet-50’s top-1 accuracy from 73.5% to 79.29%

on ImageNet

• These improvements also leads to better transfer learning
performance
Introduction
• Image Classification performance has been raised with various
architecture including AlexNet, VGG, ResNet, DenseNet and
NASNet.

• However, these advancements did not solely come from improved
model architecture

- Training refinements (ex. loss functions, data preprocessing,
optimization) as played a major role
Introduction
Training Procedures
• 1. Baseline Training Procedure

• 2. Experiment Results
Training Procedures
1. Baseline Training Procedure - During Training
• (1) Randomly sample an image and decode it into 32-bit floating point raw pixel
values in [0, 255].

• (2) Randomly crop a rectangular region whose aspect ratio is randomly sampled
in [3/4, 4/3] and area randomly sampled in [8%, 100%], then resize the cropped
region into a 224-by-224 square image.

• (3) Flip horizontally with 0.5 probability.

• (e) Scale hue, saturation, and brightness with coefficients uniformly drawn from

[0.6, 1.4].

• (5) Add PCA noise with a coefficient sampled from a normal distribution (0, 0.1).

• (6). Normalize RGB channels by subtracting 123.68, 116.779, 103.939 and
dividing by 58.393, 57.12, 57.375, respectively.
N
Training Procedures
1. Baseline Training Procedure - During Validation
• Resize each image’s shorter edge to 256 pixels while keeping its aspect
ratio

• Crop out 224-by-224 region in the center

• Normalize RGB channels similar to training

• No random data augmentations
Training Procedures
1. Baseline Training Procedure - Weights Initialization
• Convolutional and fully-connected layers : Xavier algorithm

• All biases : 0

• For batch normalization, vectors : 1 / vectors : 0γ β
Training Procedures
1. Baseline Training Procedure - Optimizer and Hyper-parameters
• Optimizer : Nesterov Accelerated Gradient (NAG) descent

• Epochs : 120

• Batch size : 256

• GPU : 8 NVIDIA V100 GPUs

• Learning rate : 0.1 (decay at 30, 60, and 90 epochs)
Training Procedures
2. Experiment Results
Efficient Training
• 1. Large-batch Training

• 2. Low-precision Training

• 3. Experiment Result
Efficient Training
1. Large-batch Training
• It is more efficient to use larger batch size

• But using large batch size decreases convergence rate for convex
problems and degrades validation accuracy.

• Four heuristics that help scale the batch size up

- Linear scaling learning rate

- Learning rate warmup

- Zero 

- No bias decay
γ
Efficient Training
1. Large-batch Training : Linear Scaling Learning Rate
• A large batch size reduces the noise in the gradient (= reduces variance)

• It is possible to increase the learning rate with a large batch size

• In Goyal et al*, linearly increasing the learning rate with the batch size
works empirically for ResNet-50 training.

• In He et al**, set the initial learning rate to 0.1 x b/256 (b = batch size)
Goyal et al*, Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017
He et al**, Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
Efficient Training
1. Large-batch Training : Learning Rate Warmup
• Initial network parameters are typically far away from the final solution

- Using a too large learning rate may result in numerical instability

• Using a small learning rate at the beginning and then

switch back to the initial learning rate when the training process is stable

• In Goyal et al*, set the learning rate as below

- Use the first batches to warm up

- Initial learning rate : 

- At batch , the learning rate :
m
η
i(1 ≤ i ≤ m) iη/m
Efficient Training
1. Large-batch Training : Zero γ
• In batch normalization, the output is 

( : standardized input / : learnable parameters initialized to 1s and 0s)

• Zero initialization heuristic

- Initialize for all BN layers that sit at the end of a residual block

- All residual blocks just return their inputs [ output = + block ]

- Mimics network that has less number of layers and is easier to train

at the initial stage
γ ̂x + β
̂x x γ, β
γ
γ = 0
x (x)
https://mc.ai/bag-of-tricks-for-image-classification-with-convolutional-neural-networks-in-keras/
Efficient Training
1. Large-batch Training : No bias decay
• Weight decay is often applied to all learnable parameters in order to 

avoid overfitting

• According to Jia et al*, it’s recommended to only apply weight decay to
weights (not to bias)

• That is, other parameters (biases and and in BN layers) are left
unregularized
γ β
Jia et al*, Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205, 2018. 

Efficient Training
2. Low-precision training
• Neural networks are commonly trained with 32-bit-floating-point (FP32)

• However, new hardware support lower precision data types

• 2 to 3 times faster after switching

from FP32 to FP16 on V100

• Micikevicius et al* suggested

Mixed precision training

( Forward and backward in FP16,

Weight update in FP32)
Micikevicius et al*, Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
Efficient Training
3. Experiment Results
Model Tweaks
• A minor adjustment to the network architecture, such as
changing the stride

• Such a tweak often barely changes the computational
complexity but might have a non-negligible effect on accuracy
Model Tweaks
Original ResNet-50
Input Stem
Downsampling Block
Model Tweaks
Original Downsampling Block
ResNet-B Downsampling Block
Using a kernel size 1x1 with a stride of 2

→ Ignores three-quarters of the input feature map
First appeared in a Torch implementation
Model Tweaks
Original Input Stem
ResNet-C Input Stem
7 x 7 convolution is 5.4 times more expensive

than 3 x 3 convolution
Proposed in Inception-v2 originally
Model Tweaks
Original Input Stem
ResNet-D Downsampling Block
Path B also ignores 3/4 of input feature maps

because of stride size
Proposed model tweak
Original Downsampling Block
Model Tweaks
Training Refinements
• 1. Cosine Learning Rate Decay

• 2. Label Smoothing

• 3. Knowledge Distillation

• 4. Mixup Training 

• 5. Experiment Results
Training Refinements
1. Cosine Learning Rate Decay
• Cosine Annealing Strategy proposed by Loshchilov et al*
ηt =
1
2 (
1 + cos
(
tπ
T ))
η
- : Learning rate at batch
- : The number of total batches
ηt t
T
Loshchilov et al*, SGDR: stochastic gradient de- scent with restarts. CoRR, abs/1608.03983, 2016.
Training Refinements
1. Cosine Learning Rate Decay
Remains large → Potentially improves training process
Training Refinements
2. Label Smoothing Proposed by Inception-v2*
Inception-v2* : C.Szegedy et al. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
Ground Truth Probability :
Optimal Solution :
- : the number of labelsK
Training Refinements
2. Label Smoothing Proposed by Inception-v2*
Inception-v2* : C.Szegedy et al. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
centers at the theoretical value
fewer extreme values
Training Refinements
3. Knowledge Distillation
G Hinton et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 

Image
Teacher Model

(Large; ex. ResNet-152)
Student Model

(Small; ex. ResNet-50)
r
z
Output
Transfer
loss(p, softmax(z)) + T2
loss(softmax(r/T), softmax(z/T))
To improve accuracy of the student model while keeping its complexity the same
※ : the temperature hyper-parameter to make the softmax outputs smootherT
Training Refinements
4. Mixup Training
H Zhang et al. mixup: Beyond empirical risk minimization. CoRR, abs/1710.09412, 2017.
https://hoya012.github.io/blog/Bag-of-Tricks-for-Image-Classification-with-Convolutional-Neural-Networks-Review/
Only use , generated by a weighted linear interpolation of two example( ̂x, ̂y) (xi, yi), (xj, yj)
drawn from the Beta distribution(α, α)
Training Refinements
5. Experiment Results
- for label smoothing
- for knowledge distillation
- for mixup training
ϵ = 0.1
T = 20
α = 0.2
Transfer Learning
Object Detection
Transfer Learning
Semantic Segmentation
Thank you
Yonsei University Severance Hospital CCIDS

More Related Content

What's hot (14)

PPTX
Deep Learning Explained
Melanie Swan
 
PDF
BOIL: Towards Representation Change for Few-shot Learning
Hyungjun Yoo
 
PPTX
S07 파크랩 DSLab.1기: QGIS: Python 종 분포 모델링(SDM)
ByeongHyeokYu
 
PDF
Forward-Forward Algorithm
Dong Heon Cho
 
PPTX
Intro To Machine Learning in Python
Russel Mahmud
 
PPTX
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
PDF
GAN - Theory and Applications
Emanuele Ghelfi
 
PDF
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
PDF
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
Peerasak C.
 
PDF
밑바닥부터 시작하는딥러닝 8장
Sunggon Song
 
PPTX
Transformer Zoo
Grigory Sapunov
 
PDF
Attention mechanism 소개 자료
Whi Kwon
 
PDF
Kid Blockchain - Everything You Need to Know - (Part 2)
Seungjoo Kim
 
PPTX
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
Lalit Jain
 
Deep Learning Explained
Melanie Swan
 
BOIL: Towards Representation Change for Few-shot Learning
Hyungjun Yoo
 
S07 파크랩 DSLab.1기: QGIS: Python 종 분포 모델링(SDM)
ByeongHyeokYu
 
Forward-Forward Algorithm
Dong Heon Cho
 
Intro To Machine Learning in Python
Russel Mahmud
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
GAN - Theory and Applications
Emanuele Ghelfi
 
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
Peerasak C.
 
밑바닥부터 시작하는딥러닝 8장
Sunggon Song
 
Transformer Zoo
Grigory Sapunov
 
Attention mechanism 소개 자료
Whi Kwon
 
Kid Blockchain - Everything You Need to Know - (Part 2)
Seungjoo Kim
 
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
Lalit Jain
 

Similar to Bag of tricks for image classification with convolutional neural networks review [cdm] (20)

PPTX
Deeplearning
Nimrita Koul
 
PPTX
Batch normalization presentation
Owin Will
 
PPTX
Deep Learning
Pawan Singh
 
PPTX
Ai in 45 minutes
昉达 王
 
PPTX
Computer Vision for Beginners
Sanghamitra Deb
 
PDF
Distributed deep learning
Mehdi Shibahara
 
PDF
Training Neural Networks
Databricks
 
PDF
Eye deep
sveitser
 
PDF
Deep Learning: concepts and use cases (October 2018)
Julien SIMON
 
PPTX
2020 11 4_bag_of_tricks
JAEMINJEONG5
 
PDF
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Roelof Pieters
 
PDF
imageclassification-160206090009.pdf
KammetaJoshna
 
PDF
Accelerating stochastic gradient descent using adaptive mini batch size3
muayyad alsadi
 
PDF
Designing your neural networks – a step by step walkthrough
Lavanya Shukla
 
PPTX
Image classification with Deep Neural Networks
Yogendra Tamang
 
PPTX
08 neural networks
ankit_ppt
 
PDF
Convolutional neural networks for image classification — evidence from Kaggle...
Dmytro Mishkin
 
PPTX
Nuts and Bolts of Transfer Learning.pptx
vmanjusundertamil21
 
PDF
Machine learning from a software engineer's perspective - Marijn van Zelst - ...
Codemotion
 
PDF
Machine Learning from a Software Engineer's perspective
Marijn van Zelst
 
Deeplearning
Nimrita Koul
 
Batch normalization presentation
Owin Will
 
Deep Learning
Pawan Singh
 
Ai in 45 minutes
昉达 王
 
Computer Vision for Beginners
Sanghamitra Deb
 
Distributed deep learning
Mehdi Shibahara
 
Training Neural Networks
Databricks
 
Eye deep
sveitser
 
Deep Learning: concepts and use cases (October 2018)
Julien SIMON
 
2020 11 4_bag_of_tricks
JAEMINJEONG5
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Roelof Pieters
 
imageclassification-160206090009.pdf
KammetaJoshna
 
Accelerating stochastic gradient descent using adaptive mini batch size3
muayyad alsadi
 
Designing your neural networks – a step by step walkthrough
Lavanya Shukla
 
Image classification with Deep Neural Networks
Yogendra Tamang
 
08 neural networks
ankit_ppt
 
Convolutional neural networks for image classification — evidence from Kaggle...
Dmytro Mishkin
 
Nuts and Bolts of Transfer Learning.pptx
vmanjusundertamil21
 
Machine learning from a software engineer's perspective - Marijn van Zelst - ...
Codemotion
 
Machine Learning from a Software Engineer's perspective
Marijn van Zelst
 
Ad

More from Dongmin Choi (20)

PDF
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
Dongmin Choi
 
PDF
Review: Incremental Few-shot Instance Segmentation [CDM]
Dongmin Choi
 
PDF
Review: You Only Look One-level Feature
Dongmin Choi
 
PDF
Transformer in Computer Vision
Dongmin Choi
 
PDF
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
Dongmin Choi
 
PDF
YolactEdge Review [cdm]
Dongmin Choi
 
PDF
Review : Inter-slice Context Residual Learning for 3D Medical Image Segmentation
Dongmin Choi
 
PDF
Deformable DETR Review [CDM]
Dongmin Choi
 
PDF
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
Dongmin Choi
 
PDF
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Dongmin Choi
 
PDF
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Dongmin Choi
 
PDF
Review : Rethinking Pre-training and Self-training
Dongmin Choi
 
PDF
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...
Dongmin Choi
 
PDF
Pyradiomics Customization [CDM]
Dongmin Choi
 
PDF
Seeing What a GAN Cannot Generate [cdm]
Dongmin Choi
 
PDF
Neural network pruning with residual connections and limited-data review [cdm]
Dongmin Choi
 
PDF
Network Deconvolution review [cdm]
Dongmin Choi
 
PDF
How much position information do convolutional neural networks encode? review...
Dongmin Choi
 
PDF
Objects as points (CenterNet) review [CDM]
Dongmin Choi
 
PDF
Augmix review [cdm]
Dongmin Choi
 
[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...
Dongmin Choi
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Dongmin Choi
 
Review: You Only Look One-level Feature
Dongmin Choi
 
Transformer in Computer Vision
Dongmin Choi
 
Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...
Dongmin Choi
 
YolactEdge Review [cdm]
Dongmin Choi
 
Review : Inter-slice Context Residual Learning for 3D Medical Image Segmentation
Dongmin Choi
 
Deformable DETR Review [CDM]
Dongmin Choi
 
Review : Prototype Mixture Models for Few-shot Semantic Segmentation
Dongmin Choi
 
Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...
Dongmin Choi
 
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Dongmin Choi
 
Review : Rethinking Pre-training and Self-training
Dongmin Choi
 
Review : Structure Boundary Preserving Segmentation
for Medical Image with Am...
Dongmin Choi
 
Pyradiomics Customization [CDM]
Dongmin Choi
 
Seeing What a GAN Cannot Generate [cdm]
Dongmin Choi
 
Neural network pruning with residual connections and limited-data review [cdm]
Dongmin Choi
 
Network Deconvolution review [cdm]
Dongmin Choi
 
How much position information do convolutional neural networks encode? review...
Dongmin Choi
 
Objects as points (CenterNet) review [CDM]
Dongmin Choi
 
Augmix review [cdm]
Dongmin Choi
 
Ad

Recently uploaded (20)

PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 

Bag of tricks for image classification with convolutional neural networks review [cdm]

  • 1. Bag of tricks for image classification with convolutional neural networks Choi Dongmin Yonsei University Severance Hospital CCIDS
  • 2. Contents • Abstract • Introduction • Training Procedures • Efficient Training • Model Tweaks • Training Refinements • Transfer Learning • Conclusion
  • 3. Abstract • Focus on training refinements, such as data augmentations and optimization methods • Tips of how to deal with these refinements • Raised ResNet-50’s top-1 accuracy from 73.5% to 79.29%
 on ImageNet • These improvements also leads to better transfer learning performance
  • 4. Introduction • Image Classification performance has been raised with various architecture including AlexNet, VGG, ResNet, DenseNet and NASNet. • However, these advancements did not solely come from improved model architecture
 - Training refinements (ex. loss functions, data preprocessing, optimization) as played a major role
  • 6. Training Procedures • 1. Baseline Training Procedure • 2. Experiment Results
  • 7. Training Procedures 1. Baseline Training Procedure - During Training • (1) Randomly sample an image and decode it into 32-bit floating point raw pixel values in [0, 255]. • (2) Randomly crop a rectangular region whose aspect ratio is randomly sampled in [3/4, 4/3] and area randomly sampled in [8%, 100%], then resize the cropped region into a 224-by-224 square image. • (3) Flip horizontally with 0.5 probability. • (e) Scale hue, saturation, and brightness with coefficients uniformly drawn from
 [0.6, 1.4]. • (5) Add PCA noise with a coefficient sampled from a normal distribution (0, 0.1). • (6). Normalize RGB channels by subtracting 123.68, 116.779, 103.939 and dividing by 58.393, 57.12, 57.375, respectively. N
  • 8. Training Procedures 1. Baseline Training Procedure - During Validation • Resize each image’s shorter edge to 256 pixels while keeping its aspect ratio • Crop out 224-by-224 region in the center • Normalize RGB channels similar to training • No random data augmentations
  • 9. Training Procedures 1. Baseline Training Procedure - Weights Initialization • Convolutional and fully-connected layers : Xavier algorithm • All biases : 0 • For batch normalization, vectors : 1 / vectors : 0γ β
  • 10. Training Procedures 1. Baseline Training Procedure - Optimizer and Hyper-parameters • Optimizer : Nesterov Accelerated Gradient (NAG) descent • Epochs : 120 • Batch size : 256 • GPU : 8 NVIDIA V100 GPUs • Learning rate : 0.1 (decay at 30, 60, and 90 epochs)
  • 12. Efficient Training • 1. Large-batch Training • 2. Low-precision Training • 3. Experiment Result
  • 13. Efficient Training 1. Large-batch Training • It is more efficient to use larger batch size • But using large batch size decreases convergence rate for convex problems and degrades validation accuracy. • Four heuristics that help scale the batch size up
 - Linear scaling learning rate
 - Learning rate warmup
 - Zero 
 - No bias decay γ
  • 14. Efficient Training 1. Large-batch Training : Linear Scaling Learning Rate • A large batch size reduces the noise in the gradient (= reduces variance) • It is possible to increase the learning rate with a large batch size • In Goyal et al*, linearly increasing the learning rate with the batch size works empirically for ResNet-50 training. • In He et al**, set the initial learning rate to 0.1 x b/256 (b = batch size) Goyal et al*, Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017 He et al**, Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
  • 15. Efficient Training 1. Large-batch Training : Learning Rate Warmup • Initial network parameters are typically far away from the final solution
 - Using a too large learning rate may result in numerical instability • Using a small learning rate at the beginning and then
 switch back to the initial learning rate when the training process is stable • In Goyal et al*, set the learning rate as below
 - Use the first batches to warm up
 - Initial learning rate : 
 - At batch , the learning rate : m η i(1 ≤ i ≤ m) iη/m
  • 16. Efficient Training 1. Large-batch Training : Zero γ • In batch normalization, the output is 
 ( : standardized input / : learnable parameters initialized to 1s and 0s) • Zero initialization heuristic
 - Initialize for all BN layers that sit at the end of a residual block
 - All residual blocks just return their inputs [ output = + block ]
 - Mimics network that has less number of layers and is easier to train
 at the initial stage γ ̂x + β ̂x x γ, β γ γ = 0 x (x) https://mc.ai/bag-of-tricks-for-image-classification-with-convolutional-neural-networks-in-keras/
  • 17. Efficient Training 1. Large-batch Training : No bias decay • Weight decay is often applied to all learnable parameters in order to 
 avoid overfitting • According to Jia et al*, it’s recommended to only apply weight decay to weights (not to bias) • That is, other parameters (biases and and in BN layers) are left unregularized γ β Jia et al*, Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205, 2018. 

  • 18. Efficient Training 2. Low-precision training • Neural networks are commonly trained with 32-bit-floating-point (FP32) • However, new hardware support lower precision data types • 2 to 3 times faster after switching
 from FP32 to FP16 on V100 • Micikevicius et al* suggested
 Mixed precision training
 ( Forward and backward in FP16,
 Weight update in FP32) Micikevicius et al*, Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  • 20. Model Tweaks • A minor adjustment to the network architecture, such as changing the stride • Such a tweak often barely changes the computational complexity but might have a non-negligible effect on accuracy
  • 21. Model Tweaks Original ResNet-50 Input Stem Downsampling Block
  • 22. Model Tweaks Original Downsampling Block ResNet-B Downsampling Block Using a kernel size 1x1 with a stride of 2
 → Ignores three-quarters of the input feature map First appeared in a Torch implementation
  • 23. Model Tweaks Original Input Stem ResNet-C Input Stem 7 x 7 convolution is 5.4 times more expensive
 than 3 x 3 convolution Proposed in Inception-v2 originally
  • 24. Model Tweaks Original Input Stem ResNet-D Downsampling Block Path B also ignores 3/4 of input feature maps
 because of stride size Proposed model tweak Original Downsampling Block
  • 26. Training Refinements • 1. Cosine Learning Rate Decay • 2. Label Smoothing • 3. Knowledge Distillation • 4. Mixup Training • 5. Experiment Results
  • 27. Training Refinements 1. Cosine Learning Rate Decay • Cosine Annealing Strategy proposed by Loshchilov et al* ηt = 1 2 ( 1 + cos ( tπ T )) η - : Learning rate at batch - : The number of total batches ηt t T Loshchilov et al*, SGDR: stochastic gradient de- scent with restarts. CoRR, abs/1608.03983, 2016.
  • 28. Training Refinements 1. Cosine Learning Rate Decay Remains large → Potentially improves training process
  • 29. Training Refinements 2. Label Smoothing Proposed by Inception-v2* Inception-v2* : C.Szegedy et al. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. Ground Truth Probability : Optimal Solution : - : the number of labelsK
  • 30. Training Refinements 2. Label Smoothing Proposed by Inception-v2* Inception-v2* : C.Szegedy et al. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. centers at the theoretical value fewer extreme values
  • 31. Training Refinements 3. Knowledge Distillation G Hinton et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 
 Image Teacher Model
 (Large; ex. ResNet-152) Student Model
 (Small; ex. ResNet-50) r z Output Transfer loss(p, softmax(z)) + T2 loss(softmax(r/T), softmax(z/T)) To improve accuracy of the student model while keeping its complexity the same ※ : the temperature hyper-parameter to make the softmax outputs smootherT
  • 32. Training Refinements 4. Mixup Training H Zhang et al. mixup: Beyond empirical risk minimization. CoRR, abs/1710.09412, 2017. https://hoya012.github.io/blog/Bag-of-Tricks-for-Image-Classification-with-Convolutional-Neural-Networks-Review/ Only use , generated by a weighted linear interpolation of two example( ̂x, ̂y) (xi, yi), (xj, yj) drawn from the Beta distribution(α, α)
  • 33. Training Refinements 5. Experiment Results - for label smoothing - for knowledge distillation - for mixup training ϵ = 0.1 T = 20 α = 0.2
  • 36. Thank you Yonsei University Severance Hospital CCIDS