Deep Learning: concepts and use cases (October 2018)

Deep Learning:
concepts and use cases
Julien Simon
Principal Technical Evangelist, AI and Machine Learning, AWS
@julsimon
October 2018

What to expect
• An introduction to Deep Learning theory
• Neurons & Neural Networks
• The Training Process
• Backpropagation
• Optimizers
• Common network architectures and use cases
• Convolutional Neural Networks
• Recurrent Neural Networks
• Long Short Term Memory Networks
• Generative Adversarial Networks
• Getting started

• Artificial Intelligence: design software applications which
exhibit human-like behavior, e.g. speech, natural language
processing, reasoning or intuition
• Machine Learning: using statistical algorithms, teach
machines to learn from featurized data without being
explicitly programmed
• Deep Learning: using neural networks, teach machines to
learn from complex data where features cannot be
explicitly expressed

An introduction
to Deep Learning theory

Activation functionsThe neuron
!
"#$
%
xi ∗ wi + b = u
”Multiply and Accumulate”
Source: Wikipedia
bias

x =
x11, x12, …. x1I
x21, x22, …. x2I
… … …
xm1, xm2, …. xmI
I features
m samples
y =
2
0
…
4
m labels,
N2 categories
0,0,1,0,0,…,0
1,0,0,0,0,…,0
…
0,0,0,0,1,…,0
One-hot encoding
Neural networks
B u i l d i n g a s i m p l e c l a s s i f i e r
Biases are ignored for the rest of this discussion

x =
x11, x12, …. x1I
x21, x22, …. x2I
… … …
xm1, xm2, …. xmI
I features
m samples
y =
2
0
…
4
m labels,
N2 categories
Total number of predictions
Accuracy =
Number of correct predictions
0,0,1,0,0,…,0
1,0,0,0,0,…,0
…
0,0,0,0,1,…,0
One-hot encoding
Neural networks

Initially, the network will not predict correctly
f(X1) = Y’1
A loss function measures the difference between
the real label Y1 and the predicted label Y’1
error = loss(Y1, Y’1)
For a batch of samples:
!
"#$
%&'() *"+,
loss(Yi, Y’i) = batch error
The purpose of the training process is to
minimize error by gradually adjusting weights.
Neural networks

Mini-batch Training
Training data set Training
Trained
neural network
Batch size
Learning rate
Number of epochs
Hyper parameters
Backpropagation
Forward propagation

Validation
Validation data set
(also called dev set)
Neural network
in training
Validation
accuracy
Prediction at
the end of
each epoch
This data set must have the same distribution as real-life samples,
or else validation accuracy won’t reflect real-life accuracy.

Test
Test data set Fully trained
neural network
Test accuracy
Prediction at
the end of
experimentation
This data set must have the same distribution as real-life samples,
or else test accuracy won’t reflect real-life accuracy.

Stochastic Gradient Descent (1951)
Imagine you stand on top of a mountain (…).
You want to get down to the valley as quickly as
possible, but there is fog and you can only see
your immediate surroundings. How can you get
down the mountain as quickly as possible?
You look around and identify the steepest path
down, go down that path for a bit, again look
around and find the new steepest path, go down
that path, and repeat—this is exactly what
gradient descent does.
Tim Dettmers, University of Lugano, 2015
https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/
The « step size » depends on
the learning rate
z=f(x,y)

Finding the slope with Derivatives
Source: Wikipedia, Oklahoma State University, Khan Academy
End-to-end example of computing
backpropagation with partial derivatives:
https://mattmazur.com/2015/03/17/a-step-by-step-
backpropagation-example

Local minima and saddle points
« Do neural networks enter and
escape a series of local minima? Do
they move at varying speed as they
approach and then pass a variety of
saddle points? Answering these
questions definitively is difficult, but
we present evidence strongly
suggesting that the answer to all of
these questions is no. »
« Qualitatively characterizing neural network
optimization problems », Goodfellow et al,
2015 https://arxiv.org/abs/1412.6544

Optimizers
https://medium.com/@julsimon/tumbling-down-the-sgd-rabbit-hole-part-1-740fa402f0d7
SGD works remarkably
well and is still widely
used.
Adaptative optimizers use
a variable learning rate.
Some even use a learning
rate per dimension
(Adam).

Early stopping
Training accuracy
Loss function
Accuracy
100%
Epochs
Validation accuracy
Loss
Best epoch
OVERFITTING
« Deep Learning ultimately is about finding a minimum
that generalizes well, with bonus points for finding one
fast and reliably », Sebastian Ruder

Common network architectures
and use cases

Fully Connected Networks are nice, but…
• What if we need lots of layers in order to extract complex features?
• The number of parameters increases very quickly with the number of layers
• Overfitting is a constant problem
• What about large data?
• 256x256 images = 65,535 input neurons ?
• What about 2D/3D data ? Won’t we lose lots of info by flattening it?
• Images, videos, etc.
• What about sequential data, where the order of samples is
important?
• Translating text
• Predicting time series

Convolutional Neural Networks (CNN)
Le Cun, 1998: handwritten digit recognition, 32x32 pixels
https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/

Source: http://timdettmers.com
Extracting features with convolution
Convolution extracts features automatically.
Kernel parameters are learned during the training process.

Downsampling images with pooling
Source: Stanford University
Pooling shrinks images while preserving significant information.

Classification, detection, segmentation
https://github.com/dmlc/gluon-cv
Based on models published in 2015-2017
[electric_guitar],
with probability 0.671
Gluon

Face Detection
https://github.com/tornadomeet/mxnet-face
Based on models published 2015-2016
https://github.com/deepinsight/insightface
https://arxiv.org/abs/1801.07698
January 2018
Face Recognition
LFW 99.80%+
Megaface 98%+
with a single model
MXNetMXNet

Keras Image Inpainting
https://github.com/MathiasGruber/PConv-Keras
April 2018

Real-Time Pose Estimation
https://github.com/dragonfly90/mxnet_Realtime_Multi-Person_Pose_Estimation
November 2016
MXNet

Caffe 2 Real-Time Pose Estimation: DensePose
https://github.com/facebookresearch/DensePose
February 2018

Recurrent Neural Networks (RNN)
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Image
captioning
Sentiment
analysis
Machine
translation
Video frame
labeling

Recurrent Neural Networks
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short Term Memory Networks (LSTM)
Hochreiter and Schmidhuber,1997
• A LSTM neuron computes the
output based on the input and a
previous state
• LSTM neurons have « short-term
memory »
• They do a better job than RNN at
predicting longer sequences of data

Machine Translation – AWS Sockeye
https://github.com/awslabs/sockeye
MXNet

OCR – Tesseract 4.0 (beta)
https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00
https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/

Generative Adversarial Networks

Generative Adversarial Networks
Goodfellow, 2014 https://arxiv.org/abs/1406.2661
https://medium.com/@julsimon/generative-adversarial-networks-on-apache-mxnet-part-1-b6d39e6b5df1
Generator
Building images
from random vectors
Detector
Learning to detect real samples
from generated ones
Gradient updates

GAN: Welcome to the (un)real world, Neo
Generating new ”celebrity” faces
https://github.com/tkarras/progressive_growing_of_gans
April 2018
From semantic map to 2048x1024 picture
https://tcwang0509.github.io/pix2pixHD/
November 2017
TF
PyTorch

GAN: Everybody dance now
https://www.youtube.com/watch?v=PCBTZh41Ris
August 2018

Resources
http://www.deeplearningbook.org/
https://gluon.mxnet.io
https://keras.io
https://medium.com/@julsimon
https://gitlab.com/juliensimon/{aws,dlnotebooks}

Deep Learning: concepts and use cases (October 2018)

Thank you!
Julien Simon
Principal Technical Evangelist, AI and Machine Learning, AWS
@julsimon

Deep Learning: concepts and use cases (October 2018)

More Related Content

What's hot (20)

Similar to Deep Learning: concepts and use cases (October 2018) (20)

More from Julien SIMON (20)

Recently uploaded (20)

Deep Learning: concepts and use cases (October 2018)