Grokking TechTalk #21: Deep Learning in Computer Vision

Deep Learning in Computer Vision
Axon@Grokking
Oct. 28, 2017

Dang Huynh
Education
• Ph.D. in Computer Science (France)
Work
• Jan 2017 – now: Axon Enterprise
• 2015 – 2016: Misfit
• 2011 – 2015: Nokia Bell Labs
Research domains
• Machine vision.
• Data science.
• Telecommunication systems.
Axon Enterprise
Misfit
Nokia Bell Labs
2/43

Outline
•Refresh
•Computer vision
•Deep learning in Computer vision
•Theory vs. Reality
•Demo
4/43

Refresh
Machine learning and Deep learning
5/43

Machine learning
Input data à prediction model à output label
y
x
y = F(x)
x0
y0?
6/43

Machine Learning
y = 4x1
3 - 2x2
2 + 8
x2
f(x) = x3x1
f(x) = x2
+1
y
weight=1
0
0
1
4
-2
8
7/43

Machine Learning
Challenges
• Relevant data acquisition
• Data preprocessing
• Feature selection
• Model selection: simplicity versus complexity
• Result interpretation.
8/43

Deep Learning
• Machine Learning with many (deep) hidden layers
x2
x1
+1
+1
+1
y1
y2
Hidden layersInput Output
9/43

Why deep learning?
Amount of data
Performance
Deep learning
Machine learning
10/43

Make computers understand images and video:
- Detection
- Recognition
- Tracking
- Extraction
Computer Vision
Object detection 12/43

Still there are challenges: object can be…
Computer Vision
… partly occluded
… or even fully occluded.
13/43

Challenge
We were building a human detector, and we accidentally got future human detector!
14/43

15/43
Traditional approach Deep learning approach
has two eyes?
has a nose below eyes?
Ok, it’s a face!
…..
Feature engineering NO feature engineering

Traditional approach vs. Deep learning
16/43
ImageNet: 1.2 million images with 1000 object categories
Source: http://pattern-recognition.weebly.com/
Deep learningTradition

Deep Learning in Computer Vision
17/43

Computer Vision
What computer sees
Red
43 45 21
13 34 12
23 88 55
Green
19 89 27
17 57 29
75 56 94
Blue
19 89 27
17 57 29
75 56 94
y = F(Red, Green, Blue)
3-D input array
Facial detection
18/43

Intuition
x2
x1
+1
+1
+1
y1
y2
Hidden layersInput Output
Facial detection
Green
Red
Blue
19/43

Convolutional Neural Network (CNN)
Idea: having a filter scanning over image.
Output matrix
Input matrix
(e.g., image)
Filter (grey)
Source: https://github.com/vdumoulin/conv_arithmetic
Convolutional process
20/43

CNN – Striding and Padding
Control how the filter convolves around the input matrix.
Output matrix
Input matrix
(e.g., image)
Filter (grey)
Source: https://github.com/vdumoulin/conv_arithmetic
Stride = 2, Zero-padding = 1
21/43

Convolutional operation
0 1 1 1 0 0 0
0 0 1 1 1 0 0
0 0 0 1 1 1 0
0 0 0 1 1 0 0
0 0 1 1 0 0 0
0 1 1 0 0 0 0
1 1 0 0 0 0 0
1 0 1
0 1 0
1 0 1
1 4 3 4 1
1 2 4 3 3
1 2 3 4 1
1 3 3 1 1
3 3 1 1 0
5 x 5
Output
3 x 3
Filter
7 x 7
Input
* =
Input [height1, width1, # of channels]
Filter [height2, width2, # of channels]
Output [height3, width3, # of filters]
22/43

Rectified Linear Unit (ReLU)
ReLU: F(y) = max(0,y)
-3 2 0
1 -1 0
-5 2 4
0 2 0
1 0 0
0 2 4
ReLU
Non-linear activation function.
23/43

Max Pooling
1 0 2 3
4 6 6 8
3 1 1 0
1 2 2 4
6 8
3 4
Reduce dimension and avoid overfitting.
Max pool with 2x2 filter and stride 2
24/43

Example
Input
24 x 24 x 3
11 x 11 x 28 4 x 4 x 48 3 x 3 x 64
face/non-face
bounding box
regression
2
4
Conv: 3 x 3
MP: 2 x 2
Conv: 3 x 3
MP: 3 x 3
Conv: 2 x 2 Fully connected
128
Suppose that all Max Pooling (MP) layer has stride 2.
Input: 24 x 24 x 3
Conv: 3 x 3 x 3
MP: 2 x 2 (stride 2)
à Output dimension (24 – 3 + 1) / 2 = 11
25/43

Object scales
• Detect object of various sizes.
Source: https://www.pyimagesearch.com
Input
Tradeoffs?
scans over
26/43

Data augmentation
• Generate more artificial data points from base data.
• Apply with care to other data types!
Original Little noise Moderate Heavy noise
27/43

Complex data augmentation
Face rotation
28/43

Why data augmentation?
WITHOUT augmentation
AXON detection
WITH augmentation
29/43

How to benchmark?
Facebook detection 30/43

Deep learning in Computer Vision
Pros:
• DL reduces the need for feature engineering.
• DL outperforms classical Computer Vision approaches.
Cons:
• DL requires a huge amount of data (> 100K samples).
• DL is extremely computationally expensive to train (weeks on GPUs).
• DL model structure is a black box.
32/43

Performance vs. Portability
Theory Reality
33/43

Performance vs. Power consumption
Theory Reality
Portable battery
34/43

Special hardware for Deep Learning
Jetson TX2 (NVDIA) Google TPU Movidius Myriad
• Optimized for specific use case.
• Not plug-and-play, need good engineers to make it work.
Still far from consumer…
35/43

Privacy
• The police are our customers, so data privacy is important.
• Can we “extract features” from the private data?
36/43

Facial detection with tracking
40/43

Industry perspective
Always consider the following 4Ps:
• Performance
• Power consumption
• Portability
• Price
Deep learning is not a magic: tradeoff always exists!
43/43

We are Hiring
Full Stack, Research Engineers, Security.
https://jobs.lever.co/axon
45/43

Grokking TechTalk #21: Deep Learning in Computer Vision

More Related Content

Similar to Grokking TechTalk #21: Deep Learning in Computer Vision (20)

More from Grokking VN (20)

Recently uploaded (20)

Grokking TechTalk #21: Deep Learning in Computer Vision