Deep learning for Computer Vision intro

Intro to Deep Lerning
for Computer Vision
Nadav Carmel

Highlights
 Common CV tasks
 CNN’s – intro
 Filters, maxpools, simple example
 Normalization types
 Inception network
 Object detection
 R-CNN
 YOLO
 Face recognition
 One shot learning
 Siamese net

Computer Vision problems
 Image classification
 Object detection
 Face recognition
 Segmentation
 Style transfer
 But there are more!!
We’ll focus
on these

Convolutional filter concept – recap
 Each conv-layer has 4 params:
 Filter size (filter height = filter width)
 stride
 Input channels
 Output channels (number of filters)
6 convolutional
filters

Max-pool concept – recap
 Usally has 2 params:
 Filter size (filter height = filter width)
 Stride
 Operation is done per channel (thus: 𝐶𝑖𝑛 = 𝐶 𝑜𝑢𝑡)
 It is a non-learnble filter

Deep learning for Computer Vision intro

Normalization types
 We sometimes want the data at each layer to be normalized
 Improves learning speed and robustness
 There are few types of normalizations:
 H, W: image size
 N: batch size
 C: cannels

Inception network:
• There are many architectural questions when designing CNN’s:
 What filter size to choose? (larger one = better spatial representation, smaller one =
lower computational complexity)
 Add maxpool or not?
 Etc.
 One approach of handeling these question is: let’s try everything!

Object detection
 Some of the most important CV tasks include:
 Object detection = classification + localization of multiple
objects
 Detection model output: 𝑦 = 𝑝𝑐, 𝑏 𝑥 , 𝑏 𝑦 , 𝑏ℎ , 𝑏 𝑤 , 𝑐1 , 𝑐2 , 𝑐3

Region (sliding window) CNN
 A reagion proposal (selective search)
algorithm suggests regions for the bounding
box to go over (Ross Girshick et al.)
 These candidate boxes are resized to
match the CNN input size
 They are then fed into the convolutional
neural network that produces a features
vector
 The feature vector is fed into SVM to
produce the classification
 Finally, remove boxes with the highest
shared area in a process called non-max
suppression

You Only Look Once - YOLO
 Most object detection algorithms use regions to localize
the object within the image, and do not look at the
complete image
 In YOLO a convolutional network uses the entire image to
predicts the bounding boxes and the class probabilities for
these boxes

YOLO algo description
1. Split the image into grid of cells
2. Each cell is responsible for predicting a number bounding boxes (should match the
number of objects in the cell)
3. Run the model once to get all cells predictions
4. Remove boxes with the highest shared area in a process called non-max suppression

YOLO summary
 Main algorithm properties include:
 Extremely fast inference – makes predictions with a single network evaluation, unlike R-
CNN which requires thousands for a single image
 Since it looks at the whole image at once, its predictions are informed by global context
in the image
 Requires bounding-box tagging for training
 Since the model learns to predict bounding boxes (𝑏 𝑥, 𝑏 𝑦, 𝑏ℎ, 𝑏 𝑤) from the data, it
struggles to generalize to objects in new or unusual aspect ratios or configurations

Non-max suppression algorithm
 Remove all boxes with 𝑝𝑐 < 0.6
 While there are any remaining boxes:
 Pick the box with the largest Pc - output it as prediction
 Discard any box with IOU > 0.5 with the box from previous step

One shot learning
 Say we want to have a classification system to recognize faces
 We only have 1 or 2 images of each person
 One aproach can be to train a CNN which maps the inputs to a (one hot) label
vector, where each element corresponds to each person
 BUT:
 Train a neural net with only 1 or 2 imaages per class will highly overfit
 Each new person in the ‘pool’ will require a new, longer, output vector (and
system retraining)

One shot learning
 Instead of learining a multiclass classifier, we can learn a similarity function:
𝜌 𝑖𝑚𝑔1, 𝑖𝑚𝑔2
 𝐼𝑓: 𝜌 𝑖𝑚𝑔1, 𝑖𝑚𝑔2 ≥ 𝜏 → ”𝑠𝑎𝑚𝑒”
 𝑒𝑙𝑠𝑒: → ”𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡”

Siamese network
 We want a network that predicts the
similarity between 2 faces
 We want each of the images to be encoded
in a low-dimensional representation,
then fed into the network
 The most common encoding in this case is
computed via a Siamese-network

Triplet loss
 We want to train the siamese
nets in such way that:
 Differentnt images of the same
person will have very similar
representations
 Images of different persons will have
very different representations
 We define a triplet loss
objective:
𝐿 = 𝑚𝑎𝑥 𝑑 𝑎, 𝑝 − 𝑑 𝑎, 𝑛 + 𝑚𝑎𝑟𝑔𝑖𝑛, 0

Deep learning for Computer Vision intro

More Related Content

What's hot (20)

Similar to Deep learning for Computer Vision intro (20)

Recently uploaded (20)

Deep learning for Computer Vision intro

Editor's Notes