one shot15729752 Deep Learning for AI and DS

ONE SHOT LEARNING
FOR RECOGNITION
Keren Ofek

Today’s Agenda
■ Similarity and how to measure it?
■ Embedding
■ One-Shot learning Challenges
■ Siamese Network
■ DeepNet
■ Triplets
■ FaceNet

What is one shot learning?
■ One-shot learning aims to learn information about
object categories from few training examples.
■ The idea is to understand the similarity between the
detected object to an known object.
Image Source: Google

Similarity
Image Source: Google

How can we measure similarity?
■ We will find a function that quantifies a “distance”
between every pair of elements in a set
Non-negativity: f(x, y) ≥ 0
Identity of Discernible: f(x, y) = 0 <=> x = y
Symmetry: f(x, y) = f(y, x)
Triangle Inequality: f(x, z) ≤ f(x, y) + f(y, z)

Which Distance Metric to choose?
■ Pre-defined Metrics
Metrics which are fully
specified without the
knowledge of data.
E.g. Euclidian Distance:
f(𝒙 ,𝒚)= 𝒊 𝒙𝒊 − 𝒚𝒊
𝟐
■ Learned Metrics
Metrics which can only be
defined with the knowledge
of the data:
■ Un-Supervised Learning
Or
■ Supervised Learning
Based on slide from: February-16 Lecture http://slazebni.cs.illinois.edu/spring17/

Un-supervised distance metric: :
Mahalanobis Distance : f(x, y) = (x - y) T∑-1(x -
y)
where ∑ is the
mean-subtracted
covariance matrix
of all data points
From: Gene expression data clustering and
visualization based on a binary hierarchical clustering
framework

Un-supervised distance metric:
• 2-step procedure:
• Apply some supervised domain transform:
• Then use one of the un-supervised metrics for
performing the mapping.
Bellet, A., Habrard, A. and Sebban, A survey on metric learning for feature vectors and structured data

Embedding Process
■ The inputs are mapped to vectors in a low-dimensional space
■ low-dimensional space is called Target Space / Taken Space
■ In the taken Space, there is insensitivity
intra category changes, but sensitivity to inter category changes
https://hackernoon.com/latent-space-visualization-deep-learning-bits-2-bd09a46920df

Linear Discriminant Analysis (LDA)
■ LDA based upon the concept of searching for a linear combination
of variables that best separates two classes:
– Minimizes variance within the class
– Maximizes variance between classes
Fisher
Ratio
■ LDA is an example to understand
the motivation of embedding
=
https://analyticsdefined.com/introduction-linear-discriminant-analysis/

Linear Discriminant Analysis (LDA)
https://analyticsdefined.com/introduction-linear-discriminant-analysis/

The One-Shot learning challenge
■ There are lots of categories
■ The Number of categories is not always known
■ The number of samples in each category is small
 One shot learning is super relevant in the field of computer vision:
recognize objects in images from a single example.
 Linear Models are not relevant in this field.

Face Recognition challenges
Verification Recognition Clustering

The Method - Using CNN:
■ The Idea is to learn a function that maps input patterns to target
space.
■ non-linear mapping that can map any input vector to its
corresponding low-dimensional version.
■ The distance in the target space approximates the “semantic”
distance in the input space.
■ A discriminative training method - extract information about the
problem from the available data, without requiring specific
information about the categories.
■ The training will be on pairs of samples.

Energy-based models (EBM)
■ EBM capture dependencies between variables by associating a
scalar energy to each configuration of the variables.
■ No requirement for proper normalization
■ Provide considerably more flexibility in the design of architectures
and training criteria than probabilistic approaches
LeCun, Chopra, Hadsell, Ranzato, Huang, A Tutorial on Energy-Based Learning, 2006

Siamese Network Architecture
Koch, Zemel, Salakhutdinov, Siamese Neural Networks for One-shot Image Recognition, 2015

Similarity Metric
- Same Category
Genuine Pair
Minimizes
- Different
Categories
Impostor Pair
Maximizes
We seek to find a value of the parameter W such that:
 Contrastive term is needed to ensure: not only that the energy for a pair of
inputs from the same category is low, but also that the energy for a pair from
different categories is large.

Siamese Neural Networks

Siamese Neural Networks
One-shot Image Recognition

Hinge Loss:
Positive pair
Embedding Space
CNN
CNN
L(xp, xq) = ||xp – xq||2
Loss
Positive
Query
https://www.cs.cornell.edu/~kb/publications/SIG15ProductNet.pdf

Hinge Loss
Negative Pair
Embedding Space
CNN
Margin (m)
CNN
Loss = 0
L(xp, xq) = max(0, m2 - || xp – xq || 2 )
Neg
Query

Hinge Loss
Negative Pair
Embedding Space
CNN
Margin (m)
CNN
Loss
L(xp, xq) = max(0, m2 - || xp – xq || 2 )
Neg
Query

Visualization of Learned Features
Ahmed, Jones, Marks, An improved deep learning architecture for person re-identification, 2015
Positive Pair Negative Pair

Example: Omniglot dataset
■ Omniglot contains examples from 50 alphabets:
• well-established international languages (Latin and
Korean)
• lesser known local dialects.

Omniglot dataset: classification task
Input:
Classes:
Results:

Omniglot dataset: Verification task
Results:
■ Training on 3 data set sizes with 30,000, 90,000, and 150,000
■ Sampling random same and different pairs.
■ Generating 8 random affine distortions for each category

DeepFace (Facebook, 2014)
■ The conventional pipeline:
Detect ⇒ align ⇒ represent ⇒ classify
■ Face alignment: Transform a face to be in a canonical
pose
■ Face representation: Find a representation of a face
which is suitable for follow-up tasks (small size,
computationally cheap to compare, invariant to
irrelevant changes)
■ 3D face modeling
■ A nine-layer deep neural network
■ More than 120 million parameters
Taigman; Yang; Ranzato, Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face

The alignment process - 'Frontalization'

(a) 2D Alignment - detecting 6 fiducial points inside the detection
crop, centered at the center of the eyes, tip of the nose and
mouth locations
(b) 2D Alignment - aligned crop: composing the final 2D
transformation

(c) 2D Alignment - localizing additional 67 fiducial points
(d) 3D Alignment - The reference 3D shape transformed to the 2D-
aligned crop image-plane.

The alignment process - 'Frontalization '
(e) 2D-3D Alignment - Triangle visibility w.r.t. to the fitted 3D-2D
camera; darker triangles are less visible.
(f) 3D Alignment - The 67 fiducial points induced by the 3D model
that are used to direct the piece-wise affine wrapping.
(g) 2D Alignment - The final frontalized crop

The DeepFace architecture
C1, M2, C3 : Extract low-level features (simple edges and texture)
L4, L5, L6: Locally connected Layers
L7, L8: Fully connected Layers

The training process
■ We use cross-entropy as the loss function:
(K is the index of the true label for a given input)
■ The loss is minimized over the parameters by computing the
gradient of L with respect to the parameters and by updating the
parameters using stochastic gradient descent (SGD).
■ The gradients are computed by standard backpropagation of the
error.

Face verification
■ In order to tell whether two images of faces show the same person
they try three different methods.
■ Each of these relies on the vector extracted by the first fully
connected layer in the network (4096d).
■ Let these vectors be f1 (image 1) and f2 (image 2). The methods
are then:
1. Inner product
2. Weighted X^2 (chi-squared) distance
3. Siamese network.

Results
■ LFW (Labeled Faces in the Wild): 13,323 web photos of
5,749 celebrities
Human Performance : 97.5% accuracy
The DeepFace Performance: 97.35% accuracy
■ YTF (YouTube Faces): 3425 YouTube videos of 1,595
subjects
The DeepFace Performance: 92.5% accuracy
(reduces the previous error by more than 50%!)

Triplets Network
■ The Triplet Loss minimizes the distance between an anchor and a
positive, both of which have the same identity, and maximizes the
distance between the anchor and a negative of a different identity.

Triplet Network
Hoffer, Ailon, DEEP METRIC LEARNING USING TRIPLET NETWORK , 2015
𝑁𝑒𝑡 𝑥𝑞 − 𝑁𝑒𝑡(𝑥𝑛) 2
𝑁𝑒𝑡 𝑥𝑞 − 𝑁𝑒𝑡(𝑥𝑝) 2
𝑁𝑒𝑡 𝑁𝑒𝑡 𝑁𝑒𝑡
Comparator
𝑥𝑛 𝑥𝑞 𝑥𝑝

FaceNet (Google, 2015)
■ FaceNet approach is that they use Euclidean embedding
space to find the similarity or difference between faces.
■ The method uses a deep convolutional network trained
to directly optimize the embedding itself, rather than an
intermediate bottleneck layer as in previous deep
learning approaches.
■ They used large-scale dataset
https://arxiv.org/pdf/1503.03832v3.pdf

The network consists of a batch input layer and a deep CNN
followed by L2 normalization, which results in the face
embedding. This is followed by the triplet loss during training.
Schroff, Kalenichenko, Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering, 2015

■ We learn an embedding f(x), from an image x into a feature space
R d , such that the squared distance between all faces of the same
identity is small, whereas the squared distance between a pair of
face images from different identities is large.
■ Loss Function:
𝑓 - the embedding function
𝐿 =
𝑖
𝑁
𝑓 𝑥𝑞,𝑖 − 𝑓(𝑥𝑝,𝑖)
2
2
− 𝑓 𝑥𝑞,𝑖 − 𝑓 𝑥𝑛,𝑖 2
2
+ 𝛼
+

Triplets Selection
■ In order to ensure fast convergence it is crucial to select triplets that violate
the triplet constraint in
■ We could select (𝑥𝑝,𝑖) and 𝑥𝑛,𝑖 :
(1) hard positive- argmax (𝑥𝑝,𝑖) 𝑓 𝑥𝑞,𝑖 − 𝑓(𝑥𝑝,𝑖) 2
2
(2) hard negative- argmin (𝑥𝑝,𝑖) 𝑓 𝑥𝑞,𝑖 − 𝑓(𝑥𝑛,𝑖) 2
2
■ But, Choose all anchor-positive pairs in a mini-batch while selecting only
hard-negatives gave better results.
■ To avoid local minima, we use ”semi-hard” exemplars:
𝑥𝑛,𝑖 such that
𝑓 𝑥𝑞,𝑖 − 𝑓(𝑥𝑝,𝑖)
2
2
+ 𝛼 < 𝑓 𝑥𝑞,𝑖 − 𝑓
2
2
𝑓 𝑥𝑞,𝑖 − 𝑓(𝑥𝑝,𝑖) 2
2
< 𝑓 𝑥𝑞,𝑖 − 𝑓 2
2

Embeddin
g location
of anchor
Hard
Positive
Hard
Negative
Triplets
Negative Space
Positive Space

The Model structure
(after the training)
Input
DISTANCE

The FaceNet architecture (NN1)

The FaceNet architecture (NN2)

The results – New records!
■ Performance on Youtube Faces DB: 95.12% accuracy
■ Performance on Labeled Faces in the Wild DB: 99.63%
accuracy
■ Sensitivity to Image Quality:
(The CNN was trained on 220x220 input images)
# Pixels Val-Rate
1,600 37.8%
6,400 79.5%
14,400 84.5%
25,600 85.7%
65,539 86.4%

FaceNet – Image clustering

Summary
The subjects we covered:
■ Embedding - Similarity in taken space
■ One-Shot learning Challenges
■ Network principles: Siamese Network and Triplets
■ Network in used: DeepNet and FaceNet

References
■ Chopra, Hadsell, LeCun, Learning a Similarity Metric Discriminatively, with Application to
Face Verification , 2005
■ Taigman; Yang; Ranzato, Wolf, DeepFace: Closing the Gap to Human-Level Performance
in Face Verification, 2014
■ Schroff, Kalenichenko, Philbin, FaceNet: A Unified Embedding for Face Recognition and
Clustering, 2015
■ Koch, Zemel, Salakhutdinov, Siamese Neural Networks for One-shot Image Recognition,
2015
■ Hermans, Beyer, Leibe, In Defense of the Triplet Loss for Person Re-Identification, 2017
■ LeCun, Chopra, Hadsell, Ranzato, Huang, A Tutorial on Energy-Based Learning, 2006
■ Bell, Bala, Learning visual similarity for product design with convolutional neural
networks, 2015
■ Ahmed, Jones, Marks, An improved deep learning architecture for person re-
identification, 2015
■ Nando de Freias, Max-margin learning, transfer and memory networks, 2015
■ Jagannatha Rao, Wang, Cottrell, A Deep Siamese Neural Network Learns the Human-
Perceived Similarity Structure of Facial Expressions Without Explicit Categories, 2011
■ Hoffer, Ailon, DEEP METRIC LEARNING USING TRIPLET NETWORK , 2015
■ Bellet, Habrard, Sebban, A survey on metric learning for feature vectors and structured

one shot15729752 Deep Learning for AI and DS

More Related Content

Similar to one shot15729752 Deep Learning for AI and DS (20)

More from ManiMaran230751 (14)

Recently uploaded (20)

one shot15729752 Deep Learning for AI and DS

Editor's Notes