Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019

@DocXavi
Module 6 - Day 8 - Lecture 1
Self-supervised Learning
from Video Sequences
28th March 2019
[http://pagines.uab.cat/mcv/]
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de
Catalunya

2
Outline
1. Unsupervised Learning
2. Self-supervised Learning
a. Autoencoder
b. Temporal regularisations
c. Temporal verifications
d. Predictive Learning
e. Miscellaneous: optical flow, color & multiview

Types of machine learning
Yann Lecun’s Black Forest cake
3
Slide credit:
Yann LeCun

Supervised learning
4
Slide credit: Kevin McGuinness
y^

Unsupervised learning
5
Slide credit: Kevin McGuinness
y^

6
We can categorize three types of learning procedures:
1. Supervised Learning:
𝐲 = ƒ(𝐱)
2. Unsupervised Learning:
ƒ(𝐱)
3. Reinforcement Learning (RL):
𝐲 = ƒ(𝐱)
𝐳
Predict label y corresponding to
observation x
Estimate the distribution of
observation x
Predict action y based on
observation x, to maximize a future
reward z

7
We can categorize three types of learning procedures:
1. Supervised Learning:
𝐲 = ƒ(𝐱)
2. Unsupervised Learning:
ƒ(𝐱)
3. Reinforcement Learning (RL):
𝐲 = ƒ(𝐱)
𝐳

8
Why Unsupervised Learning ?
● It is the nature of how intelligent beings percept
the world.
● It can save us tons of efforts to build a
human-alike intelligent agent compared to a
totally supervised fashion.
● Vast amounts of unlabelled data.

9
Assumptions for Unsupervised Learning
Slide: Kevin McGuinness (DLCV UPC 2017)
To model P(X) given data, it is necessary to make some assumptions
“You can’t do inference without making assumptions”
-- David MacKay, Information Theory, Inference, and Learning Algorithms

10
Assumptions for Unsupervised Learning
To model P(X) given data, it is necessary to make some assumptions
“You can’t do inference without making assumptions”
-- David MacKay, Information Theory, Inference, and Learning Algorithms
Typical assumptions:
● Smoothness assumption
○ Points which are close to each other are more likely to share a label.
● Cluster assumption
○ The data form discrete clusters; points in the same cluster are likely to share a label
● Manifold assumption
○ The data lie approximately on a manifold of much lower dimension than the input space.

11
The manifold hypothesis
x1
x2
Linear manifold
wT
x + b
x1
x2
Non-linear
manifold

12
The manifold hypothesis
The data distribution lie close to a low-dimensional
manifold
Example: consider image data
● Very high dimensional (1,000,000D)
● A randomly generated image will almost certainly not
look like any real world scene
○ The space of images that occur in nature is
almost completely empty
● Hypothesis: real world images lie on a smooth,
low-dimensional manifold
○ Manifold distance is a good measure of
similarity
Similar for audio and text

13
Video lectures on Unsupervised Learning
Kevin McGuinness, UPC DLCV 2016 Xavier Giró, UPC DLAI 2017

14
Outline
a. Autoencoder

15
Acknowledgements
Víctor Campos Junting Pan Xunyu Lin Sebastian
Palacio
Carlos
Arenas

16
Self-supervised learning
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● A surrogate task must be invented by withholding a part of the unlabeled
data and training the NN to predict it.
Unlabeled data
(X)

17
Self-supervised learning
Reference: Andrew Zisserman (PAISS 2018)
Self-supervised learning is a form of unsupervised learning where the data
provides the supervision.
● By defining a proxy loss, the NN learns representations, which should be
valuable for the actually target task.
^
y
loss
Representations learned without labels

18
Outline
a. Autoencoder

19
Autoencoder (AE)
Fig: “Deep Learning Tutorial” Stanford
Autoencoders:
● Predict at the output the
same input data.
● Do not need labels.

20
Autoencoder (AE)
What is the use of an autoencoder ?

21
Autoencoder (AE)
Dimensionality reduction:
Use the hidden layer as a
feature extractor of any
desired size.

22
Autoencoder (AE)
Encoder
W1
Decoder
W2
hdata reconstruction
Loss
(reconstruction error)
Latent variables
(representation/features)
Pretraining:
1. Initialize a NN by solving an autoencoding
problem.

23
Autoencoder (AE)
Latent variables
(representation/features)
Encoder
W1
hdata Classifier
WC
prediction
y Loss
(cross entropy)
Pretraining:
1. Initialize a NN solving an autoencoding
problem.
2. Train for final task with “few” labels.

24
Outline
a. Autoencoder

25
Temporal regularization: ISA
Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant spatio-temporal features for
action recognition with independent subspace analysis." CVPR 2011
Features are learned with a Independent Subspace Analysis (ISA). Uses
convolution and pooling operations.

26Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant spatio-temporal features for
Features are learned unsupervisedly by considering 3D (space+time) video
blocks.

27Le, Quoc V., Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. "Learning hierarchical invariant spatio-temporal features for
Feature visualizations.

28
Assumption: adjacent video frames contain semantically similar information.
Autoencoder trained with regularizations by slowliness and sparisty.
Goroshin, Ross, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. "Unsupervised learning of spatiotemporally
coherent metrics." ICCV 2015.
Temporal regularization: Slowliness

29Jayaraman, Dinesh, and Kristen Grauman. "Slow and steady feature analysis: higher order temporal coherence in video."
CVPR 2016. [video]
Slow feature analysis
● Temporal coherence assumption: features
should change slowly over time in video
Steady feature analysis
● Second order changes also small: changes in
the past should resemble changes in the future
Train on triplets of frames from video
Loss encourages nearby frames to have slow
and steady features, and far frames to have
different features
Temporal regularization: Slowliness

30
Outline
a. Autoencoder
c. Temporal verification

31
Related work on still images
Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual representation learning by context prediction."
ICCV 2015.
A surrogate task is defined by exploiting the spatial context.

32
Related work on still images
Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual representation learning by context prediction."
ICCV 2015.
What video-specific surrogate tasks could you think about ?

33
Temporal coherence
(Slides by Xunyu Lin): Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using
temporal order verification." ECCV 2016. [code]
Temporal order of frames is
exploited as the supervisory
signal for learning.

34
Temporal coherence
Take temporal order as the supervisory signals for learning
Shuffled
sequences
Binary classification
In order
Not in order

35
Temporal coherence

36
Temporal coherence

37
Temporal verification
#Odd-one-out Fernando, Basura, Hakan Bilen, Efstratios Gavves, and Stephen Gould. "Self-supervised video
representation learning with odd-one-out networks." ICCV 2017
Train a network to detect which of the video sequences contains frames in the wrong order.

38
Temporal coherence
Lee, Hsin-Ying, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. "Unsupervised representation learning by sorting
sequences." ICCV 2017.
Sort the sequence of frames.

39
Temporal coherence
#T-CAM Wei, Donglai, Joseph J. Lim, Andrew Zisserman, and William T. Freeman. "Learning and using
the arrow of time." CVPR 2018.
Predict whether the video moves forward or backward.

40
Outline
a. Autoencoder
d. Frame Prediction

41
Predictive Learning
Slide credit:
Yann LeCun

42
Frame Prediction
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video
Representations using LSTMs." In ICML 2015. [Github]
Learning video representations (features) by...

43
(1) frame reconstruction (AE):
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised Learning of Video Representations using
LSTMs." ICML 2015. [Github]
Frame Prediction

44
(2) frame prediction
Frame Prediction

45
Unsupervised learned features (lots of data) are
fine-tuned for activity recognition (small data).
Frame Prediction

46
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error."
ICLR 2016 [project] [code]
Video frame prediction with a ConvNet.
Frame Prediction

47
The blurry predictions from MSE (l1) are improved with multi-scale architecture,
adversarial training and an image gradient difference loss (GDL) function.
Frame Prediction

48
Frame Prediction

49#DrNet Denton, Emily L. "Unsupervised learning of disentangled representations from video." NIPS 2017.
The model learns to disentangle (“separate”) the visual features that correspond
to the:
Object Pose
(wrt the camera)
Object Content
(class)
Frame Prediction + Disentangled features

50
#DrNet Denton, Emily L. "Unsupervised learning of disentangled representations from video." NIPS 2017.
#MCNet R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence
prediction. In ICLR, 2017
100 step video generation on KTH where green frames indicate conditioned input
and red frames indicate generations.
Generations from the MCNet of Villegas et al. (2017), are shown for comparison.

51
A CNN video architecture learns disentangles features for appearance & motion
when trained for frame prediction.
a_feat
F(a_feat,
m_feat)
FC layer
- Frame t
pixel_loss
- Frame t
total_loss
m_feat
Temporal conv
- [t-N, t] frames
- w_size: N
2D Spatial
deconv
- Frame t
2D Spatial
deconv
- Frame t+K
pixel_loss
- Frame t+K
2D Spatial conv
- [t-N, t] frames
Input clip
20 frames
DecoderEncoder
- Backbone for the
two streams
Block
gradients
Sum of both
pixel losses
#DisNet Carlos Arenas, Victor Campos, Sebastian Palacio, Xavier Giro-i-Nieto, “Video Understanding through the
Disentanglement of Appearance and Motion” MSc thesis, ETSETB TelecomBCN 2018.

52#DisNet Carlos Arenas, Victor Campos, Sebastian Palacio, Xavier Giro-i-Nieto, “Video Understanding through the
Disentanglement of Appearance and Motion” MSc thesis, ETSETB TelecomBCN 2018.
A synthetic dataset of moving MNIST digits was built to have access to virtually an
infinite amount of data.
‐ Train:
○ 5000 clips: (0) Horizontal
○ 5000 clips: (3) Vertical
‐ Validation:
○ 500 clips: (3) Horizontal
○ 500 clips: (0) Vertical
‐ Bounding angle: 180º
‐ Speed: 8 pixels/frame
‐ Size: original scale 1:1
.
.
.
.
.
.
.
.
. 20
64
6
4

53
Outline
a. Autoencoder
d. Frame Prediction

54
Pathak, Deepak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. "Learning features by watching
objects move." CVPR 2017
Noisy labels from motion (optical flow)
Noisy labels can be built with optical flow computed with a handcrafted tool.
NN somehow regularizes the noise present in the annotations

55
Vondrick, Carl, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. "Tracking emerges by
colorizing videos." ECCV 2018. [blog]
Noisy labels from color
A NN is trained to colorize a video frame, given the color of the first frame of the
video sequence.
CNN

56
A NN is trained to colorize a video frame, given the color of the first frame of the
video sequence.

57
Learned embeddings can be clustered to track objects.

Temporal + Multiview Weak Labels
Sermanet, Pierre, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain.
"Time-contrastive networks: Self-supervised learning from video." ICRA 2018.

Temporal + Multiview Weak Labels

61
Outline
a. Autoencoder
d. Frame Prediction

63
Deep Learning courses @ UPC TelecomBCN:
● MSc course [2017] [2018]
● BSc course [2018] [2019]
● 1st edition (2016)
● 2nd edition (2017)
● 3rd edition (2018)
● 4th edition (2019)
● 1st edition (2017)
● 2nd edition (2018)
● 3rd edition - NLP (2019)
Next edition: Autumn 2019 Registration open for 2019Registration open for 2019

64
Deep Learning for Professionals @ UPC School
Next edition starts November 2019. Sign up here.

Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019

More Related Content

What's hot (20)

Similar to Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019 (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019