SlideShare a Scribd company logo
U-N.o.1T: A U-Net exploration, in Depth
John Luke Chuter, Geoffrey Boris Boullanger, Manuel Nieves Saez
{jchuter, gbakker, mnievess}@stanford.edu
December 18, 2018
Abstract
Hardware progress has enabled solutions which were
historically computationally intractable. This is particu-
larly true in video analysis. This technological advance has
opened a new frontier of problems. Within this expanse,
we have chosen the classic problem of depth inference from
images. Specifically, given a sequence of images captured
over time, we output depth maps corresponding one-to-one
with the input sequence. As a spatiotemporal problem, we
were motivated to model it with convolutions (spatial) and
LSTMs (temporal). These are used in a U-Net encoder-
decoder architecture. The results indicate some potential
in such an approach, the process by which we came to this
conclusion is detailed below.
1. Introduction
Solutions to the above problem would enable 3D world
generation from simple video input with applications from
VR to robotics. While there are hardware approaches to
depth-determination problems, such as lidar or multiple
lenses, software solutions provide flexibility in their appli-
cation. Furthermore, since humans are visual creatures, we
have adapted our environments to be largely determinable
through visual means, such that visual approaches suit these
environments.
1.1. In Depth
After researching this initial problem in depth, we be-
came familiar with literature on depth maps, their algo-
rithms and datasets. This presented itself as a sensible path
forward, as it seemed simpler and better scoped. This area
is a classic one, with not only history but ongoing and re-
cent progress. Concerning depth maps, there are various
families of problems; single image to depth map, depth map
alignments, from sparse to dense - but given the background
research we’d done on the image+depth map sequence, we
were naturally drawn to the most similar problem: from a
sequence of images, generate a sequence of depth maps.
There are many reasons to be excited about such a prob-
lem: spatiotemporal models are hot stuff. For us, however,
we wanted to learn about RNNs and CNNs, and as space-
time lends itself to natural conceptions of convolutions and
recurrent networks, we proceeded down that path.
Quite excited to apply modern RNN and CNN tech-
niques, we were both disappointed and relieved to find ex-
tremely relevant literature: ’DepthNet’ [?], ’Spatiotempo-
ral Modeling for Crowd Counting in Videos’ [?], ’Bidirec-
tional Recurrent Convolutional Networks for Multi-Frame
Super-Resolution’ [?], ’Cross-scene Crowd Counting via
Deep Convolutional Neural Networks’ [?], and ’Pyramid
Dilated Deeper ConvLSTM for Video Salient Object Detec-
tion’ [?]. All these papers address spatiotemporal problems
with RNNs and convolutions.
While there are people who claim ”RNNs are dead, long
live convolutions/attentions/whatever-hotness”, we wanted
to explore this avenue further - which brings us to the lit-
erature by not-those-people. In pursuit of this approach we
have our own opinion, as will be discussed at the end.
2. Related Work
It is fitting to begin with paper that introduced the
core unit of our model, ”Convolutional LSTM: A Machine
Learning Approach for Precipitation Nowcasting” [?]. This
paper details the convolutional LSTM cell, wherein a typ-
ical LSTM cell performs a convolution at its gates. This
enables encoding of spatial information (from the convo-
lution) while benefiting from the LSTM. The authors then
detail stacking of such convLSTM layers, to create a deep
convLSTM network for encoding. The next notable pa-
per, ”DepthNet” [?] presents the most similar model to our
own. Specifically, they explore the combination of U-Net
architecture with convLSTM layers in an encoder-decoder
framework for purposes of depth estimation. Our variations
from there explore how to implement bi-directionality, as
a natural and common expansion to most LSTM models,
which we detail in the following Model section. ”Spa-
tiotemporal Modeling for Crowd Counting in Videos” [?]
demonstrates one method of implementing bidirectionality
in a spatiotemporal setting. ”Pyramid Dilated Deeper Con-
vLSTM for Video Salient Object Detection” [?] combines
multiple advanced techniques, but tackles a highly different
problem. In this realm, there were several closely related
problems:
We chose DepthNet [?] as a baseline model to iterate
from. First, a brief description of this baseline: 8 convL-
STM layers are stacked in the encoding phase of U-Net
1
encoder-decoder network. These provide connections and
skip connections to the decoding phase, which is made of 4
convolutional and trans-convolutional pairs. For details we
cannot do justice to here, refer to the DepthNet paper.
The DepthNet authors themselves propose several possibil-
ities for alteration, and we came up with a few ourselves.
Alternative models include: Implement an explainability
mask to better predict depth maps for individual objects,
Attention mechanism, or Bi-directionality. It was this third
option we chose to explore, as in a network like this there
are surprisingly many ways to try to incorporate the forward
and backward passes. While this remains an object of ex-
perimentation, there are three principle categories of varia-
tion: Full-communication, Sparse-communication, Media-
tion. We chose full-communication between a left and right
pass over the input image sequence.
There are many great people and great ideas, [?] [?] [?]
[?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] but we will continue
to our own.
3. Dataset and Features
3.1. Descriptive Overview
In the search for a dataset with both picture and depth
map, we have decided to use the KITTI dataset [?], which
was originally created from a Volkswagen station wagon for
use in mobile robotics and autonomous driving research. A
few hours of traffic scenarios have been recorded, using var-
ious sensors like a high resolution color camera or a Velo-
dyne 3D laser scanner. Even if our project is on a differ-
ent subject compared to the dataset’s target audience, we
were attracted by the large amount of recordings of paired
video/depth map that the KITTI dataset offered. We are not
using the other measurements provided by KITTI - e.g GPS,
timestamp, etc.
The features for an image and depthmap pair are the pix-
els therein; i.e. the RGB values and depth values. These
depthmap groundtruths are generated with LIDAR.
We are using the full raw dataset from KITTI containing
180GB worth of data, divided into categories: Road, City,
Residential, Campus and Person. As we are conscious it
would be impractical on either of the two machines we are
using for training ((a) NVidia 1080Ti; (b) NVidia P100 with
circa 10G available RAM for each GPU), we had to reduce
the amount of data we would use.
3.2. Preprocessing
First, we organized the data and store image sequences
in subfolders as it seems to simplify and speed up the train-
ing [?]. Second, we had to reduce the quality of the images
so that we could run the best model on our GPUs without
needing extremely small batch sizes. We are now using im-
ages of size 64x18px. Finally, we are lucky enough to have
Figure 1. a. Conv-LSTM b. bi-ConvLSTM Cell
a large dataset so we have randomly selected sequences of
images (subfolders) in order to split our dataset into train,
valid and test sub-datasets. These train, valid, and test sets
are optimally preselected by KITTI. The two first were used
to try different versions of models and calibrate the hyper-
parameters for the most successful model, while the test set
will only be used once, at the end, in order to report the per-
formance of our best model on the final report. We created a
bespoke data loader due to the unusual nature of our dataset
(i.e. images stored by sequences in subfolders and depth
maps linked to the sequence). This dataloader includes pre-
processing such as Unity Normalization transformation, to
quicken training. Now, on to the methods of what was to be
trained, and how.
4. Methods
4.1. ConvLSTM, bi-ConvLSTM
ConvLSTMs are more than a convolutional layer into
an LSTM layer; they convolve on the hidden state and
the input together. This functional difference has led to
some speculation as to the merits of one over the other,
where convLSTMs sometimes prove more effective (as
in Very Deep Convolutional Networks for End-to-End
Speech Recognition [?]. We were curious to explore the
more recent development of convLSTMS. Additionally,
DepthNet achieved good results with convLSTMs, which
indicated potential.
The specific math for a ConvLSTM is:
it = σ(ReLU(Wxi ∗ Xt + Whi ∗ Ht1 + Wci ◦ Ct1 + bi))
ft = σ(ReLU(Wxf ∗ Xt + Whf ∗ Ht1 + Wcf ◦ Ct1 + bf ))
gt = tanh(ReLU((Wxg ∗ Xt + Whg ∗ Ht1 + bg))
Ct = ft ◦ Ct1 + it ◦ gt
ot = σ(ReLU(Wxo ∗ Xt + Who ∗ Ht1 + Wco ◦ Ct + bo))
Ht = ot ◦ tanh(Ct)
2
Figure 2. U-Net Encoder-Decoder
Figure 3. UNoIT Detail
where ∗ refers to a convolution operation and ◦ to the
Hadamard product.
4.2. Architecture - U-Net Encoder-Decoder
Specific to our problem, depth map sequence prediction
from an image sequence,
5. Experiments/Results/Discussion
5.1. Experiments
HARDWARE: For this project, we are using two sepa-
rate machines with both a recent NVidia GPU (1080Ti and
P100). Our current implementation of the model uses Py-
torch 0.4 [?] and Cuda 9.1.
running time; hours for each model different models
played with different sequence lengths; 1,3,6 different im-
age sizes started large @ 416x128px, then 128x36, to 64x18
5.2. Metrics
for each: math motivation
custom loss: scale invariant (supposedly) (equation here)
RMSE: standard; easy to implement (free in pytorch)
KITTI and depthnet metrics penalizes depths for being
different vs. MAE (Absolute) penalizes large differences
harshly
iRMSE: dunno
MAE: l1 loss; easy to implement KITTI and depthnet pe-
nalizes depths for being different vs. RMSE penalizes large
differences gently *can expand diff of MAE and RMSE
indicates whether variance of error distribution frequency
is large; large error distribution frequency variance yields
higher MAE ¡= RMSE; magnitude of difference implies
above.
a1, a2, a3: all generally used; good for baseline com-
parison loss don’t give intutive understanding of progress /
”goodness” of results; accuracy more comprehensible mul-
tiples gve idea of ballpark we’re in for inference accuracy;
how much they differ from each other shows how much of
the data is within certain thresholds. ”distribution”
5.3. Baseline Measures - Other People’s Results
We are comparing ourselves most directly to DepthNet
and other KITTI competitors, with the corresponding loss
measures. The current two leaders are DL-61(DORN) and
DL-SORD-SQ :
SILog sqErrorRel absErrorRel iRMSE
DL-61(DORN) 11.77% 2.23% 8.78% 12.98%
DL-SORD-SQ 13.00% 2.95% 10.38% 13.78%
DepthNet too; w
5.4. Results - Our Results
figures of results; table of final metrics train-
ing graphs training performance test performance
SILog RMSE δ < 1.25 δ < 1.252
δ < 1.253
6 Image Sequence 0.44 6.37 0.74 0.87 0.93
1 Image Sequence 0.54 7.4 0.705 0.84 0.91
5.5. Discussion
LSTM merits: sequence length comparisons image of
graph here convolution merits
6. Conclusion/Future Work
There are several areads data processing: -
transformation alternatives -bigger image size -per
sequence, overlap/reuse frames of previous and posterior
sequence
loss functions; custom loss vs: -MSE -(loss function in
GAN Goodfellow)
different model options: -hyperparamters -batch size -
kernel size -num-filters -learning rate -number encoding-
decoding layers -sequence length
3
-architectural -activation functions -MaxPool layers? -
just sequential; not U-Net -bidirectional vs unidirectional
(DepthNet) vs multiple bidirectional options -simple RNN
vs LSTM vs GRU -conv, no RNN -3D conv -3D conv inside
convLSTM (multiple time steps into one
-similar problems: -input sequence of stereo images
pairs, generate sequence of depthmaps
longer computation time
7. Appendices
Contributions
John has researched the literature, designed the models,
and implemented SimpleU as a first pass. Manuel has im-
plemented an alternative convLSTM encoder. Geoffrey has
set up the infrastructure, preprocessed data, organized the
proposal and milestone, and assisted model architecture.
4

More Related Content

PPTX
Mmsys slideshare-intel-nokia
Rufael Mekuria
 
PDF
A Video Watermarking Scheme to Hinder Camcorder Piracy
IOSR Journals
 
PDF
A brief introduction to recent segmentation methods
Shunta Saito
 
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
PDF
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Pr045 deep lab_semantic_segmentation
Taeoh Kim
 
PDF
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Mmsys slideshare-intel-nokia
Rufael Mekuria
 
A Video Watermarking Scheme to Hinder Camcorder Piracy
IOSR Journals
 
A brief introduction to recent segmentation methods
Shunta Saito
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Universitat Politècnica de Catalunya
 
Pr045 deep lab_semantic_segmentation
Taeoh Kim
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 

What's hot (20)

PDF
Transformer 動向調査 in 画像認識
Kazuki Maeno
 
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
岳華 杜
 
PDF
#6 PyData Warsaw: Deep learning for image segmentation
Matthew Opala
 
PPTX
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
Joonhyung Lee
 
PDF
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Ganesan Narayanasamy
 
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Universitat Politècnica de Catalunya
 
PDF
Video to Video Translation CGAN
Alessandro Calmanovici
 
PDF
DWT-SVD Based Visual Cryptography Scheme for Audio Watermarking
inventionjournals
 
PDF
Multidimensional RNN
Grigory Sapunov
 
PPT
Build Your Own 3D Scanner: 3D Scanning with Structured Lighting
Douglas Lanman
 
PDF
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
Susang Kim
 
PDF
Build Your Own 3D Scanner: Course Notes
Douglas Lanman
 
PDF
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
Ravi Kiran B.
 
PDF
G0523444
IOSR Journals
 
PDF
CNN Attention Networks
Taeoh Kim
 
PDF
A FRACTAL BASED IMAGE CIPHER USING KNUTH SHUFFLE METHOD AND DYNAMIC DIFFUSION
IJCNCJournal
 
Transformer 動向調査 in 画像認識
Kazuki Maeno
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
岳華 杜
 
#6 PyData Warsaw: Deep learning for image segmentation
Matthew Opala
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
Joonhyung Lee
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Ganesan Narayanasamy
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Universitat Politècnica de Catalunya
 
Video to Video Translation CGAN
Alessandro Calmanovici
 
DWT-SVD Based Visual Cryptography Scheme for Audio Watermarking
inventionjournals
 
Multidimensional RNN
Grigory Sapunov
 
Build Your Own 3D Scanner: 3D Scanning with Structured Lighting
Douglas Lanman
 
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
Susang Kim
 
Build Your Own 3D Scanner: Course Notes
Douglas Lanman
 
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
Ravi Kiran B.
 
G0523444
IOSR Journals
 
CNN Attention Networks
Taeoh Kim
 
A FRACTAL BASED IMAGE CIPHER USING KNUTH SHUFFLE METHOD AND DYNAMIC DIFFUSION
IJCNCJournal
 
Ad

Similar to U_N.o.1T: A U-Net exploration, in Depth (20)

PDF
A Beginner's Guide to Monocular Depth Estimation
Ryo Takahashi
 
PDF
Unconstrained 2D to Stereoscopic 3D Image and Video Conversion using Semi-Aut...
Ray Phan
 
PDF
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
物件偵測與辨識技術
CHENHuiMei
 
PPTX
PR098: MegaDepth: Learning Single-View Depth Prediction from Internet Photos
광희 이
 
PPTX
Image captioning
Muhammad Zbeedat
 
PDF
AN ENHANCEMENT FOR THE CONSISTENT DEPTH ESTIMATION OF MONOCULAR VIDEOS USING ...
mlaij
 
PPTX
fully convolutional networks for semantic segmentation
XinyangLi16
 
PDF
Deep learning in Computer Vision
David Dao
 
PDF
161209 Unsupervised Learning of Video Representations using LSTMs
Junho Cho
 
PDF
Depth Fusion from RGB and Depth Sensors III
Yu Huang
 
PPTX
Deep Learning with Python (PyData Seattle 2015)
Alexander Korbonits
 
PPTX
FINAL_Team_4.pptx
nitin571047
 
PPTX
Computer vision lab seminar(deep learning) yong hoon
Yonghoon Kwon
 
PDF
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
PPTX
Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Chris...
Data Con LA
 
PDF
REVIEW ON OBJECT DETECTION WITH CNN
IRJET Journal
 
PPTX
240219_RNN, LSTM code.pptxdddddddddddddddd
ssuser2624f71
 
PDF
Possibilities of generative models
Alison B. Lowndes
 
PDF
“Can You See What I See? The Power of Deep Learning,” a Presentation from Str...
Edge AI and Vision Alliance
 
A Beginner's Guide to Monocular Depth Estimation
Ryo Takahashi
 
Unconstrained 2D to Stereoscopic 3D Image and Video Conversion using Semi-Aut...
Ray Phan
 
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Universitat Politècnica de Catalunya
 
物件偵測與辨識技術
CHENHuiMei
 
PR098: MegaDepth: Learning Single-View Depth Prediction from Internet Photos
광희 이
 
Image captioning
Muhammad Zbeedat
 
AN ENHANCEMENT FOR THE CONSISTENT DEPTH ESTIMATION OF MONOCULAR VIDEOS USING ...
mlaij
 
fully convolutional networks for semantic segmentation
XinyangLi16
 
Deep learning in Computer Vision
David Dao
 
161209 Unsupervised Learning of Video Representations using LSTMs
Junho Cho
 
Depth Fusion from RGB and Depth Sensors III
Yu Huang
 
Deep Learning with Python (PyData Seattle 2015)
Alexander Korbonits
 
FINAL_Team_4.pptx
nitin571047
 
Computer vision lab seminar(deep learning) yong hoon
Yonghoon Kwon
 
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
Data Con LA 2019 - State of the Art of Innovation in Computer Vision by Chris...
Data Con LA
 
REVIEW ON OBJECT DETECTION WITH CNN
IRJET Journal
 
240219_RNN, LSTM code.pptxdddddddddddddddd
ssuser2624f71
 
Possibilities of generative models
Alison B. Lowndes
 
“Can You See What I See? The Power of Deep Learning,” a Presentation from Str...
Edge AI and Vision Alliance
 
Ad

Recently uploaded (20)

PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Presentation on animal welfare a good topic
kidscream385
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 

U_N.o.1T: A U-Net exploration, in Depth

  • 1. U-N.o.1T: A U-Net exploration, in Depth John Luke Chuter, Geoffrey Boris Boullanger, Manuel Nieves Saez {jchuter, gbakker, mnievess}@stanford.edu December 18, 2018 Abstract Hardware progress has enabled solutions which were historically computationally intractable. This is particu- larly true in video analysis. This technological advance has opened a new frontier of problems. Within this expanse, we have chosen the classic problem of depth inference from images. Specifically, given a sequence of images captured over time, we output depth maps corresponding one-to-one with the input sequence. As a spatiotemporal problem, we were motivated to model it with convolutions (spatial) and LSTMs (temporal). These are used in a U-Net encoder- decoder architecture. The results indicate some potential in such an approach, the process by which we came to this conclusion is detailed below. 1. Introduction Solutions to the above problem would enable 3D world generation from simple video input with applications from VR to robotics. While there are hardware approaches to depth-determination problems, such as lidar or multiple lenses, software solutions provide flexibility in their appli- cation. Furthermore, since humans are visual creatures, we have adapted our environments to be largely determinable through visual means, such that visual approaches suit these environments. 1.1. In Depth After researching this initial problem in depth, we be- came familiar with literature on depth maps, their algo- rithms and datasets. This presented itself as a sensible path forward, as it seemed simpler and better scoped. This area is a classic one, with not only history but ongoing and re- cent progress. Concerning depth maps, there are various families of problems; single image to depth map, depth map alignments, from sparse to dense - but given the background research we’d done on the image+depth map sequence, we were naturally drawn to the most similar problem: from a sequence of images, generate a sequence of depth maps. There are many reasons to be excited about such a prob- lem: spatiotemporal models are hot stuff. For us, however, we wanted to learn about RNNs and CNNs, and as space- time lends itself to natural conceptions of convolutions and recurrent networks, we proceeded down that path. Quite excited to apply modern RNN and CNN tech- niques, we were both disappointed and relieved to find ex- tremely relevant literature: ’DepthNet’ [?], ’Spatiotempo- ral Modeling for Crowd Counting in Videos’ [?], ’Bidirec- tional Recurrent Convolutional Networks for Multi-Frame Super-Resolution’ [?], ’Cross-scene Crowd Counting via Deep Convolutional Neural Networks’ [?], and ’Pyramid Dilated Deeper ConvLSTM for Video Salient Object Detec- tion’ [?]. All these papers address spatiotemporal problems with RNNs and convolutions. While there are people who claim ”RNNs are dead, long live convolutions/attentions/whatever-hotness”, we wanted to explore this avenue further - which brings us to the lit- erature by not-those-people. In pursuit of this approach we have our own opinion, as will be discussed at the end. 2. Related Work It is fitting to begin with paper that introduced the core unit of our model, ”Convolutional LSTM: A Machine Learning Approach for Precipitation Nowcasting” [?]. This paper details the convolutional LSTM cell, wherein a typ- ical LSTM cell performs a convolution at its gates. This enables encoding of spatial information (from the convo- lution) while benefiting from the LSTM. The authors then detail stacking of such convLSTM layers, to create a deep convLSTM network for encoding. The next notable pa- per, ”DepthNet” [?] presents the most similar model to our own. Specifically, they explore the combination of U-Net architecture with convLSTM layers in an encoder-decoder framework for purposes of depth estimation. Our variations from there explore how to implement bi-directionality, as a natural and common expansion to most LSTM models, which we detail in the following Model section. ”Spa- tiotemporal Modeling for Crowd Counting in Videos” [?] demonstrates one method of implementing bidirectionality in a spatiotemporal setting. ”Pyramid Dilated Deeper Con- vLSTM for Video Salient Object Detection” [?] combines multiple advanced techniques, but tackles a highly different problem. In this realm, there were several closely related problems: We chose DepthNet [?] as a baseline model to iterate from. First, a brief description of this baseline: 8 convL- STM layers are stacked in the encoding phase of U-Net 1
  • 2. encoder-decoder network. These provide connections and skip connections to the decoding phase, which is made of 4 convolutional and trans-convolutional pairs. For details we cannot do justice to here, refer to the DepthNet paper. The DepthNet authors themselves propose several possibil- ities for alteration, and we came up with a few ourselves. Alternative models include: Implement an explainability mask to better predict depth maps for individual objects, Attention mechanism, or Bi-directionality. It was this third option we chose to explore, as in a network like this there are surprisingly many ways to try to incorporate the forward and backward passes. While this remains an object of ex- perimentation, there are three principle categories of varia- tion: Full-communication, Sparse-communication, Media- tion. We chose full-communication between a left and right pass over the input image sequence. There are many great people and great ideas, [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] but we will continue to our own. 3. Dataset and Features 3.1. Descriptive Overview In the search for a dataset with both picture and depth map, we have decided to use the KITTI dataset [?], which was originally created from a Volkswagen station wagon for use in mobile robotics and autonomous driving research. A few hours of traffic scenarios have been recorded, using var- ious sensors like a high resolution color camera or a Velo- dyne 3D laser scanner. Even if our project is on a differ- ent subject compared to the dataset’s target audience, we were attracted by the large amount of recordings of paired video/depth map that the KITTI dataset offered. We are not using the other measurements provided by KITTI - e.g GPS, timestamp, etc. The features for an image and depthmap pair are the pix- els therein; i.e. the RGB values and depth values. These depthmap groundtruths are generated with LIDAR. We are using the full raw dataset from KITTI containing 180GB worth of data, divided into categories: Road, City, Residential, Campus and Person. As we are conscious it would be impractical on either of the two machines we are using for training ((a) NVidia 1080Ti; (b) NVidia P100 with circa 10G available RAM for each GPU), we had to reduce the amount of data we would use. 3.2. Preprocessing First, we organized the data and store image sequences in subfolders as it seems to simplify and speed up the train- ing [?]. Second, we had to reduce the quality of the images so that we could run the best model on our GPUs without needing extremely small batch sizes. We are now using im- ages of size 64x18px. Finally, we are lucky enough to have Figure 1. a. Conv-LSTM b. bi-ConvLSTM Cell a large dataset so we have randomly selected sequences of images (subfolders) in order to split our dataset into train, valid and test sub-datasets. These train, valid, and test sets are optimally preselected by KITTI. The two first were used to try different versions of models and calibrate the hyper- parameters for the most successful model, while the test set will only be used once, at the end, in order to report the per- formance of our best model on the final report. We created a bespoke data loader due to the unusual nature of our dataset (i.e. images stored by sequences in subfolders and depth maps linked to the sequence). This dataloader includes pre- processing such as Unity Normalization transformation, to quicken training. Now, on to the methods of what was to be trained, and how. 4. Methods 4.1. ConvLSTM, bi-ConvLSTM ConvLSTMs are more than a convolutional layer into an LSTM layer; they convolve on the hidden state and the input together. This functional difference has led to some speculation as to the merits of one over the other, where convLSTMs sometimes prove more effective (as in Very Deep Convolutional Networks for End-to-End Speech Recognition [?]. We were curious to explore the more recent development of convLSTMS. Additionally, DepthNet achieved good results with convLSTMs, which indicated potential. The specific math for a ConvLSTM is: it = σ(ReLU(Wxi ∗ Xt + Whi ∗ Ht1 + Wci ◦ Ct1 + bi)) ft = σ(ReLU(Wxf ∗ Xt + Whf ∗ Ht1 + Wcf ◦ Ct1 + bf )) gt = tanh(ReLU((Wxg ∗ Xt + Whg ∗ Ht1 + bg)) Ct = ft ◦ Ct1 + it ◦ gt ot = σ(ReLU(Wxo ∗ Xt + Who ∗ Ht1 + Wco ◦ Ct + bo)) Ht = ot ◦ tanh(Ct) 2
  • 3. Figure 2. U-Net Encoder-Decoder Figure 3. UNoIT Detail where ∗ refers to a convolution operation and ◦ to the Hadamard product. 4.2. Architecture - U-Net Encoder-Decoder Specific to our problem, depth map sequence prediction from an image sequence, 5. Experiments/Results/Discussion 5.1. Experiments HARDWARE: For this project, we are using two sepa- rate machines with both a recent NVidia GPU (1080Ti and P100). Our current implementation of the model uses Py- torch 0.4 [?] and Cuda 9.1. running time; hours for each model different models played with different sequence lengths; 1,3,6 different im- age sizes started large @ 416x128px, then 128x36, to 64x18 5.2. Metrics for each: math motivation custom loss: scale invariant (supposedly) (equation here) RMSE: standard; easy to implement (free in pytorch) KITTI and depthnet metrics penalizes depths for being different vs. MAE (Absolute) penalizes large differences harshly iRMSE: dunno MAE: l1 loss; easy to implement KITTI and depthnet pe- nalizes depths for being different vs. RMSE penalizes large differences gently *can expand diff of MAE and RMSE indicates whether variance of error distribution frequency is large; large error distribution frequency variance yields higher MAE ¡= RMSE; magnitude of difference implies above. a1, a2, a3: all generally used; good for baseline com- parison loss don’t give intutive understanding of progress / ”goodness” of results; accuracy more comprehensible mul- tiples gve idea of ballpark we’re in for inference accuracy; how much they differ from each other shows how much of the data is within certain thresholds. ”distribution” 5.3. Baseline Measures - Other People’s Results We are comparing ourselves most directly to DepthNet and other KITTI competitors, with the corresponding loss measures. The current two leaders are DL-61(DORN) and DL-SORD-SQ : SILog sqErrorRel absErrorRel iRMSE DL-61(DORN) 11.77% 2.23% 8.78% 12.98% DL-SORD-SQ 13.00% 2.95% 10.38% 13.78% DepthNet too; w 5.4. Results - Our Results figures of results; table of final metrics train- ing graphs training performance test performance SILog RMSE δ < 1.25 δ < 1.252 δ < 1.253 6 Image Sequence 0.44 6.37 0.74 0.87 0.93 1 Image Sequence 0.54 7.4 0.705 0.84 0.91 5.5. Discussion LSTM merits: sequence length comparisons image of graph here convolution merits 6. Conclusion/Future Work There are several areads data processing: - transformation alternatives -bigger image size -per sequence, overlap/reuse frames of previous and posterior sequence loss functions; custom loss vs: -MSE -(loss function in GAN Goodfellow) different model options: -hyperparamters -batch size - kernel size -num-filters -learning rate -number encoding- decoding layers -sequence length 3
  • 4. -architectural -activation functions -MaxPool layers? - just sequential; not U-Net -bidirectional vs unidirectional (DepthNet) vs multiple bidirectional options -simple RNN vs LSTM vs GRU -conv, no RNN -3D conv -3D conv inside convLSTM (multiple time steps into one -similar problems: -input sequence of stereo images pairs, generate sequence of depthmaps longer computation time 7. Appendices Contributions John has researched the literature, designed the models, and implemented SimpleU as a first pass. Manuel has im- plemented an alternative convLSTM encoder. Geoffrey has set up the infrastructure, preprocessed data, organized the proposal and milestone, and assisted model architecture. 4