U_N.o.1T: A U-Net exploration, in Depth

U-N.o.1T: A U-Net exploration, in Depth
John Luke Chuter, Geoffrey Boris Boullanger, Manuel Nieves Saez
{jchuter, gbakker, mnievess}@stanford.edu
December 18, 2018
Abstract
Hardware progress has enabled solutions which were
historically computationally intractable. This is particu-
larly true in video analysis. This technological advance has
opened a new frontier of problems. Within this expanse,
we have chosen the classic problem of depth inference from
images. Specifically, given a sequence of images captured
over time, we output depth maps corresponding one-to-one
with the input sequence. As a spatiotemporal problem, we
were motivated to model it with convolutions (spatial) and
LSTMs (temporal). These are used in a U-Net encoder-
decoder architecture. The results indicate some potential
in such an approach, the process by which we came to this
conclusion is detailed below.
1. Introduction
Solutions to the above problem would enable 3D world
generation from simple video input with applications from
VR to robotics. While there are hardware approaches to
depth-determination problems, such as lidar or multiple
lenses, software solutions provide flexibility in their appli-
cation. Furthermore, since humans are visual creatures, we
have adapted our environments to be largely determinable
through visual means, such that visual approaches suit these
environments.
1.1. In Depth
After researching this initial problem in depth, we be-
came familiar with literature on depth maps, their algo-
rithms and datasets. This presented itself as a sensible path
forward, as it seemed simpler and better scoped. This area
is a classic one, with not only history but ongoing and re-
cent progress. Concerning depth maps, there are various
families of problems; single image to depth map, depth map
alignments, from sparse to dense - but given the background
research we’d done on the image+depth map sequence, we
were naturally drawn to the most similar problem: from a
sequence of images, generate a sequence of depth maps.
There are many reasons to be excited about such a prob-
lem: spatiotemporal models are hot stuff. For us, however,
we wanted to learn about RNNs and CNNs, and as space-
time lends itself to natural conceptions of convolutions and
recurrent networks, we proceeded down that path.
Quite excited to apply modern RNN and CNN tech-
niques, we were both disappointed and relieved to find ex-
tremely relevant literature: ’DepthNet’ [?], ’Spatiotempo-
ral Modeling for Crowd Counting in Videos’ [?], ’Bidirec-
tional Recurrent Convolutional Networks for Multi-Frame
Super-Resolution’ [?], ’Cross-scene Crowd Counting via
Deep Convolutional Neural Networks’ [?], and ’Pyramid
Dilated Deeper ConvLSTM for Video Salient Object Detec-
tion’ [?]. All these papers address spatiotemporal problems
with RNNs and convolutions.
While there are people who claim ”RNNs are dead, long
live convolutions/attentions/whatever-hotness”, we wanted
to explore this avenue further - which brings us to the lit-
erature by not-those-people. In pursuit of this approach we
have our own opinion, as will be discussed at the end.
2. Related Work
It is fitting to begin with paper that introduced the
core unit of our model, ”Convolutional LSTM: A Machine
Learning Approach for Precipitation Nowcasting” [?]. This
paper details the convolutional LSTM cell, wherein a typ-
ical LSTM cell performs a convolution at its gates. This
enables encoding of spatial information (from the convo-
lution) while benefiting from the LSTM. The authors then
detail stacking of such convLSTM layers, to create a deep
convLSTM network for encoding. The next notable pa-
per, ”DepthNet” [?] presents the most similar model to our
own. Specifically, they explore the combination of U-Net
architecture with convLSTM layers in an encoder-decoder
framework for purposes of depth estimation. Our variations
from there explore how to implement bi-directionality, as
a natural and common expansion to most LSTM models,
which we detail in the following Model section. ”Spa-
tiotemporal Modeling for Crowd Counting in Videos” [?]
demonstrates one method of implementing bidirectionality
in a spatiotemporal setting. ”Pyramid Dilated Deeper Con-
vLSTM for Video Salient Object Detection” [?] combines
multiple advanced techniques, but tackles a highly different
problem. In this realm, there were several closely related
problems:
We chose DepthNet [?] as a baseline model to iterate
from. First, a brief description of this baseline: 8 convL-
STM layers are stacked in the encoding phase of U-Net
1

encoder-decoder network. These provide connections and
skip connections to the decoding phase, which is made of 4
convolutional and trans-convolutional pairs. For details we
cannot do justice to here, refer to the DepthNet paper.
The DepthNet authors themselves propose several possibil-
ities for alteration, and we came up with a few ourselves.
Alternative models include: Implement an explainability
mask to better predict depth maps for individual objects,
Attention mechanism, or Bi-directionality. It was this third
option we chose to explore, as in a network like this there
are surprisingly many ways to try to incorporate the forward
and backward passes. While this remains an object of ex-
perimentation, there are three principle categories of varia-
tion: Full-communication, Sparse-communication, Media-
tion. We chose full-communication between a left and right
pass over the input image sequence.
There are many great people and great ideas, [?] [?] [?]
[?] [?] [?] [?] [?] [?] [?] [?] [?] [?] [?] but we will continue
to our own.
3. Dataset and Features
3.1. Descriptive Overview
In the search for a dataset with both picture and depth
map, we have decided to use the KITTI dataset [?], which
was originally created from a Volkswagen station wagon for
use in mobile robotics and autonomous driving research. A
few hours of traffic scenarios have been recorded, using var-
ious sensors like a high resolution color camera or a Velo-
dyne 3D laser scanner. Even if our project is on a differ-
ent subject compared to the dataset’s target audience, we
were attracted by the large amount of recordings of paired
video/depth map that the KITTI dataset offered. We are not
using the other measurements provided by KITTI - e.g GPS,
timestamp, etc.
The features for an image and depthmap pair are the pix-
els therein; i.e. the RGB values and depth values. These
depthmap groundtruths are generated with LIDAR.
We are using the full raw dataset from KITTI containing
180GB worth of data, divided into categories: Road, City,
Residential, Campus and Person. As we are conscious it
would be impractical on either of the two machines we are
using for training ((a) NVidia 1080Ti; (b) NVidia P100 with
circa 10G available RAM for each GPU), we had to reduce
the amount of data we would use.
3.2. Preprocessing
First, we organized the data and store image sequences
in subfolders as it seems to simplify and speed up the train-
ing [?]. Second, we had to reduce the quality of the images
so that we could run the best model on our GPUs without
needing extremely small batch sizes. We are now using im-
ages of size 64x18px. Finally, we are lucky enough to have
Figure 1. a. Conv-LSTM b. bi-ConvLSTM Cell
a large dataset so we have randomly selected sequences of
images (subfolders) in order to split our dataset into train,
valid and test sub-datasets. These train, valid, and test sets
are optimally preselected by KITTI. The two first were used
to try different versions of models and calibrate the hyper-
parameters for the most successful model, while the test set
will only be used once, at the end, in order to report the per-
formance of our best model on the final report. We created a
bespoke data loader due to the unusual nature of our dataset
(i.e. images stored by sequences in subfolders and depth
maps linked to the sequence). This dataloader includes pre-
processing such as Unity Normalization transformation, to
quicken training. Now, on to the methods of what was to be
trained, and how.
4. Methods
4.1. ConvLSTM, bi-ConvLSTM
ConvLSTMs are more than a convolutional layer into
an LSTM layer; they convolve on the hidden state and
the input together. This functional difference has led to
some speculation as to the merits of one over the other,
where convLSTMs sometimes prove more effective (as
in Very Deep Convolutional Networks for End-to-End
Speech Recognition [?]. We were curious to explore the
more recent development of convLSTMS. Additionally,
DepthNet achieved good results with convLSTMs, which
indicated potential.
The specific math for a ConvLSTM is:
it = σ(ReLU(Wxi ∗ Xt + Whi ∗ Ht1 + Wci ◦ Ct1 + bi))
ft = σ(ReLU(Wxf ∗ Xt + Whf ∗ Ht1 + Wcf ◦ Ct1 + bf ))
gt = tanh(ReLU((Wxg ∗ Xt + Whg ∗ Ht1 + bg))
Ct = ft ◦ Ct1 + it ◦ gt
ot = σ(ReLU(Wxo ∗ Xt + Who ∗ Ht1 + Wco ◦ Ct + bo))
Ht = ot ◦ tanh(Ct)
2

Figure 2. U-Net Encoder-Decoder
Figure 3. UNoIT Detail
where ∗ refers to a convolution operation and ◦ to the
Hadamard product.
4.2. Architecture - U-Net Encoder-Decoder
Specific to our problem, depth map sequence prediction
from an image sequence,
5. Experiments/Results/Discussion
5.1. Experiments
HARDWARE: For this project, we are using two sepa-
rate machines with both a recent NVidia GPU (1080Ti and
P100). Our current implementation of the model uses Py-
torch 0.4 [?] and Cuda 9.1.
running time; hours for each model different models
played with different sequence lengths; 1,3,6 different im-
age sizes started large @ 416x128px, then 128x36, to 64x18
5.2. Metrics
for each: math motivation
custom loss: scale invariant (supposedly) (equation here)
RMSE: standard; easy to implement (free in pytorch)
KITTI and depthnet metrics penalizes depths for being
different vs. MAE (Absolute) penalizes large differences
harshly
iRMSE: dunno
MAE: l1 loss; easy to implement KITTI and depthnet pe-
nalizes depths for being different vs. RMSE penalizes large
differences gently *can expand diff of MAE and RMSE
indicates whether variance of error distribution frequency
is large; large error distribution frequency variance yields
higher MAE ¡= RMSE; magnitude of difference implies
above.
a1, a2, a3: all generally used; good for baseline com-
parison loss don’t give intutive understanding of progress /
”goodness” of results; accuracy more comprehensible mul-
tiples gve idea of ballpark we’re in for inference accuracy;
how much they differ from each other shows how much of
the data is within certain thresholds. ”distribution”
5.3. Baseline Measures - Other People’s Results
We are comparing ourselves most directly to DepthNet
and other KITTI competitors, with the corresponding loss
measures. The current two leaders are DL-61(DORN) and
DL-SORD-SQ :
SILog sqErrorRel absErrorRel iRMSE
DL-61(DORN) 11.77% 2.23% 8.78% 12.98%
DL-SORD-SQ 13.00% 2.95% 10.38% 13.78%
DepthNet too; w
5.4. Results - Our Results
figures of results; table of final metrics train-
ing graphs training performance test performance
SILog RMSE δ < 1.25 δ < 1.252
δ < 1.253
6 Image Sequence 0.44 6.37 0.74 0.87 0.93
1 Image Sequence 0.54 7.4 0.705 0.84 0.91
5.5. Discussion
LSTM merits: sequence length comparisons image of
graph here convolution merits
6. Conclusion/Future Work
There are several areads data processing: -
transformation alternatives -bigger image size -per
sequence, overlap/reuse frames of previous and posterior
sequence
loss functions; custom loss vs: -MSE -(loss function in
GAN Goodfellow)
different model options: -hyperparamters -batch size -
kernel size -num-filters -learning rate -number encoding-
decoding layers -sequence length
3

-architectural -activation functions -MaxPool layers? -
just sequential; not U-Net -bidirectional vs unidirectional
(DepthNet) vs multiple bidirectional options -simple RNN
vs LSTM vs GRU -conv, no RNN -3D conv -3D conv inside
convLSTM (multiple time steps into one
-similar problems: -input sequence of stereo images
pairs, generate sequence of depthmaps
longer computation time
7. Appendices
Contributions
John has researched the literature, designed the models,
and implemented SimpleU as a ﬁrst pass. Manuel has im-
plemented an alternative convLSTM encoder. Geoffrey has
set up the infrastructure, preprocessed data, organized the
proposal and milestone, and assisted model architecture.
4

U_N.o.1T: A U-Net exploration, in Depth

More Related Content

What's hot (20)

Similar to U_N.o.1T: A U-Net exploration, in Depth (20)

Recently uploaded (20)

U_N.o.1T: A U-Net exploration, in Depth