SlideShare a Scribd company logo
Three Ways to Improve Semantic Segmentation
with Self-Supervised Depth Estimation
Lukas Hoyer
ETH Zurich
lhoyer@student.ethz.ch
Dengxin Dai
ETH Zurich
dai@vision.ee.ethz.ch
Yuhua Chen
ETH Zurich
yuhua.chen@vision.ee.ethz.ch
Adrian Köring
University of Bonn
adrian.koering@uni-bonn.de
Suman Saha
ETH Zurich
suman.saha@vision.ee.ethz.ch
Luc Van Gool
ETH Zurich & KU Leuven
vangool@vision.ee.ethz.ch
Abstract
Training deep networks for semantic segmentation re-
quires large amounts of labeled training data, which
presents a major challenge in practice, as labeling seg-
mentation masks is a highly labor-intensive process. To
address this issue, we present a framework for semi-
supervised semantic segmentation, which is enhanced by
self-supervised monocular depth estimation from unlabeled
image sequences. In particular, we propose three key con-
tributions: (1) We transfer knowledge from features learned
during self-supervised depth estimation to semantic seg-
mentation, (2) we implement a strong data augmentation
by blending images and labels using the geometry of the
scene, and (3) we utilize the depth feature diversity as well
as the level of difficulty of learning depth in a student-
teacher framework to select the most useful samples to be
annotated for semantic segmentation. We validate the pro-
posed model on the Cityscapes dataset, where all three
modules demonstrate significant performance gains, and
we achieve state-of-the-art results for semi-supervised se-
mantic segmentation. The implementation is available at
https://github.com/lhoyer/improving_segmentation_
with_selfsupervised_depth.
1. Introduction
Convolutional Neural Networks (CNNs) [35] have
achieved state-of-the-art results for various computer vi-
sion tasks including semantic segmentation [40, 5]. How-
ever, training CNNs typically requires large-scale annotated
datasets, due to millions of learnable parameters involved.
Collecting such training data relies primarily on manual an-
notation. For semantic segmentation, the process can be
particularly costly, due to the required dense annotations.
For example, annotating a single image in the Cityscapes
dataset took on average 1.5 hours [9].
Recently, self-supervised learning has shown to be a
promising replacement for manually labeled data. It aims
to learn representations from the structure of unlabeled
data, instead of relying on a supervised loss, which in-
volves manual labels. The principle has been successfully
applied in depth estimation for stereo pairs [16] or im-
age sequences [73]. Additionally, semantic segmentation
is known to be tightly coupled with depth. Several works
have reported that jointly learning segmentation and super-
vised depth estimation can benefit the performance of both
tasks [61]. Motivated by these observations, we investigate
the question: How can we leverage self-supervised depth
estimation to improve semantic segmentation?
In this work, we propose a threefold approach to utilize
self-supervised monocular depth estimation (SDE) [16, 73,
17] to improve the performance of semantic segmentation
and to reduce the amount of annotation needed. Our contri-
butions span across the holistic learning process from data
selection, over data augmentation, up to cross-task repre-
sentation learning, while being unified by the use of SDE.
First, we employ SDE as an auxiliary task for seman-
tic image segmentation under a transfer learning and multi-
task learning framework and show that it noticeably im-
proves the performance of semantic segmentation, espe-
cially when supervision is limited. Previous works only
cover full supervision [32], pretraining [26], or improving
SDE instead of segmentation [20]. Second, we propose a
strong data augmentation strategy, DepthMix, which blends
images as well as their labels according to the geometry of
the scenes obtained from SDE. In comparison to previous
methods [69, 47], DepthMix explicitly respects the geomet-
ric structure of the scenes and generates fewer artifacts (see
Fig. 1). And third, we propose an Automatic Data Selection
for Annotation, which selects the most useful samples to be
1
arXiv:2012.10782v2
[cs.CV]
5
Apr
2021
annotated in order to maximize the gain. The selection is
iteratively driven by two criteria: diversity and uncertainty.
Both of them are conducted by a novel use of SDE as proxy
task in this context. While our method follows the active
learning cycle (model training → query selection → an-
notation → model training) [53, 66], it does not require a
human in the loop to provide semantic segmentation labels
as the human is replaced by a proxy-task SDE oracle. This
greatly improves flexibility, scalability, and efficiency, espe-
cially considering crowdsourcing platforms for annotation.
The main advantage of our method is that we can learn
from a large base of easily accessible unlabeled image se-
quences and utilize the learned knowledge to improve se-
mantic segmentation performance in various ways. In our
experimental evaluation on Cityscapes [9], we demonstrate
significant performance gains of all three components and
improve the previous state-of-the-art for semi-supervised
segmentation by a considerable margin. Specifically, our
method achieves 92% of the full annotation baseline per-
formance with only 1/30 available labels and even slightly
outperforms it with only 1/8 labels. Our contributions sum-
marize as follows:
(1) To the best of our knowledge, we are the first to utilize
SDE as an auxiliary task to exploit unlabeled image
sequences and significantly improve the performance
of semi-supervised semantic segmentation.
(2) We propose DepthMix, a strong data augmentation
strategy, which respects the geometry of the scene and
achieves, in combination with (1), state-of-the-art re-
sults for semi-supervised semantic segmentation.
(3) We propose a novel Automatic Data Selection for An-
notation based on SDE to improve the flexibility of ac-
tive learning. It replaces the human annotator with an
SDE oracle and lifts the requirement of having a hu-
man in the loop of data selection.
2. Related Work
2.1. (Semi-Supervised) Semantic Segmentation
Since Convolutional Neural Networks (CNNs) [35] were
first used by Long et al. [40] for semantic segmentation,
they have become the state-of-the-art method for this prob-
lem. Most architectures are based on an encoder decoder
design such as [40, 51, 6]. Skip connections [51] and di-
lated convolutions [4, 67] preserve details in the segmen-
tation and spatial pyramid pooling [15, 71, 5] aggregates
different scales to exploit spatial context information.
Semi-supervised semantic segmentation makes use of
additional unlabeled data during training. For that purpose,
Souly et al. [59] and Hung et al. [23] utilize generative ad-
versarial networks [18]. Souly et al. [59] use that concept to
generate additional training samples, while Hung et al. [23]
train the discriminator based on the semantic segmentation
probability maps. s4GAN [43] extends this idea by adding
a multi-label classification mean teacher [60]. Another line
of work [48, 12, 47] is based on consistency training, where
perturbations are applied to unlabeled images or their inter-
mediate features and a loss term enforces consistency of the
segmentation. While Ouali et al. [48] study perturbation of
encoder features, CutMix [12] mixes crops from the input
images and their pseudo-labels to generate additional train-
ing data, and ClassMix [47] uses pseudo-label [36] class
segments to build the mix mask. Our proposed DepthMix
module is inspired by these methods but, in contrast, it also
respects the structure of the scene when mixing samples.
Commonly, several approaches [43, 12, 47, 11] include self-
training with pseudo-labels [36] and a mean teacher frame-
work [60], which is extended by Feng et al. [11] with a
class-balanced curriculum. Another related line of work is
learning useful representations for semantic segmentation
from self-supervised tasks such as tracking [63], context in-
painting [49], colorization [34], depth estimation [26] (see
Section 2.3), or optical flow prediction [37]. However, all of
these approaches are outperformed by ImageNet pretraining
and are, therefore, not relevant for semi-supervised seman-
tic segmentation in practice.
2.2. Active Learning
Another approach to reduce the number of required an-
notations is active learning. It iteratively requests the most
informative samples to be labeled by a human. On the one
side, uncertainty-based approaches select samples with a
high uncertainty estimated based on, e.g., entropy [24, 54]
or ensemble disagreement [55, 42]. On the other side,
diversity-based approaches select samples, which most in-
crease the diversity of the labeled set [44, 52, 58]. For seg-
mentation, active learning is typically based on uncertainty
measures such as MC dropout [13, 66, 41], entropy [29, 64],
or multi-view consistency [57]. In addition to methods se-
lecting whole images [19, 66, 64], several approaches apply
a more fine-grained label request at region level [41, 29, 57]
and also include a label cost estimate [41, 29].
In contrast to these works, we perform automatic data
selection for annotation by replacing the human with SDE
as oracle. Therefore, we do not require human-in-the-
loop annotation during the active learning cycle. Previous
works performing unsupervised data selection are restricted
to shallow models [68, 70, 45, 22, 56, 39], classification
with low-dimensional inputs [38], or do not perform an it-
erative data selection [72] to dynamically adapt to the un-
certainty of the model trained on the currently labeled set.
2.3. Improving Segmentation with SDE
Self-supervised depth estimation (SDE) aims to learn
depth estimation from the geometric relations of stereo im-
2
age pairs [14, 16] or monocular videos [73]. Due to the bet-
ter availability of videos, we use the latter approach, where
a neural network estimates depth and camera motion of two
subsequent images and a photometric loss is computed after
a differentiable warping. The approach has been improved
by several follow-up works [17, 8, 74].
The combination of semantic segmentation and SDE was
studied in previous works with the goal of improving depth
estimation. While [50, 28, 7, 32] learn both tasks jointly,
[3, 20, 27] distill knowledge from a teacher semantic seg-
mentation network to guide SDE. To further utilize coher-
ence between semantic segmentation and SDE, [50, 7] pro-
posed additional loss terms that encourage spatial proximity
between depth discontinuities and segmentation contours.
In contrast to these works, we do not aim to improve
SDE but rather semi-supervised semantic segmentation.
The closest to our approach are [26], [46], and [32]. Jiang et
al. [26] utilizes relative depth computed from optical flow
to replace ImageNet pretraining for semantic segmentation.
In contrast, we additionally study multi-task learning of
SDE and semantic segmentation and show that combining
SDE with ImageNet features can even further boost perfor-
mance. Novosel et al. [46] and Klingner et al. [32] improve
the semantic segmentation performance by jointly learning
SDE. However, they focus on the fully-supervised setting,
while our work explicitly addresses the challenges of semi-
supervised semantic segmentation by using the depth es-
timates to generate additional training data and an auto-
matic data selection mechanism based on SDE. Another
work supporting the usefulness of SDE for semantic seg-
mentation from another viewpoint is [31] demonstrating an
improved noise and attack robustness.
3. Methods
In this section, we present our three ways to improve the
performance of semantic segmentation with self-supervised
depth estimation (SDE). They focus on three different as-
pects of semantic segmentation, covering data selection
for annotation, data augmentation, and multi-task learning.
Given N images and K image sequences from the same
domain, our first method, Automatic Data Selection for An-
notation, uses SDE learned on the K (unlabeled) sequences
to select NA images out of the N images for human annota-
tion (see Alg. 1). Our second approach, termed DepthMix,
leverages the learned SDE to create geometrically-sound
‘virtual’ training samples from pairs of labeled images and
their annotations (see Fig. 1). Our third method learns se-
mantic segmentation with SDE as an auxiliary task under a
multi-tasking framework (see Fig. 2). The learning is rein-
forced by a multi-task pretraining process combining SDE
with image classification.
For SDE, we follow the method of Godard et al. [17],
which we briefly introduce in the following. We first train
a depth estimation network fD to predict the depth of a tar-
get image and a pose estimation network fT to estimate the
camera motion from the target image and the source im-
age. Depth and pose are used to produce a differentiable
warping to transform the source image into the target im-
age. The photometric error between the target image and
multiple warped source frames is combined by a pixel-wise
minimum. Besides, stationary pixels are masked out and an
edge-aware depth smoothness term is applied resulting in
the final self-supervised depth loss LD. We refer the reader
to the original paper [17] for more details.
3.1. Automatic Data Selection for Annotation
We use SDE as proxy task for selecting NA samples out
of a set of N unlabeled samples for a human to create se-
mantic segmentation labels. The selection is conducted pro-
gressively in multiple steps, similar to the standard active
learning cycle (model training → query selection → anno-
tation → model training). However, our data selection is
fully automatic and does not require a human in the loop as
the annotation is done by a proxy-task SDE oracle.
Let’s denote by G, GA, and GU , the whole image set, the
selected sub-set for annotation, and the un-selected sub-set.
Initially, we have GA = ∅ and GU = G. The selection is
driven by two criteria: diversity and uncertainty. Diversity
sampling encourages that selected images are diverse and
cover different scenes. Uncertainty sampling favors adding
unlabeled images that are near a decision boundary (with
high uncertainties) of the model trained on the current GA.
For uncertainty sampling, we need to train and update the
model with GA. It is inefficient to repeat this every time a
new image is added. For the sake of efficiency, we divide
the selection into T steps and only train the model T times.
In each step t, nt images are selected and moved from GU
to GA, so we have
PT
t=1 nt = NA. After each step t, a
model is trained on GA and evaluated on GU to get updated
uncertainties for step t + 1.
Diversity Sampling: To ensure that the chosen annotated
samples are diverse enough to represent the entire dataset
well, we use an iterative farthest point sampling based on
the L2 distance over features ΦSDE
computed by an inter-
mediate layer of the SDE network. At step t, for each of the
nt samples, we choose the one in GU with the largest dis-
tance to the current annotation set GA. The set of selected
samples GA is iteratively extended by moving one image at
a time from GU to GA until the nt images are collected:
GU = GU  {Ii} and GA = GA ∪ {Ii}, (1)
i = arg max
Ii∈GU
min
Ij ∈GA
||ΦSDE
i − ΦSDE
j ||2. (2)
Uncertainty Sampling: While Diversity Sampling is able
to select diverse new samples, it is unaware of the uncer-
tainties of a semantic segmentation model over these sam-
ples. Uncertainty Sampling aims to select difficult samples,
3
Algorithm 1: Automatic Data Selection
1: t = 1
2: i ← uniform(1, N)
3: GA = {Ii} and GU = GU  {Ii}
4: for k = 2 to NA do
5: if k ==
Pt
t0=1 nt0 then
6: Train depth student ΦSIDE on the current GA
7: Calculate E(i) ∀Ii ∈ GU
8: t = t + 1
9: end if
10: if t == 1 then
11: Obtain index i according to Eq. 2
12: else
13: Obtain index i according to Eq. 4
14: end if
15: GA = GA ∪ {Ii} and GU = GU  {Ii}
16: end for
i.e., samples in GU that the model trained on the current
GA cannot handle well. In order to train this model, ac-
tive learning typically uses a human-in-the-loop strategy to
add annotations for selected samples. In this work, we use
a proxy task based on self-supervised annotations, which
can run automatically, to make the method more flexible
and efficient. Since our target task is single-image semantic
segmentation, we choose to use single-image depth estima-
tion (SIDE) as the proxy task. Importantly, due to our SDE
framework, depth pseudo-labels are available for G. Us-
ing these pseudo-labels, we train a SIDE method on GA and
measure the uncertainty of its depth predictions on GU . Due
to the high correlation of single-image semantic segmenta-
tion and SIDE, the generated uncertainties are informative
and can be used to guide our sampling procedure. As the
depth student model is trained only on GA, it can specifi-
cally approximate the difficulty of candidate samples with
respect to the already selected samples in GA. The student
is trained from scratch in each step t, instead of being fine-
tuned from t−1, to avoid getting stuck in the previous local
minimum. Note that the SDE method is trained on a much
larger unlabeled dataset, i.e., the K image sequences, and
can provide good guidance for the SIDE method.
In particular, the uncertainty is signaled by the dispar-
ity error between the student network fSIDE and the teacher
network fSDE in the log-scale space under L1 distance:
E(i) = || log(1 + fSDE(Ii)) − log(1 + fSIDE(Ii))||1. (3)
As the disparity difference of far-away objects is small, the
log-scale is used to avoid the loss being dominated by close-
range objects. This criterion can be added into Eq. 2 to
also select samples with higher uncertainties for the dataset
Random Class Choice Depth Comparison
ClassMix (Baseline)
DepthMix (Ours)
Si
Sj
MD
⊙Si
MCl
⊙Si
S‘Cl
S‘D
Figure 1. Concept of the proposed DepthMix augmentation (refer
to Sec. 3.2) and its baseline ClassMix [47]. By utilizing SDE,
DepthMix mitigates geometric artifacts.
update in Eq. 1:
i = arg max
Ii∈GU
min
Ij ∈GA
||ΦSDE
i − ΦSDE
j ||2 + λEE(i), (4)
where λE is a parameter to balance the contribution of the
two terms. For diversity sampling, we still use SDE features
instead of SIDE student features as SDE is trained on the
entire dataset, which provides better features for diversity
estimation. When nt images have been selected according
to Eq. 1 and Eq. 4 at step t, a new SIDE model will be
trained on the current GA in order to continue further. As
presented previously, our selection proceeds progressively
in T steps until we collect all NA images. The algorithm
of this selection is summarized in Alg. 1, where
Pt
t0=1 nt0
describes the desired size of GA at the end of step t.
3.2. DepthMix Data Augmentation
Inspired by the recent success of data augmentation ap-
proaches that mixup pairs of images and their (pseudo) la-
bels to generate more training samples for semantic seg-
mentation [69, 12, 47], we propose an algorithm, termed
DepthMix, to utilize self-supervised depth estimates to
maintain the integrity of the scene structure during mixing.
Given two images Ii and Ij of the same size, we would
like to copy some regions from Ii and paste them directly
into Ij to get a virtual sample I0
. The copied regions are
indicated by a mask M, which is a binary image of the same
size as the two images. The image creation is done as
I0
= M
Ii + (1 − M)
Ij, (5)
where
denotes the element-wise product. The label maps
of the two images Si and Sj are mixed up with the same
mask M to generate S0
. The mixing can be applied to la-
beled data and unlabeled data using human ground truths
or pseudo-labels, respectively. Existing methods generate
4
ImageNet
Encoder fI
Feature Distance
Loss LF
Seg.
Loss Lce
Camera
Motion Tt,t+1
SDE Loss LD
Pose
CNN
Shared
Encoder fE
Depth
Decoder fD
Semantic
Decoder fS
Image It
Image It+1
Depth Dt
Segmentation St
Ground Truth St
^
^
Figure 2. Architecture for learning semantic segmentation with
SDE as auxiliary task according to Sec. 3.3. The dashed paths
are only used during training and only if image sequences and/or
segmentation ground truth are available for a training sample.
this mask M in different ways, e.g., randomly sampled rect-
angular regions [69, 12] or randomly selected object seg-
ments [47]. In those methods, the structure of the scene is
not considered and foreground and background are not dis-
tinguished. We find images synthesized by these methods
often violate the geometric relationships between objects.
For instance, a distant object can be copied onto a close-
range object or only unoccluded parts of mid-range objects
are copied onto the other image. Imagine how strange it is
to see a pedestrian standing on top of a car or to see sky
through a hole in a building (just as shown in Fig. 1 left).
Our DepthMix is designed to mitigate this issue. It uses
the estimated depth D̂i and D̂j of the two images to gen-
erate the mix mask M that respects the notion of geometry.
It is implemented by selecting only pixels from Ii whose
depth values are smaller than the depth values of the pixels
at the same locations in Ij:
M(a, b) =

1 if D̂i(a, b)  D̂j(a, b) + 
0 otherwise
(6)
where a and b are pixel indices, and  is a small value to
avoid conflicts of objects that are naturally at the same depth
plane such as road or sky. By using this M, DepthMix re-
spects the depth of objects in both images, such that only
closer objects can occlude further-away objects. We illus-
trate this advantage of DepthMix with an example in Fig. 1.
3.3. Semi-Supervised Semantic Segmentation
In this section, we train a semantic segmentation model
utilizing the labeled image dataset GA, the unlabeled image
dataset GU , and K unlabeled image sequences. We first dis-
cuss how to exploit SDE on the image sequences to improve
our semantic segmentation. We then show how to use GU
to further improve the performance.
Learning with Auxiliary Tasks: For learning semantic
segmentation and SDE jointly, we use a network with
shared encoder fE
θ and a separate depth fD
θ and segmen-
tation decoder fS
θ (see Fig. 2). The depth branch is trained
using the SDE loss LD and the segmentation branch gS
θ =
fS
θ ◦ fE
θ is trained using the pixel-wise cross-entropy Lce.
In order to initialize the pose estimation network and the
depth decoder properly, the architecture is first trained on
K unlabeled image sequences for SDE. As a common prac-
tice, we initialize the encoder with ImageNet weights as
they provide useful semantic features learned during image
classification. To avoid forgetting semantic features during
the SDE pretraining, we utilize a feature distance loss be-
tween the current bottleneck features fE
θ and the bottleneck
features of the encoder with ImageNet weights fE
I :
LF = ||fE
θ − fE
I ||2. (7)
The loss for the depth pretraining is the weighted sum of the
SDE loss and the ImageNet feature distance loss:
LP = LD + λF LF . (8)
To additionally incorporate transfer learning from depth
estimation to semantic segmentation, the weights of fD
θ are
used to initialize fS
θ . For effective multi-task learning, we
use an attention-guided distillation module [65] to exchange
useful intermediate features between both decoders.
Learning with Unlabeled Images: In order to further uti-
lize the unlabeled dataset GU , we generate pseudo-labels
using the mean teacher algorithm [60], which is commonly
used in semi-supervised learning [1, 62, 12, 47]. For that
purpose, an exponential moving average is applied to the
weights of the semantic segmentation model gS
θ to obtain
the weights of the mean teacher θT :
θ0
T = αθT + (1 − α)θ. (9)
To generate the pseudo-labels, an argmax over the classes
C is applied to the prediction of the mean teacher.
SU = arg max
c∈C
(gS
θT
(IU )). (10)
The mean teacher can be considered as a temporal ensem-
ble, resulting in stable predictions for the pseudo-labels,
while the argmax ensures confident predictions [47].
For the semi-supervised setting, the segmentation net-
work is trained with labeled samples (IA, SA) and pseudo-
labeled samples (IU , SU ):
LSSL = Lce(gS
θ (IA), SA) + λP (SU )Lce(gS
θ (IU ), SU ))
(11)
λP (SU ) is chosen to reflect the quality of the pseudo-label
represented by the fraction of pixels exceeding a thresh-
old τ for the predicted probability of the most confident
5
class maxc∈C(gS
θT
(IU )), as suggested in [47]. We in-
corporate DepthMix samples (I0
, S0
), which are obtained
from the combined labeled and pseudo-labeled data pool
Ii, Ij ∈ GA ∪ GU (see Eq. 5), into Eq. 11 to replace the
unlabeled samples (SU , LU ). Our semi-supervised learning
is now changed to:
LSSL = Lce(gS
θ (IA), SA) + λP (S0
)Lce(gS
θ (I0
), S0
)).
(12)
4. Experiments
4.1. Implementation Details
Dataset: We evaluate our method on the Cityscapes
dataset [9], which consists of 2975 training and 500 vali-
dation images with semantic segmentation labels from Eu-
ropean street scenes. We downsample the images to 1024×
512 pixels. Besides, random cropping to a size of 512×512
and random horizontal flipping are used in the training. Im-
portantly, Cityscapes provides 20 unlabeled frames before
and 10 after the labeled image, which are used for SDE
training. During the semi-supervised segmentation, only
the originally 2975 labeled training images are used. They
are randomly split into a labeled and an unlabeled subset.
Network Architecture: Our network consists of a shared
ResNet101 [21] encoder with output stride 16 and a sepa-
rate decoder for segmentation and SDE. The decoder con-
sists of an ASPP [5] block to aggregate features from mul-
tiple scales and another four upsampling blocks with skip
connections [51]. For SDE, the upsampling blocks have a
disparity side output at the respective scale. For effective
multi-task learning, we additionally follow PAD-Net [65]
and deploy an attention-guided distillation module after the
third decoder block. It serves the purpose of exchanging
useful features between segmentation and depth estimation.
Training: For the SDE pretraining, the depth and pose net-
work are trained using Adam [30], a batch size of 4, and
an initial learning rate of 1 × 10−4
, which is divided by 10
after 160k iterations. The SDE loss is calculated on four
scales with three subsequent images. During the first 300k
iterations, only the depth decoder and the pose network are
trained. Afterwards, the depth encoder is fine-tuned with
an ImageNet feature distance λF = 1 × 10−2
for another
50k iterations. The encoder is initialized with ImageNet
weights, either before depth pretraining or before semantic
segmentation if depth pretraining is ablated.
For the multi-task setting, we train the network using
SGD with a learning rate of 1 × 10−3
for the encoder and
depth decoder, 1 × 10−2
for the segmentation decoder, and
1 × 10−6
for the pose network. The learning rate is reduced
by 10 after 30k iterations and trained for another 10k itera-
tions. A momentum of 0.9, a weight decay of 5×10−4
, and
a gradient norm clipping to 10 are used. The loss for seg-
mentation and SDE are weighted equally. The mean teacher
Figure 3. Example semantic segmentations of our method for 100
labeled samples in comparison with ClassMix [47].
has α = 0.99 and within an iteration, the network is trained
on a clean labeled and an augmented mixed batch with size
2, respectively. The latter uses DepthMix with  = 0.03,
color jitter, and Gaussian blur.
Data Selection for Annotation: In the data selection ex-
periment, we use a slimmed network architecture with a
ResNet50 encoder and fewer decoder channels for fSIDE.
It is trained using Adam with 1 × 10−4
learning rate and
polynomial decay with exponent 0.9 for faster convergence.
For calculating the depth feature diversity, we use the output
of the second depth decoder block after SDE pretraining. It
is downsampled by average pooling to a size of 8x4 pix-
els and the feature channels are normalized to zero-mean
unit-variance over the dataset. The student depth error is
weighted by λE = 1000. The number of the selected sam-
ples (
Pt
t0=1 nt0 ) is iteratively increased to 25, 50, 100, 200,
372, and 744. For each subset, a student depth network is
trained from scratch for 4k, 8k, 12k, 16k, and 20k iterations,
respectively, to calculate the student depth error.
4.2. Semi-Supervised Semantic Segmentation
First, we compare our approach with several state-of-the-
art semi-supervised learning approaches. We summarize
the results in Tab. 1. The performance (mIoU in %) of the
semi-supervised methods and their baselines (only trained
on the labeled dataset) are shown for a different number of
labeled samples. As the performance of the baselines dif-
fers, there are columns showing the absolute improvement
for better comparability. As our baseline utilizes a more ca-
pable network architecture due to the U-Net decoder with
ASPP as opposed to a DeepLabv2 decoder used by most
previous works, we also reimplemented the state-of-the-art
method, ClassMix [47] with our network architecture and
training parameters to ensure a direct comparison.
As shown in Tab. 1, our method (without data selection)
6
Table 1. Performance on the Cityscapes validation set (mIoU in %, standard deviation over 3 random seeds).
Labeled Samples 1/30 (100) 1/8 (372) 1/4 (744) Full (2975)
Baseline [23] – 55.50

59.90

66.40

Adversarial [23] – 58.80 +3.30 62.30 +2.40 –
Baseline [43] – 56.20

60.20

66.00
s4GAN [43] – 59.30 +3.10 61.90 +1.70 65.80 –0.20
Baseline [12] 44.41 ±1.11

55.25 ±0.66

60.57 ±1.13

67.53 ±0.35

CutMix [12] 51.20 ±2.29 +6.79 60.34 ±1.24 +5.09 63.87 ±0.71 +3.30 67.68 ±0.37 +0.15
Baseline [11] 45.50

56.70

61.10

66.90
DST–CBC [11] 48.70 +3.20 60.50 +3.80 64.40 +3.30 –
Baseline [47] 43.84 ±0.71

54.84 ±1.14

60.08 ±0.62

66.19 ±0.11
ClassMix [47] 54.07 ±1.61 +10.23 61.35 ±0.62 +6.51 63.63 ±0.33 +3.55 –
Baseline 48.75 ±1.61  59.14 ±1.02

63.46 ±0.38

67.77 ±0.13

ClassMix [47]1
56.82 ±1.65 +8.07 63.86 ±0.41 +4.72 65.57 ±0.71 +2.11 –
ClassMix [47] (+Video) 56.79 ±1.98 +8.04 63.22 ±0.84 +4.08 65.72 ±0.18 +2.26 68.23 ±0.70 +0.46
Ours 58.40 ±1.36 +9.65 66.66 ±1.05 +7.52 68.43 ±0.06 +4.98 71.16 ±0.16 +3.40
Ours (+Data Selection) 62.09 ±0.39 +13.34 68.01 ±0.83 +8.87 69.38 ±0.33 +5.92 –
outperforms all other approaches on each labeled subset
size for both the absolute performance as well as the im-
provement to the baseline. The only exception is the abso-
lute improvement of the original results of ClassMix for 100
labeled samples. However, if we consider ClassMix trained
in our setting, our method outperforms it also in this case.
This can be explained by the considerably higher baseline
performance in our setting, which increases the difficulty to
achieve an high improvement. Adding data selection even
further increases the performance by a significant margin,
so that our method, trained with only 1/8 of the labels, even
slightly outperforms the fully-supervised baseline.
To identify whether the improvement originates from ac-
cess to more unlabeled data or from the effectiveness of
our approach, we compare to another baseline “ClassMix
(+Video)”. More specifically, we also provide all unla-
beled image sequences to ClassMix and see how much it
can benefit from this additional amount of unlabeled data.
Experimental results show no significant difference. This is
probably due to the high correlation of the Cityscapes im-
age dataset and the video dataset (the images are the 20th
frames of the video clips).
The adequacy of our approach is also reflected in the ex-
ample predictions in Fig. 3. We can observe that the con-
tours of classes are more precise. Moreover, difficult ob-
jects such as bus, train, rider, or truck can be better distin-
guished. This observation is also quantitatively confirmed
by the class-wise IoU improvement shown in Fig. 4.
4.3. Ablation Study
Next, we analyze the individual contribution of each
component of the proposed method. For this purpose, we
1 Results of the reimplementation in our experiment setting.
Table 2. Ablation of the architecture components (D-T: SDE
Transfer Learning, D-M SDE Transfer and Multi-Task Learning,
F: ImageNet Feature Distance Loss, P: Pseudo-Labeling, X-C:
Mix Class, X-D: Mix Depth, S - Data Selection). mIoU in %,
standard deviation over 3 seeds.
D F P X S 372 Samples 2975 Samples
59.14 ±1.02

67.77 ±0.13

T 60.46 ±0.64 +1.31 69.00 ±0.70 +1.23
T X 60.80 ±0.69 +1.66 69.47 ±0.38 +1.71
M X 61.25 ±0.55 +2.10 69.76 ±0.39 +1.99
X 62.39 ±0.86 +3.24 –
X C 63.16 ±0.89 +4.02 69.60 ±0.32 +1.83
X D 64.14 ±1.34 +5.00 69.83 ±0.36 +2.06
M X X D 66.66 ±1.05 +7.52 71.16 ±0.16 +3.40
X 64.25 ± 0.18 +5.11 –
M X X D X 68.01 ±0.83 +8.87 –
Road
S.walk
Building
Wall
Fence
Pole
Tr.
Light
Tr.
Sign
Veget.
Terrain
Sky
Person
Rider
Car
Truck
Bus
Train
M.cycle
Bicycle
Baseline
DM
XD
DM+XD
DM+XD+S 0.0
0.2
Figure 4. Improvement of the class-wise IoU over the baseline per-
formance for 372 labeled samples (DM: SDE Multi-Task Learn-
ing, XD: DepthMix with Pseudo-Labels, S: Data Selection).
test several ablated versions of our model for both the cases
of 372 and 2975 labeled samples. We summarize the re-
sults in Tab. 2. It can be seen that each contribution adds
a significant performance improvement over the baseline.
For 372 (2975) annotated samples, transfer and multi-task
7
Image i Image j
Depth i Depth j Mixed Image I’
a)
b)
Figure 5. DepthMix applied to Cityscapes crops.
learning improve the performance by +2.10 (+1.99), Depth-
Mix with pseudo-labels by +5.00 (+2.06), and automatic
data selection by +5.11 (–) mIoU percentage points. As our
components are orthogonal, combining them even further
increases performance. SDE Multi-Tasking and DepthMix
achieve +7.52 (+3.40) and all three components +8.87 (–)
mIoU percentage points improvement. Note that the high
variance for few labeled samples is mostly due to the high
influence of the randomly selected labeled subset. The cho-
sen subset affects all configurations equally and the reported
improvements are consistent for each subset.
Furthermore, we compare DepthMix with ClassMix as a
standalone. For a fair comparison, we additionally include
mixing labeled samples with their ground truth to ClassMix.
It can be seen that DepthMix outperforms the ClassMix by
0.98 (0.23) percentage points for 372 (2975) annotated sam-
ples, which shows the effect of the geometry aware augmen-
tation. Fig. 5 shows DepthMix examples demonstrating that
SDE allows to correctly model occlusions and to produce
synthetic samples with a realistic appearance.
For more insights into possible reasons for these im-
provements, we visualize the improvement of the architec-
ture components over the baseline for each class separately
in Fig. 4. It can be seen that depth multi-task learning (DM)
improves mostly the classes fence, traffic light, traffic sign,
rider, truck, and motorcycle, which is possibly due to their
characteristic depth profile learned during SDE. For exam-
ple, a good depth estimation performance requires correctly
segmenting poles or traffic signs as missing them can cause
large depth errors. This can also be seen in Fig. 3. Depth-
Mix (XD) further improves the performance of wall, truck,
bus, and train. This might be caused by the fact the Depth-
Mix presents those rather difficult objects in another con-
text, which might help the network to generalize better.
In the suppl. materials, we further show that our method
is still applicable if SDE is trained on a different dataset
than semantic segmentation within a similar visual domain.
4.4. Automatic Data Selection for Annotation
Finally, we evaluate the proposed automatic data selec-
tion. Tab. 3 shows a comparison of our method with a base-
line and a competing method. The baseline selects the la-
Table 3. Comparison of data selection methods (DS: Diversity
Sampling based on depth features, US: Uncertainty Sampling
based on depth student error). mIoU in %, std. dev. over 3 seeds.
# Labeled 1/30 (100) 1/8 (372) 1/4 (744)
Random 48.75 ±1.61 59.14 ±1.02 63.46 ±0.38
Entropy 53.63 ±0.77 63.51 ±0.68 66.18 ±0.50
Ours (US) 51.75 ±1.12 62.77 ±0.46 66.76 ±0.45
Ours (DS) 53.00 ±0.51 63.23 ±0.69 66.37 ±0.20
Ours (DS+US) 54.37 ±0.36 64.25 ±0.18 66.94 ±0.59
beled samples randomly, while the second, strong competi-
tor uses active learning and iteratively chooses the samples
with the highest segmentation entropy. In contrast to our
method, this requires a human in the loop to create the se-
mantic labels for iteratively selected images. It can be seen
that our method with the combined Diversity Sampling and
Uncertainty Sampling (DS+US) outperforms both compar-
ison methods, demonstrating the effectiveness of ensuring
diversity and exploiting difficult samples based on depth. It
also supports the assumption that depth estimation and se-
mantic segmentation are correlated in terms of sample dif-
ficulty. The class-wise analysis (see the last row of Fig. 4)
shows that data selection significantly improves the perfor-
mance of truck, bus, and train, which are usually difficult
to distinguish in a semi-supervised setting. We would like
to note that our automatic data selection method can be ap-
plied to any semantic segmentation method.
5. Conclusion
In this work, we have studied how self-supervised depth
estimation (SDE) can be utilized to improve semantic
segmentation in both the semi-supervised and the fully-
supervised setting. We introduced three effective strategies
capable of leveraging the knowledge learned from SDE.
First, we show that the SDE feature representation can be
transferred to semantic segmentation, by means of SDE pre-
training and joint learning of segmentation and depth. Sec-
ond, we demonstrate that the proposed DepthMix strategy
outperforms related mixing strategies by avoiding inconsis-
tent geometry of the generated images. Third, we present
an automatic data selection for annotation algorithm based
on SDE, which does not require human-in-the-loop anno-
tations. We validate the benefits of the three components
by extensive experiments on Cityscapes, where we demon-
strate significant gains over the baselines and competing
methods. By using SDE, our approach achieves state-of-
the-art performance, suggesting that SDE can be a valuable
self-supervision for semantic segmentation.
Acknowledgements: This work is funded by Toyota Motor
Europe via the research project TRACE-Zurich and by a
research project from armasuisse.
8
References
[1] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas
Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A
holistic approach to semi-supervised learning. In Adv. Neural
Inform. Process. Syst., pages 5049–5059, 2019. 5
[2] Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla.
Semantic object classes in video: A high-definition ground
truth database. Pattern Recognition Letters, pages 88–97,
2009. 12
[3] Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia
Angelova. Depth prediction without the sensors: Leveraging
structure for unsupervised learning from monocular videos.
In AAAI Conf. Artif. Intell., pages 8001–8008, 2019. 3, 13
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
Kevin Murphy, and Alan L Yuille. Semantic image segmen-
tation with deep convolutional nets and fully connected crfs.
In Int. Conf. Learn. Represent., pages 834–848, 2015. 2
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image
segmentation with deep convolutional nets, atrous convolu-
tion, and fully connected crfs. IEEE Trans. Pattern Anal.
Mach. Intell., pages 834–848, 2017. 1, 2, 6, 12
[6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian
Schroff, and Hartwig Adam. Encoder-decoder with atrous
separable convolution for semantic image segmentation. In
Eur. Conf. Comput. Vis., pages 801–818, 2018. 2
[7] Po-Yi Chen, Alexander H Liu, Yen-Cheng Liu, and Yu-
Chiang Frank Wang. Towards scene understanding: Un-
supervised monocular depth estimation with semantic-aware
representation. In IEEE Conf. Comput. Vis. Pattern Recog.,
pages 2624–2632, 2019. 3
[8] Yuhua Chen, Cordelia Schmid, and Cristian Sminchis-
escu. Self-supervised learning with geometric constraints in
monocular video: Connecting flow, depth, and camera. In
Int. Conf. Comput. Vis., pages 7063–7072, 2019. 3
[9] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
Franke, Stefan Roth, and Bernt Schiele. The cityscapes
dataset for semantic urban scene understanding. In IEEE
Conf. Comput. Vis. Pattern Recog., pages 3213–3223, 2016.
1, 2, 6
[10] Qi Dai, Vaishakh Patil, Simon Hecker, Dengxin Dai, Luc
Van Gool, and Konrad Schindler. Self-supervised object mo-
tion and depth estimation from video. In IEEE Conf. Com-
put. Vis. Pattern Recog. Workshops, pages 1004–1005, 2020.
13
[11] Zhengyang Feng, Qianyu Zhou, Guangliang Cheng, Xin
Tan, Jianping Shi, and Lizhuang Ma. Semi-supervised se-
mantic segmentation via dynamic self-training and class-
balanced curriculum. arXiv preprint arXiv:2004.08514,
2020. 2, 7
[12] Geoffrey French, Samuli Laine, Timo Aila, Michal Mack-
iewicz, and Graham Finlayson. Semi-supervised semantic
segmentation needs strong, varied perturbations. In Brit.
Mach. Vis. Conf., 2020. 2, 4, 5, 7
[13] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian
approximation: Representing model uncertainty in deep
learning. In Int. Conf. Mach. Learning, pages 1050–1059,
2016. 2
[14] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian
Reid. Unsupervised cnn for single view depth estimation:
Geometry to the rescue. In Eur. Conf. Comput. Vis., pages
740–756, 2016. 3
[15] Golnaz Ghiasi and Charless C Fowlkes. Laplacian pyramid
reconstruction and refinement for semantic segmentation. In
Eur. Conf. Comput. Vis., pages 519–534, 2016. 2
[16] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow.
Unsupervised monocular depth estimation with left-right
consistency. In IEEE Conf. Comput. Vis. Pattern Recog.,
pages 270–279, 2017. 1, 3
[17] Clément Godard, Oisin Mac Aodha, Michael Firman, and
Gabriel J Brostow. Digging into self-supervised monocular
depth estimation. In Int. Conf. Comput. Vis., pages 3828–
3838, 2019. 1, 3, 12
[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Adv. Neural
Inform. Process. Syst., pages 2672–2680, 2014. 2
[19] Marc Górriz, Xavier Giró Nieto, Axel Carlier, and Em-
manuel Faure. Cost-effective active learning for melanoma
segmentation. In Adv. Neural Inform. Process. Syst. Work-
shop ML4H: Machine Learning for Health, pages 1–5, 2017.
2
[20] Vitor Guizilini, Rui Hou, Jie Li, Rares Ambrus, and Adrien
Gaidon. Semantically-guided representation learning for
self-supervised monocular depth. In Int. Conf. Learn. Rep-
resent., 2020. 1, 3
[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In IEEE Conf.
Comput. Vis. Pattern Recog., pages 770–778, 2016. 6
[22] Yao Hu, Debing Zhang, Zhongming Jin, Deng Cai, and Xi-
aofei He. Active learning via neighborhood reconstruction.
In Int. Joint Conf. Artif. Intell., pages 1415–1421, 2013. 2
[23] Wei Chih Hung, Yi Hsuan Tsai, Yan Ting Liou, Yen-Yu
Lin, and Ming Hsuan Yang. Adversarial learning for semi-
supervised semantic segmentation. In Brit. Mach. Vis. Conf.,
2018. 2, 7
[24] Rebecca Hwa. Sample selection for statistical parsing. Com-
putational linguistics, pages 253–276, 2004. 2
[25] Sergey Ioffe and Christian Szegedy. Batch normalization:
Accelerating deep network training by reducing internal co-
variate shift. arXiv preprint arXiv:1502.03167, 2015. 12
[26] Huaizu Jiang, Gustav Larsson, Michael Maire
Greg Shakhnarovich, and Erik Learned-Miller. Self-
supervised relative depth learning for urban scene under-
standing. In Eur. Conf. Comput. Vis., pages 19–35, 2018. 1,
2, 3
[27] Huaizu Jiang, Deqing Sun, Varun Jampani, Zhaoyang Lv,
Erik Learned-Miller, and Jan Kautz. Sense: A shared en-
coder network for scene-flow estimation. In Int. Conf. Com-
put. Vis., pages 3195–3204, 2019. 3
[28] Jianbo Jiao, Ying Cao, Yibing Song, and Rynson Lau. Look
deeper into depth: Monocular depth estimation with seman-
tic booster and attention-driven loss. In Eur. Conf. Comput.
Vis., pages 53–69, 2018. 3
9
[29] Tejaswi Kasarla, Gattigorla Nagendar, Guruprasad M Hegde,
Vineeth Balasubramanian, and CV Jawahar. Region-based
active learning for efficient labeling in semantic segmenta-
tion. In IEEE Winter Conf. Appl. of Comput. Vis., pages
1109–1117, 2019. 2
[30] Diederik P Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980,
2014. 6
[31] Marvin Klingner, Andreas Bar, and Tim Fingscheidt. Im-
proved noise and attack robustness for semantic segmenta-
tion by using multi-task training with self-supervised depth
estimation. In IEEE Conf. Comput. Vis. Pattern Recog. Work-
shops, pages 320–321, 2020. 3
[32] Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk,
and Tim Fingscheidt. Self-supervised monocular depth es-
timation: Solving the dynamic object problem by semantic
guidance. In Eur. Conf. Comput. Vis., pages 582–600, 2020.
1, 3, 13
[33] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed-
erico Tombari, and Nassir Navab. Deeper depth prediction
with fully convolutional residual networks. In Int. Conf. 3D
Vision, pages 239–248, 2016. 12
[34] Gustav Larsson, Michael Maire, and Gregory
Shakhnarovich. Colorization as a proxy task for visual
understanding. In IEEE Conf. Comput. Vis. Pattern Recog.,
pages 6874–6883, 2017. 2
[35] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
Haffner. Gradient-based learning applied to document recog-
nition. Proceedings of the IEEE, pages 2278–2324, 1998. 1,
2
[36] Dong-Hyun Lee. Pseudo-label: The simple and efficient
semi-supervised learning method for deep neural networks.
In Int. Conf. Mach. Learning, 2013. 2
[37] Seokju Lee, Junsik Kim, Tae-Hyun Oh, Yongseop Jeong,
Donggeun Yoo, Stephen Lin, and In So Kweon. Visuomotor
understanding for representation learning of driving scenes.
In Brit. Mach. Vis. Conf., 2019. 2
[38] Changsheng Li, Handong Ma, Zhao Kang, Ye Yuan, Xiao-
Yu Zhang, and Guoren Wang. On deep unsupervised active
learning. Int. Joint Conf. Artif. Intell., 2020. 2
[39] Changsheng Li, Xiangfeng Wang, Weishan Dong, Junchi
Yan, Qingshan Liu, and Hongyuan Zha. Joint active learning
with feature selection via cur matrix decomposition. IEEE
Trans. Pattern Anal. Mach. Intell., pages 1382–1396, 2018.
2
[40] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation. In IEEE
Conf. Comput. Vis. Pattern Recog., pages 3431–3440, 2015.
1, 2
[41] Radek Mackowiak, Philip Lenz, Omair Ghori, Ferran Diego,
Oliver Lange, and Carsten Rother. Cereals-cost-effective
region-based active learning for semantic segmentation. In
Brit. Mach. Vis. Conf., 2018. 2
[42] Andrew Kachites McCallumzy and Kamal Nigamy. Employ-
ing em and pool-based active learning for text classification.
In Int. Conf. Mach. Learning, pages 359–367, 1998. 2
[43] Sudhanshu Mittal, Maxim Tatarchenko, and Thomas Brox.
Semi-supervised semantic segmentation with high-and low-
level consistency. IEEE Trans. Pattern Anal. Mach. Intell.,
2019. 2, 7
[44] Hieu T Nguyen and Arnold Smeulders. Active learning using
pre-clustering. In Int. Conf. Mach. Learning, page 79, 2004.
2
[45] Feiping Nie, Hua Wang, Heng Huang, and Chris Ding. Early
active learning via robust representation and structured spar-
sity. In Int. Joint Conf. Artif. Intell., 2013. 2
[46] Jelena Novosel, Prashanth Viswanath, and Bruno Arse-
nali. Boosting semantic segmentation with multi-task self-
supervised learning for autonomous driving applications. In
Int. Conf. Comput. Vis. Workshops, 2019. 3
[47] Viktor Olsson, Wilhelm Tranheden, Juliano Pinto, and
Lennart Svensson. Classmix: Segmentation-based data aug-
mentation for semi-supervised learning. In IEEE Winter
Conf. on Applications of Comput. Vis., pages 1369–1378,
2021. 1, 2, 4, 5, 6, 7, 13
[48] Yassine Ouali, Céline Hudelot, and Myriam Tami. Semi-
supervised semantic segmentation with cross-consistency
training. In IEEE Conf. Comput. Vis. Pattern Recog., pages
12674–12684, 2020. 2
[49] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor
Darrell, and Alexei A Efros. Context encoders: Feature
learning by inpainting. In IEEE Conf. Comput. Vis. Pattern
Recog., pages 2536–2544, 2016. 2
[50] Pierluigi Zama Ramirez, Matteo Poggi, Fabio Tosi, Stefano
Mattoccia, and Luigi Di Stefano. Geometry meets semantics
for semi-supervised monocular depth estimation. In Asian
Conf. Comput. Vis., pages 298–313, 2018. 3
[51] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
Convolutional networks for biomedical image segmenta-
tion. In Int. Conf. Medical Image Computing and Computer-
assisted Intervention, pages 234–241, 2015. 2, 6, 12
[52] Ozan Sener and Silvio Savarese. Active learning for convo-
lutional neural networks: A core-set approach. In Int. Conf.
Learn. Represent., 2018. 2
[53] Burr Settles. Active learning literature survey. Technical re-
port, University of Wisconsin-Madison Department of Com-
puter Sciences, 2009. 2
[54] Burr Settles and Mark Craven. An analysis of active learning
strategies for sequence labeling tasks. In Conf. Empirical
Methods Natural Language Processing, pages 1070–1079,
2008. 2
[55] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky.
Query by committee. In Annual Workshop Computational
Learning Theory, pages 287–294, 1992. 2
[56] Lei Shi and Yi-Dong Shen. Diversifying convex transduc-
tive experimental design for active learning. In IJCAI, pages
1997–2003, 2016. 2
[57] Yawar Siddiqui, Julien Valentin, and Matthias Nießner.
Viewal: Active learning with viewpoint entropy for semantic
segmentation. In IEEE Conf. Comput. Vis. Pattern Recog.,
pages 9433–9443, 2020. 2
[58] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Varia-
tional adversarial active learning. In Int. Conf. Comput. Vis.,
pages 5972–5981, 2019. 2
10
[59] Nasim Souly, Concetto Spampinato, and Mubarak Shah.
Semi supervised semantic segmentation using generative ad-
versarial network. In Int. Conf. Comput. Vis., pages 5688–
5696, 2017. 2
[60] Antti Tarvainen and Harri Valpola. Mean teachers are better
role models: Weight-averaged consistency targets improve
semi-supervised deep learning results. In Adv. Neural In-
form. Process. Syst., pages 1195–1204, 2017. 2, 5
[61] Simon Vandenhende, Stamatios Georgoulis, Wouter
Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc
Van Gool. Multi-task learning for dense prediction tasks: A
survey. IEEE Trans. Pattern Anal. Mach. Intell., 2021. 1
[62] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio,
and David Lopez-Paz. Interpolation consistency training for
semi-supervised learning. In Int. Joint Conf. Artif. Intell.,
pages 3635–3641, 2019. 5
[63] Xiaolong Wang and Abhinav Gupta. Unsupervised learning
of visual representations using videos. In Int. Conf. Comput.
Vis., pages 2794–2802, 2015. 2
[64] Shuai Xie, Zunlei Feng, Ying Chen, Songtao Sun, Chao Ma,
and Mingli Song. Deal: Difficulty-aware active learning for
semantic segmentation. In Asian Conf. Comput. Vis., 2020.
2
[65] Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe.
Pad-net: Multi-tasks guided prediction-and-distillation net-
work for simultaneous depth estimation and scene parsing.
In IEEE Conf. Comput. Vis. Pattern Recog., pages 675–684,
2018. 5, 6, 12
[66] Lin Yang, Yizhe Zhang, Jianxu Chen, Siyuan Zhang, and
Danny Z Chen. Suggestive annotation: A deep active learn-
ing framework for biomedical image segmentation. In Int.
Conf. Medical Image Computing and Computer-assisted In-
tervention, pages 399–407, 2017. 2
[67] Fisher Yu and Vladlen Koltun. Multi-scale context
aggregation by dilated convolutions. arXiv preprint
arXiv:1511.07122, 2015. 2
[68] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via trans-
ductive experimental design. In Int. Conf. Mach. Learning,
pages 1081–1088, 2006. 2
[69] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-
larization strategy to train strong classifiers with localizable
features. In Int. Conf. Comput. Vis., pages 6023–6032, 2019.
1, 4, 5
[70] Lijun Zhang, Chun Chen, Jiajun Bu, Deng Cai, Xiaofei He,
and Thomas S Huang. Active learning based on locally lin-
ear reconstruction. IEEE Trans. Pattern Anal. Mach. Intell.,
pages 2026–2038, 2011. 2
[71] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
IEEE Conf. Comput. Vis. Pattern Recog., pages 2881–2890,
2017. 2
[72] Hao Zheng, Lin Yang, Jianxu Chen, Jun Han, Yizhe Zhang,
Peixian Liang, Zhuo Zhao, Chaoli Wang, and Danny Z Chen.
Biomedical image segmentation via representative annota-
tion. In Proceedings of the AAAI Conference on Artificial
Intelligence, pages 5901–5908, 2019. 2
[73] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G
Lowe. Unsupervised learning of depth and ego-motion from
video. In IEEE Conf. Comput. Vis. Pattern Recog., pages
1851–1858, 2017. 1, 3
[74] Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Un-
supervised joint learning of depth and flow using cross-task
consistency. In Eur. Conf. Comput. Vis., pages 36–53, 2018.
3
[75] Laurent Zwald and Sophie Lambert-Lacroix. The
berhu penalty and the grouped effect. arXiv preprint
arXiv:1207.6868, 2012. 12
11
A. Further Implementation Details
In the following paragraphs, a more detailed de-
scription of the network architecture and the training
is provided. The reference implementation is available
at https://github.com/lhoyer/improving_
segmentation_with_selfsupervised_depth.
Network Architecture The neural network combines a
DeepLabv3 [5] with a U-Net [51] decoder for depth and
segmentation prediction each. As encoder, a ResNet101
with dilated (instead of strided) convolutions in the last
block is used, following [5]. Features from multiple scales
are aggregated by an ASPP [5] block with dilation rates
of 6, 12, and 18. Similar to U-Net [51], the decoder has
five upsampling blocks with skip connections. Each up-
sampling block consists of a 3x3 convolution layer (except
the first block, which is the ASPP), a bilinear upsampling
operation, a concatenation with the encoder features of the
corresponding size (skip connection), and another 3x3 con-
volution layer. Both convolutional layers are followed by
an ELU non-linearity. The number of output channels for
the blocks are 256, 256, 128, 128, and 64. The last four
blocks also have another 3x3 convolutional layer followed
by a sigmoid activation attached to their output for the pur-
pose of predicting the disparity at the respective scale. For
effective multi-task learning, we additionally follow PAD-
Net [65] and deploy an attention-guided multi-modal dis-
tillation module with additional side output for semantic
segmentation after the third decoder block. In experiments
without multi-task learning, only the semantic segmentation
decoder is used. For pose estimation, we use a lightweight
ResNet18 encoder followed by four convolutions to pro-
duce the translation and the rotation in angle-axis represen-
tation as suggested in [17].
Runtime To give an impression of the computational
complexity of our architecture, we provide the training time
per iteration and the inference time per image on an Nvidia
Tesla P100 in Tab. S4. The values are averaged over 100 it-
erations or 500 images, respectively. Please note that these
timings include the computational overhead of the training
framework such as logging and validation metric calcula-
tion.
Data Selection In the data selection experiment, we use a
slimmed network architecture for fSIDE with a ResNet50
backbone, 256, 128, 128, 64, and 64 decoder channels, and
BatchNorm [25] in the decoder for efficiency and faster
convergence. The depth student network is trained using
a berHu loss [75, 33]. The quality of the selected subset
with annotations GA is evaluated for semantic segmentation
using our default architecture and training hyperparameters.
Table S4. Training and inference time on an Nvidia Tesla P100 av-
eraged over 100 iterations or 500 images, respectively. D-T: SDE
Transfer Learning, D-M SDE Transfer and Multi-Task Learning,
P: Pseudo-Labelling, X-D: Mix Depth
D P X Training Time Inference Time
T 188 ms/it 66 ms/img
T X 466 ms/it 67 ms/img
T X D 476 ms/it 66 ms/img
M X D 1215 ms/it 160 ms/img
B. Cross-Dataset Transfer Learning
In this section, we show that the unlabeled image se-
quences and the labeled segmentations can also originate
from different datasets within similar visual domains. For
that purpose, we train the SDE on Cityscapes sequences
and learn the semi-supervised semantic segmentation on the
CamVid dataset [2], which contains 367 train, 101 valida-
tion, and 233 test images with dense semantic segmenta-
tion labels for 11 classes from street scenes in Cambridge.
To ensure a similar feature resolution, we upsample the
CamVid images from 480 × 360 to 672 × 512 pixels and
randomly crop to a size of 512 × 512.
Table S5 shows that the results on CamVid are similar to
our main results on Cityscapes. For 50 labeled training sam-
ples, SDE pretraining improves the mIoU by 3.6 percentage
points, pseudo-labels and DepthMix by another 4.07 per-
centage points, and data selection by another 1.41 percent-
age points. In the end, our proposed method significantly
outperforms ClassMix by 2.34 percentage points for 50 la-
beled samples and 2.14 percentage points for 100 labeled
samples. Also for the fully labeled dataset, our method can
improve the performance by 3.29 percentage points.
C. Further Example Predictions
Further examples for semantic segmentation and SDE
are shown in Fig. S6. In general, the same observations as in
the main paper can be made. Our method provides clearer
segmentation contours for objects that are bordered by pro-
nounced depth discontinuities such as pole, traffic sign, or
traffic light. We also show improved differentiation between
similar classes such as truck, bus, and train. On the down-
side, SDE sometimes fails for cars driving directly in front
of the camera (see 7th row in Fig. S6) and violating the re-
construction assumptions. Those cars are observed at the
exact same location across the image sequence and can not
be correctly reconstructed during SDE training, even with
correct depth and pose estimates. However, this differentia-
tion between moving and non-moving cars does not hinder
the transfer of SDE-learned features to semantic segmenta-
tion but can cause problems with DepthMix (see Section D).
12

More Related Content

What's hot (19)

PDF
E1803012329
IOSR Journals
 
PDF
Block Classification Scheme of Compound Images: A Hybrid Extension
DR.P.S.JAGADEESH KUMAR
 
PDF
K2 Algorithm-based Text Detection with An Adaptive Classifier Threshold
CSCJournals
 
PDF
Hangul Recognition Using Support Vector Machine
Editor IJCATR
 
PDF
Sample Paper Techscribe
guest533af374
 
PDF
Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Net...
Willy Marroquin (WillyDevNET)
 
PDF
Pioneering VDT Image Compression using Block Coding
DR.P.S.JAGADEESH KUMAR
 
PDF
AN IMPROVED MULTI-SOM ALGORITHM
IJNSA Journal
 
PDF
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
Luma Tawfiq
 
PDF
Multimodal Biometrics Recognition by Dimensionality Diminution Method
IJERA Editor
 
PDF
INTRA BLOCK AND INTER BLOCK NEIGHBORING JOINT DENSITY BASED APPROACH FOR JPEG...
ijsc
 
PPT
Evaluation of Texture in CBIR
Zahra Mansoori
 
PDF
A Study of BFLOAT16 for Deep Learning Training
Subhajit Sahu
 
PDF
Web image annotation by diffusion maps manifold learning algorithm
ijfcstjournal
 
PDF
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
Putra Wanda
 
PDF
Content-based Image Retrieval Using The knowledge of Color, Texture in Binary...
Zahra Mansoori
 
PDF
Blind seperation image sources via adaptive dictionary learning
Mohan Raj
 
E1803012329
IOSR Journals
 
Block Classification Scheme of Compound Images: A Hybrid Extension
DR.P.S.JAGADEESH KUMAR
 
K2 Algorithm-based Text Detection with An Adaptive Classifier Threshold
CSCJournals
 
Hangul Recognition Using Support Vector Machine
Editor IJCATR
 
Sample Paper Techscribe
guest533af374
 
Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Net...
Willy Marroquin (WillyDevNET)
 
Pioneering VDT Image Compression using Block Coding
DR.P.S.JAGADEESH KUMAR
 
AN IMPROVED MULTI-SOM ALGORITHM
IJNSA Journal
 
DESIGN SUITABLE FEED FORWARD NEURAL NETWORK TO SOLVE TROESCH'S PROBLEM
Luma Tawfiq
 
Multimodal Biometrics Recognition by Dimensionality Diminution Method
IJERA Editor
 
INTRA BLOCK AND INTER BLOCK NEIGHBORING JOINT DENSITY BASED APPROACH FOR JPEG...
ijsc
 
Evaluation of Texture in CBIR
Zahra Mansoori
 
A Study of BFLOAT16 for Deep Learning Training
Subhajit Sahu
 
Web image annotation by diffusion maps manifold learning algorithm
ijfcstjournal
 
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
Putra Wanda
 
Content-based Image Retrieval Using The knowledge of Color, Texture in Binary...
Zahra Mansoori
 
Blind seperation image sources via adaptive dictionary learning
Mohan Raj
 

Similar to 3ways to improve semantic segmentation (20)

PDF
Noise-robust classification with hypergraph neural network
nooriasukmaningtyas
 
PDF
Deep Image Clustering Based on Label Similarity and Maximizing Mutual Informa...
MaisaTobiasII
 
PDF
Taskonomy of Transfer Learning
Ex Lecturer of HUMP
 
PDF
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...
IRJET Journal
 
PDF
When deep learners change their mind learning dynamics for active learning
Devansh16
 
PDF
A simplified and novel technique to retrieve color images from hand-drawn sk...
IJECEIAES
 
PDF
Effects of Data Enrichment with Image Transformations on the Performance of D...
Hakan Temiz
 
PDF
An approach for improved students’ performance prediction using homogeneous ...
IJECEIAES
 
PDF
TL-DTC_Michael Ravoo.pdf
FlorinSpainhour
 
PDF
Novel Ensemble Tree for Fast Prediction on Data Streams
IJERA Editor
 
PPTX
[ICIP 2022] ACT-NET: Asymmetric Co-Teacher Network for Semi-Supervised Memory...
Ziyuan Zhao
 
PDF
Paper Explained: RandAugment: Practical automated data augmentation with a re...
Devansh16
 
PDF
Concatenated decision paths classification for time series shapelets
ijics
 
PDF
CONCATENATED DECISION PATHS CLASSIFICATION FOR TIME SERIES SHAPELETS
ijcisjournal
 
PDF
An Ensemble Approach To Improve Homomorphic Encrypted Data Classification Per...
IJCI JOURNAL
 
PDF
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
sherinmm
 
PDF
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
sherinmm
 
DOCX
Expandable bayesian
Ahmad Amri
 
PDF
Quantitative strategies of different loss functions aggregation for knowledge...
IAESIJAI
 
PDF
06522405
anilcvsr
 
Noise-robust classification with hypergraph neural network
nooriasukmaningtyas
 
Deep Image Clustering Based on Label Similarity and Maximizing Mutual Informa...
MaisaTobiasII
 
Taskonomy of Transfer Learning
Ex Lecturer of HUMP
 
IRJET - Factors Affecting Deployment of Deep Learning based Face Recognition ...
IRJET Journal
 
When deep learners change their mind learning dynamics for active learning
Devansh16
 
A simplified and novel technique to retrieve color images from hand-drawn sk...
IJECEIAES
 
Effects of Data Enrichment with Image Transformations on the Performance of D...
Hakan Temiz
 
An approach for improved students’ performance prediction using homogeneous ...
IJECEIAES
 
TL-DTC_Michael Ravoo.pdf
FlorinSpainhour
 
Novel Ensemble Tree for Fast Prediction on Data Streams
IJERA Editor
 
[ICIP 2022] ACT-NET: Asymmetric Co-Teacher Network for Semi-Supervised Memory...
Ziyuan Zhao
 
Paper Explained: RandAugment: Practical automated data augmentation with a re...
Devansh16
 
Concatenated decision paths classification for time series shapelets
ijics
 
CONCATENATED DECISION PATHS CLASSIFICATION FOR TIME SERIES SHAPELETS
ijcisjournal
 
An Ensemble Approach To Improve Homomorphic Encrypted Data Classification Per...
IJCI JOURNAL
 
Maximum Correntropy Based Dictionary Learning Framework for Physical Activity...
sherinmm
 
MAXIMUM CORRENTROPY BASED DICTIONARY LEARNING FOR PHYSICAL ACTIVITY RECOGNITI...
sherinmm
 
Expandable bayesian
Ahmad Amri
 
Quantitative strategies of different loss functions aggregation for knowledge...
IAESIJAI
 
06522405
anilcvsr
 
Ad

Recently uploaded (20)

PPTX
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
PPTX
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PDF
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
PDF
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
PPTX
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
PPTX
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
PPTX
How to Set Maximum Difference Odoo 18 POS
Celine George
 
PDF
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
PDF
The dynastic history of the Chahmana.pdf
PrachiSontakke5
 
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
How to Set Maximum Difference Odoo 18 POS
Celine George
 
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
The dynastic history of the Chahmana.pdf
PrachiSontakke5
 
Ad

3ways to improve semantic segmentation

  • 1. Three Ways to Improve Semantic Segmentation with Self-Supervised Depth Estimation Lukas Hoyer ETH Zurich [email protected] Dengxin Dai ETH Zurich [email protected] Yuhua Chen ETH Zurich [email protected] Adrian Köring University of Bonn [email protected] Suman Saha ETH Zurich [email protected] Luc Van Gool ETH Zurich & KU Leuven [email protected] Abstract Training deep networks for semantic segmentation re- quires large amounts of labeled training data, which presents a major challenge in practice, as labeling seg- mentation masks is a highly labor-intensive process. To address this issue, we present a framework for semi- supervised semantic segmentation, which is enhanced by self-supervised monocular depth estimation from unlabeled image sequences. In particular, we propose three key con- tributions: (1) We transfer knowledge from features learned during self-supervised depth estimation to semantic seg- mentation, (2) we implement a strong data augmentation by blending images and labels using the geometry of the scene, and (3) we utilize the depth feature diversity as well as the level of difficulty of learning depth in a student- teacher framework to select the most useful samples to be annotated for semantic segmentation. We validate the pro- posed model on the Cityscapes dataset, where all three modules demonstrate significant performance gains, and we achieve state-of-the-art results for semi-supervised se- mantic segmentation. The implementation is available at https://github.com/lhoyer/improving_segmentation_ with_selfsupervised_depth. 1. Introduction Convolutional Neural Networks (CNNs) [35] have achieved state-of-the-art results for various computer vi- sion tasks including semantic segmentation [40, 5]. How- ever, training CNNs typically requires large-scale annotated datasets, due to millions of learnable parameters involved. Collecting such training data relies primarily on manual an- notation. For semantic segmentation, the process can be particularly costly, due to the required dense annotations. For example, annotating a single image in the Cityscapes dataset took on average 1.5 hours [9]. Recently, self-supervised learning has shown to be a promising replacement for manually labeled data. It aims to learn representations from the structure of unlabeled data, instead of relying on a supervised loss, which in- volves manual labels. The principle has been successfully applied in depth estimation for stereo pairs [16] or im- age sequences [73]. Additionally, semantic segmentation is known to be tightly coupled with depth. Several works have reported that jointly learning segmentation and super- vised depth estimation can benefit the performance of both tasks [61]. Motivated by these observations, we investigate the question: How can we leverage self-supervised depth estimation to improve semantic segmentation? In this work, we propose a threefold approach to utilize self-supervised monocular depth estimation (SDE) [16, 73, 17] to improve the performance of semantic segmentation and to reduce the amount of annotation needed. Our contri- butions span across the holistic learning process from data selection, over data augmentation, up to cross-task repre- sentation learning, while being unified by the use of SDE. First, we employ SDE as an auxiliary task for seman- tic image segmentation under a transfer learning and multi- task learning framework and show that it noticeably im- proves the performance of semantic segmentation, espe- cially when supervision is limited. Previous works only cover full supervision [32], pretraining [26], or improving SDE instead of segmentation [20]. Second, we propose a strong data augmentation strategy, DepthMix, which blends images as well as their labels according to the geometry of the scenes obtained from SDE. In comparison to previous methods [69, 47], DepthMix explicitly respects the geomet- ric structure of the scenes and generates fewer artifacts (see Fig. 1). And third, we propose an Automatic Data Selection for Annotation, which selects the most useful samples to be 1 arXiv:2012.10782v2 [cs.CV] 5 Apr 2021
  • 2. annotated in order to maximize the gain. The selection is iteratively driven by two criteria: diversity and uncertainty. Both of them are conducted by a novel use of SDE as proxy task in this context. While our method follows the active learning cycle (model training → query selection → an- notation → model training) [53, 66], it does not require a human in the loop to provide semantic segmentation labels as the human is replaced by a proxy-task SDE oracle. This greatly improves flexibility, scalability, and efficiency, espe- cially considering crowdsourcing platforms for annotation. The main advantage of our method is that we can learn from a large base of easily accessible unlabeled image se- quences and utilize the learned knowledge to improve se- mantic segmentation performance in various ways. In our experimental evaluation on Cityscapes [9], we demonstrate significant performance gains of all three components and improve the previous state-of-the-art for semi-supervised segmentation by a considerable margin. Specifically, our method achieves 92% of the full annotation baseline per- formance with only 1/30 available labels and even slightly outperforms it with only 1/8 labels. Our contributions sum- marize as follows: (1) To the best of our knowledge, we are the first to utilize SDE as an auxiliary task to exploit unlabeled image sequences and significantly improve the performance of semi-supervised semantic segmentation. (2) We propose DepthMix, a strong data augmentation strategy, which respects the geometry of the scene and achieves, in combination with (1), state-of-the-art re- sults for semi-supervised semantic segmentation. (3) We propose a novel Automatic Data Selection for An- notation based on SDE to improve the flexibility of ac- tive learning. It replaces the human annotator with an SDE oracle and lifts the requirement of having a hu- man in the loop of data selection. 2. Related Work 2.1. (Semi-Supervised) Semantic Segmentation Since Convolutional Neural Networks (CNNs) [35] were first used by Long et al. [40] for semantic segmentation, they have become the state-of-the-art method for this prob- lem. Most architectures are based on an encoder decoder design such as [40, 51, 6]. Skip connections [51] and di- lated convolutions [4, 67] preserve details in the segmen- tation and spatial pyramid pooling [15, 71, 5] aggregates different scales to exploit spatial context information. Semi-supervised semantic segmentation makes use of additional unlabeled data during training. For that purpose, Souly et al. [59] and Hung et al. [23] utilize generative ad- versarial networks [18]. Souly et al. [59] use that concept to generate additional training samples, while Hung et al. [23] train the discriminator based on the semantic segmentation probability maps. s4GAN [43] extends this idea by adding a multi-label classification mean teacher [60]. Another line of work [48, 12, 47] is based on consistency training, where perturbations are applied to unlabeled images or their inter- mediate features and a loss term enforces consistency of the segmentation. While Ouali et al. [48] study perturbation of encoder features, CutMix [12] mixes crops from the input images and their pseudo-labels to generate additional train- ing data, and ClassMix [47] uses pseudo-label [36] class segments to build the mix mask. Our proposed DepthMix module is inspired by these methods but, in contrast, it also respects the structure of the scene when mixing samples. Commonly, several approaches [43, 12, 47, 11] include self- training with pseudo-labels [36] and a mean teacher frame- work [60], which is extended by Feng et al. [11] with a class-balanced curriculum. Another related line of work is learning useful representations for semantic segmentation from self-supervised tasks such as tracking [63], context in- painting [49], colorization [34], depth estimation [26] (see Section 2.3), or optical flow prediction [37]. However, all of these approaches are outperformed by ImageNet pretraining and are, therefore, not relevant for semi-supervised seman- tic segmentation in practice. 2.2. Active Learning Another approach to reduce the number of required an- notations is active learning. It iteratively requests the most informative samples to be labeled by a human. On the one side, uncertainty-based approaches select samples with a high uncertainty estimated based on, e.g., entropy [24, 54] or ensemble disagreement [55, 42]. On the other side, diversity-based approaches select samples, which most in- crease the diversity of the labeled set [44, 52, 58]. For seg- mentation, active learning is typically based on uncertainty measures such as MC dropout [13, 66, 41], entropy [29, 64], or multi-view consistency [57]. In addition to methods se- lecting whole images [19, 66, 64], several approaches apply a more fine-grained label request at region level [41, 29, 57] and also include a label cost estimate [41, 29]. In contrast to these works, we perform automatic data selection for annotation by replacing the human with SDE as oracle. Therefore, we do not require human-in-the- loop annotation during the active learning cycle. Previous works performing unsupervised data selection are restricted to shallow models [68, 70, 45, 22, 56, 39], classification with low-dimensional inputs [38], or do not perform an it- erative data selection [72] to dynamically adapt to the un- certainty of the model trained on the currently labeled set. 2.3. Improving Segmentation with SDE Self-supervised depth estimation (SDE) aims to learn depth estimation from the geometric relations of stereo im- 2
  • 3. age pairs [14, 16] or monocular videos [73]. Due to the bet- ter availability of videos, we use the latter approach, where a neural network estimates depth and camera motion of two subsequent images and a photometric loss is computed after a differentiable warping. The approach has been improved by several follow-up works [17, 8, 74]. The combination of semantic segmentation and SDE was studied in previous works with the goal of improving depth estimation. While [50, 28, 7, 32] learn both tasks jointly, [3, 20, 27] distill knowledge from a teacher semantic seg- mentation network to guide SDE. To further utilize coher- ence between semantic segmentation and SDE, [50, 7] pro- posed additional loss terms that encourage spatial proximity between depth discontinuities and segmentation contours. In contrast to these works, we do not aim to improve SDE but rather semi-supervised semantic segmentation. The closest to our approach are [26], [46], and [32]. Jiang et al. [26] utilizes relative depth computed from optical flow to replace ImageNet pretraining for semantic segmentation. In contrast, we additionally study multi-task learning of SDE and semantic segmentation and show that combining SDE with ImageNet features can even further boost perfor- mance. Novosel et al. [46] and Klingner et al. [32] improve the semantic segmentation performance by jointly learning SDE. However, they focus on the fully-supervised setting, while our work explicitly addresses the challenges of semi- supervised semantic segmentation by using the depth es- timates to generate additional training data and an auto- matic data selection mechanism based on SDE. Another work supporting the usefulness of SDE for semantic seg- mentation from another viewpoint is [31] demonstrating an improved noise and attack robustness. 3. Methods In this section, we present our three ways to improve the performance of semantic segmentation with self-supervised depth estimation (SDE). They focus on three different as- pects of semantic segmentation, covering data selection for annotation, data augmentation, and multi-task learning. Given N images and K image sequences from the same domain, our first method, Automatic Data Selection for An- notation, uses SDE learned on the K (unlabeled) sequences to select NA images out of the N images for human annota- tion (see Alg. 1). Our second approach, termed DepthMix, leverages the learned SDE to create geometrically-sound ‘virtual’ training samples from pairs of labeled images and their annotations (see Fig. 1). Our third method learns se- mantic segmentation with SDE as an auxiliary task under a multi-tasking framework (see Fig. 2). The learning is rein- forced by a multi-task pretraining process combining SDE with image classification. For SDE, we follow the method of Godard et al. [17], which we briefly introduce in the following. We first train a depth estimation network fD to predict the depth of a tar- get image and a pose estimation network fT to estimate the camera motion from the target image and the source im- age. Depth and pose are used to produce a differentiable warping to transform the source image into the target im- age. The photometric error between the target image and multiple warped source frames is combined by a pixel-wise minimum. Besides, stationary pixels are masked out and an edge-aware depth smoothness term is applied resulting in the final self-supervised depth loss LD. We refer the reader to the original paper [17] for more details. 3.1. Automatic Data Selection for Annotation We use SDE as proxy task for selecting NA samples out of a set of N unlabeled samples for a human to create se- mantic segmentation labels. The selection is conducted pro- gressively in multiple steps, similar to the standard active learning cycle (model training → query selection → anno- tation → model training). However, our data selection is fully automatic and does not require a human in the loop as the annotation is done by a proxy-task SDE oracle. Let’s denote by G, GA, and GU , the whole image set, the selected sub-set for annotation, and the un-selected sub-set. Initially, we have GA = ∅ and GU = G. The selection is driven by two criteria: diversity and uncertainty. Diversity sampling encourages that selected images are diverse and cover different scenes. Uncertainty sampling favors adding unlabeled images that are near a decision boundary (with high uncertainties) of the model trained on the current GA. For uncertainty sampling, we need to train and update the model with GA. It is inefficient to repeat this every time a new image is added. For the sake of efficiency, we divide the selection into T steps and only train the model T times. In each step t, nt images are selected and moved from GU to GA, so we have PT t=1 nt = NA. After each step t, a model is trained on GA and evaluated on GU to get updated uncertainties for step t + 1. Diversity Sampling: To ensure that the chosen annotated samples are diverse enough to represent the entire dataset well, we use an iterative farthest point sampling based on the L2 distance over features ΦSDE computed by an inter- mediate layer of the SDE network. At step t, for each of the nt samples, we choose the one in GU with the largest dis- tance to the current annotation set GA. The set of selected samples GA is iteratively extended by moving one image at a time from GU to GA until the nt images are collected: GU = GU {Ii} and GA = GA ∪ {Ii}, (1) i = arg max Ii∈GU min Ij ∈GA ||ΦSDE i − ΦSDE j ||2. (2) Uncertainty Sampling: While Diversity Sampling is able to select diverse new samples, it is unaware of the uncer- tainties of a semantic segmentation model over these sam- ples. Uncertainty Sampling aims to select difficult samples, 3
  • 4. Algorithm 1: Automatic Data Selection 1: t = 1 2: i ← uniform(1, N) 3: GA = {Ii} and GU = GU {Ii} 4: for k = 2 to NA do 5: if k == Pt t0=1 nt0 then 6: Train depth student ΦSIDE on the current GA 7: Calculate E(i) ∀Ii ∈ GU 8: t = t + 1 9: end if 10: if t == 1 then 11: Obtain index i according to Eq. 2 12: else 13: Obtain index i according to Eq. 4 14: end if 15: GA = GA ∪ {Ii} and GU = GU {Ii} 16: end for i.e., samples in GU that the model trained on the current GA cannot handle well. In order to train this model, ac- tive learning typically uses a human-in-the-loop strategy to add annotations for selected samples. In this work, we use a proxy task based on self-supervised annotations, which can run automatically, to make the method more flexible and efficient. Since our target task is single-image semantic segmentation, we choose to use single-image depth estima- tion (SIDE) as the proxy task. Importantly, due to our SDE framework, depth pseudo-labels are available for G. Us- ing these pseudo-labels, we train a SIDE method on GA and measure the uncertainty of its depth predictions on GU . Due to the high correlation of single-image semantic segmenta- tion and SIDE, the generated uncertainties are informative and can be used to guide our sampling procedure. As the depth student model is trained only on GA, it can specifi- cally approximate the difficulty of candidate samples with respect to the already selected samples in GA. The student is trained from scratch in each step t, instead of being fine- tuned from t−1, to avoid getting stuck in the previous local minimum. Note that the SDE method is trained on a much larger unlabeled dataset, i.e., the K image sequences, and can provide good guidance for the SIDE method. In particular, the uncertainty is signaled by the dispar- ity error between the student network fSIDE and the teacher network fSDE in the log-scale space under L1 distance: E(i) = || log(1 + fSDE(Ii)) − log(1 + fSIDE(Ii))||1. (3) As the disparity difference of far-away objects is small, the log-scale is used to avoid the loss being dominated by close- range objects. This criterion can be added into Eq. 2 to also select samples with higher uncertainties for the dataset Random Class Choice Depth Comparison ClassMix (Baseline) DepthMix (Ours) Si Sj MD ⊙Si MCl ⊙Si S‘Cl S‘D Figure 1. Concept of the proposed DepthMix augmentation (refer to Sec. 3.2) and its baseline ClassMix [47]. By utilizing SDE, DepthMix mitigates geometric artifacts. update in Eq. 1: i = arg max Ii∈GU min Ij ∈GA ||ΦSDE i − ΦSDE j ||2 + λEE(i), (4) where λE is a parameter to balance the contribution of the two terms. For diversity sampling, we still use SDE features instead of SIDE student features as SDE is trained on the entire dataset, which provides better features for diversity estimation. When nt images have been selected according to Eq. 1 and Eq. 4 at step t, a new SIDE model will be trained on the current GA in order to continue further. As presented previously, our selection proceeds progressively in T steps until we collect all NA images. The algorithm of this selection is summarized in Alg. 1, where Pt t0=1 nt0 describes the desired size of GA at the end of step t. 3.2. DepthMix Data Augmentation Inspired by the recent success of data augmentation ap- proaches that mixup pairs of images and their (pseudo) la- bels to generate more training samples for semantic seg- mentation [69, 12, 47], we propose an algorithm, termed DepthMix, to utilize self-supervised depth estimates to maintain the integrity of the scene structure during mixing. Given two images Ii and Ij of the same size, we would like to copy some regions from Ii and paste them directly into Ij to get a virtual sample I0 . The copied regions are indicated by a mask M, which is a binary image of the same size as the two images. The image creation is done as I0 = M
  • 5. Ii + (1 − M)
  • 7. denotes the element-wise product. The label maps of the two images Si and Sj are mixed up with the same mask M to generate S0 . The mixing can be applied to la- beled data and unlabeled data using human ground truths or pseudo-labels, respectively. Existing methods generate 4
  • 8. ImageNet Encoder fI Feature Distance Loss LF Seg. Loss Lce Camera Motion Tt,t+1 SDE Loss LD Pose CNN Shared Encoder fE Depth Decoder fD Semantic Decoder fS Image It Image It+1 Depth Dt Segmentation St Ground Truth St ^ ^ Figure 2. Architecture for learning semantic segmentation with SDE as auxiliary task according to Sec. 3.3. The dashed paths are only used during training and only if image sequences and/or segmentation ground truth are available for a training sample. this mask M in different ways, e.g., randomly sampled rect- angular regions [69, 12] or randomly selected object seg- ments [47]. In those methods, the structure of the scene is not considered and foreground and background are not dis- tinguished. We find images synthesized by these methods often violate the geometric relationships between objects. For instance, a distant object can be copied onto a close- range object or only unoccluded parts of mid-range objects are copied onto the other image. Imagine how strange it is to see a pedestrian standing on top of a car or to see sky through a hole in a building (just as shown in Fig. 1 left). Our DepthMix is designed to mitigate this issue. It uses the estimated depth D̂i and D̂j of the two images to gen- erate the mix mask M that respects the notion of geometry. It is implemented by selecting only pixels from Ii whose depth values are smaller than the depth values of the pixels at the same locations in Ij: M(a, b) = 1 if D̂i(a, b) D̂j(a, b) + 0 otherwise (6) where a and b are pixel indices, and is a small value to avoid conflicts of objects that are naturally at the same depth plane such as road or sky. By using this M, DepthMix re- spects the depth of objects in both images, such that only closer objects can occlude further-away objects. We illus- trate this advantage of DepthMix with an example in Fig. 1. 3.3. Semi-Supervised Semantic Segmentation In this section, we train a semantic segmentation model utilizing the labeled image dataset GA, the unlabeled image dataset GU , and K unlabeled image sequences. We first dis- cuss how to exploit SDE on the image sequences to improve our semantic segmentation. We then show how to use GU to further improve the performance. Learning with Auxiliary Tasks: For learning semantic segmentation and SDE jointly, we use a network with shared encoder fE θ and a separate depth fD θ and segmen- tation decoder fS θ (see Fig. 2). The depth branch is trained using the SDE loss LD and the segmentation branch gS θ = fS θ ◦ fE θ is trained using the pixel-wise cross-entropy Lce. In order to initialize the pose estimation network and the depth decoder properly, the architecture is first trained on K unlabeled image sequences for SDE. As a common prac- tice, we initialize the encoder with ImageNet weights as they provide useful semantic features learned during image classification. To avoid forgetting semantic features during the SDE pretraining, we utilize a feature distance loss be- tween the current bottleneck features fE θ and the bottleneck features of the encoder with ImageNet weights fE I : LF = ||fE θ − fE I ||2. (7) The loss for the depth pretraining is the weighted sum of the SDE loss and the ImageNet feature distance loss: LP = LD + λF LF . (8) To additionally incorporate transfer learning from depth estimation to semantic segmentation, the weights of fD θ are used to initialize fS θ . For effective multi-task learning, we use an attention-guided distillation module [65] to exchange useful intermediate features between both decoders. Learning with Unlabeled Images: In order to further uti- lize the unlabeled dataset GU , we generate pseudo-labels using the mean teacher algorithm [60], which is commonly used in semi-supervised learning [1, 62, 12, 47]. For that purpose, an exponential moving average is applied to the weights of the semantic segmentation model gS θ to obtain the weights of the mean teacher θT : θ0 T = αθT + (1 − α)θ. (9) To generate the pseudo-labels, an argmax over the classes C is applied to the prediction of the mean teacher. SU = arg max c∈C (gS θT (IU )). (10) The mean teacher can be considered as a temporal ensem- ble, resulting in stable predictions for the pseudo-labels, while the argmax ensures confident predictions [47]. For the semi-supervised setting, the segmentation net- work is trained with labeled samples (IA, SA) and pseudo- labeled samples (IU , SU ): LSSL = Lce(gS θ (IA), SA) + λP (SU )Lce(gS θ (IU ), SU )) (11) λP (SU ) is chosen to reflect the quality of the pseudo-label represented by the fraction of pixels exceeding a thresh- old τ for the predicted probability of the most confident 5
  • 9. class maxc∈C(gS θT (IU )), as suggested in [47]. We in- corporate DepthMix samples (I0 , S0 ), which are obtained from the combined labeled and pseudo-labeled data pool Ii, Ij ∈ GA ∪ GU (see Eq. 5), into Eq. 11 to replace the unlabeled samples (SU , LU ). Our semi-supervised learning is now changed to: LSSL = Lce(gS θ (IA), SA) + λP (S0 )Lce(gS θ (I0 ), S0 )). (12) 4. Experiments 4.1. Implementation Details Dataset: We evaluate our method on the Cityscapes dataset [9], which consists of 2975 training and 500 vali- dation images with semantic segmentation labels from Eu- ropean street scenes. We downsample the images to 1024× 512 pixels. Besides, random cropping to a size of 512×512 and random horizontal flipping are used in the training. Im- portantly, Cityscapes provides 20 unlabeled frames before and 10 after the labeled image, which are used for SDE training. During the semi-supervised segmentation, only the originally 2975 labeled training images are used. They are randomly split into a labeled and an unlabeled subset. Network Architecture: Our network consists of a shared ResNet101 [21] encoder with output stride 16 and a sepa- rate decoder for segmentation and SDE. The decoder con- sists of an ASPP [5] block to aggregate features from mul- tiple scales and another four upsampling blocks with skip connections [51]. For SDE, the upsampling blocks have a disparity side output at the respective scale. For effective multi-task learning, we additionally follow PAD-Net [65] and deploy an attention-guided distillation module after the third decoder block. It serves the purpose of exchanging useful features between segmentation and depth estimation. Training: For the SDE pretraining, the depth and pose net- work are trained using Adam [30], a batch size of 4, and an initial learning rate of 1 × 10−4 , which is divided by 10 after 160k iterations. The SDE loss is calculated on four scales with three subsequent images. During the first 300k iterations, only the depth decoder and the pose network are trained. Afterwards, the depth encoder is fine-tuned with an ImageNet feature distance λF = 1 × 10−2 for another 50k iterations. The encoder is initialized with ImageNet weights, either before depth pretraining or before semantic segmentation if depth pretraining is ablated. For the multi-task setting, we train the network using SGD with a learning rate of 1 × 10−3 for the encoder and depth decoder, 1 × 10−2 for the segmentation decoder, and 1 × 10−6 for the pose network. The learning rate is reduced by 10 after 30k iterations and trained for another 10k itera- tions. A momentum of 0.9, a weight decay of 5×10−4 , and a gradient norm clipping to 10 are used. The loss for seg- mentation and SDE are weighted equally. The mean teacher Figure 3. Example semantic segmentations of our method for 100 labeled samples in comparison with ClassMix [47]. has α = 0.99 and within an iteration, the network is trained on a clean labeled and an augmented mixed batch with size 2, respectively. The latter uses DepthMix with = 0.03, color jitter, and Gaussian blur. Data Selection for Annotation: In the data selection ex- periment, we use a slimmed network architecture with a ResNet50 encoder and fewer decoder channels for fSIDE. It is trained using Adam with 1 × 10−4 learning rate and polynomial decay with exponent 0.9 for faster convergence. For calculating the depth feature diversity, we use the output of the second depth decoder block after SDE pretraining. It is downsampled by average pooling to a size of 8x4 pix- els and the feature channels are normalized to zero-mean unit-variance over the dataset. The student depth error is weighted by λE = 1000. The number of the selected sam- ples ( Pt t0=1 nt0 ) is iteratively increased to 25, 50, 100, 200, 372, and 744. For each subset, a student depth network is trained from scratch for 4k, 8k, 12k, 16k, and 20k iterations, respectively, to calculate the student depth error. 4.2. Semi-Supervised Semantic Segmentation First, we compare our approach with several state-of-the- art semi-supervised learning approaches. We summarize the results in Tab. 1. The performance (mIoU in %) of the semi-supervised methods and their baselines (only trained on the labeled dataset) are shown for a different number of labeled samples. As the performance of the baselines dif- fers, there are columns showing the absolute improvement for better comparability. As our baseline utilizes a more ca- pable network architecture due to the U-Net decoder with ASPP as opposed to a DeepLabv2 decoder used by most previous works, we also reimplemented the state-of-the-art method, ClassMix [47] with our network architecture and training parameters to ensure a direct comparison. As shown in Tab. 1, our method (without data selection) 6
  • 10. Table 1. Performance on the Cityscapes validation set (mIoU in %, standard deviation over 3 random seeds). Labeled Samples 1/30 (100) 1/8 (372) 1/4 (744) Full (2975) Baseline [23] – 55.50 59.90 66.40 Adversarial [23] – 58.80 +3.30 62.30 +2.40 – Baseline [43] – 56.20 60.20 66.00 s4GAN [43] – 59.30 +3.10 61.90 +1.70 65.80 –0.20 Baseline [12] 44.41 ±1.11 55.25 ±0.66 60.57 ±1.13 67.53 ±0.35 CutMix [12] 51.20 ±2.29 +6.79 60.34 ±1.24 +5.09 63.87 ±0.71 +3.30 67.68 ±0.37 +0.15 Baseline [11] 45.50 56.70 61.10 66.90 DST–CBC [11] 48.70 +3.20 60.50 +3.80 64.40 +3.30 – Baseline [47] 43.84 ±0.71 54.84 ±1.14 60.08 ±0.62 66.19 ±0.11 ClassMix [47] 54.07 ±1.61 +10.23 61.35 ±0.62 +6.51 63.63 ±0.33 +3.55 – Baseline 48.75 ±1.61 59.14 ±1.02 63.46 ±0.38 67.77 ±0.13 ClassMix [47]1 56.82 ±1.65 +8.07 63.86 ±0.41 +4.72 65.57 ±0.71 +2.11 – ClassMix [47] (+Video) 56.79 ±1.98 +8.04 63.22 ±0.84 +4.08 65.72 ±0.18 +2.26 68.23 ±0.70 +0.46 Ours 58.40 ±1.36 +9.65 66.66 ±1.05 +7.52 68.43 ±0.06 +4.98 71.16 ±0.16 +3.40 Ours (+Data Selection) 62.09 ±0.39 +13.34 68.01 ±0.83 +8.87 69.38 ±0.33 +5.92 – outperforms all other approaches on each labeled subset size for both the absolute performance as well as the im- provement to the baseline. The only exception is the abso- lute improvement of the original results of ClassMix for 100 labeled samples. However, if we consider ClassMix trained in our setting, our method outperforms it also in this case. This can be explained by the considerably higher baseline performance in our setting, which increases the difficulty to achieve an high improvement. Adding data selection even further increases the performance by a significant margin, so that our method, trained with only 1/8 of the labels, even slightly outperforms the fully-supervised baseline. To identify whether the improvement originates from ac- cess to more unlabeled data or from the effectiveness of our approach, we compare to another baseline “ClassMix (+Video)”. More specifically, we also provide all unla- beled image sequences to ClassMix and see how much it can benefit from this additional amount of unlabeled data. Experimental results show no significant difference. This is probably due to the high correlation of the Cityscapes im- age dataset and the video dataset (the images are the 20th frames of the video clips). The adequacy of our approach is also reflected in the ex- ample predictions in Fig. 3. We can observe that the con- tours of classes are more precise. Moreover, difficult ob- jects such as bus, train, rider, or truck can be better distin- guished. This observation is also quantitatively confirmed by the class-wise IoU improvement shown in Fig. 4. 4.3. Ablation Study Next, we analyze the individual contribution of each component of the proposed method. For this purpose, we 1 Results of the reimplementation in our experiment setting. Table 2. Ablation of the architecture components (D-T: SDE Transfer Learning, D-M SDE Transfer and Multi-Task Learning, F: ImageNet Feature Distance Loss, P: Pseudo-Labeling, X-C: Mix Class, X-D: Mix Depth, S - Data Selection). mIoU in %, standard deviation over 3 seeds. D F P X S 372 Samples 2975 Samples 59.14 ±1.02 67.77 ±0.13 T 60.46 ±0.64 +1.31 69.00 ±0.70 +1.23 T X 60.80 ±0.69 +1.66 69.47 ±0.38 +1.71 M X 61.25 ±0.55 +2.10 69.76 ±0.39 +1.99 X 62.39 ±0.86 +3.24 – X C 63.16 ±0.89 +4.02 69.60 ±0.32 +1.83 X D 64.14 ±1.34 +5.00 69.83 ±0.36 +2.06 M X X D 66.66 ±1.05 +7.52 71.16 ±0.16 +3.40 X 64.25 ± 0.18 +5.11 – M X X D X 68.01 ±0.83 +8.87 – Road S.walk Building Wall Fence Pole Tr. Light Tr. Sign Veget. Terrain Sky Person Rider Car Truck Bus Train M.cycle Bicycle Baseline DM XD DM+XD DM+XD+S 0.0 0.2 Figure 4. Improvement of the class-wise IoU over the baseline per- formance for 372 labeled samples (DM: SDE Multi-Task Learn- ing, XD: DepthMix with Pseudo-Labels, S: Data Selection). test several ablated versions of our model for both the cases of 372 and 2975 labeled samples. We summarize the re- sults in Tab. 2. It can be seen that each contribution adds a significant performance improvement over the baseline. For 372 (2975) annotated samples, transfer and multi-task 7
  • 11. Image i Image j Depth i Depth j Mixed Image I’ a) b) Figure 5. DepthMix applied to Cityscapes crops. learning improve the performance by +2.10 (+1.99), Depth- Mix with pseudo-labels by +5.00 (+2.06), and automatic data selection by +5.11 (–) mIoU percentage points. As our components are orthogonal, combining them even further increases performance. SDE Multi-Tasking and DepthMix achieve +7.52 (+3.40) and all three components +8.87 (–) mIoU percentage points improvement. Note that the high variance for few labeled samples is mostly due to the high influence of the randomly selected labeled subset. The cho- sen subset affects all configurations equally and the reported improvements are consistent for each subset. Furthermore, we compare DepthMix with ClassMix as a standalone. For a fair comparison, we additionally include mixing labeled samples with their ground truth to ClassMix. It can be seen that DepthMix outperforms the ClassMix by 0.98 (0.23) percentage points for 372 (2975) annotated sam- ples, which shows the effect of the geometry aware augmen- tation. Fig. 5 shows DepthMix examples demonstrating that SDE allows to correctly model occlusions and to produce synthetic samples with a realistic appearance. For more insights into possible reasons for these im- provements, we visualize the improvement of the architec- ture components over the baseline for each class separately in Fig. 4. It can be seen that depth multi-task learning (DM) improves mostly the classes fence, traffic light, traffic sign, rider, truck, and motorcycle, which is possibly due to their characteristic depth profile learned during SDE. For exam- ple, a good depth estimation performance requires correctly segmenting poles or traffic signs as missing them can cause large depth errors. This can also be seen in Fig. 3. Depth- Mix (XD) further improves the performance of wall, truck, bus, and train. This might be caused by the fact the Depth- Mix presents those rather difficult objects in another con- text, which might help the network to generalize better. In the suppl. materials, we further show that our method is still applicable if SDE is trained on a different dataset than semantic segmentation within a similar visual domain. 4.4. Automatic Data Selection for Annotation Finally, we evaluate the proposed automatic data selec- tion. Tab. 3 shows a comparison of our method with a base- line and a competing method. The baseline selects the la- Table 3. Comparison of data selection methods (DS: Diversity Sampling based on depth features, US: Uncertainty Sampling based on depth student error). mIoU in %, std. dev. over 3 seeds. # Labeled 1/30 (100) 1/8 (372) 1/4 (744) Random 48.75 ±1.61 59.14 ±1.02 63.46 ±0.38 Entropy 53.63 ±0.77 63.51 ±0.68 66.18 ±0.50 Ours (US) 51.75 ±1.12 62.77 ±0.46 66.76 ±0.45 Ours (DS) 53.00 ±0.51 63.23 ±0.69 66.37 ±0.20 Ours (DS+US) 54.37 ±0.36 64.25 ±0.18 66.94 ±0.59 beled samples randomly, while the second, strong competi- tor uses active learning and iteratively chooses the samples with the highest segmentation entropy. In contrast to our method, this requires a human in the loop to create the se- mantic labels for iteratively selected images. It can be seen that our method with the combined Diversity Sampling and Uncertainty Sampling (DS+US) outperforms both compar- ison methods, demonstrating the effectiveness of ensuring diversity and exploiting difficult samples based on depth. It also supports the assumption that depth estimation and se- mantic segmentation are correlated in terms of sample dif- ficulty. The class-wise analysis (see the last row of Fig. 4) shows that data selection significantly improves the perfor- mance of truck, bus, and train, which are usually difficult to distinguish in a semi-supervised setting. We would like to note that our automatic data selection method can be ap- plied to any semantic segmentation method. 5. Conclusion In this work, we have studied how self-supervised depth estimation (SDE) can be utilized to improve semantic segmentation in both the semi-supervised and the fully- supervised setting. We introduced three effective strategies capable of leveraging the knowledge learned from SDE. First, we show that the SDE feature representation can be transferred to semantic segmentation, by means of SDE pre- training and joint learning of segmentation and depth. Sec- ond, we demonstrate that the proposed DepthMix strategy outperforms related mixing strategies by avoiding inconsis- tent geometry of the generated images. Third, we present an automatic data selection for annotation algorithm based on SDE, which does not require human-in-the-loop anno- tations. We validate the benefits of the three components by extensive experiments on Cityscapes, where we demon- strate significant gains over the baselines and competing methods. By using SDE, our approach achieves state-of- the-art performance, suggesting that SDE can be a valuable self-supervision for semantic segmentation. Acknowledgements: This work is funded by Toyota Motor Europe via the research project TRACE-Zurich and by a research project from armasuisse. 8
  • 12. References [1] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. In Adv. Neural Inform. Process. Syst., pages 5049–5059, 2019. 5 [2] Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, pages 88–97, 2009. 12 [3] Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In AAAI Conf. Artif. Intell., pages 8001–8008, 2019. 3, 13 [4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmen- tation with deep convolutional nets and fully connected crfs. In Int. Conf. Learn. Represent., pages 834–848, 2015. 2 [5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., pages 834–848, 2017. 1, 2, 6, 12 [6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Eur. Conf. Comput. Vis., pages 801–818, 2018. 2 [7] Po-Yi Chen, Alexander H Liu, Yen-Cheng Liu, and Yu- Chiang Frank Wang. Towards scene understanding: Un- supervised monocular depth estimation with semantic-aware representation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2624–2632, 2019. 3 [8] Yuhua Chen, Cordelia Schmid, and Cristian Sminchis- escu. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In Int. Conf. Comput. Vis., pages 7063–7072, 2019. 3 [9] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3213–3223, 2016. 1, 2, 6 [10] Qi Dai, Vaishakh Patil, Simon Hecker, Dengxin Dai, Luc Van Gool, and Konrad Schindler. Self-supervised object mo- tion and depth estimation from video. In IEEE Conf. Com- put. Vis. Pattern Recog. Workshops, pages 1004–1005, 2020. 13 [11] Zhengyang Feng, Qianyu Zhou, Guangliang Cheng, Xin Tan, Jianping Shi, and Lizhuang Ma. Semi-supervised se- mantic segmentation via dynamic self-training and class- balanced curriculum. arXiv preprint arXiv:2004.08514, 2020. 2, 7 [12] Geoffrey French, Samuli Laine, Timo Aila, Michal Mack- iewicz, and Graham Finlayson. Semi-supervised semantic segmentation needs strong, varied perturbations. In Brit. Mach. Vis. Conf., 2020. 2, 4, 5, 7 [13] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Int. Conf. Mach. Learning, pages 1050–1059, 2016. 2 [14] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Eur. Conf. Comput. Vis., pages 740–756, 2016. 3 [15] Golnaz Ghiasi and Charless C Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmentation. In Eur. Conf. Comput. Vis., pages 519–534, 2016. 2 [16] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In IEEE Conf. Comput. Vis. Pattern Recog., pages 270–279, 2017. 1, 3 [17] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In Int. Conf. Comput. Vis., pages 3828– 3838, 2019. 1, 3, 12 [18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Adv. Neural Inform. Process. Syst., pages 2672–2680, 2014. 2 [19] Marc Górriz, Xavier Giró Nieto, Axel Carlier, and Em- manuel Faure. Cost-effective active learning for melanoma segmentation. In Adv. Neural Inform. Process. Syst. Work- shop ML4H: Machine Learning for Health, pages 1–5, 2017. 2 [20] Vitor Guizilini, Rui Hou, Jie Li, Rares Ambrus, and Adrien Gaidon. Semantically-guided representation learning for self-supervised monocular depth. In Int. Conf. Learn. Rep- resent., 2020. 1, 3 [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pages 770–778, 2016. 6 [22] Yao Hu, Debing Zhang, Zhongming Jin, Deng Cai, and Xi- aofei He. Active learning via neighborhood reconstruction. In Int. Joint Conf. Artif. Intell., pages 1415–1421, 2013. 2 [23] Wei Chih Hung, Yi Hsuan Tsai, Yan Ting Liou, Yen-Yu Lin, and Ming Hsuan Yang. Adversarial learning for semi- supervised semantic segmentation. In Brit. Mach. Vis. Conf., 2018. 2, 7 [24] Rebecca Hwa. Sample selection for statistical parsing. Com- putational linguistics, pages 253–276, 2004. 2 [25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015. 12 [26] Huaizu Jiang, Gustav Larsson, Michael Maire Greg Shakhnarovich, and Erik Learned-Miller. Self- supervised relative depth learning for urban scene under- standing. In Eur. Conf. Comput. Vis., pages 19–35, 2018. 1, 2, 3 [27] Huaizu Jiang, Deqing Sun, Varun Jampani, Zhaoyang Lv, Erik Learned-Miller, and Jan Kautz. Sense: A shared en- coder network for scene-flow estimation. In Int. Conf. Com- put. Vis., pages 3195–3204, 2019. 3 [28] Jianbo Jiao, Ying Cao, Yibing Song, and Rynson Lau. Look deeper into depth: Monocular depth estimation with seman- tic booster and attention-driven loss. In Eur. Conf. Comput. Vis., pages 53–69, 2018. 3 9
  • 13. [29] Tejaswi Kasarla, Gattigorla Nagendar, Guruprasad M Hegde, Vineeth Balasubramanian, and CV Jawahar. Region-based active learning for efficient labeling in semantic segmenta- tion. In IEEE Winter Conf. Appl. of Comput. Vis., pages 1109–1117, 2019. 2 [30] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 6 [31] Marvin Klingner, Andreas Bar, and Tim Fingscheidt. Im- proved noise and attack robustness for semantic segmenta- tion by using multi-task training with self-supervised depth estimation. In IEEE Conf. Comput. Vis. Pattern Recog. Work- shops, pages 320–321, 2020. 3 [32] Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. Self-supervised monocular depth es- timation: Solving the dynamic object problem by semantic guidance. In Eur. Conf. Comput. Vis., pages 582–600, 2020. 1, 3, 13 [33] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed- erico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In Int. Conf. 3D Vision, pages 239–248, 2016. 12 [34] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6874–6883, 2017. 2 [35] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recog- nition. Proceedings of the IEEE, pages 2278–2324, 1998. 1, 2 [36] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Int. Conf. Mach. Learning, 2013. 2 [37] Seokju Lee, Junsik Kim, Tae-Hyun Oh, Yongseop Jeong, Donggeun Yoo, Stephen Lin, and In So Kweon. Visuomotor understanding for representation learning of driving scenes. In Brit. Mach. Vis. Conf., 2019. 2 [38] Changsheng Li, Handong Ma, Zhao Kang, Ye Yuan, Xiao- Yu Zhang, and Guoren Wang. On deep unsupervised active learning. Int. Joint Conf. Artif. Intell., 2020. 2 [39] Changsheng Li, Xiangfeng Wang, Weishan Dong, Junchi Yan, Qingshan Liu, and Hongyuan Zha. Joint active learning with feature selection via cur matrix decomposition. IEEE Trans. Pattern Anal. Mach. Intell., pages 1382–1396, 2018. 2 [40] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3431–3440, 2015. 1, 2 [41] Radek Mackowiak, Philip Lenz, Omair Ghori, Ferran Diego, Oliver Lange, and Carsten Rother. Cereals-cost-effective region-based active learning for semantic segmentation. In Brit. Mach. Vis. Conf., 2018. 2 [42] Andrew Kachites McCallumzy and Kamal Nigamy. Employ- ing em and pool-based active learning for text classification. In Int. Conf. Mach. Learning, pages 359–367, 1998. 2 [43] Sudhanshu Mittal, Maxim Tatarchenko, and Thomas Brox. Semi-supervised semantic segmentation with high-and low- level consistency. IEEE Trans. Pattern Anal. Mach. Intell., 2019. 2, 7 [44] Hieu T Nguyen and Arnold Smeulders. Active learning using pre-clustering. In Int. Conf. Mach. Learning, page 79, 2004. 2 [45] Feiping Nie, Hua Wang, Heng Huang, and Chris Ding. Early active learning via robust representation and structured spar- sity. In Int. Joint Conf. Artif. Intell., 2013. 2 [46] Jelena Novosel, Prashanth Viswanath, and Bruno Arse- nali. Boosting semantic segmentation with multi-task self- supervised learning for autonomous driving applications. In Int. Conf. Comput. Vis. Workshops, 2019. 3 [47] Viktor Olsson, Wilhelm Tranheden, Juliano Pinto, and Lennart Svensson. Classmix: Segmentation-based data aug- mentation for semi-supervised learning. In IEEE Winter Conf. on Applications of Comput. Vis., pages 1369–1378, 2021. 1, 2, 4, 5, 6, 7, 13 [48] Yassine Ouali, Céline Hudelot, and Myriam Tami. Semi- supervised semantic segmentation with cross-consistency training. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12674–12684, 2020. 2 [49] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2536–2544, 2016. 2 [50] Pierluigi Zama Ramirez, Matteo Poggi, Fabio Tosi, Stefano Mattoccia, and Luigi Di Stefano. Geometry meets semantics for semi-supervised monocular depth estimation. In Asian Conf. Comput. Vis., pages 298–313, 2018. 3 [51] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmenta- tion. In Int. Conf. Medical Image Computing and Computer- assisted Intervention, pages 234–241, 2015. 2, 6, 12 [52] Ozan Sener and Silvio Savarese. Active learning for convo- lutional neural networks: A core-set approach. In Int. Conf. Learn. Represent., 2018. 2 [53] Burr Settles. Active learning literature survey. Technical re- port, University of Wisconsin-Madison Department of Com- puter Sciences, 2009. 2 [54] Burr Settles and Mark Craven. An analysis of active learning strategies for sequence labeling tasks. In Conf. Empirical Methods Natural Language Processing, pages 1070–1079, 2008. 2 [55] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Annual Workshop Computational Learning Theory, pages 287–294, 1992. 2 [56] Lei Shi and Yi-Dong Shen. Diversifying convex transduc- tive experimental design for active learning. In IJCAI, pages 1997–2003, 2016. 2 [57] Yawar Siddiqui, Julien Valentin, and Matthias Nießner. Viewal: Active learning with viewpoint entropy for semantic segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9433–9443, 2020. 2 [58] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Varia- tional adversarial active learning. In Int. Conf. Comput. Vis., pages 5972–5981, 2019. 2 10
  • 14. [59] Nasim Souly, Concetto Spampinato, and Mubarak Shah. Semi supervised semantic segmentation using generative ad- versarial network. In Int. Conf. Comput. Vis., pages 5688– 5696, 2017. 2 [60] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Adv. Neural In- form. Process. Syst., pages 1195–1204, 2017. 2, 5 [61] Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 2021. 1 [62] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation consistency training for semi-supervised learning. In Int. Joint Conf. Artif. Intell., pages 3635–3641, 2019. 5 [63] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Int. Conf. Comput. Vis., pages 2794–2802, 2015. 2 [64] Shuai Xie, Zunlei Feng, Ying Chen, Songtao Sun, Chao Ma, and Mingli Song. Deal: Difficulty-aware active learning for semantic segmentation. In Asian Conf. Comput. Vis., 2020. 2 [65] Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Pad-net: Multi-tasks guided prediction-and-distillation net- work for simultaneous depth estimation and scene parsing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 675–684, 2018. 5, 6, 12 [66] Lin Yang, Yizhe Zhang, Jianxu Chen, Siyuan Zhang, and Danny Z Chen. Suggestive annotation: A deep active learn- ing framework for biomedical image segmentation. In Int. Conf. Medical Image Computing and Computer-assisted In- tervention, pages 399–407, 2017. 2 [67] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015. 2 [68] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via trans- ductive experimental design. In Int. Conf. Mach. Learning, pages 1081–1088, 2006. 2 [69] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. In Int. Conf. Comput. Vis., pages 6023–6032, 2019. 1, 4, 5 [70] Lijun Zhang, Chun Chen, Jiajun Bu, Deng Cai, Xiaofei He, and Thomas S Huang. Active learning based on locally lin- ear reconstruction. IEEE Trans. Pattern Anal. Mach. Intell., pages 2026–2038, 2011. 2 [71] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2881–2890, 2017. 2 [72] Hao Zheng, Lin Yang, Jianxu Chen, Jun Han, Yizhe Zhang, Peixian Liang, Zhuo Zhao, Chaoli Wang, and Danny Z Chen. Biomedical image segmentation via representative annota- tion. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5901–5908, 2019. 2 [73] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1851–1858, 2017. 1, 3 [74] Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Un- supervised joint learning of depth and flow using cross-task consistency. In Eur. Conf. Comput. Vis., pages 36–53, 2018. 3 [75] Laurent Zwald and Sophie Lambert-Lacroix. The berhu penalty and the grouped effect. arXiv preprint arXiv:1207.6868, 2012. 12 11
  • 15. A. Further Implementation Details In the following paragraphs, a more detailed de- scription of the network architecture and the training is provided. The reference implementation is available at https://github.com/lhoyer/improving_ segmentation_with_selfsupervised_depth. Network Architecture The neural network combines a DeepLabv3 [5] with a U-Net [51] decoder for depth and segmentation prediction each. As encoder, a ResNet101 with dilated (instead of strided) convolutions in the last block is used, following [5]. Features from multiple scales are aggregated by an ASPP [5] block with dilation rates of 6, 12, and 18. Similar to U-Net [51], the decoder has five upsampling blocks with skip connections. Each up- sampling block consists of a 3x3 convolution layer (except the first block, which is the ASPP), a bilinear upsampling operation, a concatenation with the encoder features of the corresponding size (skip connection), and another 3x3 con- volution layer. Both convolutional layers are followed by an ELU non-linearity. The number of output channels for the blocks are 256, 256, 128, 128, and 64. The last four blocks also have another 3x3 convolutional layer followed by a sigmoid activation attached to their output for the pur- pose of predicting the disparity at the respective scale. For effective multi-task learning, we additionally follow PAD- Net [65] and deploy an attention-guided multi-modal dis- tillation module with additional side output for semantic segmentation after the third decoder block. In experiments without multi-task learning, only the semantic segmentation decoder is used. For pose estimation, we use a lightweight ResNet18 encoder followed by four convolutions to pro- duce the translation and the rotation in angle-axis represen- tation as suggested in [17]. Runtime To give an impression of the computational complexity of our architecture, we provide the training time per iteration and the inference time per image on an Nvidia Tesla P100 in Tab. S4. The values are averaged over 100 it- erations or 500 images, respectively. Please note that these timings include the computational overhead of the training framework such as logging and validation metric calcula- tion. Data Selection In the data selection experiment, we use a slimmed network architecture for fSIDE with a ResNet50 backbone, 256, 128, 128, 64, and 64 decoder channels, and BatchNorm [25] in the decoder for efficiency and faster convergence. The depth student network is trained using a berHu loss [75, 33]. The quality of the selected subset with annotations GA is evaluated for semantic segmentation using our default architecture and training hyperparameters. Table S4. Training and inference time on an Nvidia Tesla P100 av- eraged over 100 iterations or 500 images, respectively. D-T: SDE Transfer Learning, D-M SDE Transfer and Multi-Task Learning, P: Pseudo-Labelling, X-D: Mix Depth D P X Training Time Inference Time T 188 ms/it 66 ms/img T X 466 ms/it 67 ms/img T X D 476 ms/it 66 ms/img M X D 1215 ms/it 160 ms/img B. Cross-Dataset Transfer Learning In this section, we show that the unlabeled image se- quences and the labeled segmentations can also originate from different datasets within similar visual domains. For that purpose, we train the SDE on Cityscapes sequences and learn the semi-supervised semantic segmentation on the CamVid dataset [2], which contains 367 train, 101 valida- tion, and 233 test images with dense semantic segmenta- tion labels for 11 classes from street scenes in Cambridge. To ensure a similar feature resolution, we upsample the CamVid images from 480 × 360 to 672 × 512 pixels and randomly crop to a size of 512 × 512. Table S5 shows that the results on CamVid are similar to our main results on Cityscapes. For 50 labeled training sam- ples, SDE pretraining improves the mIoU by 3.6 percentage points, pseudo-labels and DepthMix by another 4.07 per- centage points, and data selection by another 1.41 percent- age points. In the end, our proposed method significantly outperforms ClassMix by 2.34 percentage points for 50 la- beled samples and 2.14 percentage points for 100 labeled samples. Also for the fully labeled dataset, our method can improve the performance by 3.29 percentage points. C. Further Example Predictions Further examples for semantic segmentation and SDE are shown in Fig. S6. In general, the same observations as in the main paper can be made. Our method provides clearer segmentation contours for objects that are bordered by pro- nounced depth discontinuities such as pole, traffic sign, or traffic light. We also show improved differentiation between similar classes such as truck, bus, and train. On the down- side, SDE sometimes fails for cars driving directly in front of the camera (see 7th row in Fig. S6) and violating the re- construction assumptions. Those cars are observed at the exact same location across the image sequence and can not be correctly reconstructed during SDE training, even with correct depth and pose estimates. However, this differentia- tion between moving and non-moving cars does not hinder the transfer of SDE-learned features to semantic segmenta- tion but can cause problems with DepthMix (see Section D). 12
  • 16. Table S5. Performance on the CamVid test set (mIoU in %, standard deviation over 3 random seeds). The SDE is trained on Cityscapes sequences. DT: SDE Transfer Learning, XD - DepthMix, S: Data Selection. # Labeled 50 100 367 (Full) Baseline 59.16 ±1.79 63.05 ±0.59 68.18 ±0.13 Ours (DT) 62.75 ±2.32 +3.60 66.19 ±0.96 +3.15 70.45 ±0.35 +2.27 ClassMix [47] 65.89 ±0.33 +6.73 67.48 ±1.02 +4.43 - Ours (DT+XD) 66.82 ±1.16 +7.66 68.91 ±0.62 +5.86 71.46 ±0.22 +3.29 Ours (DT+XD+S) 68.23 ±0.39 +9.07 69.62 ±0.64 +6.57 - Figure S6. Further example predictions for 100 annotated training samples including the self-supervised disparity estimate of the multi-task learning framework. D. DepthMix Real-World Examples In Fig. S7, we show examples of DepthMix applied to Cityscapes crops. Generally, it can be seen that DepthMix works well in most cases. The self-supervised depth esti- mates allow to correctly model occlusions and the produced synthetic samples have a realistic appearance. In Fig. S8, we show a selection of typical failure cases of DepthMix. First, the SDE can be inaccurate for dynamic objects (see Sec. C), which can cause an inaccurate struc- ture within the mixed image (Fig. S8 a, b, and c). However, this type of failure case is common in ClassMix and its fre- quency is greatly reduced with DepthMix. A remedy might be SDE extensions that incorporate the motion of dynamic objects [3, 10, 32]. Second, in some cases, the SDE can be imprecise and the depth discontinuities do not appear at the same location as the class border. This can cause arti- facts in the mixed image (Fig. S8 d and e) but also in the mixed segmentation (Fig. S8 e: sky within the building). Note that the same can happen for ClassMix when using pseudo-labels for creating the mix mask. 13
  • 17. a) b) c) d) e) f) g) h) Figure S7. DepthMix applied to Cityscapes crops. From left to right, the source images with their SDE estimate, the mixed image I0 overlaid with border of the mix mask M in blue/orange depending on the adjacent source image (i - orange, j - blue), the mixed image without visual guidance I0 , the mixed depth D0 , and the mixed segmentation S0 are shown. For simplicity, the source segmentations for the mixed segmentation S0 originate from the ground truth labels. 14
  • 18. a) b) c) d) e) Figure S8. DepthMix failure cases. From left to right, the source images with their SDE estimate, the mixed image I0 overlaid with border of the mix mask M in blue/orange depending on the adjacent source image (i - orange, j - blue), the mixed image without visual guidance I0 , the mixed depth D0 , and the mixed segmentation S0 are shown. For simplicity, the source segmentations for the mixed segmentation S0 originate from the ground truth labels. 15