SlideShare a Scribd company logo
This paper has been submitted for publication on November 15, 2016.
Learning from Simulated and Unsupervised Images through Adversarial
Training
Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, Russ Webb
Apple Inc.
{a_shrivastava, tpf, otuzel, jsusskind, wenda_wang, rwebb}@apple.com
Abstract
With recent progress in graphics, it has become more
tractable to train models on synthetic images, poten-
tially avoiding the need for expensive annotations. How-
ever, learning from synthetic images may not achieve the
desired performance due to a gap between synthetic and
real image distributions. To reduce this gap, we pro-
pose Simulated+Unsupervised (S+U) learning, where
the task is to learn a model to improve the realism of
a simulator’s output using unlabeled real data, while
preserving the annotation information from the simula-
tor. We develop a method for S+U learning that uses
an adversarial network similar to Generative Adversar-
ial Networks (GANs), but with synthetic images as in-
puts instead of random vectors. We make several key
modifications to the standard GAN algorithm to pre-
serve annotations, avoid artifacts and stabilize training:
(i) a ‘self-regularization’ term, (ii) a local adversarial
loss, and (iii) updating the discriminator using a history
of refined images. We show that this enables genera-
tion of highly realistic images, which we demonstrate
both qualitatively and with a user study. We quantita-
tively evaluate the generated images by training mod-
els for gaze estimation and hand pose estimation. We
show a significant improvement over using synthetic im-
ages, and achieve state-of-the-art results on the MPI-
IGaze dataset without any labeled real data.
1. Introduction
Large labeled training datasets are becoming increas-
ingly important with the recent rise in high capacity deep
neural networks [4, 18, 44, 44, 1, 15]. However, labeling
such large datasets is expensive and time-consuming.
Thus the idea of training on synthetic instead of real im-
ages has become appealing because the annotations are
automatically available. Human pose estimation with
Kinect [32] and, more recently, a plethora of other tasks
have been tackled using synthetic data [40, 39, 26, 31].
Refiner
Unlabeled Real Images
Synthetic Refined
Figure 1. Simulated+Unsupervised (S+U) learning. The task is
to learn a model that improves the realism of synthetic images
from a simulator using unlabeled real data, while preserving
the annotation information.
However, learning from synthetic images can be prob-
lematic due to a gap between synthetic and real im-
age distributions – synthetic data is often not realistic
enough, leading the network to learn details only present
in synthetic images and fail to generalize well on real
images. One solution to closing this gap is to improve
the simulator. However, increasing the realism is often
computationally expensive, the renderer design takes a
lot of hard work, and even top renderers may still fail to
model all the characteristics of real images. This lack
of realism may cause models to overfit to ‘unrealistic’
details in the synthetic images.
In this paper, we propose Simulated+Unsupervised
(S+U) learning, where the goal is to improve the real-
ism of synthetic images from a simulator using unla-
beled real data. The improved realism enables the train-
ing of better machine learning models on large datasets
without any data collection or human annotation effort.
In addition to adding realism, S+U learning should pre-
serve annotation information for training of machine
learning models – e.g. the gaze direction in Figure 1
should be preserved. Moreover, since machine learning
models can be sensitive to artifacts in the synthetic data,
S+U learning should generate images without artifacts.
arXiv:1612.07828v1[cs.CV]22Dec2016
We develop a method for S+U learning, which we
term SimGAN, that refines synthetic images from a sim-
ulator using a neural network which we call the ‘refiner
network’. Figure 2 gives an overview of our method: a
synthetic image is generated with a black box simulator
and is refined using the refiner network. To add real-
ism – the first requirement of an S+U learning algorithm
– we train our refiner network using an adversarial loss,
similar to Generative Adversarial Networks (GANs) [7],
such that the refined images are indistinguishable from
real ones using a discriminative network. Second, to
preserve the annotations of synthetic images, we com-
plement the adversarial loss with a self-regularization
loss that penalizes large changes between the synthetic
and refined images. Moreover, we propose to use a
fully convolutional neural network that operates on a
pixel level and preserves the global structure, rather than
holistically modifying the image content as in e.g. a fully
connected encoder network. Third, the GAN framework
requires training two neural networks with competing
goals, which is known to be unstable and tends to in-
troduce artifacts [29]. To avoid drifting and introduc-
ing spurious artifacts while attempting to fool a single
stronger discriminator, we limit the discriminator’s re-
ceptive field to local regions instead of the whole image,
resulting in multiple local adversarial losses per image.
Moreover, we introduce a method for improving the sta-
bility of training by updating the discriminator using a
history of refined images rather than the ones from the
current refiner network.
Contributions:
1. We propose S+U learning that uses unlabeled real
data to refine the synthetic images generated by a
simulator.
2. We train a refiner network to add realism to syn-
thetic images using a combination of an adversarial
loss and a self-regularization loss.
3. We make several key modifications to the GAN
training framework to stabilize training and prevent
the refiner network from producing artifacts.
4. We present qualitative, quantitative, and user study
experiments showing that the proposed framework
significantly improves the realism of the simulator
output. We achieve state-of-the-art results, without
any human annotation effort, by training deep neu-
ral networks on the refined output images.
1.1. Related Work
The GAN framework learns two networks (a gener-
ator and a discriminator) with competing losses. The
Simulator
Discriminator
Synthetic Refined
Unlabeled real
–
Refiner
Real vs Refined D
R
Figure 2. Overview of SimGAN. We refine the output of
the simulator with a refiner neural network, R, that mini-
mizes the combination of a local adversarial loss and a ‘self-
regularization’ term. The adversarial loss fools a discrimi-
nator network, D, that classifies an image as real or refined.
The self-regularization term minimizes the image difference
between the synthetic and the refined images. This preserves
the annotation information (e.g. gaze direction), making the
refined images useful for training a machine learning model.
The refiner network R and the discriminator network D are
updated alternately.
goal of the generator network is to map a random vector
to a realistic image, whereas the goal of the discrimina-
tor is to distinguish the generated and the real images.
The GAN framework was first introduced by Goodfel-
low et al. [7] to generate visually realistic images and,
since then, many improvements and interesting applica-
tions have been proposed [29]. Wang and Gupta [38]
use a Structured GAN to learn surface normals and then
combine it with a Style GAN to generate natural indoor
scenes. Im et al. [12] propose a recurrent generative
model trained using adversarial training. The recently
proposed iGAN [45] enables users to change the im-
age interactively on a natural image manifold. CoGAN
by Liu et al. [19] uses coupled GANs to learn a joint
distribution over images from multiple modalities with-
out requiring tuples of corresponding images, achiev-
ing this by a weight-sharing constraint that favors the
joint distribution solution. Chen et al. [2] propose Info-
GAN, an information-theoretic extension of GAN, that
allows learning of meaningful representations. Tuzel et
al. [36] tackled image superresolution for face images
with GANs. Li and Wand [17] propose a Markovian
GAN for efficient texture synthesis. Lotter et al. [20] use
adversarial loss in an LSTM network for visual sequence
prediction. Yu et al. [41] propose the SeqGAN frame-
work that uses GANs for reinforcement learning. Many
recent works have explored related problems in the do-
main of generative models, such as PixelRNN [37] that
predicts pixels sequentially with an RNN with a softmax
loss. The generative networks focus on generating im-
ages using a random noise vector; thus, in contrast to our
method, the generated images do not have any annota-
tion information that can be used for training a machine
learning model.
Many efforts have explored using synthetic data for
various prediction tasks, including gaze estimation [40],
text detection and classification in RGB images [8, 14],
font recognition [39], object detection [9, 24], hand
pose estimation in depth images [35, 34], scene recog-
nition in RGB-D [10], semantic segmentation of urban
scenes [28], and human pose estimation [23, 3, 16, 13,
25, 27]. Gaidon et al. [5] show that pre-training a deep
neural network on synthetic data leads to improved per-
formance. Our work is complementary to these ap-
proaches, where we improve the realism of the simulator
using unlabeled real data.
Ganin and Lempitsky [6] use synthetic data in a
domain adaptation setting where the learned features
are invariant to the domain shift between synthetic and
real images. Wang et al. [39] train a Stacked Con-
volutional Auto-Encoder on synthetic and real data to
learn the lower-level representations of their font detec-
tor ConvNet. Zhang et al. [42] learn a Multichannel Au-
toencoder to reduce the domain shift between real and
synthetic data. In contrast to classical domain adaptation
methods that adapt the features with respect to a specific
prediction task, we bridge the gap between image dis-
tributions through adversarial training. This approach
allows us to generate very realistic images which can be
used to train any machine learning model, potentially for
multiple tasks.
2. S+U Learning with SimGAN
The goal of Simulated+Unsupervised learning is to
use a set of unlabeled real images yi ∈ Y to learn a
refiner Rθ(x) that refines a synthetic image x, where θ
are the function parameters. Let the refined image be
denoted by ˜x, then
˜x := Rθ(x).
The key requirement for S+U learning is that the re-
fined image ˜x should look like a real image in appear-
ance while preserving the annotation information from
the simulator.
To this end, we propose to learn θ by minimizing a
combination of two losses:
LR(θ) =
i
real(θ; ˜xi, Y) + λ reg(θ; ˜xi, xi), (1)
where xi is the ith
synthetic training image, and ˜xi is
the corresponding refined image. The first part of the
cost, real, adds realism to the synthetic images, while the
second part, reg, preserves the annotation information
by minimizing the difference between the synthetic and
the refined images. In the following sections, we expand
this formulation and provide an algorithm to optimize
for θ.
2.1. Adversarial Loss with Self-Regularization
To add realism to the synthetic image, we need to
bridge the gap between the distributions of synthetic and
real images. An ideal refiner will make it impossible to
classify a given image as real or refined with high confi-
dence. This motivates the use of an adversarial discrim-
inator network, Dφ, that is trained to classify images
as real vs refined, where φ are the the parameters of
the discriminator network. The adversarial loss used in
training the refiner network, R, is responsible for ‘fool-
ing’ the network D into classifying the refined images
as real. Following the GAN approach [7], we model this
as a two-player minimax game, and update the refiner
network, Rθ, and the discriminator network, Dφ, alter-
nately. Next, we describe this intuition more precisely.
The discriminator network updates its parameters by
minimizing the following loss:
LD(φ) = −
i
log(Dφ(˜xi)) −
j
log(1 − Dφ(yj)).
(2)
This is equivalent to cross-entropy error for a two class
classification problem where Dφ(.) is the probability of
the input being a synthetic image, and 1 − Dφ(.) that of
a real one. We implement Dφ as a ConvNet whose last
layer outputs the probability of the sample being a re-
fined image. For training this network, each mini-batch
consists of randomly sampled refined synthetic images
˜xi’s and real images yj’s. The target labels for the cross-
entropy loss layer are 0 for every yj, and 1 for every ˜xi.
Then φ for a mini-batch is updated by taking a stochas-
tic gradient descent (SGD) step on the mini-batch loss
gradient.
In our implementation, the realism loss function real
in (1) uses the trained discriminator D as follows:
real(θ; ˜xi, Y) = −
i
log(1 − Dφ(Rθ(xi))). (3)
By minimizing this loss function, the refiner forces the
discriminator to fail classifying the refined images as
synthetic. In addition to generating realistic images, the
refiner network should preserve the annotation informa-
tion of the simulator. For example, for gaze estimation
the learned transformation should not change the gaze
direction, and for hand pose estimation the location of
the joints should not change. This is an essential ingredi-
ent to enable training a machine learning model that uses
the refined images with the simulator’s annotations. To
enforce this, we propose using a self-regularization loss
that minimizes the image difference between the syn-
thetic and the refined image. Thus, the overall refiner
Algorithm 1: Adversarial training of refiner net-
work Rθ
Input: Sets of synthetic images xi ∈ X , and real
images yj ∈ Y, max number of steps (T),
number of discriminator network updates
per step (Kd), number of generative
network updates per step (Kg).
Output: ConvNet model Rθ.
for t = 1, . . . , T do
for k = 1, . . . , Kg do
1. Sample a mini-batch of synthetic images
xi.
2. Update θ by taking a SGD step on
mini-batch loss LR(θ) in (4) .
end
for k = 1, . . . , Kd do
1. Sample a mini-batch of synthetic images
xi, and real images yj.
2. Compute ˜xi = Rθ(xi) with current θ.
3. Update φ by taking a SGD step on
mini-batch loss LD(φ) in (2).
end
end
Discriminator
D
Probability mapInput image
w
h
Figure 3. Illustration of local adversarial loss. The discrimina-
tor network outputs a w × h probability map. The adversarial
loss function is the sum of the cross-entropy losses over the
local patches.
loss function (1) used in our implementation is:
LR(θ) = −
i
log(1 − Dφ(Rθ(xi)))
+λ Rθ(xi) − xi 1, (4)
where . 1 is 1 norm. We implement Rθ as a fully con-
volutional neural net without striding or pooling. This
modifies the synthetic image on a pixel level, rather
than holistically modifying the image content as in e.g.
a fully connected encoder network, and preserves the
global structure and the annotations. We learn the refiner
and discriminator parameters by minimizing LR(θ) and
LD(φ) alternately. While updating the parameters of
Rθ, we keep φ fixed, and while updating Dφ, we fix θ.
We summarize this training procedure in Algorithm 1.
2.2. Local Adversarial Loss
Another key requirement for the refiner network is
that it should learn to model the real image characteris-
tics without introducing any artifacts. When we train a
Buffer of
refined images
Refined images
with current
Refined Real
Mini-batch for D
R
Figure 4. Illustration of using a history of refined images. See
text for details.
single strong discriminator network, the refiner network
tends to over-emphasize certain image features to fool
the current discriminator network, leading to drifting
and producing artifacts. A key observation is that any
local patch we sample from the refined image, should
have similar statistics to a real image patch. Therefore,
rather than defining a global discriminator network, we
can define discriminator network that classifies all local
image patches separately. This not only limits the re-
ceptive field, and hence the capacity of the discriminator
network, but also provides many samples per image for
learning the discriminator network. This also improves
training of the refiner network because we have multiple
‘realism loss’ values per image.
In our implementation, we design the discriminator
D to be a fully convolutional network that outputs w ×
h dimensional probability map of patches belonging to
fake class, where w × h are the number of local patches
in the image. While training the refiner network, we sum
the cross-entropy loss values over w × h local patches,
as illustrated in Figure 3.
2.3. Updating Discriminator using a History of
Refined Images
Another problem of adversarial training is that the
discriminator network only focuses on the latest refined
images. This may cause (i) diverging of the adversar-
ial training, and (ii) the refiner network re-introducing
the artifacts that the discriminator has forgotten about.
Any refined image generated by the refiner network at
any time during the entire training procedure is a ‘fake’
image for the discriminator. Hence, the discriminator
should be able to classify all these images as fake. Based
on this observation, we introduce a method to improve
the stability of adversarial training by updating the dis-
criminator using a history of refined images, rather than
only the ones in the current mini-batch. We slightly
modify Algorithm 1 to have a buffer of refined images
generated by previous networks. Let B be the size of the
buffer and b be the mini-batch size used in Algorithm 1.
Unlabeled Real Images
Synthetic
Simulated images
Refined
Figure 5. Example output of SimGAN for the UnityEyes gaze estimation dataset [40]. (Left) real images from MPIIGaze [43]. Our
refiner network does not use any label information from MPIIGaze dataset at training time. (Right) refinement results on UnityEye.
The skin texture and the iris region in the refined synthetic images are qualitatively significantly more similar to the real images
than to the synthetic images. More examples are included in the supplementary material.
onv
nxn
onv
nxn
ture maps
s
LU
Conv
f@nxn
Conv
f@nxn
+
ReLU
ReLU
Input
Features
Output
Features
Figure 6. A ResNet block with two n×n convolutional layers,
each with f feature maps.
At each iteration of discriminator training, we compute
the discriminator loss function by sampling b/2 images
from the current refiner network, and sampling an addi-
tional b/2 images from the buffer to update parameters
φ. We keep the size of the buffer, B, fixed. After each
training iteration, we randomly replace b/2 samples in
the buffer with the newly generated refined images. This
procedure is illustrated in Figure 4.
3. Experiments
We evaluate our method for appearance-based gaze
estimation in the wild on the MPIIGaze dataset [40, 43],
and hand pose estimation on the NYU hand pose dataset
of depth images [35]. We use fully convolutional refiner
network with ResNet blocks (Figure 6) for all our exper-
iments.
3.1. Appearance-based Gaze Estimation
Gaze estimation is a key ingredient for many human
computer interaction (HCI) tasks. However, estimat-
ing the gaze direction from an eye image is challeng-
ing, especially when the image is of low quality, e.g.
from a laptop or a mobile phone camera – annotating the
eye images with a gaze direction vector is challenging
even for humans. Therefore, to generate large amounts
of annotated data, several recent approaches [40, 43]
train their models on large amounts of synthetic data.
Here, we show that training with the refined synthetic
images generated by SimGAN significantly outperforms
the state-of-the-art for this task.
The gaze estimation dataset consists of 1.2M syn-
thetic images from eye gaze synthesizer UnityEyes [40]
and 214K real images from the MPIIGaze dataset [43]
– samples shown in Figure 5. MPIIGaze is a very chal-
lenging eye gaze estimation dataset captured under ex-
treme illumination conditions. For UnityEyes we use a
single generic rendering environment to generate train-
ing data without any dataset-specific targeting.
Qualitative Results: Figure 5 shows examples of syn-
thetic, real and refined images from the eye gaze dataset.
As shown, we observe a significant qualitative improve-
ment of the synthetic images: SimGAN successfully
captures the skin texture, sensor noise and the appear-
ance of the iris region in the real images. Note that our
method preserves the annotation information (gaze di-
rection) while improving the realism.
‘Visual Turing Test’: To quantitatively evaluate the
visual quality of the refined images, we designed a sim-
ple user study where subjects were asked to classify
images as real or refined synthetic. Each subject was
shown a random selection of 50 real images and 50 re-
fined images in a random order, and was asked to label
the images as either real or refined. The subjects were
constantly shown 20 examples of real and refined im-
ages while performing the task. The subjects found it
very hard to tell the difference between the real images
and the refined images. In our aggregate analysis, 10
subjects chose the correct label 517 times out of 1000
trials (p = 0.148), which is not significantly better than
chance. Table 1 shows the confusion matrix. In con-
trast, when testing on original synthetic images vs real
images, we showed 10 real and 10 synthetic images per
subject, and the subjects chose correctly 162 times out
of 200 trials (p ≤ 10−8
), which is significantly better
than chance.
Quantitative Results: We train a simple convolu-
tional neural network (CNN) similar to [43] to predict
the eye gaze direction (encoded by a 3-dimensional vec-
tor for x, y, z) with l2 loss. We train on UnityEyes and
test on MPIIGaze. Figure 7 and Table 2 compare the
performance of a gaze estimation CNN trained on syn-
thetic data to that of another CNN trained on refined
Selected as real Selected as synt
Ground truth real 224 276
Ground truth synt 207 293
Table 1. Results of the ‘Visual Turing test’ user study for clas-
sifying real vs refined images. Subjects were asked to dis-
tinguish between refined synthetic images (output from our
method) and real images (from MPIIGaze). The average hu-
man classification accuracy was 51.7%, demonstrating that the
automatically generated refined images are visually very hard
to distinguish from real images.
0 5 10 15 20 25
Distance from ground truth [degrees]
0
10
20
30
40
50
60
70
80
90
100
Percentageofimages
Refined Synthetic Data 4x
Refined Synthetic Data
Synthetic Data 4x
Synthetic Data
Figure 7. Quantitative results for appearance-based gaze esti-
mation on the MPIIGaze dataset with real eye images. The
plot shows cumulative curves as a function of degree error as
compared to the ground truth eye gaze direction, for differ-
ent numbers of training examples of synthetic and refined syn-
thetic data. Gaze estimation using the refined images instead
of the synthetic images results in significantly improved per-
formance.
synthetic data, the output of SimGAN. We observe a
large improvement in performance from training on the
SimGAN output, a 22.3% absolute percentage improve-
ment. We also observe a large improvement from train-
ing on more training data – here 4x refers to 100% of the
training dataset. The quantitative evaluation confirms
the value of the qualitative improvements observed in
Figure 5, and shows that machine learning models gen-
eralize significantly better using SimGAN.
Table 3 shows a comparison to the state-of-the-art.
Training the CNN on the refined images outperforms the
state-of-the-art on the MPIIGaze dataset, with a relative
improvement of 21%. This large improvement shows
the practical value of our method in many HCI tasks.
Implementation Details: The refiner network, Rθ, is
a residual network (ResNet) [11]. Each ResNet block
consists of two convolutional layers containing 64 fea-
ture maps as shown in Figure 6. An input image of size
55 × 35 is convolved with 3 × 3 filters that output 64
Training data % of images within d
Synthetic Data 62.3
Synthetic Data 4x 64.9
Refined Synthetic Data 69.4
Refined Synthetic Data 4x 87.2
Table 2. Comparison of a gaze estimator trained on synthetic
data and the output of SimGAN. The results are at distance
d = 7 degrees from ground truth. Training on the refined
synthetic output of SimGAN outperforms training on synthetic
data by 22.3%, without requiring supervision for the real data.
Method R/S Error
Support Vector Regression (SVR) [30] R 16.5
Adaptive Linear Regression ALR) [21] R 16.4
Random Forest (RF) [33] R 15.4
kNN with UT Multiview [43] R 16.2
CNN with UT Multiview [43] R 13.9
k-NN with UnityEyes [40] S 9.9
CNN with UnityEyes Synthetic Images S 11.2
CNN with UnityEyes Refined Images S 7.8
Table 3. Comparison of SimGAN to the state-of-the-art on the
MPIIGaze dataset of real eyes. The second column indicates
whether the methods are trained on Real/Synthetic data. The
error the is mean eye gaze estimation error in degrees. Train-
ing on refined images results in a 2.1 degree improvement, a
relative 21% improvement compared to the state-of-the-art.
feature maps. The output is passed through 4 ResNet
blocks. The output of the last ResNet block is passed
to a 1 × 1 convolutional layer producing 1 feature map
corresponding to the refined synthetic image.
The discriminator network, Dφ, contains 5 con-
volution layers and 2 max-pooling layers as follows:
(1) Conv3x3, stride=2, feature maps=96, (2) Conv3x3,
stride=2, feature maps=64, (3) MaxPool3x3, stride=1,
(4) Conv3x3, stride=1, feature maps=32, (5) Conv1x1,
stride=1, feature maps=32, (6) Conv1x1, stride=1, fea-
ture maps=2, (7) Softmax.
Our adversarial network is fully convolutional, and
has been designed such that the receptive field of the
last layer neurons in Rθ and Dφ are similar. We first
train the Rθ network with just self-regularization loss
for 1, 000 steps, and Dφ for 200 steps. Then, for each
update of Dφ, we update Rθ twice, i.e. Kd is set to 1,
and Kg is set to 50 in Algorithm 1.
The eye gaze estimation network is similar to [43],
with some changes to enable it to better exploit our
large synthetic dataset. The input is a 35 × 55
grayscale image that is passed through 5 convolu-
tional layers followed by 3 fully connected layers,
the last one encoding the 3-dimensional gaze vector:
(1) Conv3x3, feature maps=32, (2) Conv3x3, feature
maps=32, (3) Conv3x3, feature maps=64, (4) Max-
Pool3x3, stride=2, (5) Conv3x3, feature maps=80,
(6) Conv3x3, feature maps=192, (7) MaxPool2x2,
Global adversarial loss Local adversarial loss
Figure 8. Importance of using a local adversarial loss. (Left)
an example image that has been generated with a standard
‘global’ adversarial loss on the whole image. The noise around
the edge of the hand contains obvious unrealistic depth bound-
ary artifacts. (Right) the same image generated with a local
adversarial loss that looks significantly more realistic.
Synthetic Refined
(with history)
Refined
(without history)
Figure 9. Using a history of refined images for updating the
discriminator. (Left) synthetic images; (middle) result of us-
ing the history of refined images; (right) result without using
a history of refined images (instead using only the most re-
cent refined images). We observe obvious unrealistic artifacts,
especially around the corners of the eyes.
stride=2, (8) FC9600, (9) FC1000, (10) FC3, (11) Eu-
clidean loss. All networks are trained with a constant
0.001 learning rate and 512 batch size, until the valida-
tion error converges.
3.2. Hand Pose Estimation from Depth Images
Next, we evaluate our method for hand pose esti-
mation in depth images. We use the NYU hand pose
dataset [35] that contains 72, 757 training frames and
8, 251 testing frames captured by 3 Kinect cameras –
one frontal and 2 side views. Each depth frame is labeled
with hand pose information that has been used to create
Figure 10. NYU hand pose dataset. (Left) depth frame; (right)
corresponding synthetic image.
a synthetic depth image. Figure 10 shows one such ex-
ample frame. We pre-process the data by cropping the
pixels from real images using the synthetic images. The
images are resized to 224 × 224 before passing them to
the ConvNet. The background depth values are set to
zero and the foreground values are set to original depth
value minus 2000 (assuming that the background is at
2000 millimeters).
Qualitative Results: Figure 11 shows example output
of SimGAN on the NYU hand pose test set. As is ap-
parent from the figure, the main source of noise in real
depth images is from depth discontinuity at the edges.
SimGAN is able to learn to model this kind of noise
without requiring any label information for the real im-
ages, resulting in more realistic-looking images for this
domain as well.
Quantitative Results: We train a fully convolutional
hand pose estimator CNN similar to Stacked Hourglass
Net [22] on real, synthetic and refined synthetic images
of the NYU hand pose training set, and evaluate each
model on all real images in the NYU hand pose test set.
We train on the same 14 hand joints as in [35]. Many
state-of-the-art hand pose estimation methods are cus-
tomized pipelines that consist of several steps. We use
only a single deep neural network to analyze the effect
of improving the synthetic images to avoid bias due to
other factors. Figure 12 and Table 4 present quantitative
results on NYU hand pose. Training on refined synthetic
data – the output of SimGAN which does not require any
labeling for the real images – significantly outperforms
the model trained on real images with supervision, by
8.8%. The proposed method also outperforms training
on synthetic data. We also observe a large improvement
as the number of training examples is increased, which
comes with zero annotation cost to us as we train on the
output of a simulator – here 3x corresponds to training
on all views.
Implementation Details: The architecture is the same
as for eye gaze estimation, except the input image size
is 224 × 224, filter size is 7 × 7, and 10 ResNet blocks
are used. The discriminative net Dφ is: (1) Conv7x7,
stride=4, feature maps=96, (2) Conv5x5, stride=2, fea-
ture maps=64, (3) MaxPool3x3, stride=2, (4) Conv3x3,
R
R
SyntheticRefinedUnlabeled Real Images Simulated images
Figure 11. Example refined test images for the NYU hand pose dataset [35]. (Left) real images, (right) synthetic images and the
corresponding refined output images from the refiner network. The major source of noise in the real images is the non-smooth depth
boundaries. The refiner network learns to model the noise present in the real images, importantly without requiring any labels for
the real images.
1 2 3 4 5 6 7 8 9 10
Distance from ground truth [pixels]
20
30
40
50
60
70
80
90
100
Percentageofimages
Refined Synthetic Data 3x
Synthetic Data 3x
Real Data
Refined Synthetic Data
Synthetic Data
Figure 12. Quantitative results for hand pose estimation on the
NYU hand pose test set of real depth images [35]. The plot
shows cumulative curves as a function of distance from ground
truth keypoint locations, for different numbers of training ex-
amples of synthetic and refined images. Training a pose esti-
mator on the output of SimGAN significantly outperforms the
same network trained on real images. Importantly, our refiner
generative model does not require labeling for the real images.
Training data % of images within d
Synthetic Data 69.7
Refined Synthetic Data 72.4
Real Data 74.5
Synthetic Data 3x 77.7
Refined Synthetic Data 3x 83.3
Table 4. Comparison of a hand pose estimator trained on syn-
thetic data, real data, and the output of SimGAN. The results
are at distance d = 5 pixels from ground truth. Training on
the output of SimGAN outperforms training on supervised real
data by 8.8%, without requiring any supervision.
stride=2, feature maps=32, (5) Conv1x1, stride=1, fea-
ture maps=32, (6) Conv1x1, stride=1, feature maps=2,
(7) Softmax. We train the Rθ network first with just self-
regularization loss for 500 steps and Dφ for 200 steps;
then, for each update of Dφ we update Rθ twice, i.e. Kd
is set to 1, and Kg is set to 2 in Algorithm 1.
For hand pose estimation, we use the Stacked Hour-
glass Net of [22] 2 hourglass blocks, and an output
heatmap size 64 × 64. We augment at training time with
random [−20, 20] degree rotations and crops. All net-
works are trained until the validation error converges.
3.3. Analysis of Modifications to Adversarial
Training
First we compare local vs global adversarial loss dur-
ing training. A global adversarial loss uses a fully con-
nected layer in the discriminator network, classifying
the whole image as real vs refined. The local adversar-
ial loss removes the artifacts and makes the generated
image significantly more realistic, as seen in Figure 8.
Next, in Figure 9, we show result of using history of
refined images, and compare it with standard adversarial
training for gaze estimation. As shown in the figure, us-
ing the buffer of refined images prevents severe artifacts
in standard training, e.g. around the corner of the eyes.
4. Conclusions and Future Work
We have proposed Simulated+Unsupervised learning
to refine a simulator’s output with unlabeled real data.
S+U learning adds realism to the simulator and pre-
serves the global structure and the annotations of the
synthetic images. We described SimGAN, our method
for S+U learning, that uses an adversarial network and
demonstrated state-of-the-art results without any labeled
real data. In future, we intend to explore modeling the
noise distribution to generate more than one refined im-
age for each synthetic image, and investigate refining
videos rather than single images.
References
[1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Nat-
sev, G. Toderici, B. Varadarajan, and S. Vi-
jayanarasimhan. Youtube-8m: A large-scale
video classification benchmark. arXiv preprint
arXiv:1609.08675, 2016.
[2] X. Chen, Y. Duan, R. Houthooft, J. Schulman,
I. Sutskever, and P. Abbeel. InfoGAN: Inter-
pretable representation learning by information
maximizing generative adversarial nets. arXiv
preprint arXiv:1606.03657, 2016.
[3] T. Darrell, P. Viola, and G. Shakhnarovich. Fast
pose estimation with parameter sensitive hashing.
In Proc. CVPR, 2015.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
L. Fei-Fei. ImageNet: A Large-Scale Hierarchical
Image Database. In Proc. CVPR, 2009.
[5] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual
worlds as proxy for multi-object tracking analysis.
In Proc. CVPR, 2016.
[6] Y. Ganin and V. Lempitsky. Unsupervised domain
adaptation by backpropagation. arXiv preprint
arXiv:1409.7495, 2014.
[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. Courville, and
Y. Bengio. Generative adversarial nets. In Proc.
NIPS, 2014.
[8] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic
data for text localisation in natural images. Proc.
CVPR, 2016.
[9] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik.
Learning rich features from rgb-d images for ob-
ject detection and segmentation. In Proc. ECCV,
2014.
[10] A. Handa, V. Patraucean, V. Badrinarayanan,
S. Stent, and R. Cipolla. SceneNet: Understand-
ing real world indoor scenes with synthetic data.
In Proc. CVPR, 2015.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid-
ual learning for image recognition. arXiv preprint
arXiv:1512.03385, 2015.
[12] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic.
Generating images with recurrent adversarial net-
works. http://arxiv.org/abs/ 1602.05110, 2016.
[13] C. Ionescu, D. Papava, V. Olaru, and C. Sminchis-
escu. Human3.6m: Large scale datasets and pre-
dictive methods for 3d human sensing in natural
environments. PAMI, 36(7):1325–1339, 2014.
[14] M. Jaderberg, K. Simonyan, A. Vedaldi, and
A. Zisserman. Reading text in the wild with con-
volutional neural networks. IJCV, 116(1):1–20,
2016.
[15] I. Krasin, T. Duerig, N. Alldrin, A. Veit, S. Abu-
El-Haija, S. Belongie, D. Cai, Z. Feng, V. Fer-
rari, V. Gomes, A. Gupta, D. Narayanan, C. Sun,
G. Chechik, and K. Murphy. OpenImages: A pub-
lic dataset for large-scale multi-label and multi-
class image classification. Dataset available from
https://github.com/openimages, 2016.
[16] Y. LeCun, F. Huang, and L. Bottou. Learning
methods for generic object recognition with invari-
ance to pose and lighting. In Proc. CVPR, 2004.
[17] C. Li and M. Wand. Precomputed real-time tex-
ture synthesis with markovian generative adversar-
ial networks. In Proc. ECCV, 2016.
[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Per-
ona, D. Ramanan, P. Dollár, and C. L. Zitnick. Mi-
crosoft COCO: Common objects in context. In
Proc. ECCV, 2014.
[19] M.-Y. Liu and O. Tuzel. Coupled generative adver-
sarial networks. In Proc. NIPS, 2016.
[20] W. Lotter, G. Kreiman, and D. Cox. Unsupervised
learning of visual structure using predictive gener-
ative networks. arXiv preprint arXiv:1511.06380,
2015.
[21] F. Lu, Y. Sugano, T. Okabe, and Y. Sato. Adaptive
linear regression for appearance-based gaze esti-
mation. PAMI, 36(10):2033–2046, 2014.
[22] A. Newell, K. Yang, and J. Deng. Stacked hour-
glass networks for human pose estimation. arXiv
preprint arXiv:1603.06937, 2016.
[23] D. Park and D. Ramanan. Articulated pose esti-
mation with tiny synthetic videos. In Proc. CVPR,
2015.
[24] X. Peng, B. Sun, K. Ali, and K. Saenko. Learning
deep object detectors from 3d models. In Proc.
ICCV, 2015.
[25] L. Pishchulin, A. Jain, M. Andriluka, T. Thor-
mählen, and B. Schiele. Articulated people detec-
tion and pose estimation: Reshaping the future. In
Proc. CVPR, 2012.
[26] W. Qiu and A. Yuille. UnrealCV: Connecting
computer vision to Unreal Engine. arXiv preprint
arXiv:1609.01326, 2016.
[27] G. Rogez and C. Schmid. MoCap-guided data aug-
mentation for 3d pose estimation in the wild. arXiv
preprint arXiv:1607.02046, 2016.
[28] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and
A. M. Lopez. The SYNTHIA Dataset: A large col-
lection of synthetic images for semantic segmenta-
tion of urban scenes. In Proc. CVPR, 2016.
[29] T. Salimans, I. Goodfellow, W. Zaremba, V. Che-
ung, A. Radford, and X. Chen. Improved
techniques for training gans. arXiv preprint
arXiv:1606.03498, 2016.
[30] T. Schneider, B. Schauerte, and R. Stiefelha-
gen. Manifold alignment for person independent
appearance-based gaze estimation. In Proc. ICPR,
2014.
[31] A. Shafaei, J. Little, and M. Schmidt. Play and
learn: Using video games to train computer vision
models. In Proc. BMVC, 2016.
[32] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp,
M. Cook, M. Finocchio, R. Moore, P. Kohli,
A. Criminisi, A. Kipman, and A. Blake. Efficient
human pose estimation from single depth images.
PAMI, 35(12):2821–2840, 2013.
[33] Y. Sugano, Y. Matsushita, and Y. Sato. Learning-
by-synthesis for appearance-based 3d gaze estima-
tion. In Proc. CVPR, 2014.
[34] J. Supancic, G. Rogez, Y. Yang, J. Shotton, and
D. Ramanan. Depth-based hand pose estimation:
data, methods, and challenges. In Proc. CVPR,
2015.
[35] J. Tompson, M. Stein, Y. Lecun, and K. Per-
lin. Real-time continuous pose recovery of human
hands using convolutional networks. ACM Trans.
Graphics, 2014.
[36] O. Tuzel, Y. Taguchi, and J. Hershey. Global-
local face upsampling network. arXiv preprint
arXiv:1603.07235, 2016.
[37] A. van den Oord, N. Kalchbrenner, and
K. Kavukcuoglu. Pixel recurrent neural net-
works. arXiv preprint arXiv:1601.06759, 2016.
[38] X. Wang and A. Gupta. Generative image model-
ing using style and structure adversarial networks.
In Proc. ECCV, 2016.
[39] Z. Wang, J. Yang, H. Jin, E. Shechtman, A. Agar-
wala, J. Brandt, and T. Huang. Deepfont: Identify
your font from an image. In Proc. ACMM, 2015.
[40] E. Wood, T. Baltrušaitis, L. Morency, P. Robin-
son, and A. Bulling. Learning an appearance-based
gaze estimator from one million synthesised im-
ages. In Proc. ACM Symposium on Eye Tracking
Research & Applications, 2016.
[41] L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan:
Sequence generative adversarial nets with policy
gradient. arXiv preprint arXiv:1609.05473, 2016.
[42] X. Zhang, Y. Fu, A. Zang, L. Sigal, and
G. Agam. Learning classifiers from synthetic data
using a multichannel autoencoder. arXiv preprint
arXiv:1503.03163, 2015.
[43] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling.
Appearance-based gaze estimation in the wild. In
Proc. CVPR, 2015.
[44] Y. Zhang, K. Lee, and H. Lee. Augmenting su-
pervised neural networks with unsupervised objec-
tives for large-scale image classification. In Proc.
ICML, 2016.
[45] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and
A. Efros. Generative visual manipulation on the
natural image manifold. In Proc. ECCV, 2016.
Additional Experiments
Qualitative Experiments for Appearance-based
Gaze Estimation
Dataset: The gaze estimation dataset consists of
1.2M synthetic images from eye gaze synthesizer Uni-
tyEyes [40] and 214K real images from the MPIIGaze
dataset [43] – samples shown in Figure 13. MPIIGaze is
a very challenging eye gaze estimation dataset captured
under extreme illumination conditions. For UnityEyes
we use a single generic rendering environment to gener-
ate training data without any dataset-specific targeting.
Qualititative Results: In Figure 14, we show many
examples of synthetic, and refined images from the eye
gaze dataset. We show many pairs of synthetic and re-
fined in multiple rows. The top row contains synthetic
images, and the bottom row contains corresponding re-
fined images. As shown, we observe a significant qual-
itative improvement of the synthetic images: SimGAN
successfully captures the skin texture, sensor noise and
the appearance of the iris region in the real images. Note
that our method preserves the annotation information
(gaze direction) while improving the realism.
Qualitative Experiments for Hand Pose Estima-
tion
Dataset: Next, we evaluate our method for hand pose
estimation in depth images. We use the NYU hand pose
dataset [35] that contains 72, 757 training frames and
8, 251 testing frames. Each depth frame is labeled with
hand pose information that has been used to create a syn-
thetic depth image. We pre-process the data by cropping
the pixels from real images using the synthetic images.
Figure 15 shows example real depth images from the
dataset. The images are resized to 224 × 224 before
passing them to the refiner network.
Quantative Results: We show examples of synthetic
and refined hand depth images in Figure 16 from the test
set. We show our results in multiple pairs of rows. The
top row in each pair, contains synthetic depth image, and
the bottom row shows the corresponding refined image
using the proposed SimGAN approach. Note the real-
ism added to the depth boundary in the refined images,
compare to the real images in Figure 15.
Convergence Experiment
To investigate the convergence of our method, we vi-
sualize intermediate results as training progresses. As
shown in Figure 17, in the beginning, the refiner network
learns to predict very smooth edges using only the self-
regularization loss. As the adversarial loss is enabled,
the network starts adding artifacts at the depth bound-
aries. However, as these artifacts are not the same as
real images, the discriminator easily learns to differenti-
ate between the real and refined images. Slowly the net-
work starts adding realistic noise, and after many steps,
the refiner generates very realistic-looking images. We
found it helpful to train the network with a low learn-
ing rate and for a large number of steps. For NYU hand
pose we used lr=0.0002 in the beginning, and reduced
to 0.00005 after 600, 000 steps.
Figure 13. Example real images from MPIIGaze dataset.
SyntheticRefinedSyntheticRefinedSyntheticRefinedSyntheticRefinedSyntheticRefinedSyntheticRefinedSyntheticRefined
Figure 14. Qualitative results for automatic refinement of simulated eyes. The top row (in each set of two rows) shows the synthetic
eye image, and the bottom row shows the corresponding refined image.
Figure 15. Example real test images in the NYU hand dataset.
SyntheticRefinedSyntheticRefinedSyntheticRefinedSyntheticRefined
Figure 16. Qualitative results for automatic refinement of NYU hand depth images. The top row (in each set of two rows) shows
the synthetic hand image, and the bottom row is the corresponding refined image. Note how realistic the depth boundaries are
compared to real images in Figure 15.
Training Iterations
Iterations
Synthetic
Images
Refined
Images
Figure 17. SimGAN output as a function of training iterations for NYU hand pose. Columns correspond to increasing training
iterations. First row shows synthetic images, and the second row shows corresponding refined images. The first column is the result
of training with 1 image difference for 300 steps; the later rows show the result when trained on top of this model. In the beginning
the adversarial part of the cost introduces different kinds of unrealistic noise to try beat the adversarial network Dφ. As the dueling
between Rθ and Dφ progresses, Rθ learns to model the right kind of noise.

More Related Content

PPTX
brief Introduction to Different Kinds of GANs
Parham Zilouchian
 
PPTX
Cat and dog classification
omaraldabash
 
PPTX
Deep Advances in Generative Modeling
indico data
 
PDF
Tutorial on Deep Generative Models
MLReview
 
PDF
Generative Adversarial Networks and Their Applications
Artifacia
 
PDF
Deep Learning for Computer Vision: Generative models and adversarial training...
Universitat Politècnica de Catalunya
 
PDF
Generative Models and Adversarial Training (D3L4 2017 UPC Deep Learning for ...
Universitat Politècnica de Catalunya
 
PDF
Variants of GANs - Jaejun Yoo
JaeJun Yoo
 
brief Introduction to Different Kinds of GANs
Parham Zilouchian
 
Cat and dog classification
omaraldabash
 
Deep Advances in Generative Modeling
indico data
 
Tutorial on Deep Generative Models
MLReview
 
Generative Adversarial Networks and Their Applications
Artifacia
 
Deep Learning for Computer Vision: Generative models and adversarial training...
Universitat Politècnica de Catalunya
 
Generative Models and Adversarial Training (D3L4 2017 UPC Deep Learning for ...
Universitat Politècnica de Catalunya
 
Variants of GANs - Jaejun Yoo
JaeJun Yoo
 

What's hot (20)

PDF
PR-315: Taming Transformers for High-Resolution Image Synthesis
Hyeongmin Lee
 
PPTX
Angular and Deep Learning
Oswald Campesato
 
PDF
Generative adversarial network_Ayadi_Alaeddine
Deep Learning Italia
 
PDF
Deep Generative Models
Chia-Wen Cheng
 
PPTX
Generative Adversarial Networks and Their Applications in Medical Imaging
Sanghoon Hong
 
PDF
An introduction to Deep Learning
Julien SIMON
 
PPTX
Deep learning based recommender systems (lab seminar paper review)
hyunsung lee
 
PPTX
An introduction to Machine Learning (and a little bit of Deep Learning)
Thomas da Silva Paula
 
PDF
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
210610 SSIIi2021 Computer Vision x Trasnformer
exwzds
 
PPTX
Dssg talk CNN intro
Vincent Tatan
 
PDF
Transformer 動向調査 in 画像認識(修正版)
Kazuki Maeno
 
PPTX
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Turi, Inc.
 
PDF
SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用
SSII
 
PDF
Matching Network
SuwhanBaek
 
PDF
Deep Learning and Reinforcement Learning
Renārs Liepiņš
 
PDF
Deep image generating models
Luba Elliott
 
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
Sujit Pal
 
PPTX
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PyData
 
PDF
Introduction to ambient GAN
JaeJun Yoo
 
PR-315: Taming Transformers for High-Resolution Image Synthesis
Hyeongmin Lee
 
Angular and Deep Learning
Oswald Campesato
 
Generative adversarial network_Ayadi_Alaeddine
Deep Learning Italia
 
Deep Generative Models
Chia-Wen Cheng
 
Generative Adversarial Networks and Their Applications in Medical Imaging
Sanghoon Hong
 
An introduction to Deep Learning
Julien SIMON
 
Deep learning based recommender systems (lab seminar paper review)
hyunsung lee
 
An introduction to Machine Learning (and a little bit of Deep Learning)
Thomas da Silva Paula
 
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
210610 SSIIi2021 Computer Vision x Trasnformer
exwzds
 
Dssg talk CNN intro
Vincent Tatan
 
Transformer 動向調査 in 画像認識(修正版)
Kazuki Maeno
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Turi, Inc.
 
SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用
SSII
 
Matching Network
SuwhanBaek
 
Deep Learning and Reinforcement Learning
Renārs Liepiņš
 
Deep image generating models
Luba Elliott
 
Artificial Intelligence, Machine Learning and Deep Learning
Sujit Pal
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PyData
 
Introduction to ambient GAN
JaeJun Yoo
 
Ad

Similar to Learning from Simulated and Unsupervised Images through Adversarial Training. Apple Inc. (20)

PDF
Cartoonization of images using machine Learning
IRJET Journal
 
PDF
Обучение нейросетей компьютерного зрения в видеоиграх
Anatol Alizar
 
PDF
Intel ILS: Enhancing Photorealism Enhancement
Alejandro Franceschi
 
PDF
deep_stereo_arxiv_2015
Ivan Neulander
 
PPTX
Face-GAN project report.pptx
AndleebFatima16
 
PPTX
Face-GAN project report
AndleebFatima16
 
PDF
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
sipij
 
PDF
IMAGE GENERATION FROM CAPTION
ijscai
 
PDF
Image Generation from Caption
IJSCAI Journal
 
PDF
CNN FEATURES ARE ALSO GREAT AT UNSUPERVISED CLASSIFICATION
cscpconf
 
PDF
IMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEY
ijcsit
 
PDF
Image Generation with Gans-based Techniques: A Survey
AIRCC Publishing Corporation
 
PDF
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Universitat Politècnica de Catalunya
 
DOC
Implementing Neural Style Transfer
Tahsin Mayeesha
 
PDF
Обучение нейросети машинного зрения в видеоиграх
Anatol Alizar
 
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
PPTX
One shot learning
Vuong Ho Ngoc
 
PDF
A Literature Survey on Image Linguistic Visual Question Answering
IRJET Journal
 
PDF
IRJET- Comparative Study of Different Techniques for Text as Well as Object D...
IRJET Journal
 
PDF
s41598-023-28094-1.pdf
archurssu
 
Cartoonization of images using machine Learning
IRJET Journal
 
Обучение нейросетей компьютерного зрения в видеоиграх
Anatol Alizar
 
Intel ILS: Enhancing Photorealism Enhancement
Alejandro Franceschi
 
deep_stereo_arxiv_2015
Ivan Neulander
 
Face-GAN project report.pptx
AndleebFatima16
 
Face-GAN project report
AndleebFatima16
 
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
sipij
 
IMAGE GENERATION FROM CAPTION
ijscai
 
Image Generation from Caption
IJSCAI Journal
 
CNN FEATURES ARE ALSO GREAT AT UNSUPERVISED CLASSIFICATION
cscpconf
 
IMAGE GENERATION WITH GANS-BASED TECHNIQUES: A SURVEY
ijcsit
 
Image Generation with Gans-based Techniques: A Survey
AIRCC Publishing Corporation
 
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Universitat Politècnica de Catalunya
 
Implementing Neural Style Transfer
Tahsin Mayeesha
 
Обучение нейросети машинного зрения в видеоиграх
Anatol Alizar
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
One shot learning
Vuong Ho Ngoc
 
A Literature Survey on Image Linguistic Visual Question Answering
IRJET Journal
 
IRJET- Comparative Study of Different Techniques for Text as Well as Object D...
IRJET Journal
 
s41598-023-28094-1.pdf
archurssu
 
Ad

More from eraser Juan José Calderón (20)

PDF
Evaluación de t-MOOC universitario sobre competencias digitales docentes medi...
eraser Juan José Calderón
 
PDF
Call for paper 71. Revista Comunicar
eraser Juan José Calderón
 
PDF
Editorial of the JBBA Vol 4, Issue 1, May 2021. Naseem Naqvi,
eraser Juan José Calderón
 
PDF
REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL LAYING DOWN HARMONIS...
eraser Juan José Calderón
 
PDF
Predicting Big Data Adoption in Companies With an Explanatory and Predictive ...
eraser Juan José Calderón
 
PDF
Innovar con blockchain en las ciudades: Ideas para lograrlo, casos de uso y a...
eraser Juan José Calderón
 
PDF
Innovar con blockchain en las ciudades: Ideas para lograrlo, casos de uso y a...
eraser Juan José Calderón
 
PDF
Ética y Revolución Digital . revista Diecisiete nº 4. 2021
eraser Juan José Calderón
 
PDF
#StopBigTechGoverningBigTech . More than 170 Civil Society Groups Worldwide O...
eraser Juan José Calderón
 
PDF
PACTO POR LA CIENCIA Y LA INNOVACIÓN 8 de febrero de 2021
eraser Juan José Calderón
 
PDF
Expert Panel of the European Blockchain Observatory and Forum
eraser Juan José Calderón
 
PDF
Desigualdades educativas derivadas del COVID-19 desde una perspectiva feminis...
eraser Juan José Calderón
 
PDF
"Experiencias booktuber: Más allá del libro y de la pantalla"
eraser Juan José Calderón
 
PDF
The impact of digital influencers on adolescent identity building.
eraser Juan José Calderón
 
PDF
Open educational resources (OER) in the Spanish universities
eraser Juan José Calderón
 
PDF
El modelo flipped classroom: un reto para una enseñanza centrada en el alumno
eraser Juan José Calderón
 
PDF
Pensamiento propio e integración transdisciplinaria en la epistémica social. ...
eraser Juan José Calderón
 
PDF
Escuela de Robótica de Misiones. Un modelo de educación disruptiva.
eraser Juan José Calderón
 
PDF
La Universidad española Frente a la pandemia. Actuaciones de Crue Universidad...
eraser Juan José Calderón
 
PDF
Covid-19 and IoT: Some Perspectives on the Use of IoT Technologies in Prevent...
eraser Juan José Calderón
 
Evaluación de t-MOOC universitario sobre competencias digitales docentes medi...
eraser Juan José Calderón
 
Call for paper 71. Revista Comunicar
eraser Juan José Calderón
 
Editorial of the JBBA Vol 4, Issue 1, May 2021. Naseem Naqvi,
eraser Juan José Calderón
 
REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL LAYING DOWN HARMONIS...
eraser Juan José Calderón
 
Predicting Big Data Adoption in Companies With an Explanatory and Predictive ...
eraser Juan José Calderón
 
Innovar con blockchain en las ciudades: Ideas para lograrlo, casos de uso y a...
eraser Juan José Calderón
 
Innovar con blockchain en las ciudades: Ideas para lograrlo, casos de uso y a...
eraser Juan José Calderón
 
Ética y Revolución Digital . revista Diecisiete nº 4. 2021
eraser Juan José Calderón
 
#StopBigTechGoverningBigTech . More than 170 Civil Society Groups Worldwide O...
eraser Juan José Calderón
 
PACTO POR LA CIENCIA Y LA INNOVACIÓN 8 de febrero de 2021
eraser Juan José Calderón
 
Expert Panel of the European Blockchain Observatory and Forum
eraser Juan José Calderón
 
Desigualdades educativas derivadas del COVID-19 desde una perspectiva feminis...
eraser Juan José Calderón
 
"Experiencias booktuber: Más allá del libro y de la pantalla"
eraser Juan José Calderón
 
The impact of digital influencers on adolescent identity building.
eraser Juan José Calderón
 
Open educational resources (OER) in the Spanish universities
eraser Juan José Calderón
 
El modelo flipped classroom: un reto para una enseñanza centrada en el alumno
eraser Juan José Calderón
 
Pensamiento propio e integración transdisciplinaria en la epistémica social. ...
eraser Juan José Calderón
 
Escuela de Robótica de Misiones. Un modelo de educación disruptiva.
eraser Juan José Calderón
 
La Universidad española Frente a la pandemia. Actuaciones de Crue Universidad...
eraser Juan José Calderón
 
Covid-19 and IoT: Some Perspectives on the Use of IoT Technologies in Prevent...
eraser Juan José Calderón
 

Recently uploaded (20)

PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PPTX
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
PPTX
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
PPTX
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
PPTX
CDH. pptx
AneetaSharma15
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PDF
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
PPTX
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
DOCX
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
Artificial Intelligence in Gastroentrology: Advancements and Future Presprec...
AyanHossain
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
CDH. pptx
AneetaSharma15
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
Biological Classification Class 11th NCERT CBSE NEET.pdf
NehaRohtagi1
 
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 

Learning from Simulated and Unsupervised Images through Adversarial Training. Apple Inc.

  • 1. This paper has been submitted for publication on November 15, 2016. Learning from Simulated and Unsupervised Images through Adversarial Training Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, Russ Webb Apple Inc. {a_shrivastava, tpf, otuzel, jsusskind, wenda_wang, rwebb}@apple.com Abstract With recent progress in graphics, it has become more tractable to train models on synthetic images, poten- tially avoiding the need for expensive annotations. How- ever, learning from synthetic images may not achieve the desired performance due to a gap between synthetic and real image distributions. To reduce this gap, we pro- pose Simulated+Unsupervised (S+U) learning, where the task is to learn a model to improve the realism of a simulator’s output using unlabeled real data, while preserving the annotation information from the simula- tor. We develop a method for S+U learning that uses an adversarial network similar to Generative Adversar- ial Networks (GANs), but with synthetic images as in- puts instead of random vectors. We make several key modifications to the standard GAN algorithm to pre- serve annotations, avoid artifacts and stabilize training: (i) a ‘self-regularization’ term, (ii) a local adversarial loss, and (iii) updating the discriminator using a history of refined images. We show that this enables genera- tion of highly realistic images, which we demonstrate both qualitatively and with a user study. We quantita- tively evaluate the generated images by training mod- els for gaze estimation and hand pose estimation. We show a significant improvement over using synthetic im- ages, and achieve state-of-the-art results on the MPI- IGaze dataset without any labeled real data. 1. Introduction Large labeled training datasets are becoming increas- ingly important with the recent rise in high capacity deep neural networks [4, 18, 44, 44, 1, 15]. However, labeling such large datasets is expensive and time-consuming. Thus the idea of training on synthetic instead of real im- ages has become appealing because the annotations are automatically available. Human pose estimation with Kinect [32] and, more recently, a plethora of other tasks have been tackled using synthetic data [40, 39, 26, 31]. Refiner Unlabeled Real Images Synthetic Refined Figure 1. Simulated+Unsupervised (S+U) learning. The task is to learn a model that improves the realism of synthetic images from a simulator using unlabeled real data, while preserving the annotation information. However, learning from synthetic images can be prob- lematic due to a gap between synthetic and real im- age distributions – synthetic data is often not realistic enough, leading the network to learn details only present in synthetic images and fail to generalize well on real images. One solution to closing this gap is to improve the simulator. However, increasing the realism is often computationally expensive, the renderer design takes a lot of hard work, and even top renderers may still fail to model all the characteristics of real images. This lack of realism may cause models to overfit to ‘unrealistic’ details in the synthetic images. In this paper, we propose Simulated+Unsupervised (S+U) learning, where the goal is to improve the real- ism of synthetic images from a simulator using unla- beled real data. The improved realism enables the train- ing of better machine learning models on large datasets without any data collection or human annotation effort. In addition to adding realism, S+U learning should pre- serve annotation information for training of machine learning models – e.g. the gaze direction in Figure 1 should be preserved. Moreover, since machine learning models can be sensitive to artifacts in the synthetic data, S+U learning should generate images without artifacts. arXiv:1612.07828v1[cs.CV]22Dec2016
  • 2. We develop a method for S+U learning, which we term SimGAN, that refines synthetic images from a sim- ulator using a neural network which we call the ‘refiner network’. Figure 2 gives an overview of our method: a synthetic image is generated with a black box simulator and is refined using the refiner network. To add real- ism – the first requirement of an S+U learning algorithm – we train our refiner network using an adversarial loss, similar to Generative Adversarial Networks (GANs) [7], such that the refined images are indistinguishable from real ones using a discriminative network. Second, to preserve the annotations of synthetic images, we com- plement the adversarial loss with a self-regularization loss that penalizes large changes between the synthetic and refined images. Moreover, we propose to use a fully convolutional neural network that operates on a pixel level and preserves the global structure, rather than holistically modifying the image content as in e.g. a fully connected encoder network. Third, the GAN framework requires training two neural networks with competing goals, which is known to be unstable and tends to in- troduce artifacts [29]. To avoid drifting and introduc- ing spurious artifacts while attempting to fool a single stronger discriminator, we limit the discriminator’s re- ceptive field to local regions instead of the whole image, resulting in multiple local adversarial losses per image. Moreover, we introduce a method for improving the sta- bility of training by updating the discriminator using a history of refined images rather than the ones from the current refiner network. Contributions: 1. We propose S+U learning that uses unlabeled real data to refine the synthetic images generated by a simulator. 2. We train a refiner network to add realism to syn- thetic images using a combination of an adversarial loss and a self-regularization loss. 3. We make several key modifications to the GAN training framework to stabilize training and prevent the refiner network from producing artifacts. 4. We present qualitative, quantitative, and user study experiments showing that the proposed framework significantly improves the realism of the simulator output. We achieve state-of-the-art results, without any human annotation effort, by training deep neu- ral networks on the refined output images. 1.1. Related Work The GAN framework learns two networks (a gener- ator and a discriminator) with competing losses. The Simulator Discriminator Synthetic Refined Unlabeled real – Refiner Real vs Refined D R Figure 2. Overview of SimGAN. We refine the output of the simulator with a refiner neural network, R, that mini- mizes the combination of a local adversarial loss and a ‘self- regularization’ term. The adversarial loss fools a discrimi- nator network, D, that classifies an image as real or refined. The self-regularization term minimizes the image difference between the synthetic and the refined images. This preserves the annotation information (e.g. gaze direction), making the refined images useful for training a machine learning model. The refiner network R and the discriminator network D are updated alternately. goal of the generator network is to map a random vector to a realistic image, whereas the goal of the discrimina- tor is to distinguish the generated and the real images. The GAN framework was first introduced by Goodfel- low et al. [7] to generate visually realistic images and, since then, many improvements and interesting applica- tions have been proposed [29]. Wang and Gupta [38] use a Structured GAN to learn surface normals and then combine it with a Style GAN to generate natural indoor scenes. Im et al. [12] propose a recurrent generative model trained using adversarial training. The recently proposed iGAN [45] enables users to change the im- age interactively on a natural image manifold. CoGAN by Liu et al. [19] uses coupled GANs to learn a joint distribution over images from multiple modalities with- out requiring tuples of corresponding images, achiev- ing this by a weight-sharing constraint that favors the joint distribution solution. Chen et al. [2] propose Info- GAN, an information-theoretic extension of GAN, that allows learning of meaningful representations. Tuzel et al. [36] tackled image superresolution for face images with GANs. Li and Wand [17] propose a Markovian GAN for efficient texture synthesis. Lotter et al. [20] use adversarial loss in an LSTM network for visual sequence prediction. Yu et al. [41] propose the SeqGAN frame- work that uses GANs for reinforcement learning. Many recent works have explored related problems in the do- main of generative models, such as PixelRNN [37] that predicts pixels sequentially with an RNN with a softmax loss. The generative networks focus on generating im- ages using a random noise vector; thus, in contrast to our method, the generated images do not have any annota-
  • 3. tion information that can be used for training a machine learning model. Many efforts have explored using synthetic data for various prediction tasks, including gaze estimation [40], text detection and classification in RGB images [8, 14], font recognition [39], object detection [9, 24], hand pose estimation in depth images [35, 34], scene recog- nition in RGB-D [10], semantic segmentation of urban scenes [28], and human pose estimation [23, 3, 16, 13, 25, 27]. Gaidon et al. [5] show that pre-training a deep neural network on synthetic data leads to improved per- formance. Our work is complementary to these ap- proaches, where we improve the realism of the simulator using unlabeled real data. Ganin and Lempitsky [6] use synthetic data in a domain adaptation setting where the learned features are invariant to the domain shift between synthetic and real images. Wang et al. [39] train a Stacked Con- volutional Auto-Encoder on synthetic and real data to learn the lower-level representations of their font detec- tor ConvNet. Zhang et al. [42] learn a Multichannel Au- toencoder to reduce the domain shift between real and synthetic data. In contrast to classical domain adaptation methods that adapt the features with respect to a specific prediction task, we bridge the gap between image dis- tributions through adversarial training. This approach allows us to generate very realistic images which can be used to train any machine learning model, potentially for multiple tasks. 2. S+U Learning with SimGAN The goal of Simulated+Unsupervised learning is to use a set of unlabeled real images yi ∈ Y to learn a refiner Rθ(x) that refines a synthetic image x, where θ are the function parameters. Let the refined image be denoted by ˜x, then ˜x := Rθ(x). The key requirement for S+U learning is that the re- fined image ˜x should look like a real image in appear- ance while preserving the annotation information from the simulator. To this end, we propose to learn θ by minimizing a combination of two losses: LR(θ) = i real(θ; ˜xi, Y) + λ reg(θ; ˜xi, xi), (1) where xi is the ith synthetic training image, and ˜xi is the corresponding refined image. The first part of the cost, real, adds realism to the synthetic images, while the second part, reg, preserves the annotation information by minimizing the difference between the synthetic and the refined images. In the following sections, we expand this formulation and provide an algorithm to optimize for θ. 2.1. Adversarial Loss with Self-Regularization To add realism to the synthetic image, we need to bridge the gap between the distributions of synthetic and real images. An ideal refiner will make it impossible to classify a given image as real or refined with high confi- dence. This motivates the use of an adversarial discrim- inator network, Dφ, that is trained to classify images as real vs refined, where φ are the the parameters of the discriminator network. The adversarial loss used in training the refiner network, R, is responsible for ‘fool- ing’ the network D into classifying the refined images as real. Following the GAN approach [7], we model this as a two-player minimax game, and update the refiner network, Rθ, and the discriminator network, Dφ, alter- nately. Next, we describe this intuition more precisely. The discriminator network updates its parameters by minimizing the following loss: LD(φ) = − i log(Dφ(˜xi)) − j log(1 − Dφ(yj)). (2) This is equivalent to cross-entropy error for a two class classification problem where Dφ(.) is the probability of the input being a synthetic image, and 1 − Dφ(.) that of a real one. We implement Dφ as a ConvNet whose last layer outputs the probability of the sample being a re- fined image. For training this network, each mini-batch consists of randomly sampled refined synthetic images ˜xi’s and real images yj’s. The target labels for the cross- entropy loss layer are 0 for every yj, and 1 for every ˜xi. Then φ for a mini-batch is updated by taking a stochas- tic gradient descent (SGD) step on the mini-batch loss gradient. In our implementation, the realism loss function real in (1) uses the trained discriminator D as follows: real(θ; ˜xi, Y) = − i log(1 − Dφ(Rθ(xi))). (3) By minimizing this loss function, the refiner forces the discriminator to fail classifying the refined images as synthetic. In addition to generating realistic images, the refiner network should preserve the annotation informa- tion of the simulator. For example, for gaze estimation the learned transformation should not change the gaze direction, and for hand pose estimation the location of the joints should not change. This is an essential ingredi- ent to enable training a machine learning model that uses the refined images with the simulator’s annotations. To enforce this, we propose using a self-regularization loss that minimizes the image difference between the syn- thetic and the refined image. Thus, the overall refiner
  • 4. Algorithm 1: Adversarial training of refiner net- work Rθ Input: Sets of synthetic images xi ∈ X , and real images yj ∈ Y, max number of steps (T), number of discriminator network updates per step (Kd), number of generative network updates per step (Kg). Output: ConvNet model Rθ. for t = 1, . . . , T do for k = 1, . . . , Kg do 1. Sample a mini-batch of synthetic images xi. 2. Update θ by taking a SGD step on mini-batch loss LR(θ) in (4) . end for k = 1, . . . , Kd do 1. Sample a mini-batch of synthetic images xi, and real images yj. 2. Compute ˜xi = Rθ(xi) with current θ. 3. Update φ by taking a SGD step on mini-batch loss LD(φ) in (2). end end Discriminator D Probability mapInput image w h Figure 3. Illustration of local adversarial loss. The discrimina- tor network outputs a w × h probability map. The adversarial loss function is the sum of the cross-entropy losses over the local patches. loss function (1) used in our implementation is: LR(θ) = − i log(1 − Dφ(Rθ(xi))) +λ Rθ(xi) − xi 1, (4) where . 1 is 1 norm. We implement Rθ as a fully con- volutional neural net without striding or pooling. This modifies the synthetic image on a pixel level, rather than holistically modifying the image content as in e.g. a fully connected encoder network, and preserves the global structure and the annotations. We learn the refiner and discriminator parameters by minimizing LR(θ) and LD(φ) alternately. While updating the parameters of Rθ, we keep φ fixed, and while updating Dφ, we fix θ. We summarize this training procedure in Algorithm 1. 2.2. Local Adversarial Loss Another key requirement for the refiner network is that it should learn to model the real image characteris- tics without introducing any artifacts. When we train a Buffer of refined images Refined images with current Refined Real Mini-batch for D R Figure 4. Illustration of using a history of refined images. See text for details. single strong discriminator network, the refiner network tends to over-emphasize certain image features to fool the current discriminator network, leading to drifting and producing artifacts. A key observation is that any local patch we sample from the refined image, should have similar statistics to a real image patch. Therefore, rather than defining a global discriminator network, we can define discriminator network that classifies all local image patches separately. This not only limits the re- ceptive field, and hence the capacity of the discriminator network, but also provides many samples per image for learning the discriminator network. This also improves training of the refiner network because we have multiple ‘realism loss’ values per image. In our implementation, we design the discriminator D to be a fully convolutional network that outputs w × h dimensional probability map of patches belonging to fake class, where w × h are the number of local patches in the image. While training the refiner network, we sum the cross-entropy loss values over w × h local patches, as illustrated in Figure 3. 2.3. Updating Discriminator using a History of Refined Images Another problem of adversarial training is that the discriminator network only focuses on the latest refined images. This may cause (i) diverging of the adversar- ial training, and (ii) the refiner network re-introducing the artifacts that the discriminator has forgotten about. Any refined image generated by the refiner network at any time during the entire training procedure is a ‘fake’ image for the discriminator. Hence, the discriminator should be able to classify all these images as fake. Based on this observation, we introduce a method to improve the stability of adversarial training by updating the dis- criminator using a history of refined images, rather than only the ones in the current mini-batch. We slightly modify Algorithm 1 to have a buffer of refined images generated by previous networks. Let B be the size of the buffer and b be the mini-batch size used in Algorithm 1.
  • 5. Unlabeled Real Images Synthetic Simulated images Refined Figure 5. Example output of SimGAN for the UnityEyes gaze estimation dataset [40]. (Left) real images from MPIIGaze [43]. Our refiner network does not use any label information from MPIIGaze dataset at training time. (Right) refinement results on UnityEye. The skin texture and the iris region in the refined synthetic images are qualitatively significantly more similar to the real images than to the synthetic images. More examples are included in the supplementary material. onv nxn onv nxn ture maps s LU Conv f@nxn Conv f@nxn + ReLU ReLU Input Features Output Features Figure 6. A ResNet block with two n×n convolutional layers, each with f feature maps. At each iteration of discriminator training, we compute the discriminator loss function by sampling b/2 images from the current refiner network, and sampling an addi- tional b/2 images from the buffer to update parameters φ. We keep the size of the buffer, B, fixed. After each training iteration, we randomly replace b/2 samples in the buffer with the newly generated refined images. This procedure is illustrated in Figure 4. 3. Experiments We evaluate our method for appearance-based gaze estimation in the wild on the MPIIGaze dataset [40, 43], and hand pose estimation on the NYU hand pose dataset of depth images [35]. We use fully convolutional refiner network with ResNet blocks (Figure 6) for all our exper- iments. 3.1. Appearance-based Gaze Estimation Gaze estimation is a key ingredient for many human computer interaction (HCI) tasks. However, estimat- ing the gaze direction from an eye image is challeng- ing, especially when the image is of low quality, e.g. from a laptop or a mobile phone camera – annotating the eye images with a gaze direction vector is challenging even for humans. Therefore, to generate large amounts of annotated data, several recent approaches [40, 43] train their models on large amounts of synthetic data. Here, we show that training with the refined synthetic images generated by SimGAN significantly outperforms the state-of-the-art for this task. The gaze estimation dataset consists of 1.2M syn- thetic images from eye gaze synthesizer UnityEyes [40] and 214K real images from the MPIIGaze dataset [43] – samples shown in Figure 5. MPIIGaze is a very chal- lenging eye gaze estimation dataset captured under ex- treme illumination conditions. For UnityEyes we use a single generic rendering environment to generate train- ing data without any dataset-specific targeting. Qualitative Results: Figure 5 shows examples of syn- thetic, real and refined images from the eye gaze dataset. As shown, we observe a significant qualitative improve- ment of the synthetic images: SimGAN successfully captures the skin texture, sensor noise and the appear- ance of the iris region in the real images. Note that our method preserves the annotation information (gaze di- rection) while improving the realism. ‘Visual Turing Test’: To quantitatively evaluate the visual quality of the refined images, we designed a sim- ple user study where subjects were asked to classify images as real or refined synthetic. Each subject was shown a random selection of 50 real images and 50 re- fined images in a random order, and was asked to label the images as either real or refined. The subjects were constantly shown 20 examples of real and refined im- ages while performing the task. The subjects found it very hard to tell the difference between the real images and the refined images. In our aggregate analysis, 10 subjects chose the correct label 517 times out of 1000 trials (p = 0.148), which is not significantly better than chance. Table 1 shows the confusion matrix. In con- trast, when testing on original synthetic images vs real images, we showed 10 real and 10 synthetic images per subject, and the subjects chose correctly 162 times out of 200 trials (p ≤ 10−8 ), which is significantly better than chance. Quantitative Results: We train a simple convolu- tional neural network (CNN) similar to [43] to predict the eye gaze direction (encoded by a 3-dimensional vec- tor for x, y, z) with l2 loss. We train on UnityEyes and test on MPIIGaze. Figure 7 and Table 2 compare the performance of a gaze estimation CNN trained on syn- thetic data to that of another CNN trained on refined
  • 6. Selected as real Selected as synt Ground truth real 224 276 Ground truth synt 207 293 Table 1. Results of the ‘Visual Turing test’ user study for clas- sifying real vs refined images. Subjects were asked to dis- tinguish between refined synthetic images (output from our method) and real images (from MPIIGaze). The average hu- man classification accuracy was 51.7%, demonstrating that the automatically generated refined images are visually very hard to distinguish from real images. 0 5 10 15 20 25 Distance from ground truth [degrees] 0 10 20 30 40 50 60 70 80 90 100 Percentageofimages Refined Synthetic Data 4x Refined Synthetic Data Synthetic Data 4x Synthetic Data Figure 7. Quantitative results for appearance-based gaze esti- mation on the MPIIGaze dataset with real eye images. The plot shows cumulative curves as a function of degree error as compared to the ground truth eye gaze direction, for differ- ent numbers of training examples of synthetic and refined syn- thetic data. Gaze estimation using the refined images instead of the synthetic images results in significantly improved per- formance. synthetic data, the output of SimGAN. We observe a large improvement in performance from training on the SimGAN output, a 22.3% absolute percentage improve- ment. We also observe a large improvement from train- ing on more training data – here 4x refers to 100% of the training dataset. The quantitative evaluation confirms the value of the qualitative improvements observed in Figure 5, and shows that machine learning models gen- eralize significantly better using SimGAN. Table 3 shows a comparison to the state-of-the-art. Training the CNN on the refined images outperforms the state-of-the-art on the MPIIGaze dataset, with a relative improvement of 21%. This large improvement shows the practical value of our method in many HCI tasks. Implementation Details: The refiner network, Rθ, is a residual network (ResNet) [11]. Each ResNet block consists of two convolutional layers containing 64 fea- ture maps as shown in Figure 6. An input image of size 55 × 35 is convolved with 3 × 3 filters that output 64 Training data % of images within d Synthetic Data 62.3 Synthetic Data 4x 64.9 Refined Synthetic Data 69.4 Refined Synthetic Data 4x 87.2 Table 2. Comparison of a gaze estimator trained on synthetic data and the output of SimGAN. The results are at distance d = 7 degrees from ground truth. Training on the refined synthetic output of SimGAN outperforms training on synthetic data by 22.3%, without requiring supervision for the real data. Method R/S Error Support Vector Regression (SVR) [30] R 16.5 Adaptive Linear Regression ALR) [21] R 16.4 Random Forest (RF) [33] R 15.4 kNN with UT Multiview [43] R 16.2 CNN with UT Multiview [43] R 13.9 k-NN with UnityEyes [40] S 9.9 CNN with UnityEyes Synthetic Images S 11.2 CNN with UnityEyes Refined Images S 7.8 Table 3. Comparison of SimGAN to the state-of-the-art on the MPIIGaze dataset of real eyes. The second column indicates whether the methods are trained on Real/Synthetic data. The error the is mean eye gaze estimation error in degrees. Train- ing on refined images results in a 2.1 degree improvement, a relative 21% improvement compared to the state-of-the-art. feature maps. The output is passed through 4 ResNet blocks. The output of the last ResNet block is passed to a 1 × 1 convolutional layer producing 1 feature map corresponding to the refined synthetic image. The discriminator network, Dφ, contains 5 con- volution layers and 2 max-pooling layers as follows: (1) Conv3x3, stride=2, feature maps=96, (2) Conv3x3, stride=2, feature maps=64, (3) MaxPool3x3, stride=1, (4) Conv3x3, stride=1, feature maps=32, (5) Conv1x1, stride=1, feature maps=32, (6) Conv1x1, stride=1, fea- ture maps=2, (7) Softmax. Our adversarial network is fully convolutional, and has been designed such that the receptive field of the last layer neurons in Rθ and Dφ are similar. We first train the Rθ network with just self-regularization loss for 1, 000 steps, and Dφ for 200 steps. Then, for each update of Dφ, we update Rθ twice, i.e. Kd is set to 1, and Kg is set to 50 in Algorithm 1. The eye gaze estimation network is similar to [43], with some changes to enable it to better exploit our large synthetic dataset. The input is a 35 × 55 grayscale image that is passed through 5 convolu- tional layers followed by 3 fully connected layers, the last one encoding the 3-dimensional gaze vector: (1) Conv3x3, feature maps=32, (2) Conv3x3, feature maps=32, (3) Conv3x3, feature maps=64, (4) Max- Pool3x3, stride=2, (5) Conv3x3, feature maps=80, (6) Conv3x3, feature maps=192, (7) MaxPool2x2,
  • 7. Global adversarial loss Local adversarial loss Figure 8. Importance of using a local adversarial loss. (Left) an example image that has been generated with a standard ‘global’ adversarial loss on the whole image. The noise around the edge of the hand contains obvious unrealistic depth bound- ary artifacts. (Right) the same image generated with a local adversarial loss that looks significantly more realistic. Synthetic Refined (with history) Refined (without history) Figure 9. Using a history of refined images for updating the discriminator. (Left) synthetic images; (middle) result of us- ing the history of refined images; (right) result without using a history of refined images (instead using only the most re- cent refined images). We observe obvious unrealistic artifacts, especially around the corners of the eyes. stride=2, (8) FC9600, (9) FC1000, (10) FC3, (11) Eu- clidean loss. All networks are trained with a constant 0.001 learning rate and 512 batch size, until the valida- tion error converges. 3.2. Hand Pose Estimation from Depth Images Next, we evaluate our method for hand pose esti- mation in depth images. We use the NYU hand pose dataset [35] that contains 72, 757 training frames and 8, 251 testing frames captured by 3 Kinect cameras – one frontal and 2 side views. Each depth frame is labeled with hand pose information that has been used to create Figure 10. NYU hand pose dataset. (Left) depth frame; (right) corresponding synthetic image. a synthetic depth image. Figure 10 shows one such ex- ample frame. We pre-process the data by cropping the pixels from real images using the synthetic images. The images are resized to 224 × 224 before passing them to the ConvNet. The background depth values are set to zero and the foreground values are set to original depth value minus 2000 (assuming that the background is at 2000 millimeters). Qualitative Results: Figure 11 shows example output of SimGAN on the NYU hand pose test set. As is ap- parent from the figure, the main source of noise in real depth images is from depth discontinuity at the edges. SimGAN is able to learn to model this kind of noise without requiring any label information for the real im- ages, resulting in more realistic-looking images for this domain as well. Quantitative Results: We train a fully convolutional hand pose estimator CNN similar to Stacked Hourglass Net [22] on real, synthetic and refined synthetic images of the NYU hand pose training set, and evaluate each model on all real images in the NYU hand pose test set. We train on the same 14 hand joints as in [35]. Many state-of-the-art hand pose estimation methods are cus- tomized pipelines that consist of several steps. We use only a single deep neural network to analyze the effect of improving the synthetic images to avoid bias due to other factors. Figure 12 and Table 4 present quantitative results on NYU hand pose. Training on refined synthetic data – the output of SimGAN which does not require any labeling for the real images – significantly outperforms the model trained on real images with supervision, by 8.8%. The proposed method also outperforms training on synthetic data. We also observe a large improvement as the number of training examples is increased, which comes with zero annotation cost to us as we train on the output of a simulator – here 3x corresponds to training on all views. Implementation Details: The architecture is the same as for eye gaze estimation, except the input image size is 224 × 224, filter size is 7 × 7, and 10 ResNet blocks are used. The discriminative net Dφ is: (1) Conv7x7, stride=4, feature maps=96, (2) Conv5x5, stride=2, fea- ture maps=64, (3) MaxPool3x3, stride=2, (4) Conv3x3,
  • 8. R R SyntheticRefinedUnlabeled Real Images Simulated images Figure 11. Example refined test images for the NYU hand pose dataset [35]. (Left) real images, (right) synthetic images and the corresponding refined output images from the refiner network. The major source of noise in the real images is the non-smooth depth boundaries. The refiner network learns to model the noise present in the real images, importantly without requiring any labels for the real images. 1 2 3 4 5 6 7 8 9 10 Distance from ground truth [pixels] 20 30 40 50 60 70 80 90 100 Percentageofimages Refined Synthetic Data 3x Synthetic Data 3x Real Data Refined Synthetic Data Synthetic Data Figure 12. Quantitative results for hand pose estimation on the NYU hand pose test set of real depth images [35]. The plot shows cumulative curves as a function of distance from ground truth keypoint locations, for different numbers of training ex- amples of synthetic and refined images. Training a pose esti- mator on the output of SimGAN significantly outperforms the same network trained on real images. Importantly, our refiner generative model does not require labeling for the real images. Training data % of images within d Synthetic Data 69.7 Refined Synthetic Data 72.4 Real Data 74.5 Synthetic Data 3x 77.7 Refined Synthetic Data 3x 83.3 Table 4. Comparison of a hand pose estimator trained on syn- thetic data, real data, and the output of SimGAN. The results are at distance d = 5 pixels from ground truth. Training on the output of SimGAN outperforms training on supervised real data by 8.8%, without requiring any supervision. stride=2, feature maps=32, (5) Conv1x1, stride=1, fea- ture maps=32, (6) Conv1x1, stride=1, feature maps=2, (7) Softmax. We train the Rθ network first with just self- regularization loss for 500 steps and Dφ for 200 steps; then, for each update of Dφ we update Rθ twice, i.e. Kd is set to 1, and Kg is set to 2 in Algorithm 1. For hand pose estimation, we use the Stacked Hour- glass Net of [22] 2 hourglass blocks, and an output heatmap size 64 × 64. We augment at training time with random [−20, 20] degree rotations and crops. All net- works are trained until the validation error converges. 3.3. Analysis of Modifications to Adversarial Training First we compare local vs global adversarial loss dur- ing training. A global adversarial loss uses a fully con- nected layer in the discriminator network, classifying the whole image as real vs refined. The local adversar- ial loss removes the artifacts and makes the generated image significantly more realistic, as seen in Figure 8. Next, in Figure 9, we show result of using history of refined images, and compare it with standard adversarial training for gaze estimation. As shown in the figure, us- ing the buffer of refined images prevents severe artifacts in standard training, e.g. around the corner of the eyes. 4. Conclusions and Future Work We have proposed Simulated+Unsupervised learning to refine a simulator’s output with unlabeled real data. S+U learning adds realism to the simulator and pre- serves the global structure and the annotations of the synthetic images. We described SimGAN, our method for S+U learning, that uses an adversarial network and demonstrated state-of-the-art results without any labeled real data. In future, we intend to explore modeling the noise distribution to generate more than one refined im- age for each synthetic image, and investigate refining videos rather than single images.
  • 9. References [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Nat- sev, G. Toderici, B. Varadarajan, and S. Vi- jayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016. [2] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Inter- pretable representation learning by information maximizing generative adversarial nets. arXiv preprint arXiv:1606.03657, 2016. [3] T. Darrell, P. Viola, and G. Shakhnarovich. Fast pose estimation with parameter sensitive hashing. In Proc. CVPR, 2015. [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Proc. CVPR, 2009. [5] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proc. CVPR, 2016. [6] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495, 2014. [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. NIPS, 2014. [8] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. Proc. CVPR, 2016. [9] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning rich features from rgb-d images for ob- ject detection and segmentation. In Proc. ECCV, 2014. [10] A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, and R. Cipolla. SceneNet: Understand- ing real world indoor scenes with synthetic data. In Proc. CVPR, 2015. [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid- ual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. [12] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic. Generating images with recurrent adversarial net- works. http://arxiv.org/abs/ 1602.05110, 2016. [13] C. Ionescu, D. Papava, V. Olaru, and C. Sminchis- escu. Human3.6m: Large scale datasets and pre- dictive methods for 3d human sensing in natural environments. PAMI, 36(7):1325–1339, 2014. [14] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with con- volutional neural networks. IJCV, 116(1):1–20, 2016. [15] I. Krasin, T. Duerig, N. Alldrin, A. Veit, S. Abu- El-Haija, S. Belongie, D. Cai, Z. Feng, V. Fer- rari, V. Gomes, A. Gupta, D. Narayanan, C. Sun, G. Chechik, and K. Murphy. OpenImages: A pub- lic dataset for large-scale multi-label and multi- class image classification. Dataset available from https://github.com/openimages, 2016. [16] Y. LeCun, F. Huang, and L. Bottou. Learning methods for generic object recognition with invari- ance to pose and lighting. In Proc. CVPR, 2004. [17] C. Li and M. Wand. Precomputed real-time tex- ture synthesis with markovian generative adversar- ial networks. In Proc. ECCV, 2016. [18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Per- ona, D. Ramanan, P. Dollár, and C. L. Zitnick. Mi- crosoft COCO: Common objects in context. In Proc. ECCV, 2014. [19] M.-Y. Liu and O. Tuzel. Coupled generative adver- sarial networks. In Proc. NIPS, 2016. [20] W. Lotter, G. Kreiman, and D. Cox. Unsupervised learning of visual structure using predictive gener- ative networks. arXiv preprint arXiv:1511.06380, 2015. [21] F. Lu, Y. Sugano, T. Okabe, and Y. Sato. Adaptive linear regression for appearance-based gaze esti- mation. PAMI, 36(10):2033–2046, 2014. [22] A. Newell, K. Yang, and J. Deng. Stacked hour- glass networks for human pose estimation. arXiv preprint arXiv:1603.06937, 2016. [23] D. Park and D. Ramanan. Articulated pose esti- mation with tiny synthetic videos. In Proc. CVPR, 2015. [24] X. Peng, B. Sun, K. Ali, and K. Saenko. Learning deep object detectors from 3d models. In Proc. ICCV, 2015. [25] L. Pishchulin, A. Jain, M. Andriluka, T. Thor- mählen, and B. Schiele. Articulated people detec- tion and pose estimation: Reshaping the future. In Proc. CVPR, 2012. [26] W. Qiu and A. Yuille. UnrealCV: Connecting computer vision to Unreal Engine. arXiv preprint arXiv:1609.01326, 2016. [27] G. Rogez and C. Schmid. MoCap-guided data aug- mentation for 3d pose estimation in the wild. arXiv preprint arXiv:1607.02046, 2016. [28] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The SYNTHIA Dataset: A large col- lection of synthetic images for semantic segmenta- tion of urban scenes. In Proc. CVPR, 2016.
  • 10. [29] T. Salimans, I. Goodfellow, W. Zaremba, V. Che- ung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016. [30] T. Schneider, B. Schauerte, and R. Stiefelha- gen. Manifold alignment for person independent appearance-based gaze estimation. In Proc. ICPR, 2014. [31] A. Shafaei, J. Little, and M. Schmidt. Play and learn: Using video games to train computer vision models. In Proc. BMVC, 2016. [32] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman, and A. Blake. Efficient human pose estimation from single depth images. PAMI, 35(12):2821–2840, 2013. [33] Y. Sugano, Y. Matsushita, and Y. Sato. Learning- by-synthesis for appearance-based 3d gaze estima- tion. In Proc. CVPR, 2014. [34] J. Supancic, G. Rogez, Y. Yang, J. Shotton, and D. Ramanan. Depth-based hand pose estimation: data, methods, and challenges. In Proc. CVPR, 2015. [35] J. Tompson, M. Stein, Y. Lecun, and K. Per- lin. Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graphics, 2014. [36] O. Tuzel, Y. Taguchi, and J. Hershey. Global- local face upsampling network. arXiv preprint arXiv:1603.07235, 2016. [37] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural net- works. arXiv preprint arXiv:1601.06759, 2016. [38] X. Wang and A. Gupta. Generative image model- ing using style and structure adversarial networks. In Proc. ECCV, 2016. [39] Z. Wang, J. Yang, H. Jin, E. Shechtman, A. Agar- wala, J. Brandt, and T. Huang. Deepfont: Identify your font from an image. In Proc. ACMM, 2015. [40] E. Wood, T. Baltrušaitis, L. Morency, P. Robin- son, and A. Bulling. Learning an appearance-based gaze estimator from one million synthesised im- ages. In Proc. ACM Symposium on Eye Tracking Research & Applications, 2016. [41] L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policy gradient. arXiv preprint arXiv:1609.05473, 2016. [42] X. Zhang, Y. Fu, A. Zang, L. Sigal, and G. Agam. Learning classifiers from synthetic data using a multichannel autoencoder. arXiv preprint arXiv:1503.03163, 2015. [43] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling. Appearance-based gaze estimation in the wild. In Proc. CVPR, 2015. [44] Y. Zhang, K. Lee, and H. Lee. Augmenting su- pervised neural networks with unsupervised objec- tives for large-scale image classification. In Proc. ICML, 2016. [45] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. Efros. Generative visual manipulation on the natural image manifold. In Proc. ECCV, 2016.
  • 11. Additional Experiments Qualitative Experiments for Appearance-based Gaze Estimation Dataset: The gaze estimation dataset consists of 1.2M synthetic images from eye gaze synthesizer Uni- tyEyes [40] and 214K real images from the MPIIGaze dataset [43] – samples shown in Figure 13. MPIIGaze is a very challenging eye gaze estimation dataset captured under extreme illumination conditions. For UnityEyes we use a single generic rendering environment to gener- ate training data without any dataset-specific targeting. Qualititative Results: In Figure 14, we show many examples of synthetic, and refined images from the eye gaze dataset. We show many pairs of synthetic and re- fined in multiple rows. The top row contains synthetic images, and the bottom row contains corresponding re- fined images. As shown, we observe a significant qual- itative improvement of the synthetic images: SimGAN successfully captures the skin texture, sensor noise and the appearance of the iris region in the real images. Note that our method preserves the annotation information (gaze direction) while improving the realism. Qualitative Experiments for Hand Pose Estima- tion Dataset: Next, we evaluate our method for hand pose estimation in depth images. We use the NYU hand pose dataset [35] that contains 72, 757 training frames and 8, 251 testing frames. Each depth frame is labeled with hand pose information that has been used to create a syn- thetic depth image. We pre-process the data by cropping the pixels from real images using the synthetic images. Figure 15 shows example real depth images from the dataset. The images are resized to 224 × 224 before passing them to the refiner network. Quantative Results: We show examples of synthetic and refined hand depth images in Figure 16 from the test set. We show our results in multiple pairs of rows. The top row in each pair, contains synthetic depth image, and the bottom row shows the corresponding refined image using the proposed SimGAN approach. Note the real- ism added to the depth boundary in the refined images, compare to the real images in Figure 15. Convergence Experiment To investigate the convergence of our method, we vi- sualize intermediate results as training progresses. As shown in Figure 17, in the beginning, the refiner network learns to predict very smooth edges using only the self- regularization loss. As the adversarial loss is enabled, the network starts adding artifacts at the depth bound- aries. However, as these artifacts are not the same as real images, the discriminator easily learns to differenti- ate between the real and refined images. Slowly the net- work starts adding realistic noise, and after many steps, the refiner generates very realistic-looking images. We found it helpful to train the network with a low learn- ing rate and for a large number of steps. For NYU hand pose we used lr=0.0002 in the beginning, and reduced to 0.00005 after 600, 000 steps.
  • 12. Figure 13. Example real images from MPIIGaze dataset.
  • 13. SyntheticRefinedSyntheticRefinedSyntheticRefinedSyntheticRefinedSyntheticRefinedSyntheticRefinedSyntheticRefined Figure 14. Qualitative results for automatic refinement of simulated eyes. The top row (in each set of two rows) shows the synthetic eye image, and the bottom row shows the corresponding refined image.
  • 14. Figure 15. Example real test images in the NYU hand dataset.
  • 15. SyntheticRefinedSyntheticRefinedSyntheticRefinedSyntheticRefined Figure 16. Qualitative results for automatic refinement of NYU hand depth images. The top row (in each set of two rows) shows the synthetic hand image, and the bottom row is the corresponding refined image. Note how realistic the depth boundaries are compared to real images in Figure 15.
  • 16. Training Iterations Iterations Synthetic Images Refined Images Figure 17. SimGAN output as a function of training iterations for NYU hand pose. Columns correspond to increasing training iterations. First row shows synthetic images, and the second row shows corresponding refined images. The first column is the result of training with 1 image difference for 300 steps; the later rows show the result when trained on top of this model. In the beginning the adversarial part of the cost introduces different kinds of unrealistic noise to try beat the adversarial network Dφ. As the dueling between Rθ and Dφ progresses, Rθ learns to model the right kind of noise.