SlideShare a Scribd company logo
Visualizing and Understanding Convolutional Networks
Matthew D. Zeiler zeiler@cs.nyu.edu
Dept. of Computer Science, Courant Institute, New York University
Rob Fergus fergus@cs.nyu.edu
Dept. of Computer Science, Courant Institute, New York University
Abstract
Large Convolutional Network models have
recently demonstrated impressive classifica-
tion performance on the ImageNet bench-
mark (Krizhevsky et al., 2012). However
there is no clear understanding of why they
perform so well, or how they might be im-
proved. In this paper we address both issues.
We introduce a novel visualization technique
that gives insight into the function of inter-
mediate feature layers and the operation of
the classifier. Used in a diagnostic role, these
visualizations allow us to find model architec-
tures that outperform Krizhevsky et al. on
the ImageNet classification benchmark. We
also perform an ablation study to discover
the performance contribution from different
model layers. We show our ImageNet model
generalizes well to other datasets: when the
softmax classifier is retrained, it convincingly
beats the current state-of-the-art results on
Caltech-101 and Caltech-256 datasets.
1. Introduction
Since their introduction by (LeCun et al., 1989) in
the early 1990’s, Convolutional Networks (convnets)
have demonstrated excellent performance at tasks such
as hand-written digit classification and face detec-
tion. In the last year, several papers have shown
that they can also deliver outstanding performance on
more challenging visual classification tasks. (Ciresan
et al., 2012) demonstrate state-of-the-art performance
on NORB and CIFAR-10 datasets. Most notably,
(Krizhevsky et al., 2012) show record beating perfor-
mance on the ImageNet 2012 classification benchmark,
with their convnet model achieving an error rate of
16.4%, compared to the 2nd place result of 26.1%.
Several factors are responsible for this renewed inter-
est in convnet models: (i) the availability of much
larger training sets, with millions of labeled exam-
ples; (ii) powerful GPU implementations, making the
training of very large models practical and (iii) bet-
ter model regularization strategies, such as Dropout
(Hinton et al., 2012).
Despite this encouraging progress, there is still lit-
tle insight into the internal operation and behavior
of these complex models, or how they achieve such
good performance. From a scientific standpoint, this
is deeply unsatisfactory. Without clear understanding
of how and why they work, the development of better
models is reduced to trial-and-error. In this paper we
introduce a visualization technique that reveals the in-
put stimuli that excite individual feature maps at any
layer in the model. It also allows us to observe the
evolution of features during training and to diagnose
potential problems with the model. The visualization
technique we propose uses a multi-layered Deconvo-
lutional Network (deconvnet), as proposed by (Zeiler
et al., 2011), to project the feature activations back to
the input pixel space. We also perform a sensitivity
analysis of the classifier output by occluding portions
of the input image, revealing which parts of the scene
are important for classification.
Using these tools, we start with the architecture of
(Krizhevsky et al., 2012) and explore different archi-
tectures, discovering ones that outperform their results
on ImageNet. We then explore the generalization abil-
ity of the model to other datasets, just retraining the
softmax classifier on top. As such, this is a form of su-
pervised pre-training, which contrasts with the unsu-
pervised pre-training methods popularized by (Hinton
et al., 2006) and others (Bengio et al., 2007; Vincent
et al., 2008). The generalization ability of convnet fea-
tures is also explored in concurrent work by (Donahue
et al., 2013).
arXiv:1311.2901v3[cs.CV]28Nov2013
Visualizing and Understanding Convolutional Networks
1.1. Related Work
Visualizing features to gain intuition about the net-
work is common practice, but mostly limited to the 1st
layer where projections to pixel space are possible. In
higher layers this is not the case, and there are limited
methods for interpreting activity. (Erhan et al., 2009)
find the optimal stimulus for each unit by perform-
ing gradient descent in image space to maximize the
unit’s activation. This requires a careful initialization
and does not give any information about the unit’s in-
variances. Motivated by the latter’s short-coming, (Le
et al., 2010) (extending an idea by (Berkes & Wiskott,
2006)) show how the Hessian of a given unit may be
computed numerically around the optimal response,
giving some insight into invariances. The problem is
that for higher layers, the invariances are extremely
complex so are poorly captured by a simple quadratic
approximation. Our approach, by contrast, provides a
non-parametric view of invariance, showing which pat-
terns from the training set activate the feature map.
(Donahue et al., 2013) show visualizations that iden-
tify patches within a dataset that are responsible for
strong activations at higher layers in the model. Our
visualizations differ in that they are not just crops of
input images, but rather top-down projections that
reveal structures within each patch that stimulate a
particular feature map.
2. Approach
We use standard fully supervised convnet models
throughout the paper, as defined by (LeCun et al.,
1989) and (Krizhevsky et al., 2012). These models
map a color 2D input image xi, via a series of lay-
ers, to a probability vector ˆyi over the C different
classes. Each layer consists of (i) convolution of the
previous layer output (or, in the case of the 1st layer,
the input image) with a set of learned filters; (ii) pass-
ing the responses through a rectified linear function
(relu(x) = max(x, 0)); (iii) [optionally] max pooling
over local neighborhoods and (iv) [optionally] a lo-
cal contrast operation that normalizes the responses
across feature maps. For more details of these opera-
tions, see (Krizhevsky et al., 2012) and (Jarrett et al.,
2009). The top few layers of the network are conven-
tional fully-connected networks and the final layer is
a softmax classifier. Fig. 3 shows the model used in
many of our experiments.
We train these models using a large set of N labeled
images {x, y}, where label yi is a discrete variable
indicating the true class. A cross-entropy loss func-
tion, suitable for image classification, is used to com-
pare ˆyi and yi. The parameters of the network (fil-
ters in the convolutional layers, weight matrices in the
fully-connected layers and biases) are trained by back-
propagating the derivative of the loss with respect to
the parameters throughout the network, and updating
the parameters via stochastic gradient descent. Full
details of training are given in Section 3.
2.1. Visualization with a Deconvnet
Understanding the operation of a convnet requires in-
terpreting the feature activity in intermediate layers.
We present a novel way to map these activities back to
the input pixel space, showing what input pattern orig-
inally caused a given activation in the feature maps.
We perform this mapping with a Deconvolutional Net-
work (deconvnet) (Zeiler et al., 2011). A deconvnet
can be thought of as a convnet model that uses the
same components (filtering, pooling) but in reverse, so
instead of mapping pixels to features does the oppo-
site. In (Zeiler et al., 2011), deconvnets were proposed
as a way of performing unsupervised learning. Here,
they are not used in any learning capacity, just as a
probe of an already trained convnet.
To examine a convnet, a deconvnet is attached to each
of its layers, as illustrated in Fig. 1(top), providing a
continuous path back to image pixels. To start, an
input image is presented to the convnet and features
computed throughout the layers. To examine a given
convnet activation, we set all other activations in the
layer to zero and pass the feature maps as input to
the attached deconvnet layer. Then we successively
(i) unpool, (ii) rectify and (iii) filter to reconstruct
the activity in the layer beneath that gave rise to the
chosen activation. This is then repeated until input
pixel space is reached.
Unpooling: In the convnet, the max pooling opera-
tion is non-invertible, however we can obtain an ap-
proximate inverse by recording the locations of the
maxima within each pooling region in a set of switch
variables. In the deconvnet, the unpooling operation
uses these switches to place the reconstructions from
the layer above into appropriate locations, preserving
the structure of the stimulus. See Fig. 1(bottom) for
an illustration of the procedure.
Rectification: The convnet uses relu non-linearities,
which rectify the feature maps thus ensuring the fea-
ture maps are always positive. To obtain valid fea-
ture reconstructions at each layer (which also should
be positive), we pass the reconstructed signal through
a relu non-linearity.
Filtering: The convnet uses learned filters to con-
volve the feature maps from the previous layer. To
Visualizing and Understanding Convolutional Networks
invert this, the deconvnet uses transposed versions of
the same filters, but applied to the rectified maps, not
the output of the layer beneath. In practice this means
flipping each filter vertically and horizontally.
Projecting down from higher layers uses the switch
settings generated by the max pooling in the convnet
on the way up. As these switch settings are peculiar
to a given input image, the reconstruction obtained
from a single activation thus resembles a small piece
of the original input image, with structures weighted
according to their contribution toward to the feature
activation. Since the model is trained discriminatively,
they implicitly show which parts of the input image
are discriminative. Note that these projections are not
samples from the model, since there is no generative
process involved.
Layer Below Pooled Maps
Feature Maps
Rectified Feature Maps
Convolu'onal	
  
Filtering	
  {F}	
  
Rec'fied	
  Linear	
  
Func'on	
  
Pooled Maps
Max	
  Pooling	
  
Reconstruction
Rectified Unpooled Maps
Unpooled Maps
Convolu'onal	
  
Filtering	
  {FT}	
  
Rec'fied	
  Linear	
  
Func'on	
  
Layer Above
Reconstruction
Max	
  Unpooling	
  
Switches	
  
Unpooling
Max Locations
“Switches”
Pooling
Pooled Maps
Feature Map
Layer Above
Reconstruction
Unpooled
Maps
Rectified
Feature Maps
Figure 1. Top: A deconvnet layer (left) attached to a con-
vnet layer (right). The deconvnet will reconstruct an ap-
proximate version of the convnet features from the layer
beneath. Bottom: An illustration of the unpooling oper-
ation in the deconvnet, using switches which record the
location of the local max in each pooling region (colored
zones) during pooling in the convnet.
3. Training Details
We now describe the large convnet model that will be
visualized in Section 4. The architecture, shown in
Fig. 3, is similar to that used by (Krizhevsky et al.,
2012) for ImageNet classification. One difference is
that the sparse connections used in Krizhevsky’s lay-
ers 3,4,5 (due to the model being split across 2 GPUs)
are replaced with dense connections in our model.
Other important differences relating to layers 1 and
2 were made following inspection of the visualizations
in Fig. 6, as described in Section 4.1.
The model was trained on the ImageNet 2012 train-
ing set (1.3 million images, spread over 1000 different
classes). Each RGB image was preprocessed by resiz-
ing the smallest dimension to 256, cropping the center
256x256 region, subtracting the per-pixel mean (across
all images) and then using 10 different sub-crops of size
224x224 (corners + center with(out) horizontal flips).
Stochastic gradient descent with a mini-batch size of
128 was used to update the parameters, starting with a
learning rate of 10−2
, in conjunction with a momentum
term of 0.9. We anneal the learning rate throughout
training manually when the validation error plateaus.
Dropout (Hinton et al., 2012) is used in the fully con-
nected layers (6 and 7) with a rate of 0.5. All weights
are initialized to 10−2
and biases are set to 0.
Visualization of the first layer filters during training
reveals that a few of them dominate, as shown in
Fig. 6(a). To combat this, we renormalize each filter
in the convolutional layers whose RMS value exceeds
a fixed radius of 10−1
to this fixed radius. This is cru-
cial, especially in the first layer of the model, where the
input images are roughly in the [-128,128] range. As in
(Krizhevsky et al., 2012), we produce multiple differ-
ent crops and flips of each training example to boost
training set size. We stopped training after 70 epochs,
which took around 12 days on a single GTX580 GPU,
using an implementation based on (Krizhevsky et al.,
2012).
4. Convnet Visualization
Using the model described in Section 3, we now use
the deconvnet to visualize the feature activations on
the ImageNet validation set.
Feature Visualization: Fig. 2 shows feature visu-
alizations from our model once training is complete.
However, instead of showing the single strongest ac-
tivation for a given feature map, we show the top 9
activations. Projecting each separately down to pixel
space reveals the different structures that excite a
given feature map, hence showing its invariance to in-
put deformations. Alongside these visualizations we
show the corresponding image patches. These have
greater variation than visualizations as the latter solely
focus on the discriminant structure within each patch.
For example, in layer 5, row 1, col 2, the patches ap-
pear to have little in common, but the visualizations
reveal that this particular feature map focuses on the
grass in the background, not the foreground objects.
Visualizing and Understanding Convolutional Networks
Layer 2
Layer 1
Layer 3
Layer 4 Layer 5
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
The projections from each layer show the hierarchi-
cal nature of the features in the network. Layer 2 re-
sponds to corners and other edge/color conjunctions.
Layer 3 has more complex invariances, capturing sim-
ilar textures (e.g. mesh patterns (Row 1, Col 1); text
(R2,C4)). Layer 4 shows significant variation, but
is more class-specific: dog faces (R1,C1); bird’s legs
(R4,C2). Layer 5 shows entire objects with significant
pose variation, e.g. keyboards (R1,C11) and dogs (R4).
Feature Evolution during Training: Fig. 4 visu-
alizes the progression during training of the strongest
activation (across all training examples) within a given
feature map projected back to pixel space. Sudden
jumps in appearance result from a change in the image
from which the strongest activation originates. The
lower layers of the model can be seen to converge
within a few epochs. However, the upper layers only
develop develop after a considerable number of epochs
(40-50), demonstrating the need to let the models train
until fully converged.
Feature Invariance: Fig. 5 shows 5 sample images
being translated, rotated and scaled by varying degrees
while looking at the changes in the feature vectors from
the top and bottom layers of the model, relative to the
untransformed feature. Small transformations have a
dramatic effect in the first layer of the model, but a
lesser impact at the top feature layer, being quasi-
linear for translation & scaling. The network output
is stable to translations and scalings. In general, the
output is not invariant to rotation, except for object
with rotational symmetry (e.g. entertainment center).
4.1. Architecture Selection
While visualization of a trained model gives insight
into its operation, it can also assist with selecting good
architectures in the first place. By visualizing the first
and second layers of Krizhevsky et al. ’s architecture
(Fig. 6(b) & (d)), various problems are apparent. The
first layer filters are a mix of extremely high and low
frequency information, with little coverage of the mid
frequencies. Additionally, the 2nd layer visualization
shows aliasing artifacts caused by the large stride 4
used in the 1st layer convolutions. To remedy these
problems, we (i) reduced the 1st layer filter size from
11x11 to 7x7 and (ii) made the stride of the convolu-
tion 2, rather than 4. This new architecture retains
much more information in the 1st and 2nd layer fea-
tures, as shown in Fig. 6(c) & (e). More importantly, it
also improves the classification performance as shown
in Section 5.1.
4.2. Occlusion Sensitivity
With image classification approaches, a natural ques-
tion is if the model is truly identifying the location of
the object in the image, or just using the surround-
ing context. Fig. 7 attempts to answer this question
by systematically occluding different portions of the
input image with a grey square, and monitoring the
output of the classifier. The examples clearly show
the model is localizing the objects within the scene,
as the probability of the correct class drops signifi-
cantly when the object is occluded. Fig. 7 also shows
visualizations from the strongest feature map of the
top convolution layer, in addition to activity in this
map (summed over spatial locations) as a function of
occluder position. When the occluder covers the im-
age region that appears in the visualization, we see a
strong drop in activity in the feature map. This shows
that the visualization genuinely corresponds to the im-
age structure that stimulates that feature map, hence
validating the other visualizations shown in Fig. 4 and
Fig. 2.
4.3. Correspondence Analysis
Deep models differ from many existing recognition ap-
proaches in that there is no explicit mechanism for
establishing correspondence between specific object
parts in different images (e.g. faces have a particular
spatial configuration of the eyes and nose). However,
an intriguing possibility is that deep models might be
implicitly computing them. To explore this, we take 5
randomly drawn dog images with frontal pose and sys-
tematically mask out the same part of the face in each
image (e.g. all left eyes, see Fig. 8). For each image i,
we then compute: l
i = xl
i − ˜xl
i, where xl
i and ˜xl
i are the
feature vectors at layer l for the original and occluded
images respectively. We then measure the consis-
tency of this difference vector between all related im-
age pairs (i, j): ∆l =
5
i,j=1,i=j H(sign( l
i), sign( l
j)),
where H is Hamming distance. A lower value indi-
cates greater consistency in the change resulting from
the masking operation, hence tighter correspondence
between the same object parts in different images
(i.e. blocking the left eye changes the feature repre-
sentation in a consistent way). In Table 1 we compare
the ∆ score for three parts of the face (left eye, right
eye and nose) to random parts of the object, using fea-
tures from layer l = 5 and l = 7. The lower score for
these parts, relative to random object regions, for the
layer 5 features show the model does establish some
degree of correspondence.
Visualizing and Understanding Convolutional Networks
Input Image
stride 2	
  
image size 224	
  
3	
  
96	
  
5	
  
2	
  
110	
  
55
3x3 max pool
stride 2
96	
  
3	
  
1	
  
26
256	
  
filter size 7	
  
3x3 max
pool
stride 2
13
256	
  
3	
  
1	
  
13
384	
  
3	
  
1	
  
13
384	
  
Layer 1 Layer 2
13
256	
  
3x3 max
pool
stride 2
6
Layer 3 Layer 4 Layer 5
256	
  
4096
units	
  
4096
units	
  
Layer 6 Layer 7
C
class
softmax	
  
Output
contrast
norm.
contrast
norm.
Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as
the input. This is convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y.
The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within
3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature
maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from
the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax
function, C being the number of classes. All filters and feature maps are square in shape.
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed
in a different block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64].
The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to
pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic
form.
0 50 100 150 200 250 300 350
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rotation Degrees
P(trueclass)
Lawn Mower
Shih Tzu
African Crocodile
African Grey
Entertrainment Center
1
3
5
7
8
9
Vertical Translation (Pixels)
CanonicalDistance
Lawn Mower
African Crocodile
African Grey
Entertrainment Center
60 40 20 0 20 40 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Vertical Translation (Pixels)
CanonicalDistance
Lawn Mower
Shih Tzu
African Crocodile
African Grey
Entertrainment Center
1 1.2 1.4 1.6 1.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Scale (Ratio)
CanonicalDistance
Lawn Mower
Shih Tzu
African Crocodile
African Grey
Entertrainment Center
0 50 100 150 200 250 300 350
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Rotation Degrees
CanonicalDistance
Lawn Mower
Shih Tzu
African Crocodile
African Grey
Entertrainment Center
60 40 20 0 20 40 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Vertical Translation (Pixels)
P(trueclass)
Lawn Mower
Shih Tzu
African Crocodile
African Grey
Entertrainment Center
1 1.2 1.4 1.6 1.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Scale (Ratio)
P(trueclass)
Lawn Mower
Shih Tzu
African Crocodile
African Grey
Entertrainment Center
a1	
  
c1	
  
a3	
  
c3	
   c4	
  
a4	
  
1 1.2 1.4 1.6 1.8
0
2
4
6
8
10
12
Scale (Ratio)
CanonicalDistance
Lawn Mower
Shih Tzu
African Crocodile
African Grey
Entertrainment Center
0 50 100 150 200 250 300 350
0
5
10
15
Rotation Degrees
CanonicalDistance
Lawn Mower
Shih Tzu
African Crocodile
African Grey
Entertrainment Center
a2	
  
b3	
   b4	
  b2	
  b1	
  
c2	
  
Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5
example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original
and transformed images in layers 1 and 7 respectively. Col 4: the probability of the true label for each image, as the
image is transformed.
Visualizing and Understanding Convolutional Networks
(a) (b)
(c) (d) (e)
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distinctive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky
et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classifier, probability
of correct class
(e) Classifier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systematically cover up different portions of the scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response
in the unoccluded image). (c): a visualization of this feature map projected down into the input image (black square),
along with visualizations of this map from other images. The first row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct
class probability, as a function of the position of the gray square. E.g. when the dog’s face is obscured, the probability
for “pomeranian” drops significantly. (e): the most probable label as a function of occluder position. E.g. in the 1st row,
for most locations it is “pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In
the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The
3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
Figure 8. Images used for correspondence experiments.
Col 1: Original image. Col 2,3,4: Occlusion of the right
eye, left eye, and nose respectively. Other columns show
examples of random occlusions.
Mean Feature Mean Feature
Sign Change Sign Change
Occlusion Location Layer 5 Layer 7
Right Eye 0.067 ± 0.007 0.069 ± 0.015
Left Eye 0.069 ± 0.007 0.068 ± 0.013
Nose 0.079 ± 0.017 0.069 ± 0.011
Random 0.107 ± 0.017 0.073 ± 0.014
Table 1. Measure of correspondence for different object
parts in 5 different dog images. The lower scores for the
eyes and nose (compared to random object parts) show the
model implicitly establishing some form of correspondence
of parts at layer 5 in the model. At layer 7, the scores
are more similar, perhaps due to upper layers trying to
discriminate between the different breeds of dog.
5. Experiments
5.1. ImageNet 2012
This dataset consists of 1.3M/50k/100k train-
ing/validation/test examples, spread over 1000 cate-
gories. Table 2 shows our results on this dataset.
Using the exact architecture specified in (Krizhevsky
et al., 2012), we attempt to replicate their result on the
validation set. We achieve an error rate within 0.1% of
their reported value on the ImageNet 2012 validation
set.
Next we analyze the performance of our model with
the architectural changes outlined in Section 4.1 (7×7
filters in layer 1 and stride 2 convolutions in layers 1
& 2). This model, shown in Fig. 3, significantly out-
performs the architecture of (Krizhevsky et al., 2012),
beating their single model result by 1.7% (test top-5).
When we combine multiple models, we obtain a test
error of 14.8%, the best published performance
on this dataset1
(despite only using the 2012 train-
1
This performance has been surpassed in the recent
Imagenet 2013 competition (http://www.image-net.org/
ing set). We note that this error is almost half that of
the top non-convnet entry in the ImageNet 2012 classi-
fication challenge, which obtained 26.2% error (Gunji
et al., 2012).
Val Val Test
Error % Top-1 Top-5 Top-5
(Gunji et al., 2012) - - 26.2
(Krizhevsky et al., 2012), 1 convnet 40.7 18.2 −−
(Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4
(Krizhevsky et al., 2012)∗
, 1 convnets 39.0 16.6 −−
(Krizhevsky et al., 2012)∗
, 7 convnets 36.7 15.4 15.3
Our replication of
(Krizhevsky et al., 2012), 1 convnet 40.5 18.1 −−
1 convnet as per Fig. 3 38.4 16.5 −−
5 convnets as per Fig. 3 – (a) 36.7 15.3 15.3
1 convnet as per Fig. 3 but with
layers 3,4,5: 512,1024,512 maps – (b) 37.5 16.0 16.1
6 convnets, (a) & (b) combined 36.0 14.7 14.8
Table 2. ImageNet 2012 classification error rates. The ∗
indicates models that were trained on both ImageNet 2011
and 2012 training sets.
Varying ImageNet Model Sizes: In Table 3, we
first explore the architecture of (Krizhevsky et al.,
2012) by adjusting the size of layers, or removing
them entirely. In each case, the model is trained from
scratch with the revised architecture. Removing the
fully connected layers (6,7) only gives a slight increase
in error. This is surprising, given that they contain
the majority of model parameters. Removing two of
the middle convolutional layers also makes a relatively
small different to the error rate. However, removing
both the middle convolution layers and the fully con-
nected layers yields a model with only 4 layers whose
performance is dramatically worse. This would sug-
gest that the overall depth of the model is important
for obtaining good performance. In Table 3, we modify
our model, shown in Fig. 3. Changing the size of the
fully connected layers makes little difference to perfor-
mance (same for model of (Krizhevsky et al., 2012)).
However, increasing the size of the middle convolution
layers goes give a useful gain in performance. But in-
creasing these, while also enlarging the fully connected
layers results in over-fitting.
5.2. Feature Generalization
The experiments above show the importance of the
convolutional part of our ImageNet model in obtain-
ing state-of-the-art performance. This is supported by
the visualizations of Fig. 2 which show the complex in-
variances learned in the convolutional layers. We now
explore the ability of these feature extraction layers to
generalize to other datasets, namely Caltech-101 (Fei-
fei et al., 2006), Caltech-256 (Griffin et al., 2006) and
PASCAL VOC 2012. To do this, we keep layers 1-7
of our ImageNet-trained model fixed and train a new
challenges/LSVRC/2013/results.php).
Visualizing and Understanding Convolutional Networks
Train Val Val
Error % Top-1 Top-1 Top-5
Our replication of
(Krizhevsky et al., 2012), 1 convnet 35.1 40.5 18.1
Removed layers 3,4 41.8 45.4 22.1
Removed layer 7 27.4 40.0 18.4
Removed layers 6,7 27.4 44.8 22.4
Removed layer 3,4,6,7 71.1 71.3 50.1
Adjust layers 6,7: 2048 units 40.3 41.7 18.8
Adjust layers 6,7: 8192 units 26.8 40.0 18.1
Our Model (as per Fig. 3) 33.1 38.4 16.5
Adjust layers 6,7: 2048 units 38.2 40.2 17.6
Adjust layers 6,7: 8192 units 22.0 38.8 17.0
Adjust layers 3,4,5: 512,1024,512 maps 18.8 37.5 16.0
Adjust layers 6,7: 8192 units and
Layers 3,4,5: 512,1024,512 maps 10.0 38.3 16.9
Table 3. ImageNet 2012 classification error rates with var-
ious architectural changes to the model of (Krizhevsky
et al., 2012) and our model (see Fig. 3).
softmax classifier on top (for the appropriate number
of classes) using the training images of the new dataset.
Since the softmax contains relatively few parameters,
it can be trained quickly from a relatively small num-
ber of examples, as is the case for certain datasets.
The classifiers used by our model (a softmax) and
other approaches (typically a linear SVM) are of simi-
lar complexity, thus the experiments compare our fea-
ture representation, learned from ImageNet, with the
hand-crafted features used by other methods. It is im-
portant to note that both our feature representation
and the hand-crafted features are designed using im-
ages beyond the Caltech and PASCAL training sets.
For example, the hyper-parameters in HOG descrip-
tors were determined through systematic experiments
on a pedestrian dataset (Dalal & Triggs, 2005). We
also try a second strategy of training a model from
scratch, i.e. resetting layers 1-7 to random values and
train them, as well as the softmax, on the training
images of the dataset.
One complication is that some of the Caltech datasets
have some images that are also in the ImageNet train-
ing data. Using normalized correlation, we identified
these few “overlap” images2
and removed them from
our Imagenet training set and then retrained our Ima-
genet models, so avoiding the possibility of train/test
contamination.
Caltech-101: We follow the procedure of (Fei-fei
et al., 2006) and randomly select 15 or 30 images per
class for training and test on up to 50 images per class
reporting the average of the per-class accuracies in Ta-
2
For Caltech-101, we found 44 images in common (out
of 9,144 total images), with a maximum overlap of 10 for
any given class. For Caltech-256, we found 243 images in
common (out of 30,607 total images), with a maximum
overlap of 18 for any given class.
0 10 20 30 40 50 60
25
30
35
40
45
50
55
60
65
70
75
Training Images per−class
Accuracy%
Our Model
Bo etal
Sohn etal
Figure 9. Caltech-256 classification performance as the
number of training images per class is varied. Using only
6 training examples per class with our pre-trained feature
extractor, we surpass best reported result by (Bo et al.,
2013).
ble 4, using 5 train/test folds. Training took 17 min-
utes for 30 images/class. The pre-trained model beats
the best reported result for 30 images/class from (Bo
et al., 2013) by 2.2%. The convnet model trained from
scratch however does terribly, only achieving 46.5%.
Acc % Acc %
# Train 15/class 30/class
(Bo et al., 2013) − 81.4 ± 0.33
(Jianchao et al., 2009) 73.2 84.3
Non-pretrained convnet 22.8 ± 1.5 46.5 ± 1.7
ImageNet-pretrained convnet 83.8 ± 0.5 86.5 ± 0.5
Table 4. Caltech-101 classification accuracy for our con-
vnet models, against two leading alternate approaches.
Caltech-256: We follow the procedure of (Griffin
et al., 2006), selecting 15, 30, 45, or 60 training im-
ages per class, reporting the average of the per-class
accuracies in Table 5. Our ImageNet-pretrained model
beats the current state-of-the-art results obtained by
Bo et al. (Bo et al., 2013) by a significant margin:
74.2% vs 55.2% for 60 training images/class. However,
as with Caltech-101, the model trained from scratch
does poorly. In Fig. 9, we explore the “one-shot learn-
ing” (Fei-fei et al., 2006) regime. With our pre-trained
model, just 6 Caltech-256 training images are needed
to beat the leading method using 10 times as many im-
ages. This shows the power of the ImageNet feature
extractor.
Acc % Acc % Acc % Acc %
# Train 15/class 30/class 45/class 60/class
(Sohn et al., 2011) 35.1 42.1 45.7 47.9
(Bo et al., 2013) 40.5 ± 0.4 48.0 ± 0.2 51.9 ± 0.2 55.2 ± 0.3
Non-pretr. 9.0 ± 1.4 22.5 ± 0.7 31.2 ± 0.5 38.8 ± 1.4
ImageNet-pretr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3
Table 5. Caltech 256 classification accuracies.
Visualizing and Understanding Convolutional Networks
PASCAL 2012: We used the standard training and
validation images to train a 20-way softmax on top of
the ImageNet-pretrained convnet. This is not ideal, as
PASCAL images can contain multiple objects and our
model just provides a single exclusive prediction for
each image. Table 6 shows the results on the test set.
The PASCAL and ImageNet images are quite differ-
ent in nature, the former being full scenes unlike the
latter. This may explain our mean performance being
3.2% lower than the leading (Yan et al., 2012) result,
however we do beat them on 5 classes, sometimes by
large margins.
Acc % [A] [B] Ours Acc % [A] [B] Ours
Airplane 92.0 97.3 96.0 Dining tab 63.2 77.8 67.7
Bicycle 74.2 84.2 77.1 Dog 68.9 83.0 87.8
Bird 73.0 80.8 88.4 Horse 78.2 87.5 86.0
Boat 77.5 85.3 85.5 Motorbike 81.0 90.1 85.1
Bottle 54.3 60.8 55.8 Person 91.6 95.0 90.9
Bus 85.2 89.9 85.8 Potted pl 55.9 57.8 52.2
Car 81.9 86.8 78.6 Sheep 69.4 79.2 83.6
Cat 76.4 89.3 91.2 Sofa 65.4 73.4 61.1
Chair 65.2 75.4 65.0 Train 86.7 94.5 91.8
Cow 63.2 77.8 74.4 Tv 77.4 80.7 76.1
Mean 74.3 82.2 79.0 # won 0 15 5
Table 6. PASCAL 2012 classification results, comparing
our Imagenet-pretrained convnet against the leading two
methods ([A]= (Sande et al., 2012) and [B] = (Yan et al.,
2012)).
5.3. Feature Analysis
We explore how discriminative the features in each
layer of our Imagenet-pretrained model are. We do this
by varying the number of layers retained from the Ima-
geNet model and place either a linear SVM or softmax
classifier on top. Table 7 shows results on Caltech-
101 and Caltech-256. For both datasets, a steady im-
provement can be seen as we ascend the model, with
best results being obtained by using all layers. This
supports the premise that as the feature hierarchies
become deeper, they learn increasingly powerful fea-
tures.
Cal-101 Cal-256
(30/class) (60/class)
SVM (1) 44.8 ± 0.7 24.6 ± 0.4
SVM (2) 66.2 ± 0.5 39.6 ± 0.3
SVM (3) 72.3 ± 0.4 46.0 ± 0.3
SVM (4) 76.6 ± 0.4 51.3 ± 0.1
SVM (5) 86.2 ± 0.8 65.6 ± 0.3
SVM (7) 85.5 ± 0.4 71.7 ± 0.2
Softmax (5) 82.9 ± 0.4 65.7 ± 0.5
Softmax (7) 85.4 ± 0.4 72.6 ± 0.1
Table 7. Analysis of the discriminative information con-
tained in each layer of feature maps within our ImageNet-
pretrained convnet. We train either a linear SVM or soft-
max on features from different layers (as indicated in brack-
ets) from the convnet. Higher layers generally produce
more discriminative features.
6. Discussion
We explored large convolutional neural network mod-
els, trained for image classification, in a number ways.
First, we presented a novel way to visualize the ac-
tivity within the model. This reveals the features to
be far from random, uninterpretable patterns. Rather,
they show many intuitively desirable properties such as
compositionality, increasing invariance and class dis-
crimination as we ascend the layers. We also showed
how these visualization can be used to debug prob-
lems with the model to obtain better results, for ex-
ample improving on Krizhevsky et al. ’s (Krizhevsky
et al., 2012) impressive ImageNet 2012 result. We
then demonstrated through a series of occlusion exper-
iments that the model, while trained for classification,
is highly sensitive to local structure in the image and is
not just using broad scene context. An ablation study
on the model revealed that having a minimum depth
to the network, rather than any individual section, is
vital to the model’s performance.
Finally, we showed how the ImageNet trained model
can generalize well to other datasets. For Caltech-101
and Caltech-256, the datasets are similar enough that
we can beat the best reported results, in the latter case
by a significant margin. This result brings into ques-
tion to utility of benchmarks with small (i.e. < 104
)
training sets. Our convnet model generalized less well
to the PASCAL data, perhaps suffering from dataset
bias (Torralba & Efros, 2011), although it was still
within 3.2% of the best reported result, despite no tun-
ing for the task. For example, our performance might
improve if a different loss function was used that per-
mitted multiple objects per image. This would natu-
rally enable the networks to tackle the object detection
as well.
Acknowledgments
The authors are very grateful for support by NSF grant
IIS-1116923, Microsoft Research and a Sloan Fellow-
ship.
References
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle,
H. Greedy layer-wise training of deep networks. In
NIPS, pp. 153–160, 2007.
Berkes, P. and Wiskott, L. On the analysis and in-
terpretation of inhomogeneous quadratic forms as
receptive fields. Neural Computation, 2006.
Bo, L., Ren, X., and Fox, D. Multipath sparse coding
using hierarchical matching pursuit. In CVPR, 2013.
Visualizing and Understanding Convolutional Networks
Ciresan, D. C., Meier, J., and Schmidhuber, J. Multi-
column deep neural networks for image classifica-
tion. In CVPR, 2012.
Dalal, N. and Triggs, B. Histograms of oriented gra-
dients for pedestrian detection. In CVPR, 2005.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang,
N., Tzeng, E., and Darrell, T. DeCAF: A deep con-
volutional activation feature for generic visual recog-
nition. In arXiv:1310.1531, 2013.
Erhan, D., Bengio, Y., Courville, A., and Vincent, P.
Visualizing higher-layer features of a deep network.
In Technical report, University of Montreal, 2009.
Fei-fei, L., Fergus, R., and Perona, P. One-shot learn-
ing of object categories. IEEE Trans. PAMI, 2006.
Griffin, G., Holub, A., and Perona, P. The caltech 256.
In Caltech Technical Report, 2006.
Gunji, N., Higuchi, T., Yasumoto, K., Muraoka, H.,
Ushiku, Y., Harada, T., and Kuniyoshi, Y. Classifi-
cation entry. In Imagenet Competition, 2012.
Hinton, G. E., Osindero, S., and The, Y. A fast learn-
ing algorithm for deep belief nets. Neural Computa-
tion, 18:1527–1554, 2006.
Hinton, G.E., Srivastave, N., Krizhevsky, A.,
Sutskever, I., and Salakhutdinov, R. R. Improv-
ing neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580, 2012.
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and Le-
Cun, Y. What is the best multi-stage architecture
for object recognition? In ICCV, 2009.
Jianchao, Y., Kai, Y., Yihong, G., and Thomas, H.
Linear spatial pyramid matching using sparse cod-
ing for image classification. In CVPR, 2009.
Krizhevsky, A., Sutskever, I., and Hinton, G.E. Im-
agenet classification with deep convolutional neural
networks. In NIPS, 2012.
Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P., and
Ng, A. Y. Tiled convolutional neural networks. In
NIPS, 2010.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D.,
Howard, R. E., Hubbard, W., and Jackel, L. D.
Backpropagation applied to handwritten zip code
recognition. Neural Comput., 1(4):541–551, 1989.
Sande, K., Uijlings, J., Snoek, C., and Smeulders, A.
Hybrid coding for selective search. In PASCAL VOC
Classification Challenge 2012, 2012.
Sohn, K., Jung, D., Lee, H., and Hero III, A. Effi-
cient learning of sparse, distributed, convolutional
feature representations for object recognition. In
ICCV, 2011.
Torralba, A. and Efros, A. A. Unbiased look at dataset
bias. In CVPR, 2011.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol,
P. A. Extracting and composing robust features
with denoising autoencoders. In ICML, pp. 1096–
1103, 2008.
Yan, S., Dong, J., Chen, Q., Song, Z., Pan, Y., Xia,
W., Huang, Z., Hua, Y., and Shen, S. Generalized
hierarchical matching for sub-category aware object
classification. In PASCAL VOC Classification Chal-
lenge 2012, 2012.
Zeiler, M., Taylor, G., and Fergus, R. Adaptive decon-
volutional networks for mid and high level feature
learning. In ICCV, 2011.

More Related Content

What's hot (20)

PPTX
Efficient Neural Network Architecture for Image Classfication
Yogendra Tamang
 
PPTX
Introduction to CNN
Shuai Zhang
 
PDF
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
PPTX
Convolutional Neural Network (CNN)
Muhammad Haroon
 
PDF
Offline Character Recognition Using Monte Carlo Method and Neural Network
ijaia
 
PPTX
CONVOLUTIONAL NEURAL NETWORK
Md Rajib Bhuiyan
 
PPTX
Convolutional Neural Network and RNN for OCR problem.
Vishal Mishra
 
PPTX
Deep learning lecture - part 1 (basics, CNN)
SungminYou
 
PPTX
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
PPTX
Convolutional Neural Network (CNN) - image recognition
YUNG-KUEI CHEN
 
PDF
LeNet to ResNet
Somnath Banerjee
 
PPTX
CNN and its applications by ketaki
Ketaki Patwari
 
PPT
Cnn method
AmirSajedi1
 
PPTX
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
UMBC
 
PPTX
Convolutional neural network
MojammilHusain
 
PDF
Convolutional Neural Networks : Popular Architectures
ananth
 
PPTX
Convolutional Neural Network and Its Applications
Kasun Chinthaka Piyarathna
 
PDF
ujava.org Deep Learning with Convolutional Neural Network
신동 강
 
PDF
Deep learning
Rouyun Pan
 
Efficient Neural Network Architecture for Image Classfication
Yogendra Tamang
 
Introduction to CNN
Shuai Zhang
 
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Convolutional Neural Network (CNN)
Muhammad Haroon
 
Offline Character Recognition Using Monte Carlo Method and Neural Network
ijaia
 
CONVOLUTIONAL NEURAL NETWORK
Md Rajib Bhuiyan
 
Convolutional Neural Network and RNN for OCR problem.
Vishal Mishra
 
Deep learning lecture - part 1 (basics, CNN)
SungminYou
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
Convolutional Neural Network (CNN) - image recognition
YUNG-KUEI CHEN
 
LeNet to ResNet
Somnath Banerjee
 
CNN and its applications by ketaki
Ketaki Patwari
 
Cnn method
AmirSajedi1
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
UMBC
 
Convolutional neural network
MojammilHusain
 
Convolutional Neural Networks : Popular Architectures
ananth
 
Convolutional Neural Network and Its Applications
Kasun Chinthaka Piyarathna
 
ujava.org Deep Learning with Convolutional Neural Network
신동 강
 
Deep learning
Rouyun Pan
 

Similar to Visualizing and Understanding Convolutional Networks (20)

PPTX
Visualizing and understanding convolutional networks(2014)
WoochulShin10
 
PDF
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Universitat Politècnica de Catalunya
 
PDF
20150703.journal club
Hayaru SHOUNO
 
PPTX
convnets.pptx
MohamedAliHabib3
 
PDF
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Universitat Politècnica de Catalunya
 
PDF
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Universitat Politècnica de Catalunya
 
PDF
Cs231n 2017 lecture12 Visualizing and Understanding
Yanbin Kong
 
PDF
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PPTX
Convolutional neural networks
Learning Courses Online
 
PDF
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Universitat Politècnica de Catalunya
 
PDF
Deep Learning for Computer Vision: Visualization (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Universitat Politècnica de Catalunya
 
PPTX
conv_nets.pptx
ssuser80a05c
 
PDF
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Willy Marroquin (WillyDevNET)
 
PDF
1409.1556.pdf
Zuhriddin1
 
PPTX
Mnist report ppt
RaghunandanJairam
 
PPTX
Cnn visualizing
哲东 郑
 
PDF
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
Universitat Politècnica de Catalunya
 
PDF
imageclassification-160206090009.pdf
KammetaJoshna
 
PDF
Mnist report
RaghunandanJairam
 
Visualizing and understanding convolutional networks(2014)
WoochulShin10
 
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Universitat Politècnica de Catalunya
 
20150703.journal club
Hayaru SHOUNO
 
convnets.pptx
MohamedAliHabib3
 
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Universitat Politècnica de Catalunya
 
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Universitat Politècnica de Catalunya
 
Cs231n 2017 lecture12 Visualizing and Understanding
Yanbin Kong
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Convolutional neural networks
Learning Courses Online
 
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision: Visualization (UPC 2016)
Universitat Politècnica de Catalunya
 
Interpretability of Convolutional Neural Networks - Xavier Giro - UPC Barcelo...
Universitat Politècnica de Catalunya
 
conv_nets.pptx
ssuser80a05c
 
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Willy Marroquin (WillyDevNET)
 
1409.1556.pdf
Zuhriddin1
 
Mnist report ppt
RaghunandanJairam
 
Cnn visualizing
哲东 郑
 
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
Universitat Politècnica de Catalunya
 
imageclassification-160206090009.pdf
KammetaJoshna
 
Mnist report
RaghunandanJairam
 
Ad

More from Willy Marroquin (WillyDevNET) (20)

PDF
Governance in the Age of Generative AI: A 360º Approach for Resilient Pol...
Willy Marroquin (WillyDevNET)
 
PDF
Marco Ético para implementación de IA en Colombia
Willy Marroquin (WillyDevNET)
 
PDF
Microsoft AI Transformation Partner Playbook.pdf
Willy Marroquin (WillyDevNET)
 
PDF
World Economic Forum : The Global Risks Report 2024
Willy Marroquin (WillyDevNET)
 
PDF
Language Is Not All You Need: Aligning Perception with Language Models
Willy Marroquin (WillyDevNET)
 
PDF
Real Time Speech Enhancement in the Waveform Domain
Willy Marroquin (WillyDevNET)
 
PDF
Data and AI reference architecture
Willy Marroquin (WillyDevNET)
 
PDF
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Willy Marroquin (WillyDevNET)
 
PDF
An Artificial Neuron Implemented on an Actual Quantum Processor
Willy Marroquin (WillyDevNET)
 
PDF
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
Willy Marroquin (WillyDevNET)
 
PDF
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
Willy Marroquin (WillyDevNET)
 
PDF
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
Willy Marroquin (WillyDevNET)
 
PDF
Deep learning-approach
Willy Marroquin (WillyDevNET)
 
PDF
WEF new vision for education
Willy Marroquin (WillyDevNET)
 
PDF
El futuro del trabajo perspectivas regionales
Willy Marroquin (WillyDevNET)
 
PDF
ASIA Y EL NUEVO (DES)ORDEN MUNDIAL
Willy Marroquin (WillyDevNET)
 
PDF
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
Willy Marroquin (WillyDevNET)
 
PDF
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
Willy Marroquin (WillyDevNET)
 
PDF
When Will AI Exceed Human Performance? Evidence from AI Experts
Willy Marroquin (WillyDevNET)
 
PDF
Microsoft AI Platform Whitepaper
Willy Marroquin (WillyDevNET)
 
Governance in the Age of Generative AI: A 360º Approach for Resilient Pol...
Willy Marroquin (WillyDevNET)
 
Marco Ético para implementación de IA en Colombia
Willy Marroquin (WillyDevNET)
 
Microsoft AI Transformation Partner Playbook.pdf
Willy Marroquin (WillyDevNET)
 
World Economic Forum : The Global Risks Report 2024
Willy Marroquin (WillyDevNET)
 
Language Is Not All You Need: Aligning Perception with Language Models
Willy Marroquin (WillyDevNET)
 
Real Time Speech Enhancement in the Waveform Domain
Willy Marroquin (WillyDevNET)
 
Data and AI reference architecture
Willy Marroquin (WillyDevNET)
 
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Willy Marroquin (WillyDevNET)
 
An Artificial Neuron Implemented on an Actual Quantum Processor
Willy Marroquin (WillyDevNET)
 
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
Willy Marroquin (WillyDevNET)
 
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
Willy Marroquin (WillyDevNET)
 
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
Willy Marroquin (WillyDevNET)
 
Deep learning-approach
Willy Marroquin (WillyDevNET)
 
WEF new vision for education
Willy Marroquin (WillyDevNET)
 
El futuro del trabajo perspectivas regionales
Willy Marroquin (WillyDevNET)
 
ASIA Y EL NUEVO (DES)ORDEN MUNDIAL
Willy Marroquin (WillyDevNET)
 
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
Willy Marroquin (WillyDevNET)
 
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
Willy Marroquin (WillyDevNET)
 
When Will AI Exceed Human Performance? Evidence from AI Experts
Willy Marroquin (WillyDevNET)
 
Microsoft AI Platform Whitepaper
Willy Marroquin (WillyDevNET)
 
Ad

Recently uploaded (20)

PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Français Patch Tuesday - Juillet
Ivanti
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PDF
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PDF
July Patch Tuesday
Ivanti
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Français Patch Tuesday - Juillet
Ivanti
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
July Patch Tuesday
Ivanti
 

Visualizing and Understanding Convolutional Networks

  • 1. Visualizing and Understanding Convolutional Networks Matthew D. Zeiler [email protected] Dept. of Computer Science, Courant Institute, New York University Rob Fergus [email protected] Dept. of Computer Science, Courant Institute, New York University Abstract Large Convolutional Network models have recently demonstrated impressive classifica- tion performance on the ImageNet bench- mark (Krizhevsky et al., 2012). However there is no clear understanding of why they perform so well, or how they might be im- proved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of inter- mediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architec- tures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets. 1. Introduction Since their introduction by (LeCun et al., 1989) in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detec- tion. In the last year, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks. (Ciresan et al., 2012) demonstrate state-of-the-art performance on NORB and CIFAR-10 datasets. Most notably, (Krizhevsky et al., 2012) show record beating perfor- mance on the ImageNet 2012 classification benchmark, with their convnet model achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%. Several factors are responsible for this renewed inter- est in convnet models: (i) the availability of much larger training sets, with millions of labeled exam- ples; (ii) powerful GPU implementations, making the training of very large models practical and (iii) bet- ter model regularization strategies, such as Dropout (Hinton et al., 2012). Despite this encouraging progress, there is still lit- tle insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error. In this paper we introduce a visualization technique that reveals the in- put stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution of features during training and to diagnose potential problems with the model. The visualization technique we propose uses a multi-layered Deconvo- lutional Network (deconvnet), as proposed by (Zeiler et al., 2011), to project the feature activations back to the input pixel space. We also perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification. Using these tools, we start with the architecture of (Krizhevsky et al., 2012) and explore different archi- tectures, discovering ones that outperform their results on ImageNet. We then explore the generalization abil- ity of the model to other datasets, just retraining the softmax classifier on top. As such, this is a form of su- pervised pre-training, which contrasts with the unsu- pervised pre-training methods popularized by (Hinton et al., 2006) and others (Bengio et al., 2007; Vincent et al., 2008). The generalization ability of convnet fea- tures is also explored in concurrent work by (Donahue et al., 2013). arXiv:1311.2901v3[cs.CV]28Nov2013
  • 2. Visualizing and Understanding Convolutional Networks 1.1. Related Work Visualizing features to gain intuition about the net- work is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers this is not the case, and there are limited methods for interpreting activity. (Erhan et al., 2009) find the optimal stimulus for each unit by perform- ing gradient descent in image space to maximize the unit’s activation. This requires a careful initialization and does not give any information about the unit’s in- variances. Motivated by the latter’s short-coming, (Le et al., 2010) (extending an idea by (Berkes & Wiskott, 2006)) show how the Hessian of a given unit may be computed numerically around the optimal response, giving some insight into invariances. The problem is that for higher layers, the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach, by contrast, provides a non-parametric view of invariance, showing which pat- terns from the training set activate the feature map. (Donahue et al., 2013) show visualizations that iden- tify patches within a dataset that are responsible for strong activations at higher layers in the model. Our visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map. 2. Approach We use standard fully supervised convnet models throughout the paper, as defined by (LeCun et al., 1989) and (Krizhevsky et al., 2012). These models map a color 2D input image xi, via a series of lay- ers, to a probability vector ˆyi over the C different classes. Each layer consists of (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters; (ii) pass- ing the responses through a rectified linear function (relu(x) = max(x, 0)); (iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a lo- cal contrast operation that normalizes the responses across feature maps. For more details of these opera- tions, see (Krizhevsky et al., 2012) and (Jarrett et al., 2009). The top few layers of the network are conven- tional fully-connected networks and the final layer is a softmax classifier. Fig. 3 shows the model used in many of our experiments. We train these models using a large set of N labeled images {x, y}, where label yi is a discrete variable indicating the true class. A cross-entropy loss func- tion, suitable for image classification, is used to com- pare ˆyi and yi. The parameters of the network (fil- ters in the convolutional layers, weight matrices in the fully-connected layers and biases) are trained by back- propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent. Full details of training are given in Section 3. 2.1. Visualization with a Deconvnet Understanding the operation of a convnet requires in- terpreting the feature activity in intermediate layers. We present a novel way to map these activities back to the input pixel space, showing what input pattern orig- inally caused a given activation in the feature maps. We perform this mapping with a Deconvolutional Net- work (deconvnet) (Zeiler et al., 2011). A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the oppo- site. In (Zeiler et al., 2011), deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet. To examine a convnet, a deconvnet is attached to each of its layers, as illustrated in Fig. 1(top), providing a continuous path back to image pixels. To start, an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached. Unpooling: In the convnet, the max pooling opera- tion is non-invertible, however we can obtain an ap- proximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. 1(bottom) for an illustration of the procedure. Rectification: The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the fea- ture maps are always positive. To obtain valid fea- ture reconstructions at each layer (which also should be positive), we pass the reconstructed signal through a relu non-linearity. Filtering: The convnet uses learned filters to con- volve the feature maps from the previous layer. To
  • 3. Visualizing and Understanding Convolutional Networks invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally. Projecting down from higher layers uses the switch settings generated by the max pooling in the convnet on the way up. As these switch settings are peculiar to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved. Layer Below Pooled Maps Feature Maps Rectified Feature Maps Convolu'onal   Filtering  {F}   Rec'fied  Linear   Func'on   Pooled Maps Max  Pooling   Reconstruction Rectified Unpooled Maps Unpooled Maps Convolu'onal   Filtering  {FT}   Rec'fied  Linear   Func'on   Layer Above Reconstruction Max  Unpooling   Switches   Unpooling Max Locations “Switches” Pooling Pooled Maps Feature Map Layer Above Reconstruction Unpooled Maps Rectified Feature Maps Figure 1. Top: A deconvnet layer (left) attached to a con- vnet layer (right). The deconvnet will reconstruct an ap- proximate version of the convnet features from the layer beneath. Bottom: An illustration of the unpooling oper- ation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet. 3. Training Details We now describe the large convnet model that will be visualized in Section 4. The architecture, shown in Fig. 3, is similar to that used by (Krizhevsky et al., 2012) for ImageNet classification. One difference is that the sparse connections used in Krizhevsky’s lay- ers 3,4,5 (due to the model being split across 2 GPUs) are replaced with dense connections in our model. Other important differences relating to layers 1 and 2 were made following inspection of the visualizations in Fig. 6, as described in Section 4.1. The model was trained on the ImageNet 2012 train- ing set (1.3 million images, spread over 1000 different classes). Each RGB image was preprocessed by resiz- ing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 10−2 , in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout (Hinton et al., 2012) is used in the fully con- nected layers (6 and 7) with a rate of 0.5. All weights are initialized to 10−2 and biases are set to 0. Visualization of the first layer filters during training reveals that a few of them dominate, as shown in Fig. 6(a). To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10−1 to this fixed radius. This is cru- cial, especially in the first layer of the model, where the input images are roughly in the [-128,128] range. As in (Krizhevsky et al., 2012), we produce multiple differ- ent crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on (Krizhevsky et al., 2012). 4. Convnet Visualization Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set. Feature Visualization: Fig. 2 shows feature visu- alizations from our model once training is complete. However, instead of showing the single strongest ac- tivation for a given feature map, we show the top 9 activations. Projecting each separately down to pixel space reveals the different structures that excite a given feature map, hence showing its invariance to in- put deformations. Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations as the latter solely focus on the discriminant structure within each patch. For example, in layer 5, row 1, col 2, the patches ap- pear to have little in common, but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects.
  • 4. Visualizing and Understanding Convolutional Networks Layer 2 Layer 1 Layer 3 Layer 4 Layer 5 Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach. Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause high activations in a given feature map. For each feature map we also show the corresponding image patches. Note: (i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
  • 5. Visualizing and Understanding Convolutional Networks The projections from each layer show the hierarchi- cal nature of the features in the network. Layer 2 re- sponds to corners and other edge/color conjunctions. Layer 3 has more complex invariances, capturing sim- ilar textures (e.g. mesh patterns (Row 1, Col 1); text (R2,C4)). Layer 4 shows significant variation, but is more class-specific: dog faces (R1,C1); bird’s legs (R4,C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1,C11) and dogs (R4). Feature Evolution during Training: Fig. 4 visu- alizes the progression during training of the strongest activation (across all training examples) within a given feature map projected back to pixel space. Sudden jumps in appearance result from a change in the image from which the strongest activation originates. The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged. Feature Invariance: Fig. 5 shows 5 sample images being translated, rotated and scaled by varying degrees while looking at the changes in the feature vectors from the top and bottom layers of the model, relative to the untransformed feature. Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasi- linear for translation & scaling. The network output is stable to translations and scalings. In general, the output is not invariant to rotation, except for object with rotational symmetry (e.g. entertainment center). 4.1. Architecture Selection While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. By visualizing the first and second layers of Krizhevsky et al. ’s architecture (Fig. 6(b) & (d)), various problems are apparent. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolu- tion 2, rather than 4. This new architecture retains much more information in the 1st and 2nd layer fea- tures, as shown in Fig. 6(c) & (e). More importantly, it also improves the classification performance as shown in Section 5.1. 4.2. Occlusion Sensitivity With image classification approaches, a natural ques- tion is if the model is truly identifying the location of the object in the image, or just using the surround- ing context. Fig. 7 attempts to answer this question by systematically occluding different portions of the input image with a grey square, and monitoring the output of the classifier. The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops signifi- cantly when the object is occluded. Fig. 7 also shows visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of occluder position. When the occluder covers the im- age region that appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization genuinely corresponds to the im- age structure that stimulates that feature map, hence validating the other visualizations shown in Fig. 4 and Fig. 2. 4.3. Correspondence Analysis Deep models differ from many existing recognition ap- proaches in that there is no explicit mechanism for establishing correspondence between specific object parts in different images (e.g. faces have a particular spatial configuration of the eyes and nose). However, an intriguing possibility is that deep models might be implicitly computing them. To explore this, we take 5 randomly drawn dog images with frontal pose and sys- tematically mask out the same part of the face in each image (e.g. all left eyes, see Fig. 8). For each image i, we then compute: l i = xl i − ˜xl i, where xl i and ˜xl i are the feature vectors at layer l for the original and occluded images respectively. We then measure the consis- tency of this difference vector between all related im- age pairs (i, j): ∆l = 5 i,j=1,i=j H(sign( l i), sign( l j)), where H is Hamming distance. A lower value indi- cates greater consistency in the change resulting from the masking operation, hence tighter correspondence between the same object parts in different images (i.e. blocking the left eye changes the feature repre- sentation in a consistent way). In Table 1 we compare the ∆ score for three parts of the face (left eye, right eye and nose) to random parts of the object, using fea- tures from layer l = 5 and l = 7. The lower score for these parts, relative to random object regions, for the layer 5 features show the model does establish some degree of correspondence.
  • 6. Visualizing and Understanding Convolutional Networks Input Image stride 2   image size 224   3   96   5   2   110   55 3x3 max pool stride 2 96   3   1   26 256   filter size 7   3x3 max pool stride 2 13 256   3   1   13 384   3   1   13 384   Layer 1 Layer 2 13 256   3x3 max pool stride 2 6 Layer 3 Layer 4 Layer 5 256   4096 units   4096 units   Layer 6 Layer 7 C class softmax   Output contrast norm. contrast norm. Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a different block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form. 0 50 100 150 200 250 300 350 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rotation Degrees P(trueclass) Lawn Mower Shih Tzu African Crocodile African Grey Entertrainment Center 1 3 5 7 8 9 Vertical Translation (Pixels) CanonicalDistance Lawn Mower African Crocodile African Grey Entertrainment Center 60 40 20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Vertical Translation (Pixels) CanonicalDistance Lawn Mower Shih Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Scale (Ratio) CanonicalDistance Lawn Mower Shih Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Rotation Degrees CanonicalDistance Lawn Mower Shih Tzu African Crocodile African Grey Entertrainment Center 60 40 20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vertical Translation (Pixels) P(trueclass) Lawn Mower Shih Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Scale (Ratio) P(trueclass) Lawn Mower Shih Tzu African Crocodile African Grey Entertrainment Center a1   c1   a3   c3   c4   a4   1 1.2 1.4 1.6 1.8 0 2 4 6 8 10 12 Scale (Ratio) CanonicalDistance Lawn Mower Shih Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 5 10 15 Rotation Degrees CanonicalDistance Lawn Mower Shih Tzu African Crocodile African Grey Entertrainment Center a2   b3   b4  b2  b1   c2   Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5 example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original and transformed images in layers 1 and 7 respectively. Col 4: the probability of the true label for each image, as the image is transformed.
  • 7. Visualizing and Understanding Convolutional Networks (a) (b) (c) (d) (e) Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in (d). Car wheel Racer Cab Police van Pomeranian Tennis ball Keeshond Pekinese Afghan hound Gordon setter Irish setter Mortarboard Fur coat Academic gown Australian terrier Ice lolly Vizsla Neck brace 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.05 0.1 0.15 0.2 0.25 True Label: Pomeranian (a) Input Image (b) Layer 5, strongest feature map (c) Layer 5, strongest feature map projections (d) Classifier, probability of correct class (e) Classifier, most probable class True Label: Car Wheel True Label: Afghan Hound Figure 7. Three test examples where we systematically cover up different portions of the scene with a gray square (1st column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response in the unoccluded image). (c): a visualization of this feature map projected down into the input image (black square), along with visualizations of this map from other images. The first row example shows the strongest feature to be the dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct class probability, as a function of the position of the gray square. E.g. when the dog’s face is obscured, the probability for “pomeranian” drops significantly. (e): the most probable label as a function of occluder position. E.g. in the 1st row, for most locations it is “pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The 3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive to the dog (blue region in (d)), since it uses multiple feature maps.
  • 8. Visualizing and Understanding Convolutional Networks Figure 8. Images used for correspondence experiments. Col 1: Original image. Col 2,3,4: Occlusion of the right eye, left eye, and nose respectively. Other columns show examples of random occlusions. Mean Feature Mean Feature Sign Change Sign Change Occlusion Location Layer 5 Layer 7 Right Eye 0.067 ± 0.007 0.069 ± 0.015 Left Eye 0.069 ± 0.007 0.068 ± 0.013 Nose 0.079 ± 0.017 0.069 ± 0.011 Random 0.107 ± 0.017 0.073 ± 0.014 Table 1. Measure of correspondence for different object parts in 5 different dog images. The lower scores for the eyes and nose (compared to random object parts) show the model implicitly establishing some form of correspondence of parts at layer 5 in the model. At layer 7, the scores are more similar, perhaps due to upper layers trying to discriminate between the different breeds of dog. 5. Experiments 5.1. ImageNet 2012 This dataset consists of 1.3M/50k/100k train- ing/validation/test examples, spread over 1000 cate- gories. Table 2 shows our results on this dataset. Using the exact architecture specified in (Krizhevsky et al., 2012), we attempt to replicate their result on the validation set. We achieve an error rate within 0.1% of their reported value on the ImageNet 2012 validation set. Next we analyze the performance of our model with the architectural changes outlined in Section 4.1 (7×7 filters in layer 1 and stride 2 convolutions in layers 1 & 2). This model, shown in Fig. 3, significantly out- performs the architecture of (Krizhevsky et al., 2012), beating their single model result by 1.7% (test top-5). When we combine multiple models, we obtain a test error of 14.8%, the best published performance on this dataset1 (despite only using the 2012 train- 1 This performance has been surpassed in the recent Imagenet 2013 competition (http://www.image-net.org/ ing set). We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classi- fication challenge, which obtained 26.2% error (Gunji et al., 2012). Val Val Test Error % Top-1 Top-5 Top-5 (Gunji et al., 2012) - - 26.2 (Krizhevsky et al., 2012), 1 convnet 40.7 18.2 −− (Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4 (Krizhevsky et al., 2012)∗ , 1 convnets 39.0 16.6 −− (Krizhevsky et al., 2012)∗ , 7 convnets 36.7 15.4 15.3 Our replication of (Krizhevsky et al., 2012), 1 convnet 40.5 18.1 −− 1 convnet as per Fig. 3 38.4 16.5 −− 5 convnets as per Fig. 3 – (a) 36.7 15.3 15.3 1 convnet as per Fig. 3 but with layers 3,4,5: 512,1024,512 maps – (b) 37.5 16.0 16.1 6 convnets, (a) & (b) combined 36.0 14.7 14.8 Table 2. ImageNet 2012 classification error rates. The ∗ indicates models that were trained on both ImageNet 2011 and 2012 training sets. Varying ImageNet Model Sizes: In Table 3, we first explore the architecture of (Krizhevsky et al., 2012) by adjusting the size of layers, or removing them entirely. In each case, the model is trained from scratch with the revised architecture. Removing the fully connected layers (6,7) only gives a slight increase in error. This is surprising, given that they contain the majority of model parameters. Removing two of the middle convolutional layers also makes a relatively small different to the error rate. However, removing both the middle convolution layers and the fully con- nected layers yields a model with only 4 layers whose performance is dramatically worse. This would sug- gest that the overall depth of the model is important for obtaining good performance. In Table 3, we modify our model, shown in Fig. 3. Changing the size of the fully connected layers makes little difference to perfor- mance (same for model of (Krizhevsky et al., 2012)). However, increasing the size of the middle convolution layers goes give a useful gain in performance. But in- creasing these, while also enlarging the fully connected layers results in over-fitting. 5.2. Feature Generalization The experiments above show the importance of the convolutional part of our ImageNet model in obtain- ing state-of-the-art performance. This is supported by the visualizations of Fig. 2 which show the complex in- variances learned in the convolutional layers. We now explore the ability of these feature extraction layers to generalize to other datasets, namely Caltech-101 (Fei- fei et al., 2006), Caltech-256 (Griffin et al., 2006) and PASCAL VOC 2012. To do this, we keep layers 1-7 of our ImageNet-trained model fixed and train a new challenges/LSVRC/2013/results.php).
  • 9. Visualizing and Understanding Convolutional Networks Train Val Val Error % Top-1 Top-1 Top-5 Our replication of (Krizhevsky et al., 2012), 1 convnet 35.1 40.5 18.1 Removed layers 3,4 41.8 45.4 22.1 Removed layer 7 27.4 40.0 18.4 Removed layers 6,7 27.4 44.8 22.4 Removed layer 3,4,6,7 71.1 71.3 50.1 Adjust layers 6,7: 2048 units 40.3 41.7 18.8 Adjust layers 6,7: 8192 units 26.8 40.0 18.1 Our Model (as per Fig. 3) 33.1 38.4 16.5 Adjust layers 6,7: 2048 units 38.2 40.2 17.6 Adjust layers 6,7: 8192 units 22.0 38.8 17.0 Adjust layers 3,4,5: 512,1024,512 maps 18.8 37.5 16.0 Adjust layers 6,7: 8192 units and Layers 3,4,5: 512,1024,512 maps 10.0 38.3 16.9 Table 3. ImageNet 2012 classification error rates with var- ious architectural changes to the model of (Krizhevsky et al., 2012) and our model (see Fig. 3). softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset. Since the softmax contains relatively few parameters, it can be trained quickly from a relatively small num- ber of examples, as is the case for certain datasets. The classifiers used by our model (a softmax) and other approaches (typically a linear SVM) are of simi- lar complexity, thus the experiments compare our fea- ture representation, learned from ImageNet, with the hand-crafted features used by other methods. It is im- portant to note that both our feature representation and the hand-crafted features are designed using im- ages beyond the Caltech and PASCAL training sets. For example, the hyper-parameters in HOG descrip- tors were determined through systematic experiments on a pedestrian dataset (Dalal & Triggs, 2005). We also try a second strategy of training a model from scratch, i.e. resetting layers 1-7 to random values and train them, as well as the softmax, on the training images of the dataset. One complication is that some of the Caltech datasets have some images that are also in the ImageNet train- ing data. Using normalized correlation, we identified these few “overlap” images2 and removed them from our Imagenet training set and then retrained our Ima- genet models, so avoiding the possibility of train/test contamination. Caltech-101: We follow the procedure of (Fei-fei et al., 2006) and randomly select 15 or 30 images per class for training and test on up to 50 images per class reporting the average of the per-class accuracies in Ta- 2 For Caltech-101, we found 44 images in common (out of 9,144 total images), with a maximum overlap of 10 for any given class. For Caltech-256, we found 243 images in common (out of 30,607 total images), with a maximum overlap of 18 for any given class. 0 10 20 30 40 50 60 25 30 35 40 45 50 55 60 65 70 75 Training Images per−class Accuracy% Our Model Bo etal Sohn etal Figure 9. Caltech-256 classification performance as the number of training images per class is varied. Using only 6 training examples per class with our pre-trained feature extractor, we surpass best reported result by (Bo et al., 2013). ble 4, using 5 train/test folds. Training took 17 min- utes for 30 images/class. The pre-trained model beats the best reported result for 30 images/class from (Bo et al., 2013) by 2.2%. The convnet model trained from scratch however does terribly, only achieving 46.5%. Acc % Acc % # Train 15/class 30/class (Bo et al., 2013) − 81.4 ± 0.33 (Jianchao et al., 2009) 73.2 84.3 Non-pretrained convnet 22.8 ± 1.5 46.5 ± 1.7 ImageNet-pretrained convnet 83.8 ± 0.5 86.5 ± 0.5 Table 4. Caltech-101 classification accuracy for our con- vnet models, against two leading alternate approaches. Caltech-256: We follow the procedure of (Griffin et al., 2006), selecting 15, 30, 45, or 60 training im- ages per class, reporting the average of the per-class accuracies in Table 5. Our ImageNet-pretrained model beats the current state-of-the-art results obtained by Bo et al. (Bo et al., 2013) by a significant margin: 74.2% vs 55.2% for 60 training images/class. However, as with Caltech-101, the model trained from scratch does poorly. In Fig. 9, we explore the “one-shot learn- ing” (Fei-fei et al., 2006) regime. With our pre-trained model, just 6 Caltech-256 training images are needed to beat the leading method using 10 times as many im- ages. This shows the power of the ImageNet feature extractor. Acc % Acc % Acc % Acc % # Train 15/class 30/class 45/class 60/class (Sohn et al., 2011) 35.1 42.1 45.7 47.9 (Bo et al., 2013) 40.5 ± 0.4 48.0 ± 0.2 51.9 ± 0.2 55.2 ± 0.3 Non-pretr. 9.0 ± 1.4 22.5 ± 0.7 31.2 ± 0.5 38.8 ± 1.4 ImageNet-pretr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3 Table 5. Caltech 256 classification accuracies.
  • 10. Visualizing and Understanding Convolutional Networks PASCAL 2012: We used the standard training and validation images to train a 20-way softmax on top of the ImageNet-pretrained convnet. This is not ideal, as PASCAL images can contain multiple objects and our model just provides a single exclusive prediction for each image. Table 6 shows the results on the test set. The PASCAL and ImageNet images are quite differ- ent in nature, the former being full scenes unlike the latter. This may explain our mean performance being 3.2% lower than the leading (Yan et al., 2012) result, however we do beat them on 5 classes, sometimes by large margins. Acc % [A] [B] Ours Acc % [A] [B] Ours Airplane 92.0 97.3 96.0 Dining tab 63.2 77.8 67.7 Bicycle 74.2 84.2 77.1 Dog 68.9 83.0 87.8 Bird 73.0 80.8 88.4 Horse 78.2 87.5 86.0 Boat 77.5 85.3 85.5 Motorbike 81.0 90.1 85.1 Bottle 54.3 60.8 55.8 Person 91.6 95.0 90.9 Bus 85.2 89.9 85.8 Potted pl 55.9 57.8 52.2 Car 81.9 86.8 78.6 Sheep 69.4 79.2 83.6 Cat 76.4 89.3 91.2 Sofa 65.4 73.4 61.1 Chair 65.2 75.4 65.0 Train 86.7 94.5 91.8 Cow 63.2 77.8 74.4 Tv 77.4 80.7 76.1 Mean 74.3 82.2 79.0 # won 0 15 5 Table 6. PASCAL 2012 classification results, comparing our Imagenet-pretrained convnet against the leading two methods ([A]= (Sande et al., 2012) and [B] = (Yan et al., 2012)). 5.3. Feature Analysis We explore how discriminative the features in each layer of our Imagenet-pretrained model are. We do this by varying the number of layers retained from the Ima- geNet model and place either a linear SVM or softmax classifier on top. Table 7 shows results on Caltech- 101 and Caltech-256. For both datasets, a steady im- provement can be seen as we ascend the model, with best results being obtained by using all layers. This supports the premise that as the feature hierarchies become deeper, they learn increasingly powerful fea- tures. Cal-101 Cal-256 (30/class) (60/class) SVM (1) 44.8 ± 0.7 24.6 ± 0.4 SVM (2) 66.2 ± 0.5 39.6 ± 0.3 SVM (3) 72.3 ± 0.4 46.0 ± 0.3 SVM (4) 76.6 ± 0.4 51.3 ± 0.1 SVM (5) 86.2 ± 0.8 65.6 ± 0.3 SVM (7) 85.5 ± 0.4 71.7 ± 0.2 Softmax (5) 82.9 ± 0.4 65.7 ± 0.5 Softmax (7) 85.4 ± 0.4 72.6 ± 0.1 Table 7. Analysis of the discriminative information con- tained in each layer of feature maps within our ImageNet- pretrained convnet. We train either a linear SVM or soft- max on features from different layers (as indicated in brack- ets) from the convnet. Higher layers generally produce more discriminative features. 6. Discussion We explored large convolutional neural network mod- els, trained for image classification, in a number ways. First, we presented a novel way to visualize the ac- tivity within the model. This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class dis- crimination as we ascend the layers. We also showed how these visualization can be used to debug prob- lems with the model to obtain better results, for ex- ample improving on Krizhevsky et al. ’s (Krizhevsky et al., 2012) impressive ImageNet 2012 result. We then demonstrated through a series of occlusion exper- iments that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context. An ablation study on the model revealed that having a minimum depth to the network, rather than any individual section, is vital to the model’s performance. Finally, we showed how the ImageNet trained model can generalize well to other datasets. For Caltech-101 and Caltech-256, the datasets are similar enough that we can beat the best reported results, in the latter case by a significant margin. This result brings into ques- tion to utility of benchmarks with small (i.e. < 104 ) training sets. Our convnet model generalized less well to the PASCAL data, perhaps suffering from dataset bias (Torralba & Efros, 2011), although it was still within 3.2% of the best reported result, despite no tun- ing for the task. For example, our performance might improve if a different loss function was used that per- mitted multiple objects per image. This would natu- rally enable the networks to tackle the object detection as well. Acknowledgments The authors are very grateful for support by NSF grant IIS-1116923, Microsoft Research and a Sloan Fellow- ship. References Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. In NIPS, pp. 153–160, 2007. Berkes, P. and Wiskott, L. On the analysis and in- terpretation of inhomogeneous quadratic forms as receptive fields. Neural Computation, 2006. Bo, L., Ren, X., and Fox, D. Multipath sparse coding using hierarchical matching pursuit. In CVPR, 2013.
  • 11. Visualizing and Understanding Convolutional Networks Ciresan, D. C., Meier, J., and Schmidhuber, J. Multi- column deep neural networks for image classifica- tion. In CVPR, 2012. Dalal, N. and Triggs, B. Histograms of oriented gra- dients for pedestrian detection. In CVPR, 2005. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. DeCAF: A deep con- volutional activation feature for generic visual recog- nition. In arXiv:1310.1531, 2013. Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visualizing higher-layer features of a deep network. In Technical report, University of Montreal, 2009. Fei-fei, L., Fergus, R., and Perona, P. One-shot learn- ing of object categories. IEEE Trans. PAMI, 2006. Griffin, G., Holub, A., and Perona, P. The caltech 256. In Caltech Technical Report, 2006. Gunji, N., Higuchi, T., Yasumoto, K., Muraoka, H., Ushiku, Y., Harada, T., and Kuniyoshi, Y. Classifi- cation entry. In Imagenet Competition, 2012. Hinton, G. E., Osindero, S., and The, Y. A fast learn- ing algorithm for deep belief nets. Neural Computa- tion, 18:1527–1554, 2006. Hinton, G.E., Srivastave, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improv- ing neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012. Jarrett, K., Kavukcuoglu, K., Ranzato, M., and Le- Cun, Y. What is the best multi-stage architecture for object recognition? In ICCV, 2009. Jianchao, Y., Kai, Y., Yihong, G., and Thomas, H. Linear spatial pyramid matching using sparse cod- ing for image classification. In CVPR, 2009. Krizhevsky, A., Sutskever, I., and Hinton, G.E. Im- agenet classification with deep convolutional neural networks. In NIPS, 2012. Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P., and Ng, A. Y. Tiled convolutional neural networks. In NIPS, 2010. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, 1989. Sande, K., Uijlings, J., Snoek, C., and Smeulders, A. Hybrid coding for selective search. In PASCAL VOC Classification Challenge 2012, 2012. Sohn, K., Jung, D., Lee, H., and Hero III, A. Effi- cient learning of sparse, distributed, convolutional feature representations for object recognition. In ICCV, 2011. Torralba, A. and Efros, A. A. Unbiased look at dataset bias. In CVPR, 2011. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. A. Extracting and composing robust features with denoising autoencoders. In ICML, pp. 1096– 1103, 2008. Yan, S., Dong, J., Chen, Q., Song, Z., Pan, Y., Xia, W., Huang, Z., Hua, Y., and Shen, S. Generalized hierarchical matching for sub-category aware object classification. In PASCAL VOC Classification Chal- lenge 2012, 2012. Zeiler, M., Taylor, G., and Fergus, R. Adaptive decon- volutional networks for mid and high level feature learning. In ICCV, 2011.