0% found this document useful (0 votes)
86 views8 pages

Supervised Dictionary Learning: Julien Mairal Francis Bach

This document proposes a new supervised dictionary learning model that learns a single shared dictionary as well as discriminative models for different signal classes. The model learns jointly the dictionary and classification parameters to minimize reconstruction error and classification loss. This allows the dictionary to be adapted for the classification task. The model can be interpreted probabilistically or with kernels. Experimental results on digit and texture classification demonstrate the approach.

Uploaded by

Nicholas Woods
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views8 pages

Supervised Dictionary Learning: Julien Mairal Francis Bach

This document proposes a new supervised dictionary learning model that learns a single shared dictionary as well as discriminative models for different signal classes. The model learns jointly the dictionary and classification parameters to minimize reconstruction error and classification loss. This allows the dictionary to be adapted for the classification task. The model can be interpreted probabilistically or with kernels. Experimental results on digit and texture classification demonstrate the approach.

Uploaded by

Nicholas Woods
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Supervised Dictionary Learning

Julien Mairal Francis Bach


INRIA-Willow project INRIA-Willow project
[email protected] [email protected]

Jean Ponce Guillermo Sapiro Andrew Zisserman


Ecole Normale Supérieure University of Minnesota University of Oxford
[email protected] [email protected] [email protected]

Abstract
It is now well established that sparse signal models are well suited for restora-
tion tasks and can be effectively learned from audio, image, and video data. Re-
cent research has been aimed at learning discriminative sparse models instead of
purely reconstructive ones. This paper proposes a new step in that direction, with
a novel sparse representation for signals belonging to different classes in terms of
a shared dictionary and discriminative class models. The linear version of the pro-
posed model admits a simple probabilistic interpretation, while its most general
variant admits an interpretation in terms of kernels. An optimization framework
for learning all the components of the proposed model is presented, along with
experimental results on standard handwritten digit and texture classification tasks.

1 Introduction
Sparse and overcomplete image models were first introduced in [1] for modeling the spatial recep-
tive fields of simple cells in the human visual system. The linear decomposition of a signal using a
few atoms of a learned dictionary, instead of predefined ones–such as wavelets–has recently led to
state-of-the-art results for numerous low-level image processing tasks such as denoising [2], show-
ing that sparse models are well adapted to natural images. Unlike principal component analysis
decompositions, these models are in general overcomplete, with a number of basis elements greater
than the dimension of the data. Recent research has shown that sparsity helps to capture higher-order
correlation in data. In [3, 4], sparse decompositions are used with predefined dictionaries for face
and signal recognition. In [5], dictionaries are learned for a reconstruction task, and the correspond-
ing sparse models are used as features in an SVM. In [6], a discriminative method is introduced
for various classification tasks, learning one dictionary per class; the classification process itself is
based on the corresponding reconstruction error, and does not exploit the actual decomposition co-
efficients. In [7], a generative model for documents is learned at the same time as the parameters of
a deep network structure. In [8], multi-task learning is performed by learning features and tasks are
selected using a sparsity criterion. The framework we present in this paper extends these approaches
by learning simultaneously a single shared dictionary as well as models for different signal classes
in a mixed generative and discriminative formulation (see also [9], where a different discriminative
term is added to the classical reconstructive one). Similar joint generative/discriminative frame-
works have started to appear in probabilistic approaches to learning, e.g., [10, 11, 12, 13, 14], and
in neural networks [15], but not, to the best of our knowledge, in the sparse dictionary learning
framework. Section 2 presents a formulation for learning a dictionary tuned for a classification task,
which we call supervised dictionary learning, and Section 3 its interpretation in term of probabil-
ity and kernel frameworks. The optimization procedure is detailed in Section 4, and experimental
results are presented in Section 5.
2 Supervised dictionary learning
We present in this section the core of the proposed model. In classical sparse coding tasks, one con-
siders a signal x in Rn and a fixed dictionary D = [d1 , . . . , dk ] in Rn×k (allowing k > n, making
the dictionary overcomplete). In this setting, sparse coding with an ℓ1 regularization1 amounts to
computing
R⋆ (x, D) = min ||x − Dα||22 + λ1 ||α||1 . (1)
α∈Rk
It is well known in the statistics, optimization, and compressed sensing communities that the ℓ1
penalty yields a sparse solution, very few non-zero coefficients in α, although there is no explicit
analytic link between the value of λ1 and the effective sparsity that this model yields. Other sparsity
penalties using the ℓ0 regularization2 can be used as well. Since it uses a proper norm, the ℓ1
formulation of sparse coding is a convex problem, which makes the optimization tractable with
algorithms such as those introduced in [16, 17], and has proven in practice to be more stable than its
ℓ0 counterpart, in the sense that the resulting decompositions are less sensitive to small perturbations
of the input signal x. Note that sparse coding with an ℓ0 penalty is an NP-hard problem and is often
approximated using greedy algorithms.
In this paper, we consider a setting, where the signal may belong to any of p different classes. We
first consider the case of p = 2 classes and later discuss the multiclass extension. We consider a
training set of m labeled signals (xi )m n m
i=1 in R , associated with binary labels (yi ∈ {−1, +1})i=1 .
Our goal is to learn jointly a single dictionary D adapted to the classification task and a function
f which should be positive for any signal in class +1 and negative otherwise. We consider in this
paper two different models to use the sparse code α for the classification task:
(i) linear in α: f (x, α, θ) = wT α + b, where θ = {w ∈ Rk , b ∈ R} parametrizes the model.
(ii) bilinear in x and α: f (x, α, θ) = xT Wα + b, where θ = {W ∈ Rn×k , b ∈ R}. In this case,
the model is bilinear and f acts on both x and its sparse code α.
The number of parameters in (ii) is greater than in (i), which allows for richer models. Note that
one can interpret W as a linear filter encoding the input signal x into a model for the coefficients α,
which has a role similar to the encoder in [18] but for a discriminative task.
A classical approach to obtain α for (i) or (ii) is to first adapt D to the data, solving
Xm
min ||xi − Dαi ||22 + λ1 ||αi ||1 , (2)
D,α
i=1
Note also that since the reconstruction errors ||xi − Dαi ||22 are invariant to scaling simultaneously
D by a scalar and αi by its inverse, we need to constrain the ℓ2 norm of the columns of D. Such a
constraint is classical in sparse coding [2]. This reconstructive approach (dubbed REC in this paper)
provides sparse codes αi for each signal xi , which can be used a posteriori in a regular classifier
such as logistic regression, which would require to solve
Xm
C yi f (xi , αi , θ) + λ2 ||θ||22 ,

min (3)
θ
i=1
where C is the logistic loss function (C(x) = log(1 + e−x )), which enjoys properties similar to
that of the hinge loss from the SVM literature, while being differentiable, and λ2 is a regularization
parameter, which prevents overfitting. This is the approach chosen in [5] (with SVMs). However,
our goal is to learn jointly D and the model parameters θ. To that effect, we propose the formulation
X m 
C yi f (xi , αi , θ) + λ0 ||xi − Dαi ||22 + λ1 ||αi ||1 + λ2 ||θ||22 ,

min (4)
D,θ,α
i=1
where λ0 controls the importance of the reconstruction term, and the loss for a pair (xi , yi ) is
S ⋆ (xi , D, θ, yi ) = min S(α, xi , D, θ, yi ),
α
(5)
where S(α, xi , D, θ, yi ) = C yi f (xi , αi , θ) + λ0 ||xi − Dαi ||22 + λ1 ||αi ||1 .


In this setting, the classification procedure of a new signal x with an unknown label y, given a
learned dictionary D and parameters θ, involves supervised sparse coding:
min S ⋆ (x, D, θ, y), (6)
y∈{−1;+1}

The learning procedure of Eq. (4) minimizes the sum of the costs for the pairs (xi , yi )m
i=1 and cor-
responds to a generative model. We will refer later to this model as SDL-G (supervised dictionary
1
The ℓ1 norm of a vector x of size n is defined as ||x||1 = n
P
i=1 |x[i]|.
2
The ℓ0 pseudo-norm of a vector x is the number of nonzeros coefficients of x. Note that it is not a norm.
i = 1, . . . , m
D αi w

xi yi

Figure 1: Graphical model for the proposed generative/discriminative learning framework.

learning, generative). Note the explicit incorporation of the reconstructive and discriminative com-
ponent into sparse coding, in addition to the classical reconstructive term (see [9] for a different
classification component).
However, since the classification procedure from Eq. (6) compares the different costs S ⋆ (x, D, θ, y)
of a given signal for each class y = −1, +1, a more discriminative approach is to not only make
the costs S ⋆ (xi , D, θ, yi ) small, as in (4), but also make the value of S ⋆ (xi , D, θ, −yi ) greater than
S ⋆ (xi , D, θ, yi ), which is the purpose of the logistic loss function C. This leads to:
X m 
min C(S ⋆ (xi , D, θ, −yi ) − S ⋆ (xi , D, θ, yi )) + λ2 ||θ||22 . (7)
D,θ
i=1
As detailed below, this problem is more difficult to solve than (4), and therefore we adopt instead a
mixed formulation between the minimization of the generative Eq. (4) and its discriminative version
(7), (see also [13])—that is,
X m 
µC(S ⋆ (xi , D, θ, −yi ) − S ⋆ (xi , D, θ, yi )) + (1 − µ)S ⋆ (xi , D, θ, yi ) + λ2 ||θ||22 , (8)
i=1
where µ controls the trade-off between the reconstruction from Eq. (4) and the discrimination from
Eq. (7). This is the proposed generative/discriminative model for sparse signal representation and
classification from learned dictionary D and model θ. We will refer to this mixed model as SDL-D,
(supervised dictionary learning, discriminative). Note also that, again, we constrain the norm of the
columns of D to be less than or equal to one.
All of these formulations admit a straightforward
Pp multiclass extension, using softmax discriminative
cost functions Ci (x1 , ..., xp ) = log( j=1 exj −xi ), which are multiclass versions of the logistic
function, and learning one model θ i per class. Other possible approaches such as one-vs-all or
one-vs-one are of course possible, and the question of choosing the best approach among these
possibilities is still open. Compared with earlier work using one dictionary per class [6], our model
has the advantage of letting multiple classes share some features, and uses the coefficients α of
the sparse representations as part of the classification procedure, thereby following the works from
[3, 4, 5], but with learned representations optimized for the classification task similar to [9, 10].
Before presenting the optimization procedure, we provide below two interpretations of the linear
and bilinear versions of our formulation in terms of a probabilistic graphical model and a kernel.

3 Interpreting the model


3.1 A probabilistic interpretation of the linear model
Let us first construct a graphical model which gives a probabilistic interpretation to the training and
classification criteria given above when using a linear model with zero bias (no constant term) on
the coefficients—that is, f (x, α, θ) = wT α. It consists of the following components (Figure 1):
• The matrices D and the vector w are parameters of the problem, with a Gaussian prior on w,
2
p(w) ∝ e−λ2 ||w||2 , and a constraint on the columns of D–that is, ||dj ||22 = 1 for all j. All the dj ’s
are considered independent of each other.
• The coefficients αi are latent variables with a Laplace prior, p(αi ) ∝ e−λ1 ||αi ||1 .
• The signals xi are generated according to a Gaussian probability distribution conditioned on D
and αi , p(xi |αi , D) ∝ e−λ0 ||xi −Dαi ||2 . All the xi ’s are considered independent from each other.
2
• The labels yi are generated according to a probability distribution conditioned on w and αi , and
T T T
given by p(yi = ǫ|αi , W) = e−ǫw αi / e−W αi + eW αi . Given D and w, all the triplets


(αi , xi , yi ) are independent.


What is commonly called “generative training” in the literature (e.g., [12, 13]), amounts to
finding the maximum likelihood estimates for D and w according to the joint distribution
p({xi , yi }m
i=1 , D, W), where the xi ’s and the yi ’s are the training signals and their labels respec-
tively. It can easily be shown (details omitted due to space limitations) that there is an equiva-
lence between this generative training and our formulation in Eq. (4) under MAP approximations.3
Although joint generative modeling of x and y through a shared representation has shown great
promise [10], we show in this paper that a more discriminative approach is desirable. “Discrim-
inative training” is slightly different and amounts to maximizing p({yi }m m
i=1 , D, w|{xi }i=1 ) with
respect to D and w: Given some input data, one finds the best parameters that will predict the labels
of the data. The same kind of MAP approximation relates this discriminative training formulation
to the discriminative model of Eq. (7) (again, details omitted due to space limitations). The mixed
approach from Eq. (8) is a classical trade-off between generative and discriminative (e.g., [12, 13]),
where generative components are often added to discriminative frameworks to add robustness, e.g.,
to noise and occlusions (see examples of this for the model in [9]).
3.2 A kernel interpretation of the bilinear model
Our bilinear model with f (x, α, θ) = xT Wα + b does not admit a straightforward probabilistic
interpretation. On the other hand, it can easily be interpreted in terms of kernels: Given two signals
x1 and x2 , with coefficients α1 and α2 , using the kernel K(x1 , x2 ) = αT1 α2 xT1 x2 in a logistic
regression classifier amounts to finding a decision function of the same form as f . It is a product
of two linear kernels, one on the α’s and one on the input signals x. Interestingly, Raina et al. [5]
learn a dictionary adapted to reconstruction on a training set, then train an SVM a posteriori on
the decomposition coefficients α. They derive and use a Fisher kernel, which can be written as
K ′ (x1 , x2 ) = αT1 α2 rT1 r2 in this setting, where the r’s are the residuals of the decompositions. In
simple experiments, which are not reported in this paper, we have observed that the kernel K, where
the signals x replace the residuals r, generally yields a level of performance similar to K ′ and often
actually does better when the number of training samples is small or the data are noisy.

4 Optimization procedure
Classical dictionary learning techniques (e.g., [1, 5, 19]), address the problem of learning a recon-
structive dictionary D in Rn×k well adapted to a training set, which is presented in Eq. (3). It can
be seen as an optimization problem with respect to the dictionary D and the coefficients α. Altough
not jointly convex in (D, α), it is convex with respect to each unknown when the other one is fixed.
This is why block coordinate descent on D and α performs reasonably well [1, 5, 19], although not
necessarily providing the global optimum. Training when µ = 0 (generative case), i.e., from Eq.
(4), enjoys similar properties and can be addressed with the same optimization procedure. Equation
(4) can be rewritten as:
X m 
min S(xj , αj , D, θ, yi ) + λ2 ||θ||22 , s.t. ∀ j = 1, . . . , k, ||dj ||2 ≤ 1. (9)
D,θ,α
i=1
Block coordinate descent consists therefore of iterating between supervised sparse coding, where
D and θ are fixed and one optimizes with respect to the α’s and supervised dictionary update,
where the coefficients αi ’s are fixed, but D and θ are updated. Details on how to solve these two
problems are given in sections 4.1 and 4.2. The discriminative version SDL-D from Eq. (7) is more
problematic. To reach a local minimum for this difficult non-convex optimization problem, we have
chosen a continuation method, starting from the generative case and ending with the discriminative
one as in [6]. The algorithm is presented in Figure 2, and details on the hyperparameters’ settings
are given in Section 5.
4.1 Supervised sparse coding
The supervised sparse coding problem from Eq. (6) (D and θ are fixed in this step) amounts to
minimizing a convex function under an ℓ1 penalty. The fixed-point continuation method (FPC) from
3
We are also investigating how to properly estimate D by marginalizing over α instead of maximizing with
respect to α.
Input: n (signal dimensions); (xi , yi )mi=1 (training signals); k (size of the dictionary); λ0 , λ1 , λ2
(parameters); 0 ≤ µ1 ≤ µ2 ≤ . . . ≤ µm ≤ 1 (increasing sequence).
Output: D ∈ Rn×k (dictionary); θ (parameters).
Initialization: Set D to a random Gaussian matrix with normalized columns. Set θ to zero.
Loop: For µ = µ1 , . . . , µm ,
Loop: Repeat until convergence (or a fixed number of iterations),
• Supervised sparse coding: Solve, for all i = 1, . . . , m,
 ⋆
αi,− = arg minα S(α, xi , D, θ, −1)
. (10)
α⋆i,+ = arg minα S(α, xi , D, θ, +1)
• Dictionary and parameters update: Solve
m
X
µC (S(α⋆i,− , xi , D, θ, −yi ) − S(α⋆i,+ , xj , D, θ, yi )) +

min
D,θ
i=1

(1 − µ)S(α⋆i,yi , xi , D, θ, yi ) + λ2 ||θ||22 s.t. ∀j, ||dj ||2 ≤ 1. (11)

Figure 2: SDL: Supervised dictionary learning algorithm.

[17] achieves good results in terms of convergence speed for this class of problems. For our specific
problem, denoting by g the convex function to minimize, this method only requires ∇g and a bound
on the spectral norm of its Hessian Hg . Since the we have chosen models g which are both linear in
α, there exists, for each supervised sparse coding problem, a vector a in Rk and a scalar c in R such
that (
g(α) = C(aT α + c) + λ0 ||x − Dα||22 ,
∇g(α) = ∇C(aT α + c)a − 2λ0 DT (x − Dα),
and it can be shown that, if ||U||2 denotes the spectral norm of a matrix U (which is the magnitude of
its largest eigenvalue), then we can obtain the following bound, ||Hg (α)||2 ≤ |HC (aT α+c)|||a||22 +
2λ0 ||DT D||2 .

4.2 Dictionary update


The problem of updating D and θ in Eq. (11) is not convex in general (except when µ is close to 0),
but a local minimum can be obtained using projected gradient descent (as in the general literature on
dictionary learning, this local minimum has experimentally been found to be good enough in terms
of classification performance). ). Denoting E(D, θ) the function we want to minimize in Eq. (11),
we just need the partial derivatives of E with respect to D and the parameters θ. When considering
the linear model for the α’s, f (x, α, θ) = wT α + b, and θ = {w ∈ Rk , b ∈ R}, we obtain
 m
 ∂E X X
ωi,z (xi − Dα⋆i,z )α⋆T


 = −2λ 0 i,z ,
 ∂D

i=1 z={−1,+1}





 ∂E
 Xm X
= ωi,z z∇C(wT α⋆i,z + b)α⋆i,z , (12)

 ∂w i=1 z={−1,+1}


 m
∂E

 X X
= ωi,z z∇C(wT α⋆i,z + b),



 ∂b

i=1 z={−1,+1}

where ωi,z = −µz∇C S(αi,− , xi , D, θ, −yi ) − S(α⋆i,+ , xi , D, θ, yi ) + (1 − µ)1z=yi .

Partial derivatives when using our model with multiple classes or with the bilinear models
f (x, α, θ) = xT Wα + b are not presented in this paper due to space limitations.

5 Experimental validation
We compare in this section the reconstructive approach, dubbed REC, which consists of learning
a reconstructive dictionary D as in [5] and then learning the parameters θ a posteriori; SDL with
generative training (dubbed SDL-G); and SDL with discriminative learning (dubbed SDL-D). We
also compare the performance of the linear (L) and bilinear (BL) models.
REC L SDL-G L SDL-D L REC BL k-NN, ℓ2 SVM-Gauss
MNIST 4.33 3.56 1.05 3.41 5.0 1.4
USPS 6.83 6.67 3.54 4.38 5.2 4.2
Table 1: Error rates on the MNIST and USPS datasets in percents for the REC, SDL-G L and
SDL-D L approaches, compared with k-nearest neighbor and SVM with a Gaussian kernel [20].

Before presenting experimental results, let us briefly discuss the choice of the five model parameters
λ0 , λ1 , λ2 , µ and k (size of the dictionary). Tuning all of them using cross-validation is cumbersome
and unnecessary since some simple choices can be made, some of which can be made sequentially.
We define first the sparsity parameter κ = λλ10 , which dictates how sparse the decompositions are.
When the input data points have unit ℓ2 norm, choosing κ = 0.15 was empirically found to be a good
choice. For reconstructive tasks, a typical value often used in the literature (e.g., [19]) is k = 256 for
m = 100 000 signals. Nevertheless, for discriminative tasks, increasing the number of parameters is
likely to lead to overfitting, and smaller values like k = 64 or k = 32 are preferred. The scalar λ2 is
a regularization parameter for preventing the model to overfit the input data. As in logistic regression
or support vector machines, this parameter is crucial when the number of training samples is small.
Performing cross validation with the fast method REC quickly provides a reasonable value for this
parameter, which can be used afterward for SDL-G or SDL-D.
Once κ, k and λ2 are chosen, let us see how to find λ0 , which plays the important role of controlling
the trade-off between reconstruction and discrimination. First, we perform cross-validation for a few
iterations with µ = 0 to find a good value for SDL-G. Then, a scale factor making the costs S ⋆ dis-
criminative for µ > 0 can be chosen during the optimization process: PmGiven a set of computed costs
S ⋆ , one can compute a scale factor γ ⋆ such that γ ⋆ = arg minγ i=1 C({γ(S ⋆ (xi , D, θ, −yi ) −
S ⋆ (xi , D, θ, yi )). We therefore propose the following strategy, which has proven to be effective in
our experiments: Starting from small values for λ0 and a fixed κ, we apply the algorithm in Figure
2, and after a supervised sparse coding step, we compute the best scale factor γ ⋆ , and replace λ0
and λ1 by γ ⋆ λ0 and γλ1 . Typically, applying this procedure during the first 10 iterations has proven
to lead to reasonable values for these parameters. Since we are following a continuation path from
µ = 0 to µ = 1, the optimal value of µ is found along the path by measuring the classification
performance of the model on a validation set during the optimization.
5.1 Digits recognition
In this section, we present experiments on the popular MNIST [20] and USPS handwritten digit
datasets. MNIST is composed of 70 000 28 × 28 images, 60 000 for training, 10 000 for testing, each
of them containing one handwritten digit. USPS is composed of 7291 training images and 2007 test
images of size 16 × 16. As is often done in classification, we have chosen to learn pairwise binary
classifiers, one for each pair of digits. Although our framework extends to a multiclass formulation,
pairwise binary classifiers have resulted in slightly better performance in practice. Five-fold cross
validation is performed to find the best pair (k, κ). The tested values for k are {24, 32, 48, 64, 96},
and for κ, {0.13, 0.14, 0.15, 0.16, 0.17}. We keep the three best pairs of parameters and use them to
train three sets of pairwise classifiers. For a given image x, the test procedure consists of selecting
the class which receives the most votes from the pairwise classifiers. All the other parameters are
obtained using the procedure explained above. Classification results are presented on Table 1 using
the linear model. We see that for the linear model L, SDL-D L performs the best. REC BL offers
a larger feature space and performs better than REC L, but we have observed no gain by using
SDL-G BL or SDL-D BL instead of REC BL (this results are not reported in this table). Since the
linear model is already performing very well, one side effect of using BL instead of L is to increase
the number of free parameters and thus to cause overfitting. Note that our method is competitive
since the best error rates published on these datasets (without any modification of the training set)
are 0.60% [18] for MNIST and 2.4% [21] for USPS, using methods tailored to these tasks, whereas
ours is generic and has not been tuned for the handwritten digit classification domain.
The purpose of our second experiment is not to measure the raw performance of our algorithm, but
to answer the question “are the obtained dictionaries D discriminative per se?”. To do so, we have
trained on the USPS dataset 10 binary classifiers, one per digit in a one vs all fashion on the training
set. For a given value of µ, we obtain 10 dictionaries D and 10 sets of parameters θ, learned by the
SDL-D L model.
To evaluate the discriminative power of the dictionaries D, we discard the learned parameters θ and
use the dictionaries as if they had been learned in a reconstructive REC model: For each dictionary,
2.5
2.0
1.5
1.0
0.5
0
(a) REC, MNIST (b) SDL-D, MNIST 0 0.2 0.4 0.6 0.8 1.0
Figure 3: On the left, a reconstructive and a discriminative dictionary. On the right, average error
rate in percents obtained by our dictionaries learned in a discriminative framework (SDL-D L) for
various values of µ, when used at test time in a reconstructive framework (REC-L).
m REC L SDL-G L SDL-D L REC BL SDL-G BL SDL-D BL Gain
300 48.84 47.34 44.84 26.34 26.34 26.34 0%
1 500 46.8 46.3 42 22.7 22.3 22.3 2%
3 000 45.17 45.1 40.6 21.99 21.22 21.22 4%
6 000 45.71 43.68 39.77 19.77 18.75 18.61 6%
15 000 47.54 46.15 38.99 18.2 17.26 15.48 15%
30 000 47.28 45.1 38.3 18.99 16.84 14.26 25%
Table 2: Error rates for the texture classification task using various methods and sizes m of the
training set. The last column indicates the gain between the error rate of REC BL and SDL-D BL.

we decompose each image from the training set by solving the simple sparse reconstruction problem
from Eq. (1) instead of using supervised sparse coding. This provides us with some coefficients α,
which we use as features in a linear SVM. Repeating the sparse decomposition procedure on the
test set permits us to evaluate the performance of these learned linear SVMs. We plot the average
error rate of these classifiers on Figure 3 for each value of µ. We see that using the dictionaries
obtained with discrimative learning (µ > 0, SDL-D L) dramatically improves the performance of
the basic linear classifier learned a posteriori on the α’s, showing that our learned dictionaries are
discriminative per se. Figure 3 also shows a dictionary adapted to the reconstruction of the MNIST
dataset and a discriminative one, adapted to “9 vs all”.

5.2 Texture classification


In the digit recognition task, our bilinear framework did not perform better than the linear one L. We
believe that one of the main reasons is due to the simplicity of the task, where a linear model is rich
enough. The purpose of our next experiment is to answer the question “When is BL worth using?”.
We have chosen to consider two texture images from the Brodatz dataset, presented in Figure 4, and
to build two classes, composed of 12 × 12 patches taken from these two textures. We have compared
the classification performance of all our methods, including BL, for a dictionary of size k = 64 and
κ = 0.15. The training set was composed of patches from the left half of each texture and the test
sets of patches from the right half, so that there is no overlap between them in the training and test
set. Error rates are reported in Table 2 for varying sizes of the training set. This experiment shows
that in some cases, the linear model performs very poorly where BL does better. Discrimination
helps especially when the size of the training set is large. Note that we did not perform any cross-
validation to optimize the parameters k and κ for this experiment. Dictionaries obtained with REC
and SDL-D BL are presented in Figure 4. Note that though they are visually quite similar, they lead
to very different performances.

6 Conclusion
we have introduced in this paper a discriminative approach to supervised dictionary learning that
effectively exploits the corresponding sparse signal decompositions in image classification tasks, and
have proposed an effective method for learning a shared dictionary and multiple (linear or bilinear)
models. Future work will be devoted to adapting the proposed framework to shift-invariant models
that are standard in image processing tasks, but not readily generalized to the sparse dictionary
learning setting. We are also investigating extensions to unsupervised and semi-supervised learning
and applications to natural image classification.
(a) Texture 1 (b) Texture 2 (c) REC (d) SDL-D BL

Figure 4: Left: test textures. Right: reconstructive and discriminative dictionaries

Acknowledgments
This paper was supported in part by ANR under grant MGA. Guillermo Sapiro would like to thank
Fernando Rodriguez for insights into the learning of discriminatory sparsity patterns. His work is
partially supported by NSF, NGA, ONR, ARO, and DARPA.

References
[1] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by
v1? Vision Research, 37, 1997.
[2] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictio-
naries. IEEE Trans. IP, 54(12), 2006.
[3] K. Huang and S. Aviyente. Sparse representation for signal classification. In NIPS, 2006.
[4] J. Wright, A. Y. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation.
In PAMI, 2008. to appear.
[5] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: transfer learning from unla-
beled data. In ICML, 2007.
[6] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Learning discriminative dictionaries for local
image analysis. In CVPR, 2008.
[7] M. Ranzato and M. Szummer. Semi-supervised learning of compact document representations with deep
networks. In ICML, 2008.
[8] A. Argyriou and T. Evgeniou and M. Pontil Multi-Task Feature Learning. In NIPS, 2006.
[9] F. Rodriguez and G. Sapiro. Sparse representations for image classification: Learning discriminative and
reconstructive non-parametric dictionaries. IMA Preprint 2213, 2007.
[10] D. Blei and J. McAuliffe. Supervised topic models. In NIPS, 2007.
[11] A. Holub and P. Perona. A discriminative framework for modeling object classes. In CVPR, 2005.
[12] J.A. Lasserre, C.M. Bishop, and T.P. Minka. Principled hybrids of generative and discriminative models.
In CVPR, 2006.
[13] R. Raina, Y. Shen, A. Y. Ng, and A. McCallum. Classification with hybrid generative/discriminative
models. In NIPS, 2004.
[14] R. R. Salakhutdinov and G. E. Hinton. Learning a non-linear embedding by preserving class neighbour-
hood structure. In AI and Statistics, 2007.
[15] H. Larochelle, and Y. Bengio. Classification using discriminative restricted boltzmann machines. in
ICML, 2008.
[16] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Stat., 32(2), 2004.
[17] E. T. Hale, W. Yin, and Y. Zhang. A fixed-point continuation method for l1-regularized minimization with
applications to compressed sensing. CAAM Tech Report TR07-07, 2007.
[18] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an
energy-based model. In NIPS, 2006.
[19] M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designing of overcomplete
dictionaries for sparse representations. IEEE Trans. SP, 54(11), 2006.
[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proc. of the IEEE, 86(11), 1998.
[21] B. Haasdonk and D. Keysers. Tangent distant kernels for support vector machines. In ICPR, 2002.

You might also like