QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient Algorithms - Gersende Fort, Mar 22, 2018

Perturbed (accelerated) Proximal-Gradient algorithms
Gersende Fort
CNRS & Institut de Math´ematiques de Toulouse
France
Works with Eric Moulines (Ecole Polytechnique, France); Yves Atchad´e (Univ.
Michigan, USA); J.F. Aujol (Univ. Bordeaux, France) and C. Dossal (INSA
Toulouse, France)
1 / 12

Interested in (1/3)
(arg)minθ∈Rp (f(θ) + g(θ))
with
g : Rp
→ [0, ∞] is convex, non smooth, not identically equal to +∞, and lsc.
Proxγg(τ) is explicit
f is smooth (gradient Lipschitz) with an untractable gradient
Algorithm: Perturbed Proximal-Gradient
θk+1 = Proxγk+1g θk − γk+1 f(θk)
Questions: Conditions on γk+1 and on f(θk) − f(θk) to ensure the same
limiting behavior as the Prox-Gdt algorithm ?
2 / 12

Interested in (2/3)
Furthermore, in the case
a) the gradient is an untractable expectation
f(θ) =
X
H(θ, x)
explicit
πθ(dx)
probability
b) Stochastic approximation to avoid curse of dimensionality
c) i.i.d. Monte Carlo not possible/eﬃcient → Markov Chain MC (MCMC)
sampling
Questions: Since MCMC provides a biased approximation
f(θk) ≈
1
mk+1
mk+1
j=1
H(θ, Xjk) E

 1
mk+1
mk+1
j=1
H(θ, Xjk)

 − f(θk) = 0
where {X1k, · · · , Xjk, · · · } Markov chain with stationary distribution πθk
which conditions on γk+1 and on the Monte Carlo batch size mk+1 ?
is it possible to have a non vanishing bias i.e. mk+1 = m ?
3 / 12

Interested in (3/3)
Perturbed Prox-Gdt + Acceleration:
τk = θk +
tk−1 − 1
tk
(θk − θk−1)
θk+1 = Proxγk+1g θk − γk+1 f(τk)
Questions:
Which sequences γk, tk, among those satisfying
γk+1tk(tk − 1) ≤ γkt2
k−1
When stochastic approx of the gradient: which Monte Carlo batch size mk ?
Is there a gain to consider tk = O(kd
) for some 0 ≤ d ≤ 1 ?
4 / 12

Motivations for MCMC approx (1/3)
Computational Statistics, Statistical Learning
Online learning: here the “Monte Carlo points” are the
examples/observations.
Penalized Maximum Likelihood Estimation in a parametric model
argminθ f(θ)
negative log-likelihood
+ g(θ)
penalty term
5 / 12

Example 1: Latent variable models
The log-likelihood (θ) of the n observations dependence upon the obs. is omitted
(θ) = log
X
p(x, θ)
complete likelihood
µ(dx)
Untractable integral
Its gradient
(θ) = ∂θ log p(x, θ)
p(x, θ)
p(u, θ)µ(du)
µ(dx)
a posteriori distribution
Untractable integral since the normalizing constant unknown −→ MCMC
6 / 12

Example 2: Binary graphical model
N i.i.d. {0, 1}p
observations from the distribution
πθ(y1:p) ∝
1
Zθ
exp


p
i=1
θiyi +
1≤i<j≤p
θij1Iyi=yj


The log-likelihood of the obs. Y 1
, · · · , Y N
(θ) =
p
i=1
θi
N
n=1
Y n
i +
1≤i<j≤p
θij
N
n=1
1IY n
i =Y n
j
− N log Zθ
Its gradient
θi
(θ) =
N
n=1
Y n
i −
y1:p∈{0,1}p
yiπθ(y)
θij
(θ) =
N
n=1
1IY n
i =Y n
j
−
y1:p∈{0,1}p
1Iyi=yj
πθ(y)
7 / 12

Results on Perturbed Prox-Gdt (1/2)
Set: L = argminΘ(f + g) ηn+1 = f(θn) − f(θn)
Theorem (Atchad´e, F., Moulines (2015))
Assume
g convex, lower semi-continuous; f convex, C1
and its gradient is Lipschitz
with constant L; L is non empty.
n γn = +∞ and γn ∈ (0, 1/L].
Convergence of the series
n
γ2
n+1 ηn+1
2
,
n
γn+1ηn+1,
n
γn+1 An, ηn+1
where An = Proxγn+1,g(θn − γn+1 f(θn)).
Then there exists θ ∈ L such that limn θn = θ .
It generalizes and improves on previous results. What can be said in the
non-convex case (open question) and with non explicit “Prox” ?
8 / 12

Results on Perturbed Prox-Gdt (2/2)
Given non-negative weights a1, · · · , an, set An
def
=
n
k=1 ak
Theorem (Atchad´e, F., Moulines)
For any θ ∈ argminΘ(f + g),
(f + g)
n
k=1
ak
An
θk − min(f + g) ≤
a0
2γ0An
θ0 − θ 2
+
1
2An
n
k=1
ak
γk
−
ak−1
γk−1
θk−1 − θ 2
+
1
An
n
k=1
akγk ηk
2
−
1
An
n
k=1
ak Ak−1 − θ , ηk
In the case of stochastic perturbation ηk = f(θk) − f(θk): it yields bounds
with high probability, in expectation, in Lq
, · · ·
9 / 12

Stochastic Prox-Gdt, with (possibly) biased MC
approximation
Under ergodic conditions on the MCMC samplers, we have
F
1
n
n
k=1
θk − min F
Lq
= O (un)
with
Constant MC batch size mn = m (i.e. non vanishing approximation →
technical proof)
un =
1
√
n
with γn =
γ
na
, a ∈ [1/2, 1]
Increasing MC batch size
un =
1
n
with γn = γ mn ∝ n
Rate with a computational MC cost: O(n2
).
10 / 12

Nesterov-based acceleration of the Stochastic Prox-Gdt alg
Convergence Choose γn, mn, tn s.t.
γn ∈ (0, 1/L] , γk+1tk(tk − 1) ≤ γkt2
k−1
lim
n
γnt2
n = +∞,
n
γntn(1 + γntn)
1
mn
< ∞
Then there exists θ ∈ argminΘF s.t limn θn = θ .
Rate on F In addition
E [F(θn+1) − min F] = O (un)
γn mn tn un NbrMC
γ n3
n n−2
n4
γ/
√
n n2
n n−3/2
n3
In all strategies: for a MC computational cost N, the rate is 1/
√
N.
11 / 12

Open questions
1 Variance reduction technique Here the variance of the MC approximation is
O(1/mn). What happens when a “variance reduction” MC technique is used
?
2 Averaging Given non-negative weights a1, · · · , an, do γk, tk, mk exist such
that
sup
n
an ((f + g)(θn) − min(f + g)) < ∞
(f + g)
n
k=1
ak
n
j=1 aj
θk − min(f + g) = O
1
n
k=1 ak
3 Maximal rate What is the maximal rate after n iterations ? after N Monte
Carlo draws ?
4 (F)ISTA ? What about tn = O(nd
) for some 0 < d < 1 ?
A ﬁrst answer: With variance reduction MC techniques, Nesterov acceleration
(d = 1), γk = γ, mn = n3
and an = n: after N MC draws, the rate is always
better than 1/
√
N
12 / 12

QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient Algorithms - Gersende Fort, Mar 22, 2018

More Related Content

What's hot (20)

Similar to QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient Algorithms - Gersende Fort, Mar 22, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded (20)

QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient Algorithms - Gersende Fort, Mar 22, 2018