Parallel Optimization in Machine Learning

Parallel Optimization in Machine
Learning
Fabian Pedregosa
December 19, 2017 Huawei Paris Research Center

About me
• Engineer (2010-2012), Inria Saclay
(scikit-learn kickstart).
• PhD (2012-2015), Inria Saclay.
• Postdoc (2015-2016),
Dauphine–ENS–Inria Paris.
• Postdoc (2017-present), UC Berkeley
- ETH Zurich (Marie-Curie fellowship,
European Commission)
Hacker at heart ... trapped in a
researcher’s body.
1/32

Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2/32

Motivation
What has changed?
2006 = no longer mentions to speed of processors.
2/32

Motivation
What has changed?
2006 = no longer mentions to speed of processors.
Primary feature: number of cores.
2/32

40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
3/32

• Multi-core architectures are here to stay.
3/32

• Multi-core architectures are here to stay.
Parallel algorithms needed to take advantage of modern CPUs.
3/32

Parallel optimization
Parallel algorithms can be divided into two large categories:
synchronous and asynchronous. Image credits: (Peng et al. 2016)
Synchronous methods
 Easy to implement (i.e.,
developed software packages).
 Well understood.
 Limited speedup due to
synchronization costs.
Asynchronous methods
 Faster, typically larger
speedups.
 Not well understood, large gap
between theory and practice.
 No mature software solutions.
4/32

Outline
Synchronous methods
• Synchronous (stochastic) gradient descent.
Asynchronous methods
• Asynchronous stochastic gradient descent (Hogwild) (Niu et al.
2011)
• Asynchronous variance-reduced stochastic methods (Leblond, P.,
and Lacoste-Julien 2017), (Pedregosa, Leblond, and
Lacoste-Julien 2017).
• Analysis of asynchronous methods.
• Codes and implementation aspects.
Leaving out many parallel synchronous methods: ADMM (Glowinski
and Marroco 1975), CoCoA (Jaggi et al. 2014), DANE (Shamir, Srebro,
and Zhang 2014), to name a few.
5/32

Outline
Most of the following is joint work with Rémi Leblond and Simon
Lacoste-Julien
Rémi Leblond Simon Lacoste–Julien
6/32

Optimization for machine learning
Large part of problems in machine learning can be framed as
optimization problems of the form
minimize
x
f(x)
def
=
1
n
n∑
i=1
fi(x)
Gradient descent (Cauchy 1847). Descend
along steepest direction (−∇f(x))
x+
= x − γ∇f(x)
Stochastic gradient descent (SGD)
(Robbins and Monro 1951). Select a
random index i and descent along
− ∇fi(x):
x+
= x − γ∇fi(x) images source: Francis Bach
7/32

Parallel synchronous gradient descent
Computation of gradient is distributed among k workers
• Workers can be: different computers, CPUs
or GPUs
• Popular frameworks: Spark, Tensorﬂow,
PyTorch, neHadoop.
8/32

1. Choose n1, . . . nk that sum to n.
2. Distribute computation of ∇f(x) among k nodes
∇f(x) =
1
n
∑
i=1
∇fi(x)
=
1
k
(
1
n1
n1∑
i=1
∇fi(x)
done by worker 1
+ . . . +
1
n1
nk∑
i=nk−1
∇fi(x)
done by worker k
)
3. Perform the gradient descent update by a master node
x+
= x − γ∇f(x)
9/32

1. Choose n1, . . . nk that sum to n.
2. Distribute computation of ∇f(x) among k nodes
∇f(x) =
1
n
∑
i=1
∇fi(x)
=
1
k
(
1
n1
n1∑
i=1
∇fi(x)
done by worker 1
+ . . . +
1
n1
nk∑
i=nk−1
∇fi(x)
done by worker k
)
3. Perform the gradient descent update by a master node
x+
= x − γ∇f(x)
 Trivial parallelization, same analysis as gradient descent.
 Synchronization step every iteration (3.).
9/32

Parallel synchronous SGD
Can also be extended to stochastic gradient descent.
1. Select k samples i0, . . . , ik uniformly at random.
2. Compute in parallel ∇fit
on worker t
3. Perform the (mini-batch) stochastic gradient descent update
x+
= x − γ
1
k
k∑
t=1
∇fit
(x)
10/32

Parallel synchronous SGD
Can also be extended to stochastic gradient descent.
1. Select k samples i0, . . . , ik uniformly at random.
2. Compute in parallel ∇fit
on worker t
3. Perform the (mini-batch) stochastic gradient descent update
x+
= x − γ
1
k
k∑
t=1
∇fit
(x)
 Trivial parallelization, same analysis as (mini-batch) stochastic
gradient descent.
 The kind of parallelization that is implemented in deep learning
libraries (tensorﬂow, PyTorch, Thano, etc.).
 Synchronization step every iteration (3.).
10/32

Asynchronous SGD
Synchronization is the bottleneck.
 What if we just ignore it?
11/32

Asynchronous SGD
Synchronization is the bottleneck.
 What if we just ignore it?
Hogwild (Niu et al. 2011): each core runs SGD in parallel, without
synchronization, and updates the same vector of coefﬁcients.
In theory: convergence under very strong assumptions.
In practice: just works.
11/32

Hogwild in more detail
Each core follows the same procedure
1. Read the information from shared memory ˆx.
2. Sample i ∈ {1, . . . , n} uniformly at random.
3. Compute partial gradient ∇fi(ˆx).
4. Write the SGD update to shared memory x = x − γ∇fi(ˆx).
12/32

Hogwild is fast
Hogwild can be very fast. But its still SGD...
• With constant step size, bounces around the optimum.
• With decreasing step size, slow convergence.
• There are better alternatives (Emilie already mentioned some)
13/32

Looking for excitement? ...
analyze asynchronous methods!

Analysis of asynchronous methods
Simple things become counter-intuitive, e.g, how to name the
iterates?
 Iterates will change depending on the speed of processors
14/32

Naming scheme in Hogwild
Simple, intuitive and wrong
Each time a core has ﬁnished writing to shared memory, increment
iteration counter.
⇐⇒ ˆxt = (t + 1)-th succesfull update to shared memory.
Value of ˆxt and it are not determined until the iteration has ﬁnished.
=⇒ ˆxt and it are not necessarily independent.
15/32

Unbiased gradient estimate
SGD-like algorithms crucially rely on the unbiased property
Ei[∇fi(x)] = ∇f(x).
For synchronous algorithms, follows from the uniform sampling of i
Ei[∇fi(x)] =
n∑
i=1
Proba(selecting i)∇fi(x)
uniform sampling
=
n∑
i=1
1
n
∇fi(x) = ∇f(x)
16/32

A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
17/32

used in proofs.
Illustration: problem with two samples and two cores f = 1
2 (f1 + f2).
Computing ∇f1 is much expensive than ∇f2.
17/32

used in proofs.
Illustration: problem with two samples and two cores f = 1
2 (f1 + f2).
Computing ∇f1 is much expensive than ∇f2.
Start at x0. Because of the random sampling there are 4 possible
scenarios:
1. Core 1 selects f1, Core 2 selects f1 =⇒ x1 = x0 − γ∇f1(x)
So we have
Ei [∇fi] =
1
4
f1 +
3
4
f2
̸=
1
2
f1 +
1
2
f2 !!
17/32

A new labeling scheme
 New way to name iterates.
“After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment
counter each time we read the vector of coefﬁcients from shared
memory.
18/32

A new labeling scheme
 New way to name iterates.
“After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment
counter each time we read the vector of coefﬁcients from shared
memory.
 No dependency between it and the cost of computing ∇fit
.
 Full analysis of Hogwild and other asynchronous methods in
“Improved parallel stochastic optimization analysis for incremental
methods”, Leblond, P., and Lacoste-Julien (submitted).
18/32

The SAGA algorithm
Setting:
minimize
x
1
n
n∑
i=1
fi(x)
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014).
Select i ∈ {1, . . . , n} and compute (x+
, α+
) as
x+
= x − γ(∇fi(x) − αi + α) ; α+
i = ∇fi(x)
• Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x).
• Unlike SGD, because of memory terms α, variance → 0.
• Unlike SGD, converges with ﬁxed step size (1/3L)
19/32

The SAGA algorithm
Setting:
minimize
x
1
n
n∑
i=1
fi(x)
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014).
Select i ∈ {1, . . . , n} and compute (x+
, α+
) as
x+
= x − γ(∇fi(x) − αi + α) ; α+
i = ∇fi(x)
• Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x).
• Unlike SGD, because of memory terms α, variance → 0.
• Unlike SGD, converges with ﬁxed step size (1/3L)
Super easy to use in scikit-learn
19/32

Sparse SAGA
Need for a sparse variant of SAGA
• A large part of large scale datasets are sparse.
• For sparse datasets and generalized linear models (e.g., least
squares, logistic regression, etc.), partial gradients ∇fi are sparse
too.
• Asynchronous algorithms work best when updates are sparse.
SAGA update is inefﬁcient for sparse data
x+
= x − γ(∇fi(x)
sparse
− αi
sparse
+ α
dense!
) ; α+
i = ∇fi(x)
[scikit-learn uses many tricks to make it efﬁcient that we cannot use
in asynchronous version]
20/32

Sparse SAGA
Sparse variant of SAGA. Relies on
• Diagonal matrix Pi = projection onto the support of ∇fi
• Diagonal matrix D deﬁned as
Dj,j = n/number of times ∇jfi is nonzero.
Sparse SAGA algorithm (Leblond, P., and Lacoste-Julien 2017)
x+
= x − γ(∇fi(x) − αi + PiDα) ; α+
i = ∇fi(x)
• All operations are sparse, cost per iteration is
O(nonzeros in ∇fi).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
• Crucial property: Ei[PiD] = I.
21/32

Asynchronous SAGA (ASAGA)
• Each core runs an instance of Sparse SAGA.
• Updates the same vector of coefﬁcients α, α.
Theory: Under standard assumptions (bounded dalays), same
convergence rate than sequential version.
=⇒ theoretical linear speedup with respect to number of cores.
22/32

Experiments
• Improved convergence of variance-reduced methods wrt SGD.
• Signiﬁcant improvement between 1 and 10 cores.
• Speedup is signiﬁcant, but far from ideal.
23/32

Composite objective
Previous methods assume objective function is smooth.
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
Objective: minimize composite objective function:
minimize
x
1
n
n∑
i=1
fi(x) + ∥x∥1
where fi is smooth (and ∥ · ∥1 is not). For simplicity we consider the
nonsmooth term to be ℓ1 norm, but this is general to any convex
function for which we have access to its proximal operator.
24/32

(Prox)SAGA
The ProxSAGA update is inefﬁcient
x+
= proxγh
dense!
(x − γ(∇fi(x)
sparse
− αi
sparse
+ α
dense!
)) ; α+
i = ∇fi(x)
=⇒ a sparse variant is needed as a prerequisite for a practical
parallel method.
25/32

Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
26/32

Like SAGA, it relies on unbiased gradient estimate
vi=∇fi(x) − αi + DPiα ;
26/32

Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
26/32

= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Where Pi, D are as in Sparse SAGA and φi
def
=
∑d
j (PiD)i,i|xj|.
φi has two key properties: i) support of φi = support of ∇fi (sparse
updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness)
26/32

= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Where Pi, D are as in Sparse SAGA and φi
def
=
∑d
j (PiD)i,i|xj|.
φi has two key properties: i) support of φi = support of ∇fi (sparse
updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness)
Convergence: same linear convergence rate as SAGA, with cheaper
updates in presence of sparsity.
26/32

Proximal Asynchronous SAGA (ProxASAGA)
Each core runs Sparse Proximal SAGA asynchronously without locks
and updates x, α and α in shared memory.
 All read/write operations to shared memory are inconsistent, i.e.,
no performance destroying vector-level locks while reading/writing.
Convergence: under sparsity assumptions, ProxASAGA converges
with the same rate as the sequential algorithm =⇒ theoretical
linear speedup with respect to the number of cores.
27/32

Empirical results
ProxASAGA vs competing methods on 3 large-scale datasets,
ℓ1-regularized logistic regression
Dataset n p density L ∆
KDD 2010 19,264,097 1,163,024 10−6
28.12 0.15
KDD 2012 149,639,105 54,686,452 2 × 10−7
1.25 0.85
Criteo 45,840,617 1,000,000 4 × 10−5
1.25 0.89
0 20 40 60 80 100
Time (in minutes)
10 12
10 9
10 6
10 3
100
Objectiveminusoptimum
KDD10 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
KDD12 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
100 Criteo dataset
ProxASAGA (1 core)
ProxASAGA (10 cores)
AsySPCD (1 core)
AsySPCD (10 cores)
FISTA (1 core)
FISTA (10 cores)
28/32

Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
29/32

Speedup =
Time to 10−10
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
29/32

Speedup =
Time to 10−10
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
29/32

Perspectives
• Scale above 20 cores.
• Asynchronous optimization on the GPU.
• Acceleration.
• Software development.
30/32

Codes
 Code is in github: https://github.com/fabianp/ProxASAGA.
Computational code is C++ (use of atomic type) but wrapped in
Python.
A very efﬁcient implementation of SAGA can be found in the
scikit-learn and lightning
(https://github.com/scikit-learn-contrib/lightning) libraries.
31/32

References
Cauchy, Augustin (1847). “Méthode générale pour la résolution des systemes d’équations
simultanées”. In: Comp. Rend. Sci. Paris.
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient
method with support for non-strongly convex composite objectives”. In: Advances in Neural
Information Processing Systems.
Glowinski, Roland and A Marroco (1975). “Sur l’approximation, par éléments finis d’ordre un, et la
résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires”. In:
Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique.
Jaggi, Martin et al. (2014). “Communication-Efficient Distributed Dual Coordinate Ascent”. In:
Advances in Neural Information Processing Systems 27.
Leblond, Rémi, Fabian P., and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel SAGA”. In:
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
(AISTATS 2017).
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”.
In: Advances in Neural Information Processing Systems.
31/32

Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth
Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural
Information Processing Systems 30.
Peng, Zhimin et al. (2016). “ARock: an algorithmic framework for asynchronous parallel coordinate
updates”. In: SIAM Journal on Scientiﬁc Computing.
Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”. In: Ann. Math.
Statist.
Shamir, Ohad, Nati Srebro, and Tong Zhang (2014). “Communication-efﬁcient distributed
optimization using an approximate newton-type method”. In: International conference on
machine learning.
32/32

Supervised Machine Learning
Data: n observations (ai, bi) ∈ Rp
× R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT
a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2σ(xT
1 a))
Input
layer
Hidden
layer
Output
layer
a1
a2
a3
a4
a5
Ouput

Supervised Machine Learning
Data: n observations (ai, bi) ∈ Rp
× R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT
a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2σ(xT
1 a))
Minimize some distance (e.g., quadratic) between the prediction
minimize
x
1
n
n∑
i=1
ℓ(bi, h(ai, x))
notation
=
1
n
n∑
i=1
fi(x)
where popular examples of ℓ are
• Squared loss, ℓ(bi, h(ai, x))
def
= (bi − h(ai, x))2
• Logistic (softmax), ℓ(bi, h(ai, x))
def
= log(1 + exp(−bih(ai, x)))

For step size γ = 1
5L and f µ-strongly convex (µ > 0), Sparse Proximal
SAGA converges geometrically in expectation. At iteration t we have
E∥xt − x∗
∥2
≤ (1 − 1
5 min{ 1
n , 1
κ })t
C0 ,
with C0 = ∥x0 − x∗
∥2
+ 1
5L2
∑n
i=1 ∥α0
i − ∇fi(x∗
)∥2
and κ = L
µ (condition
number).
Implications
• Same convergence rate than SAGA with cheaper updates.
• In the “big data regime” (n ≥ κ): rate in O(1/n).
• In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).
• Adaptivity to strong convexity, i.e., no need to know strong
convexity parameter to obtain linear convergence.

Convergence ProxASAGA
Suppose τ ≤ 1
10
√
∆
. Then:
• If κ ≥ n, then with step size γ = 1
36L , ProxASAGA converges
geometrically with rate factor Ω( 1
κ ).
• If κ < n, then using the step size γ = 1
36nµ , ProxASAGA converges
geometrically with rate factor Ω( 1
n ).
In both cases, the convergence rate is the same as Sparse Proximal
SAGA =⇒ ProxASAGA is linearly faster up to constant factor. In both
cases the step size does not depend on τ.
If τ ≤ 6κ, a universal step size of Θ(1
L ) achieves a similar rate than
Sparse Proximal SAGA, making it adaptive to local strong convexity
(knowledge of κ not required).

Parallel Optimization in Machine Learning

More Related Content

What's hot (20)

Similar to Parallel Optimization in Machine Learning (20)

More from Fabian Pedregosa (8)

Recently uploaded (20)

Parallel Optimization in Machine Learning