SlideShare a Scribd company logo
Parallel Optimization in Machine
Learning
Fabian Pedregosa
December 19, 2017 Huawei Paris Research Center
About me
• Engineer (2010-2012), Inria Saclay
(scikit-learn kickstart).
• PhD (2012-2015), Inria Saclay.
• Postdoc (2015-2016),
Dauphine–ENS–Inria Paris.
• Postdoc (2017-present), UC Berkeley
- ETH Zurich (Marie-Curie fellowship,
European Commission)
Hacker at heart ... trapped in a
researcher’s body.
1/32
Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2/32
Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2006 = no longer mentions to speed of processors.
2/32
Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2006 = no longer mentions to speed of processors.
Primary feature: number of cores.
2/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
3/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
3/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
3/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
Parallel algorithms needed to take advantage of modern CPUs.
3/32
Parallel optimization
Parallel algorithms can be divided into two large categories:
synchronous and asynchronous. Image credits: (Peng et al. 2016)
Synchronous methods
 Easy to implement (i.e.,
developed software packages).
 Well understood.
 Limited speedup due to
synchronization costs.
Asynchronous methods
 Faster, typically larger
speedups.
 Not well understood, large gap
between theory and practice.
 No mature software solutions.
4/32
Outline
Synchronous methods
• Synchronous (stochastic) gradient descent.
Asynchronous methods
• Asynchronous stochastic gradient descent (Hogwild) (Niu et al.
2011)
• Asynchronous variance-reduced stochastic methods (Leblond, P.,
and Lacoste-Julien 2017), (Pedregosa, Leblond, and
Lacoste-Julien 2017).
• Analysis of asynchronous methods.
• Codes and implementation aspects.
Leaving out many parallel synchronous methods: ADMM (Glowinski
and Marroco 1975), CoCoA (Jaggi et al. 2014), DANE (Shamir, Srebro,
and Zhang 2014), to name a few.
5/32
Outline
Most of the following is joint work with Rémi Leblond and Simon
Lacoste-Julien
Rémi Leblond Simon Lacoste–Julien
6/32
Synchronous algorithms
Optimization for machine learning
Large part of problems in machine learning can be framed as
optimization problems of the form
minimize
x
f(x)
def
=
1
n
n∑
i=1
fi(x)
Gradient descent (Cauchy 1847). Descend
along steepest direction (−∇f(x))
x+
= x − γ∇f(x)
Stochastic gradient descent (SGD)
(Robbins and Monro 1951). Select a
random index i and descent along
− ∇fi(x):
x+
= x − γ∇fi(x) images source: Francis Bach
7/32
Parallel synchronous gradient descent
Computation of gradient is distributed among k workers
• Workers can be: different computers, CPUs
or GPUs
• Popular frameworks: Spark, Tensorflow,
PyTorch, neHadoop.
8/32
Parallel synchronous gradient descent
1. Choose n1, . . . nk that sum to n.
2. Distribute computation of ∇f(x) among k nodes
∇f(x) =
1
n
∑
i=1
∇fi(x)
=
1
k
(
1
n1
n1∑
i=1
∇fi(x)
done by worker 1
+ . . . +
1
n1
nk∑
i=nk−1
∇fi(x)
done by worker k
)
3. Perform the gradient descent update by a master node
x+
= x − γ∇f(x)
9/32
Parallel synchronous gradient descent
1. Choose n1, . . . nk that sum to n.
2. Distribute computation of ∇f(x) among k nodes
∇f(x) =
1
n
∑
i=1
∇fi(x)
=
1
k
(
1
n1
n1∑
i=1
∇fi(x)
done by worker 1
+ . . . +
1
n1
nk∑
i=nk−1
∇fi(x)
done by worker k
)
3. Perform the gradient descent update by a master node
x+
= x − γ∇f(x)
 Trivial parallelization, same analysis as gradient descent.
 Synchronization step every iteration (3.).
9/32
Parallel synchronous SGD
Can also be extended to stochastic gradient descent.
1. Select k samples i0, . . . , ik uniformly at random.
2. Compute in parallel ∇fit
on worker t
3. Perform the (mini-batch) stochastic gradient descent update
x+
= x − γ
1
k
k∑
t=1
∇fit
(x)
10/32
Parallel synchronous SGD
Can also be extended to stochastic gradient descent.
1. Select k samples i0, . . . , ik uniformly at random.
2. Compute in parallel ∇fit
on worker t
3. Perform the (mini-batch) stochastic gradient descent update
x+
= x − γ
1
k
k∑
t=1
∇fit
(x)
 Trivial parallelization, same analysis as (mini-batch) stochastic
gradient descent.
 The kind of parallelization that is implemented in deep learning
libraries (tensorflow, PyTorch, Thano, etc.).
 Synchronization step every iteration (3.).
10/32
Asynchronous algorithms
Asynchronous SGD
Synchronization is the bottleneck.
 What if we just ignore it?
11/32
Asynchronous SGD
Synchronization is the bottleneck.
 What if we just ignore it?
Hogwild (Niu et al. 2011): each core runs SGD in parallel, without
synchronization, and updates the same vector of coefficients.
In theory: convergence under very strong assumptions.
In practice: just works.
11/32
Hogwild in more detail
Each core follows the same procedure
1. Read the information from shared memory ˆx.
2. Sample i ∈ {1, . . . , n} uniformly at random.
3. Compute partial gradient ∇fi(ˆx).
4. Write the SGD update to shared memory x = x − γ∇fi(ˆx).
12/32
Hogwild is fast
Hogwild can be very fast. But its still SGD...
• With constant step size, bounces around the optimum.
• With decreasing step size, slow convergence.
• There are better alternatives (Emilie already mentioned some)
13/32
Looking for excitement? ...
analyze asynchronous methods!
Analysis of asynchronous methods
Simple things become counter-intuitive, e.g, how to name the
iterates?
 Iterates will change depending on the speed of processors
14/32
Naming scheme in Hogwild
Simple, intuitive and wrong
Each time a core has finished writing to shared memory, increment
iteration counter.
⇐⇒ ˆxt = (t + 1)-th succesfull update to shared memory.
Value of ˆxt and it are not determined until the iteration has finished.
=⇒ ˆxt and it are not necessarily independent.
15/32
Unbiased gradient estimate
SGD-like algorithms crucially rely on the unbiased property
Ei[∇fi(x)] = ∇f(x).
For synchronous algorithms, follows from the uniform sampling of i
Ei[∇fi(x)] =
n∑
i=1
Proba(selecting i)∇fi(x)
uniform sampling
=
n∑
i=1
1
n
∇fi(x) = ∇f(x)
16/32
A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
17/32
A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
Illustration: problem with two samples and two cores f = 1
2 (f1 + f2).
Computing ∇f1 is much expensive than ∇f2.
17/32
A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
Illustration: problem with two samples and two cores f = 1
2 (f1 + f2).
Computing ∇f1 is much expensive than ∇f2.
Start at x0. Because of the random sampling there are 4 possible
scenarios:
1. Core 1 selects f1, Core 2 selects f1 =⇒ x1 = x0 − γ∇f1(x)
2. Core 1 selects f1, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x)
3. Core 1 selects f2, Core 2 selects f1 =⇒ x1 = x0 − γ∇f2(x)
4. Core 1 selects f2, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x)
So we have
Ei [∇fi] =
1
4
f1 +
3
4
f2
̸=
1
2
f1 +
1
2
f2 !!
17/32
The Art of Naming Things
A new labeling scheme
 New way to name iterates.
“After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment
counter each time we read the vector of coefficients from shared
memory.
18/32
A new labeling scheme
 New way to name iterates.
“After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment
counter each time we read the vector of coefficients from shared
memory.
 No dependency between it and the cost of computing ∇fit
.
 Full analysis of Hogwild and other asynchronous methods in
“Improved parallel stochastic optimization analysis for incremental
methods”, Leblond, P., and Lacoste-Julien (submitted).
18/32
Asynchronous SAGA
The SAGA algorithm
Setting:
minimize
x
1
n
n∑
i=1
fi(x)
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014).
Select i ∈ {1, . . . , n} and compute (x+
, α+
) as
x+
= x − γ(∇fi(x) − αi + α) ; α+
i = ∇fi(x)
• Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x).
• Unlike SGD, because of memory terms α, variance → 0.
• Unlike SGD, converges with fixed step size (1/3L)
19/32
The SAGA algorithm
Setting:
minimize
x
1
n
n∑
i=1
fi(x)
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014).
Select i ∈ {1, . . . , n} and compute (x+
, α+
) as
x+
= x − γ(∇fi(x) − αi + α) ; α+
i = ∇fi(x)
• Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x).
• Unlike SGD, because of memory terms α, variance → 0.
• Unlike SGD, converges with fixed step size (1/3L)
Super easy to use in scikit-learn
19/32
Sparse SAGA
Need for a sparse variant of SAGA
• A large part of large scale datasets are sparse.
• For sparse datasets and generalized linear models (e.g., least
squares, logistic regression, etc.), partial gradients ∇fi are sparse
too.
• Asynchronous algorithms work best when updates are sparse.
SAGA update is inefficient for sparse data
x+
= x − γ(∇fi(x)
sparse
− αi
sparse
+ α
dense!
) ; α+
i = ∇fi(x)
[scikit-learn uses many tricks to make it efficient that we cannot use
in asynchronous version]
20/32
Sparse SAGA
Sparse variant of SAGA. Relies on
• Diagonal matrix Pi = projection onto the support of ∇fi
• Diagonal matrix D defined as
Dj,j = n/number of times ∇jfi is nonzero.
Sparse SAGA algorithm (Leblond, P., and Lacoste-Julien 2017)
x+
= x − γ(∇fi(x) − αi + PiDα) ; α+
i = ∇fi(x)
• All operations are sparse, cost per iteration is
O(nonzeros in ∇fi).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
• Crucial property: Ei[PiD] = I.
21/32
Asynchronous SAGA (ASAGA)
• Each core runs an instance of Sparse SAGA.
• Updates the same vector of coefficients α, α.
Theory: Under standard assumptions (bounded dalays), same
convergence rate than sequential version.
=⇒ theoretical linear speedup with respect to number of cores.
22/32
Experiments
• Improved convergence of variance-reduced methods wrt SGD.
• Significant improvement between 1 and 10 cores.
• Speedup is significant, but far from ideal.
23/32
Non-smooth problems
Composite objective
Previous methods assume objective function is smooth.
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
Objective: minimize composite objective function:
minimize
x
1
n
n∑
i=1
fi(x) + ∥x∥1
where fi is smooth (and ∥ · ∥1 is not). For simplicity we consider the
nonsmooth term to be ℓ1 norm, but this is general to any convex
function for which we have access to its proximal operator.
24/32
(Prox)SAGA
The ProxSAGA update is inefficient
x+
= proxγh
dense!
(x − γ(∇fi(x)
sparse
− αi
sparse
+ α
dense!
)) ; α+
i = ∇fi(x)
=⇒ a sparse variant is needed as a prerequisite for a practical
parallel method.
25/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate
vi=∇fi(x) − αi + DPiα ;
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Where Pi, D are as in Sparse SAGA and φi
def
=
∑d
j (PiD)i,i|xj|.
φi has two key properties: i) support of φi = support of ∇fi (sparse
updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness)
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Where Pi, D are as in Sparse SAGA and φi
def
=
∑d
j (PiD)i,i|xj|.
φi has two key properties: i) support of φi = support of ∇fi (sparse
updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness)
Convergence: same linear convergence rate as SAGA, with cheaper
updates in presence of sparsity.
26/32
Proximal Asynchronous SAGA (ProxASAGA)
Each core runs Sparse Proximal SAGA asynchronously without locks
and updates x, α and α in shared memory.
 All read/write operations to shared memory are inconsistent, i.e.,
no performance destroying vector-level locks while reading/writing.
Convergence: under sparsity assumptions, ProxASAGA converges
with the same rate as the sequential algorithm =⇒ theoretical
linear speedup with respect to the number of cores.
27/32
Empirical results
ProxASAGA vs competing methods on 3 large-scale datasets,
ℓ1-regularized logistic regression
Dataset n p density L ∆
KDD 2010 19,264,097 1,163,024 10−6
28.12 0.15
KDD 2012 149,639,105 54,686,452 2 × 10−7
1.25 0.85
Criteo 45,840,617 1,000,000 4 × 10−5
1.25 0.89
0 20 40 60 80 100
Time (in minutes)
10 12
10 9
10 6
10 3
100
Objectiveminusoptimum
KDD10 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
KDD12 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
100 Criteo dataset
ProxASAGA (1 core)
ProxASAGA (10 cores)
AsySPCD (1 core)
AsySPCD (10 cores)
FISTA (1 core)
FISTA (10 cores)
28/32
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
29/32
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
29/32
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
29/32
Perspectives
• Scale above 20 cores.
• Asynchronous optimization on the GPU.
• Acceleration.
• Software development.
30/32
Codes
 Code is in github: https://github.com/fabianp/ProxASAGA.
Computational code is C++ (use of atomic type) but wrapped in
Python.
A very efficient implementation of SAGA can be found in the
scikit-learn and lightning
(https://github.com/scikit-learn-contrib/lightning) libraries.
31/32
References
Cauchy, Augustin (1847). “Méthode générale pour la résolution des systemes d’équations
simultanées”. In: Comp. Rend. Sci. Paris.
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient
method with support for non-strongly convex composite objectives”. In: Advances in Neural
Information Processing Systems.
Glowinski, Roland and A Marroco (1975). “Sur l’approximation, par éléments finis d’ordre un, et la
résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires”. In:
Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique.
Jaggi, Martin et al. (2014). “Communication-Efficient Distributed Dual Coordinate Ascent”. In:
Advances in Neural Information Processing Systems 27.
Leblond, Rémi, Fabian P., and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel SAGA”. In:
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
(AISTATS 2017).
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”.
In: Advances in Neural Information Processing Systems.
31/32
Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth
Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural
Information Processing Systems 30.
Peng, Zhimin et al. (2016). “ARock: an algorithmic framework for asynchronous parallel coordinate
updates”. In: SIAM Journal on Scientific Computing.
Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”. In: Ann. Math.
Statist.
Shamir, Ohad, Nati Srebro, and Tong Zhang (2014). “Communication-efficient distributed
optimization using an approximate newton-type method”. In: International conference on
machine learning.
32/32
Supervised Machine Learning
Data: n observations (ai, bi) ∈ Rp
× R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT
a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2σ(xT
1 a))
Input
layer
Hidden
layer
Output
layer
a1
a2
a3
a4
a5
Ouput
Supervised Machine Learning
Data: n observations (ai, bi) ∈ Rp
× R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT
a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2σ(xT
1 a))
Minimize some distance (e.g., quadratic) between the prediction
minimize
x
1
n
n∑
i=1
ℓ(bi, h(ai, x))
notation
=
1
n
n∑
i=1
fi(x)
where popular examples of ℓ are
• Squared loss, ℓ(bi, h(ai, x))
def
= (bi − h(ai, x))2
• Logistic (softmax), ℓ(bi, h(ai, x))
def
= log(1 + exp(−bih(ai, x)))
Sparse Proximal SAGA
For step size γ = 1
5L and f µ-strongly convex (µ > 0), Sparse Proximal
SAGA converges geometrically in expectation. At iteration t we have
E∥xt − x∗
∥2
≤ (1 − 1
5 min{ 1
n , 1
κ })t
C0 ,
with C0 = ∥x0 − x∗
∥2
+ 1
5L2
∑n
i=1 ∥α0
i − ∇fi(x∗
)∥2
and κ = L
µ (condition
number).
Implications
• Same convergence rate than SAGA with cheaper updates.
• In the “big data regime” (n ≥ κ): rate in O(1/n).
• In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).
• Adaptivity to strong convexity, i.e., no need to know strong
convexity parameter to obtain linear convergence.
Convergence ProxASAGA
Suppose τ ≤ 1
10
√
∆
. Then:
• If κ ≥ n, then with step size γ = 1
36L , ProxASAGA converges
geometrically with rate factor Ω( 1
κ ).
• If κ < n, then using the step size γ = 1
36nµ , ProxASAGA converges
geometrically with rate factor Ω( 1
n ).
In both cases, the convergence rate is the same as Sparse Proximal
SAGA =⇒ ProxASAGA is linearly faster up to constant factor. In both
cases the step size does not depend on τ.
If τ ≤ 6κ, a universal step size of Θ(1
L ) achieves a similar rate than
Sparse Proximal SAGA, making it adaptive to local strong convexity
(knowledge of κ not required).
ASAGA algorithm
ProxASAGA algorithm
Atomic vs non-atomic

More Related Content

What's hot (20)

PPTX
Introduction of "TrailBlazer" algorithm
Katsuki Ohto
 
PDF
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
HONGJOO LEE
 
PDF
Profiling in Python
Fabian Pedregosa
 
PDF
Scalable Link Discovery for Modern Data-Driven Applications
Holistic Benchmarking of Big Linked Data
 
PDF
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Variational Autoencoder
Mark Chang
 
PDF
Introduction to Big Data Science
Albert Bifet
 
PDF
Meta-learning and the ELBO
Yoonho Lee
 
PDF
VRP2013 - Comp Aspects VRP
Victor Pillac
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Europy17_dibernardo
GIUSEPPE DI BERNARDO
 
PDF
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
PDF
010_20160216_Variational Gaussian Process
Ha Phuong
 
PDF
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Marjan Sterjev
 
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Internet of Things Data Science
Albert Bifet
 
PDF
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
Nesreen K. Ahmed
 
PDF
Sampling from Massive Graph Streams: A Unifying Framework
Nesreen K. Ahmed
 
PPTX
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
The Statistical and Applied Mathematical Sciences Institute
 
Introduction of "TrailBlazer" algorithm
Katsuki Ohto
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
HONGJOO LEE
 
Profiling in Python
Fabian Pedregosa
 
Scalable Link Discovery for Modern Data-Driven Applications
Holistic Benchmarking of Big Linked Data
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
The Statistical and Applied Mathematical Sciences Institute
 
Variational Autoencoder
Mark Chang
 
Introduction to Big Data Science
Albert Bifet
 
Meta-learning and the ELBO
Yoonho Lee
 
VRP2013 - Comp Aspects VRP
Victor Pillac
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Europy17_dibernardo
GIUSEPPE DI BERNARDO
 
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
010_20160216_Variational Gaussian Process
Ha Phuong
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Marjan Sterjev
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Internet of Things Data Science
Albert Bifet
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
Nesreen K. Ahmed
 
Sampling from Massive Graph Streams: A Unifying Framework
Nesreen K. Ahmed
 
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
The Statistical and Applied Mathematical Sciences Institute
 

Similar to Parallel Optimization in Machine Learning (20)

PDF
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Fabian Pedregosa
 
PDF
Chapter 1 - Introduction
Charles Deledalle
 
PPT
Stack squeues lists
James Wong
 
PPT
Stacks queues lists
Luis Goldster
 
PPT
Stacks queues lists
Harry Potter
 
PPT
Stacksqueueslists
Fraboni Ec
 
PPT
Stacks queues lists
Tony Nguyen
 
PPT
Stacks queues lists
Young Alista
 
PDF
Chap 8. Optimization for training deep models
Young-Geun Choi
 
PDF
Metaheuristic Algorithms: A Critical Analysis
Xin-She Yang
 
PDF
Nature-Inspired Optimization Algorithms
Xin-She Yang
 
PDF
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Association for Computational Linguistics
 
PDF
50120140503004
IAEME Publication
 
PDF
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
PDF
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
Erlangen Artificial Intelligence & Machine Learning Meetup
 
PDF
gans_copy.pdfhjsjsisidkskskkskwkduydjekedj
fahid32446
 
PDF
Introduction to Generative Adversarial Network
vaidehimadaan041
 
PDF
How to easily find the optimal solution without exhaustive search using Genet...
Viach Kakovskyi
 
PPT
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
PDF
Differential game theory for Traffic Flow Modelling
Serge Hoogendoorn
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Fabian Pedregosa
 
Chapter 1 - Introduction
Charles Deledalle
 
Stack squeues lists
James Wong
 
Stacks queues lists
Luis Goldster
 
Stacks queues lists
Harry Potter
 
Stacksqueueslists
Fraboni Ec
 
Stacks queues lists
Tony Nguyen
 
Stacks queues lists
Young Alista
 
Chap 8. Optimization for training deep models
Young-Geun Choi
 
Metaheuristic Algorithms: A Critical Analysis
Xin-She Yang
 
Nature-Inspired Optimization Algorithms
Xin-She Yang
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Association for Computational Linguistics
 
50120140503004
IAEME Publication
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
Erlangen Artificial Intelligence & Machine Learning Meetup
 
gans_copy.pdfhjsjsisidkskskkskwkduydjekedj
fahid32446
 
Introduction to Generative Adversarial Network
vaidehimadaan041
 
How to easily find the optimal solution without exhaustive search using Genet...
Viach Kakovskyi
 
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
Differential game theory for Traffic Flow Modelling
Serge Hoogendoorn
 
Ad

More from Fabian Pedregosa (8)

PDF
Random Matrix Theory and Machine Learning - Part 4
Fabian Pedregosa
 
PDF
Random Matrix Theory and Machine Learning - Part 3
Fabian Pedregosa
 
PDF
Random Matrix Theory and Machine Learning - Part 2
Fabian Pedregosa
 
PDF
Random Matrix Theory and Machine Learning - Part 1
Fabian Pedregosa
 
PDF
Average case acceleration through spectral density estimation
Fabian Pedregosa
 
PDF
Adaptive Three Operator Splitting
Fabian Pedregosa
 
PDF
Sufficient decrease is all you need
Fabian Pedregosa
 
PDF
Lightning: large scale machine learning in python
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 4
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 3
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 1
Fabian Pedregosa
 
Average case acceleration through spectral density estimation
Fabian Pedregosa
 
Adaptive Three Operator Splitting
Fabian Pedregosa
 
Sufficient decrease is all you need
Fabian Pedregosa
 
Lightning: large scale machine learning in python
Fabian Pedregosa
 
Ad

Recently uploaded (20)

PDF
Portable Hyperspectral Imaging (pHI) for the enhanced recording of archaeolog...
crabbn
 
PDF
Global Congress on Forensic Science and Research
infoforensicscience2
 
DOCX
Critical Book Review (CBR) - "Hate Speech: Linguistic Perspectives"
Sahmiral Amri Rajagukguk
 
PDF
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
PPTX
Q1_Science 8_Week3-Day 1.pptx science lesson
AizaRazonado
 
PPTX
ION EXCHANGE CHROMATOGRAPHY NEW PPT (JA).pptx
adhagalejotshna
 
PPTX
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
PPTX
Bacillus thuringiensis.crops & golden rice
priyadharshini87125
 
PDF
Integrating Lifestyle Data into Personalized Health Solutions (www.kiu.ac.ug)
publication11
 
PDF
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
PPT
Experimental Design by Cary Willard v3.ppt
MohammadRezaNirooman1
 
PDF
soil and environmental microbiology.pdf
Divyaprabha67
 
PDF
Unit-3 ppt.pdf organic chemistry - 3 unit 3
visionshukla007
 
PPTX
abdominal compartment syndrome presentation and treatment.pptx
LakshmiMounicaGrandh
 
DOCX
Paper - Taboo Language (Makalah Presentasi)
Sahmiral Amri Rajagukguk
 
PPTX
Systamatic Acquired Resistence (SAR).pptx
giriprasanthmuthuraj
 
PDF
Rapid protoplanet formation in the outer Solar System recorded in a dunite fr...
Sérgio Sacani
 
PPTX
Diagnostic Features of Common Oral Ulcerative Lesions.pptx
Dr Palak borade
 
PDF
A Man of the Forest: The Contributions of Gifford Pinchot
RowanSales
 
PPTX
Cerebellum_ Parts_Structure_Function.pptx
muralinath2
 
Portable Hyperspectral Imaging (pHI) for the enhanced recording of archaeolog...
crabbn
 
Global Congress on Forensic Science and Research
infoforensicscience2
 
Critical Book Review (CBR) - "Hate Speech: Linguistic Perspectives"
Sahmiral Amri Rajagukguk
 
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
Q1_Science 8_Week3-Day 1.pptx science lesson
AizaRazonado
 
ION EXCHANGE CHROMATOGRAPHY NEW PPT (JA).pptx
adhagalejotshna
 
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
Bacillus thuringiensis.crops & golden rice
priyadharshini87125
 
Integrating Lifestyle Data into Personalized Health Solutions (www.kiu.ac.ug)
publication11
 
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
Experimental Design by Cary Willard v3.ppt
MohammadRezaNirooman1
 
soil and environmental microbiology.pdf
Divyaprabha67
 
Unit-3 ppt.pdf organic chemistry - 3 unit 3
visionshukla007
 
abdominal compartment syndrome presentation and treatment.pptx
LakshmiMounicaGrandh
 
Paper - Taboo Language (Makalah Presentasi)
Sahmiral Amri Rajagukguk
 
Systamatic Acquired Resistence (SAR).pptx
giriprasanthmuthuraj
 
Rapid protoplanet formation in the outer Solar System recorded in a dunite fr...
Sérgio Sacani
 
Diagnostic Features of Common Oral Ulcerative Lesions.pptx
Dr Palak borade
 
A Man of the Forest: The Contributions of Gifford Pinchot
RowanSales
 
Cerebellum_ Parts_Structure_Function.pptx
muralinath2
 

Parallel Optimization in Machine Learning

  • 1. Parallel Optimization in Machine Learning Fabian Pedregosa December 19, 2017 Huawei Paris Research Center
  • 2. About me • Engineer (2010-2012), Inria Saclay (scikit-learn kickstart). • PhD (2012-2015), Inria Saclay. • Postdoc (2015-2016), Dauphine–ENS–Inria Paris. • Postdoc (2017-present), UC Berkeley - ETH Zurich (Marie-Curie fellowship, European Commission) Hacker at heart ... trapped in a researcher’s body. 1/32
  • 3. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2/32
  • 4. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2006 = no longer mentions to speed of processors. 2/32
  • 5. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2006 = no longer mentions to speed of processors. Primary feature: number of cores. 2/32
  • 6. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. 3/32
  • 7. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. 3/32
  • 8. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. 3/32
  • 9. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. Parallel algorithms needed to take advantage of modern CPUs. 3/32
  • 10. Parallel optimization Parallel algorithms can be divided into two large categories: synchronous and asynchronous. Image credits: (Peng et al. 2016) Synchronous methods  Easy to implement (i.e., developed software packages).  Well understood.  Limited speedup due to synchronization costs. Asynchronous methods  Faster, typically larger speedups.  Not well understood, large gap between theory and practice.  No mature software solutions. 4/32
  • 11. Outline Synchronous methods • Synchronous (stochastic) gradient descent. Asynchronous methods • Asynchronous stochastic gradient descent (Hogwild) (Niu et al. 2011) • Asynchronous variance-reduced stochastic methods (Leblond, P., and Lacoste-Julien 2017), (Pedregosa, Leblond, and Lacoste-Julien 2017). • Analysis of asynchronous methods. • Codes and implementation aspects. Leaving out many parallel synchronous methods: ADMM (Glowinski and Marroco 1975), CoCoA (Jaggi et al. 2014), DANE (Shamir, Srebro, and Zhang 2014), to name a few. 5/32
  • 12. Outline Most of the following is joint work with Rémi Leblond and Simon Lacoste-Julien Rémi Leblond Simon Lacoste–Julien 6/32
  • 14. Optimization for machine learning Large part of problems in machine learning can be framed as optimization problems of the form minimize x f(x) def = 1 n n∑ i=1 fi(x) Gradient descent (Cauchy 1847). Descend along steepest direction (−∇f(x)) x+ = x − γ∇f(x) Stochastic gradient descent (SGD) (Robbins and Monro 1951). Select a random index i and descent along − ∇fi(x): x+ = x − γ∇fi(x) images source: Francis Bach 7/32
  • 15. Parallel synchronous gradient descent Computation of gradient is distributed among k workers • Workers can be: different computers, CPUs or GPUs • Popular frameworks: Spark, Tensorflow, PyTorch, neHadoop. 8/32
  • 16. Parallel synchronous gradient descent 1. Choose n1, . . . nk that sum to n. 2. Distribute computation of ∇f(x) among k nodes ∇f(x) = 1 n ∑ i=1 ∇fi(x) = 1 k ( 1 n1 n1∑ i=1 ∇fi(x) done by worker 1 + . . . + 1 n1 nk∑ i=nk−1 ∇fi(x) done by worker k ) 3. Perform the gradient descent update by a master node x+ = x − γ∇f(x) 9/32
  • 17. Parallel synchronous gradient descent 1. Choose n1, . . . nk that sum to n. 2. Distribute computation of ∇f(x) among k nodes ∇f(x) = 1 n ∑ i=1 ∇fi(x) = 1 k ( 1 n1 n1∑ i=1 ∇fi(x) done by worker 1 + . . . + 1 n1 nk∑ i=nk−1 ∇fi(x) done by worker k ) 3. Perform the gradient descent update by a master node x+ = x − γ∇f(x)  Trivial parallelization, same analysis as gradient descent.  Synchronization step every iteration (3.). 9/32
  • 18. Parallel synchronous SGD Can also be extended to stochastic gradient descent. 1. Select k samples i0, . . . , ik uniformly at random. 2. Compute in parallel ∇fit on worker t 3. Perform the (mini-batch) stochastic gradient descent update x+ = x − γ 1 k k∑ t=1 ∇fit (x) 10/32
  • 19. Parallel synchronous SGD Can also be extended to stochastic gradient descent. 1. Select k samples i0, . . . , ik uniformly at random. 2. Compute in parallel ∇fit on worker t 3. Perform the (mini-batch) stochastic gradient descent update x+ = x − γ 1 k k∑ t=1 ∇fit (x)  Trivial parallelization, same analysis as (mini-batch) stochastic gradient descent.  The kind of parallelization that is implemented in deep learning libraries (tensorflow, PyTorch, Thano, etc.).  Synchronization step every iteration (3.). 10/32
  • 21. Asynchronous SGD Synchronization is the bottleneck.  What if we just ignore it? 11/32
  • 22. Asynchronous SGD Synchronization is the bottleneck.  What if we just ignore it? Hogwild (Niu et al. 2011): each core runs SGD in parallel, without synchronization, and updates the same vector of coefficients. In theory: convergence under very strong assumptions. In practice: just works. 11/32
  • 23. Hogwild in more detail Each core follows the same procedure 1. Read the information from shared memory ˆx. 2. Sample i ∈ {1, . . . , n} uniformly at random. 3. Compute partial gradient ∇fi(ˆx). 4. Write the SGD update to shared memory x = x − γ∇fi(ˆx). 12/32
  • 24. Hogwild is fast Hogwild can be very fast. But its still SGD... • With constant step size, bounces around the optimum. • With decreasing step size, slow convergence. • There are better alternatives (Emilie already mentioned some) 13/32
  • 25. Looking for excitement? ... analyze asynchronous methods!
  • 26. Analysis of asynchronous methods Simple things become counter-intuitive, e.g, how to name the iterates?  Iterates will change depending on the speed of processors 14/32
  • 27. Naming scheme in Hogwild Simple, intuitive and wrong Each time a core has finished writing to shared memory, increment iteration counter. ⇐⇒ ˆxt = (t + 1)-th succesfull update to shared memory. Value of ˆxt and it are not determined until the iteration has finished. =⇒ ˆxt and it are not necessarily independent. 15/32
  • 28. Unbiased gradient estimate SGD-like algorithms crucially rely on the unbiased property Ei[∇fi(x)] = ∇f(x). For synchronous algorithms, follows from the uniform sampling of i Ei[∇fi(x)] = n∑ i=1 Proba(selecting i)∇fi(x) uniform sampling = n∑ i=1 1 n ∇fi(x) = ∇f(x) 16/32
  • 29. A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. 17/32
  • 30. A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. Illustration: problem with two samples and two cores f = 1 2 (f1 + f2). Computing ∇f1 is much expensive than ∇f2. 17/32
  • 31. A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. Illustration: problem with two samples and two cores f = 1 2 (f1 + f2). Computing ∇f1 is much expensive than ∇f2. Start at x0. Because of the random sampling there are 4 possible scenarios: 1. Core 1 selects f1, Core 2 selects f1 =⇒ x1 = x0 − γ∇f1(x) 2. Core 1 selects f1, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x) 3. Core 1 selects f2, Core 2 selects f1 =⇒ x1 = x0 − γ∇f2(x) 4. Core 1 selects f2, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x) So we have Ei [∇fi] = 1 4 f1 + 3 4 f2 ̸= 1 2 f1 + 1 2 f2 !! 17/32
  • 32. The Art of Naming Things
  • 33. A new labeling scheme  New way to name iterates. “After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment counter each time we read the vector of coefficients from shared memory. 18/32
  • 34. A new labeling scheme  New way to name iterates. “After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment counter each time we read the vector of coefficients from shared memory.  No dependency between it and the cost of computing ∇fit .  Full analysis of Hogwild and other asynchronous methods in “Improved parallel stochastic optimization analysis for incremental methods”, Leblond, P., and Lacoste-Julien (submitted). 18/32
  • 36. The SAGA algorithm Setting: minimize x 1 n n∑ i=1 fi(x) The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014). Select i ∈ {1, . . . , n} and compute (x+ , α+ ) as x+ = x − γ(∇fi(x) − αi + α) ; α+ i = ∇fi(x) • Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x). • Unlike SGD, because of memory terms α, variance → 0. • Unlike SGD, converges with fixed step size (1/3L) 19/32
  • 37. The SAGA algorithm Setting: minimize x 1 n n∑ i=1 fi(x) The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014). Select i ∈ {1, . . . , n} and compute (x+ , α+ ) as x+ = x − γ(∇fi(x) − αi + α) ; α+ i = ∇fi(x) • Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x). • Unlike SGD, because of memory terms α, variance → 0. • Unlike SGD, converges with fixed step size (1/3L) Super easy to use in scikit-learn 19/32
  • 38. Sparse SAGA Need for a sparse variant of SAGA • A large part of large scale datasets are sparse. • For sparse datasets and generalized linear models (e.g., least squares, logistic regression, etc.), partial gradients ∇fi are sparse too. • Asynchronous algorithms work best when updates are sparse. SAGA update is inefficient for sparse data x+ = x − γ(∇fi(x) sparse − αi sparse + α dense! ) ; α+ i = ∇fi(x) [scikit-learn uses many tricks to make it efficient that we cannot use in asynchronous version] 20/32
  • 39. Sparse SAGA Sparse variant of SAGA. Relies on • Diagonal matrix Pi = projection onto the support of ∇fi • Diagonal matrix D defined as Dj,j = n/number of times ∇jfi is nonzero. Sparse SAGA algorithm (Leblond, P., and Lacoste-Julien 2017) x+ = x − γ(∇fi(x) − αi + PiDα) ; α+ i = ∇fi(x) • All operations are sparse, cost per iteration is O(nonzeros in ∇fi). • Same convergence properties than SAGA, but with cheaper iterations in presence of sparsity. • Crucial property: Ei[PiD] = I. 21/32
  • 40. Asynchronous SAGA (ASAGA) • Each core runs an instance of Sparse SAGA. • Updates the same vector of coefficients α, α. Theory: Under standard assumptions (bounded dalays), same convergence rate than sequential version. =⇒ theoretical linear speedup with respect to number of cores. 22/32
  • 41. Experiments • Improved convergence of variance-reduced methods wrt SGD. • Significant improvement between 1 and 10 cores. • Speedup is significant, but far from ideal. 23/32
  • 43. Composite objective Previous methods assume objective function is smooth. Cannot be applied to Lasso, Group Lasso, box constraints, etc. Objective: minimize composite objective function: minimize x 1 n n∑ i=1 fi(x) + ∥x∥1 where fi is smooth (and ∥ · ∥1 is not). For simplicity we consider the nonsmooth term to be ℓ1 norm, but this is general to any convex function for which we have access to its proximal operator. 24/32
  • 44. (Prox)SAGA The ProxSAGA update is inefficient x+ = proxγh dense! (x − γ(∇fi(x) sparse − αi sparse + α dense! )) ; α+ i = ∇fi(x) =⇒ a sparse variant is needed as a prerequisite for a practical parallel method. 25/32
  • 45. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems 26/32
  • 46. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate vi=∇fi(x) − αi + DPiα ; 26/32
  • 47. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + DPiα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) 26/32
  • 48. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + DPiα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) Where Pi, D are as in Sparse SAGA and φi def = ∑d j (PiD)i,i|xj|. φi has two key properties: i) support of φi = support of ∇fi (sparse updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness) 26/32
  • 49. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + DPiα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) Where Pi, D are as in Sparse SAGA and φi def = ∑d j (PiD)i,i|xj|. φi has two key properties: i) support of φi = support of ∇fi (sparse updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness) Convergence: same linear convergence rate as SAGA, with cheaper updates in presence of sparsity. 26/32
  • 50. Proximal Asynchronous SAGA (ProxASAGA) Each core runs Sparse Proximal SAGA asynchronously without locks and updates x, α and α in shared memory.  All read/write operations to shared memory are inconsistent, i.e., no performance destroying vector-level locks while reading/writing. Convergence: under sparsity assumptions, ProxASAGA converges with the same rate as the sequential algorithm =⇒ theoretical linear speedup with respect to the number of cores. 27/32
  • 51. Empirical results ProxASAGA vs competing methods on 3 large-scale datasets, ℓ1-regularized logistic regression Dataset n p density L ∆ KDD 2010 19,264,097 1,163,024 10−6 28.12 0.15 KDD 2012 149,639,105 54,686,452 2 × 10−7 1.25 0.85 Criteo 45,840,617 1,000,000 4 × 10−5 1.25 0.89 0 20 40 60 80 100 Time (in minutes) 10 12 10 9 10 6 10 3 100 Objectiveminusoptimum KDD10 dataset 0 10 20 30 40 Time (in minutes) 10 12 10 9 10 6 10 3 KDD12 dataset 0 10 20 30 40 Time (in minutes) 10 12 10 9 10 6 10 3 100 Criteo dataset ProxASAGA (1 core) ProxASAGA (10 cores) AsySPCD (1 core) AsySPCD (10 cores) FISTA (1 core) FISTA (10 cores) 28/32
  • 52. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA 29/32
  • 53. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. 29/32
  • 54. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. • As predicted by theory, there is a high correlation between degree of sparsity and speedup. 29/32
  • 55. Perspectives • Scale above 20 cores. • Asynchronous optimization on the GPU. • Acceleration. • Software development. 30/32
  • 56. Codes  Code is in github: https://github.com/fabianp/ProxASAGA. Computational code is C++ (use of atomic type) but wrapped in Python. A very efficient implementation of SAGA can be found in the scikit-learn and lightning (https://github.com/scikit-learn-contrib/lightning) libraries. 31/32
  • 57. References Cauchy, Augustin (1847). “Méthode générale pour la résolution des systemes d’équations simultanées”. In: Comp. Rend. Sci. Paris. Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives”. In: Advances in Neural Information Processing Systems. Glowinski, Roland and A Marroco (1975). “Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires”. In: Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique. Jaggi, Martin et al. (2014). “Communication-Efficient Distributed Dual Coordinate Ascent”. In: Advances in Neural Information Processing Systems 27. Leblond, Rémi, Fabian P., and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel SAGA”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”. In: Advances in Neural Information Processing Systems. 31/32
  • 58. Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural Information Processing Systems 30. Peng, Zhimin et al. (2016). “ARock: an algorithmic framework for asynchronous parallel coordinate updates”. In: SIAM Journal on Scientific Computing. Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”. In: Ann. Math. Statist. Shamir, Ohad, Nati Srebro, and Tong Zhang (2014). “Communication-efficient distributed optimization using an approximate newton-type method”. In: International conference on machine learning. 32/32
  • 59. Supervised Machine Learning Data: n observations (ai, bi) ∈ Rp × R Prediction function: h(a, x) ∈ R Motivating examples: • Linear prediction: h(a, x) = xT a • Neural networks: h(a, x) = xT mσ(xm−1σ(· · · xT 2σ(xT 1 a)) Input layer Hidden layer Output layer a1 a2 a3 a4 a5 Ouput
  • 60. Supervised Machine Learning Data: n observations (ai, bi) ∈ Rp × R Prediction function: h(a, x) ∈ R Motivating examples: • Linear prediction: h(a, x) = xT a • Neural networks: h(a, x) = xT mσ(xm−1σ(· · · xT 2σ(xT 1 a)) Minimize some distance (e.g., quadratic) between the prediction minimize x 1 n n∑ i=1 ℓ(bi, h(ai, x)) notation = 1 n n∑ i=1 fi(x) where popular examples of ℓ are • Squared loss, ℓ(bi, h(ai, x)) def = (bi − h(ai, x))2 • Logistic (softmax), ℓ(bi, h(ai, x)) def = log(1 + exp(−bih(ai, x)))
  • 61. Sparse Proximal SAGA For step size γ = 1 5L and f µ-strongly convex (µ > 0), Sparse Proximal SAGA converges geometrically in expectation. At iteration t we have E∥xt − x∗ ∥2 ≤ (1 − 1 5 min{ 1 n , 1 κ })t C0 , with C0 = ∥x0 − x∗ ∥2 + 1 5L2 ∑n i=1 ∥α0 i − ∇fi(x∗ )∥2 and κ = L µ (condition number). Implications • Same convergence rate than SAGA with cheaper updates. • In the “big data regime” (n ≥ κ): rate in O(1/n). • In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ). • Adaptivity to strong convexity, i.e., no need to know strong convexity parameter to obtain linear convergence.
  • 62. Convergence ProxASAGA Suppose τ ≤ 1 10 √ ∆ . Then: • If κ ≥ n, then with step size γ = 1 36L , ProxASAGA converges geometrically with rate factor Ω( 1 κ ). • If κ < n, then using the step size γ = 1 36nµ , ProxASAGA converges geometrically with rate factor Ω( 1 n ). In both cases, the convergence rate is the same as Sparse Proximal SAGA =⇒ ProxASAGA is linearly faster up to constant factor. In both cases the step size does not depend on τ. If τ ≤ 6κ, a universal step size of Θ(1 L ) achieves a similar rate than Sparse Proximal SAGA, making it adaptive to local strong convexity (knowledge of κ not required).