Fast relaxation methods for the matrix exponential

Relaxation methods for !
the matrix exponential !
on large networks

Code www.cs.purdue.edu/homes/dgleich/codes/nexpokit!

David F. Gleich!
Purdue University!
David Gleich · Purdue

Mines

1

Joint work with
Kyle Kloster @ Purdue
supported by "
NSF CAREER
1149756-CCF

error
1

Models Previous work
– and algorithms for high performance !
from the PI tackled net- computations
matrix and network
FIGURE 6

std

2

work alignment with matrix methods =for cm
edge
(b) Std, s 0.39
Simulation data analysis
overlap:
SIMAX ‘09, SISC ‘11,MapReduce ‘11, ICASSP ’12
1

j

i

i0
Overlap
Overlap

j0

error

SC ‘05, WAW ‘07, SISC ‘10, WWW ’10, …

Massive matrix "
computations

std

0

0

Fast & Scalable"
Network centrality

0

20
10

10

A
L
B
Tensor eigenvalues"
0

(d) Std, s = 1.95 cm

Ax = b
min kAx bk
Ax = x

This proposal is for matchand a power method
Network alignment
tensor
ing triangles using

P
methods:
on multi-threaded
maximize
Tijk xi xj xk

model compared to the prediction standard debble locations at the ﬁnal time for two values of
ICDM ‘09, SC ‘11, TKDE ‘13
= 1.95 cm. (Colors are visible in the electronic

ijk
n
and kxk2 = 1
subject to distributed
j
Triangle
j
X
i
s
i
approximately twenty minutes to construct using (next) architectures
k
[x
]i = ⇢ · (
Tijk xj xk + xi )
k
s.
jk
s- involved a few pre- and post-processing steps:
ta
where ! ensures the 2-norm
m Aria, globally transpose the data, compute the
g errors. The preprocessing steps took approx- SSHOPM method due to "
nd
0

0

Data clustering

WSDM ‘12, KDD ‘12, CIKM ’13 …

0

A
recise timing information, L we do notB
but
report

David
Kolda and Mayo
Gleich

· Purdue

Mines

2

t
r
o
s.
g
n.
o

The talk ends, you
believe -- whatever
you want to.

Image from rockysprings, deviantart, CC share-alike

3

Everything in the world can be
explained by a matrix, and we see
how deep the rabbit hole goes

Matrix exponentials
A is n ⇥ n, real
1
X 1
exp(A) is deﬁned as
Ak
k!

Always converges

k =0

dx
= Ax(t)
dt

,

x(t) = exp(tA)x(0)

Evolution operator "
for an ODE


Mines

4

special case of a function of a matrix f (A)
others are f (x) = 1/x; f (x) = sinh(x)...

This talk: a column of the
matrix exponential

x = exp(P)ec
x

the solution

P the matrix
the column


Mines

5

ec

Matrix computations in a red-pill


Mines

6

Solve a problem better by
exploiting its structure!

This talk: a column of the
matrix exponential

x = exp(P)ec

the solution localized
P the matrix
large, sparse, stochastic
x

the column


Mines

7

ec

Localized solutions
x = exp(P)ec
nnz(x) = 513, 969

plot(x)
1.5

10

1

10

0

error

−5

0.5

−15

0

2

4

6
5

x 10

length(x) = 513, 969

10

0

10

2

10

4

10

6

10

nonzeros

Mines

8

0

−10

10

Our mission!


Mines

9

Find the solution with work "
roughly proportional to the "
localization, not the matrix.

Our algorithm!

www.cs.purdue.edu/homes/dgleich/codes/nexpokit
0

10

−5

error

10

−10

10

−15
0

10

2

4

10

10

6

10

nonzeros

Mines

10

10

Outline
1.  Motivation and setup
2.  Converting x = exp(P) ec into a linear system
3.  Relaxation methods for "
linear systems from large networks
4.  Error analysis


Mines

11

5.  Experiments

SIAM REVIEW
Vol. 45, No. 1, pp. 3–49

c
⃝ 2003 Society for Industrial and Applied Mathematics

Cleve Moler†
Charles Van Loan‡

Mines

12

Nineteen Dubious Ways to
Compute the Exponential of a
Matrix, Twenty-Five Years Later∗

Matrix exponentials on large networks
1
X 1
exp(A) =
Ak
k!
k =0

If A is the adjacency matrix, then
Ak counts the number of length k
paths between node pairs.

[Estrada 2000, Farahat et al. 2002, 2006]
Large entries denote important nodes or edges.
Used for link prediction and centrality

k =0

If P is a transition matrix, then "
Pk is the probability of a length k
walk between node pairs.

[Kondor & Lafferty 2002, Kunegis & Lommatzsch 2009, Chung 2007]
Used for link prediction, kernels, and
clustering or community detection

Mines

13

1
X 1
exp(P) =
Pk
k!

Another useful matrix exponential
P column stochastic

e.g. P = AT D

1

A is the adjacency matrix
if A is symmetric
1

A) = D

1

exp(AD

1

)D = D


1

exp(P)D

Mines

14

exp(PT ) = exp(D

Another useful matrix exponential
P column stochastic

e.g. P = AT D

1

A is the adjacency matrix

heat kernel of a graph

dx(t)
= Lx(t)

dt

solves the heat equation at t=1.

exp( L) = exp(D 1/2 AD 1/2 I) Negative Normalized Laplacian
1
= exp(D 1/2 AD 1/2 )
e
1 1/2
1 1/2
1
1/2
= D
exp(AD )D
= D
exp(P)D1/2
e
e

Mines

15

if A is symmetric

Matrix exponentials on large networks
Is a single column interesting? Yes!
1
X 1
exp(P)ec =
Pk ec
k!
k =0

Link prediction scores for node c
A community relative to node c

But …
and so we’d like "
speed over accuracy


Mines

16

modern networks are "
large ~ O(109) nodes,
sparse ~ O(1011) edges,
constantly changing …

Newman’s
netscience
collaboration
network!

379 vertices
1828 non-zeros

x = exp(P)ec

“zero” on most nodes


Mines

17

ec has a single "
one here

The issue with existing methods
We want good results in less than one matvec.
Our graphs have small diameter and fast ﬁll-in.

Krylov methods !

exp(P)ec ⇡ ⇢Vexp(H)e1

[Sidje 1998]"
ExpoKit

A few matvecs, quick loss of sparsity due to orthogonality
!
Direct expansion! exp(P)ec ⇡

PN

1 k
k =0 k ! P ec


Mines

18

A few matvecs, quick loss of sparsity due to ﬁll-in

Outline

✓



Mines

19

5.  Experiments

Our underlying method
Direct expansion!

x = exp(P)ec ⇡

PN

1 k
k =0 k ! P ec

= xN

A few matvecs, quick loss of sparsity due to ﬁll-in

This method is stable for stochastic P!
"… no cancellation, unbounded norm, etc.
!

Lemma

kx

1
xN k1 
N!N


Mines

20

!

Our underlying method !
as a linear system
Direct expansion! x = exp(P)ec
2

I
6 P/1
I

6
6
..
6
.
P/2
"
6
6
..
!
4
.
I
P/N I
!
!

I
(I ⌦ I N

PN

⇡
32

1 k
k =0 k ! P ec

= xN

3

2 3
v0
ec
7 6 v1 7 6 0 7
76 7 6 7
N
X
76 . 7 6 . 7
7 6 . 7 = 6 . 7 xN =
vi
76 . 7 6 . 7
76 . 7 6 . 7
i=0
54 . 5 4 . 5
.
.
vN
0

SN ⌦ P)v = e1 ⌦ ec


Mines

21

Lemma we approximate xN well if we approximate v well

Our mission (2)!
Approximately solve "

Ax = b


Mines

22

when A, b are sparse,"
x is localized.

Outline

✓

✓



Mines

23

5.  Experiments

Coordinate descent, Gauss-Southwell,
Gauss-Seidel, relaxation & “push” methods
Be greedy

Don’t look at the whole system.


Mines

24

Look at equations that are violated and try and
ﬁx them.

Coordinate descent, Gauss-Southwell,
Gauss-Seidel, relaxation & “push” methods

Ax = b
r(k) = b

Ax(k)

x(k +1) = x(k ) + ej eT r(k )
j
r(k +1) = r(k)

rj(k) Aej

Procedurally!
Solve(A,b)
x = sparse(size(A,1),1)
r = b
While (1)
Pick j where r(j) != 0
z = r(j)
x(j) = x(j) + r(j)
For i where A(i,j) != 0
r(i) = r(i) – z*A(i,j)


Mines

25

Algebraically!

It’s called the “push” method
because of PageRank
r(k) = v

↵P)x = v
I
(I

PageRankPush(links,v,alpha)

↵P)x(k)

x(k +1) = x(k ) + ej eT r(k )
j
“r(k +1) = r(k ) rj(k) Aej ”
8
>0
<
ri(k +1) = ri(k) + ↵Pi,j rj(k)
> (k)
:
ri

x = sparse(size(A,1),1)
r = b
While (1)
Pick j where r(j) != 0
z = r(j)
x(j) = x(j) + z
r(j) = 0
z = alpha * z / deg(j)
For i where “j links to i”
r(i) = r(i) + z

i =j
Pi,j 6= 0
otherwise


Mines

26

I
(I

It’s called the “push” method
because of PageRank


Mines

27

Demo

Justiﬁcation of terminology
This method is frequently “rediscovered” (3 times for PageRank!)

Let Ax = b, diag(A) = I
It’s Gauss-Seidel if j is chosen cyclically
It’s Gauss-Southwell if j is the largest entry in the residual
It’s coordinate descent if A is symmetric, pos. deﬁnite
It’s a relaxation step for any A


Mines

28

Works great for other problems too! "
[Bonchi, Gleich, et al. J. Internet Math. 2012]

Back to the exponential
2
6
6
6
6
6
6
4

I
P/1

I
(I ⌦ I N

I
P/2

..

.

..

.

32

I
P/N

3

2 3
v0
ec
7 6 v1 7 6 0 7
76 7 6 7
N
X
76 . 7 6 . 7
7 6 . 7 = 6 . 7 xN =
vi
76 . 7 6 . 7
76 . 7 6 . 7
i=0
54 . 5 4 . 5
.
.
I
vN
0

SN ⌦ P)v = e1 ⌦ ec


Mines

29

Solve this system via the same method.

Optimization 1 build system implicitly

Optimization 2 don’t store vi, just store sum xN

Code (inefﬁcient, but working) for !
Gauss-Southwell to solve
function x = nexpm(P,c,tol)
n = size(P,1); N = 11; sumr=1;
r = zeros(n,N+1); r(c,1) = 1; x = zeros(n,1); % the residual and solution
while sumr >= tol % use max iteration too
[ml,q]=max(r(:)); i=mod(q-1,n)+1; k=ceil(q/n); % use a heap in practice for max
r(q) = 0; x(i) = x(i)+ml; sumr = sumr-ml;% zero the residual, add to solution
[nset,~,vals] = ﬁnd(P(:,i)); ml=ml/k; % look up the neighbors of node i
for j=1:numel(nset) % for all neighbors
if k==N, x(nset(j)) = x(nset(j)) + vals(j)*ml; % add to solution
else, r(nset(j),k+1) = r(nset(j),k+1) + vals(j)*ml;% or add to next residual
sumr = sumr + vals(j)*ml;
end, end, end % end if, end for, end while


Mines

30

Todo use dictionary for x, r and use heap or queue for residual

Outline

✓

✓
✓



Mines

31

5.  Experiments

Error analysis for Gauss-Southwell
I
(I ⌦ I N

SN ⌦ P)v = e1 ⌦ ec

Theorem
Assume P is column-stochastic, v(0) = 0.
(Nonnegativity)
iterates and residuals are nonnegative
v(l) 0 and r(l) 0

1
2dk

 l(

1
2d

)

“annoying”
d is the
largest degree


Mines

32

(Convergence)
residual goes to 0:
Ql
(l)
kr k1  k=1 1

“easy”

Proof sketch
Gauss-Southwell picks largest residual
⇒  Bound the update by avg. nonzeros in residual (sloppy)
⇒  Algebraic convergence with slow rate, but each update is
REALLY fast O(d max log n).


Mines

33

If d is log log n, then our method runs in sub-linear time "
(but so does just about anything)

Overall error analysis
After ℓ steps of Gauss-Southwell
Theorem

kxN

(`)

1
1
xk1 
+ ·`
N!N e

1
2d


Mines

34

Components!
Truncation to N terms
Residual to error

Approximate solve

More recent error analysis
Theorem (Gleich and Kloster, 2013 arXiv:1310.3423)"

Consider solving personalized PageRank using the GaussSouthwell relaxation method in a graph with a Zipf-law in
the degrees with exponent p=1 and max-degree d, then
the work involved in getting a solution with 1-norm error ε is

work = O log( 1 )( 1 )3/2 d 2 (log d)2
" "

⌘


Mines

35

⇣

Outline

✓

✓
✓


✓


Mines

36

5.  Experiments

Our implementations
C++ mex implementation with a heap to
implement Gauss-Southwell.
C++ mex implementation with a queue to store
all residual entries ≥ 1/(tol nN).

At completion, the residual norm ≤ tol.


Mines

37

We use the queue except for the runtime
comparison.

Accuracy vs. tolerance
pgp−cc
pgp social graph, 10k vertices

0.8
0.6
0.4
0.2
0
−2

−3
−4
−5
−6
−7
log10 of residual tolerance

For the pgp social
graph, we study the
precision in ﬁnding the
100 largest nodes as
we vary the tolerance.
This set of 100 does
not include the nodes
immediate neighbors.
(Boxplot over 50 trials)


Mines

38

Precision at 100

1

Accuracy vs. work
dblp−cc
dblp collaboration graph, 225k vertices
1

0.6

tol=10−5

tol=10−4

0.4

@10
@25

0.2

@100
@1000

0
−2

−1

0

10
10
10
Effective matrix−vector products


Mines

39

Precision

0.8

For the dblp collaboration
graph, we study the
precision in ﬁnding the
100 largest nodes as we
vary the work. This set of
100 does not include the
nodes immediate
neighbors. (One column,
but representative)

Runtime
Flickr social network"
500k nodes, 5M edges

0

−2

10

TSGS
TSGSQ
EXPV
MEXPV
TAYLOR

−4

10

3

10

4

10

5

10
|E| + |V|

6

10


Mines

40

Runtime (secs).

10

Outline

✓

✓
✓

3.  Coordinate descent methods for "
5.  Experiments

✓
✓


Mines

41


References and ongoing work
Kloster and Gleich, Workshop on Algorithms for the
Web-graph, 2013. Also see the journal version on arXiv.
www.cs.purdue.edu/homes/dgleich/codes/nexpokit

Error analysis using the queue (almost done …)

• 

Better linear systems for faster convergence

• 

Asynchronous coordinate descent methods

• 

Scaling up to billion node graphs (done …)
www.cs.purdue.edu/homes/dgleich
Supported by NSF CAREER 1149756-CCF
Mines

42

•

Fast relaxation methods for the matrix exponential

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Fast relaxation methods for the matrix exponential (20)

Recently uploaded (20)

Fast relaxation methods for the matrix exponential