My invited talk at the 23rd International Symposium of Mathematical Programming (ISMP, 2018)

Mathematics Of Neural Networks
Anirbit
AMS
Johns Hopkins University
( AMS Johns Hopkins University ) 1 / 25

Outline
1 Introduction
2 An overview of our results about neural nets
What functions does a deep net represent?
Why can the deep net do dictionary learning?
3 Open questions

Introduction
This overview is based on the following 4 papers of ours,

Introduction
This overview is based on the following 4 papers of ours,
ICML 2018 Workshop On Non-Convex Optimization (Not yet public)
“Convergence guarantees for RMSProp and ADAM in non-convex optimiza-
tion and their comparison to Nesterov acceleration on autoencoders”
https://eccc.weizmann.ac.il/report/2017/190/
“Lower bounds over Boolean inputs for deep neural networks with
ReLU gates”
https://arxiv.org/abs/1708.03735 (ISIT 2018)
“Sparse Coding and Autoencoders”
https://eccc.weizmann.ac.il/report/2017/098/(ICLR 2018)
“Understanding Deep Neural Networks with Rectiﬁed Linear Units”

Introduction
The collaborators!
These are works with Amitabh Basu (AMS, JHU)
and diﬀerent subsets of,
Akshay Rangamani (ECE, JHU)
Soham De (CS, UMD)
Enayat Ullah (CS, JHU)
Tejaswini Ganapathy (Salesforce, San Francisco Bay Area)
Ashish Arora, Trac D.Tran (ECE, JHU)
Raman Arora, Poorya Mianjy (CS, JHU)
Sang (Peter) Chin (CS, BU)

Introduction
What is a neural network?
The following diagram (imagine it as a directed acyclic graph where all
edges are pointing to the right) represents an instance of a “neural
network”.
Since there are no “weights” assigned to the edges of the above graph,
one should think of this as representing a certain class (set) of R4 → R3
functions which can be computed by the above “architecture” for a
*ﬁxed* choice of “activation functions” (like, ReLU(x) = max{0, x}) at
each of the blue nodes. The yellow nodes are where the input vector
comes in and the orange nodes are where the output vector comes out.

An overview of our results about neural nets
Formalizing the questions about neural nets
(1) Exact trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} 2
2
can be done in time, 2width
Sn×width
poly(n, S, width).

(1) Exact trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} 2
2
can be done in time, 2width
Sn×width
poly(n, S, width).
This is the *only* algorithm we are aware of which gets exact
global minima of the empirical risk of some net in time
polynomial in any of the parameters.
The possibility of a similar result for deeper networks or
ameliorating the dependency on width remains wildly
open!

(2) Structure discovery by the nets
Real-life data can be modeled as observations of some structured
distribution. One view of the success of neural nets can be to say
that somehow nets can often be set up in such a way that they give
a function to optimize over which reveals this hidden structure at
its optima/critical points. In one classic scenario called the “sparse
coding” we will show proofs about how the net’s loss function has
certain nice properties which are possibly helping towards revealing
the hidden data generation model (the “dictionary”).

(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to ﬁnd
good descriptions of the functions that nets can compute.

(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to ﬁnd
good descriptions of the functions that nets can compute.
Let us start with this last kind of questions!

An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one ﬁnd a complete characterization of the neural functions
parametrized by architecture?

parametrized by architecture? No Clue!

parametrized by architecture? No Clue!
Theorem (Ours)
A function f : Rn → R is continuous piecewise linear iﬀ it is
representable by a ReLU deep net. Further a ReLU deep net of
depth at most, 1 + log2(n + 1) is required to represent f . For
n = 1 there is also a sharp width lowerbound.

A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.

A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
Proof.
That 1-DNN ⊂ 2-DNN is obvious. Now observe that any R2 → R
1−DNN function is non-diﬀerentiable on a union of lines (one line
along each ReLU gate’s argument) but the given function is
non-diﬀerentiable on a union of 3 half-lines. Hence proved!

A small part of “The Big Question” which is already unclear!
The family of 2-DNN functions is parameterized as follows by
(dimension compatible) choices of matrices W1, W2, vectors
b1, b2 and a number b3,
f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})}

A small part of “The Big Question” which is already unclear!
The family of 2-DNN functions is parameterized as follows by
(dimension compatible) choices of matrices W1, W2, vectors
b1, b2 and a number b3,
f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})}
Can the R4 → R function given as x → max{0, x1, x2, x3, x4} be
written in the above form?
(While its easy to see that max{0, x1, x2, .., x2k } ∈ (k+1)-DNN)

Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (We generalize a result by Matus Telgarsky (UIUC))
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k1.

Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (We generalize a result by Matus Telgarsky (UIUC))
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k1.
Here the basic intuition is that if one starts with a small depth func-
tion which is oscillating then *without* blowing up the width too
much higher depths can be set up to recursively increase the number
of oscillations. And then such functions get very hard for the smaller
depths to even approximate in 1 norm unless they blow up in size.

Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation?

than the LTF activation? The best gap we know of is the following,

Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF

Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
Proof.
This follows by looking at this function on the hypercube, {0, 1}n
given as, f (x) = n
i=i 2i−1xi . This has 2n level sets on the discrete
cube and hence needs that many polyhedral cells to be produced by
the hyperplanes of the Sum-of-LTF circuit whereas being a linear
function it can be implemented by just 2 ReLU gates!

Now that we are done with the preliminaries, we move on to
the results which seem to need signiﬁcantly more eﬀort.

The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.

The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
We go beyond small depth lower bounds in the following restricted sense,

Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω

(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1

 when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)

Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω

(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1

 when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
This is achieved by showing that under the above restriction the
“sign-rank” is quadratically (in dimension) bounded for the func-
tions computed by such circuits, thought of as the matrix of dimen-
sion 2
dimension
2 × 2
dimension
2 . (And we recall that small depth small size
functions are known which have exponentially large sign-rank.)

Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).

Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
This is proven by the “method of random restrictions” and in particular a very
recent version of it by Daniel Kane (UCSD) and Ryan Williams (MIT) based on
the Littlewood-Oﬀord theorem.

An overview of our results about neural nets Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life learning problems. This
is a serious mathematical challenge to be able to understand as to how the
deep net “sees” these as optimization questions.

For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as,
L(D, N) = Ex,y∈D[ (y, N(x))]

For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as,
L(D, N) = Ex,y∈D[ (y, N(x))]
Why is this L so often somehow a nice function to optimize on to solve a
question which a priori had nothing to do with nets?

Sparse coding
We isolate one special optimization question where we can attempt to
oﬀer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!

Sparse coding
We isolate one special optimization question where we can attempt to
oﬀer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
In this work we attempt to progress towards giving some rigorous
explanation for the observation that nets seem to solve sparse coding!

Sparse coding
The deﬁning equation of our autoencoder computing ˜y ∈ Rn
from y ∈ Rn
The generative model: Sparse x∗
∈ Rh
and y = A∗
x∗
∈ Rn
and h n
h = ReLU(W y − ) = max{0, W y − } ∈ Rh
˜y = W T
h ∈ Rn

The power of autoencoders is surprisingly easy to demonstrate!
Software : TensorFlow (with a complicated iterative technique
called “RMSProp” which we shall explain in the next slide!)

6000 training examples and 1000 testing examples for each digit

6000 training examples and 1000 testing examples for each digit
n = 784 and the number of ReLU gates were 10000 for the 1−DNN
and 5000 and 784 for the 2−DNN.

What exactly do algorithms like ADAM and RMSProp do?
Algorithm ADAM on a diﬀerentiable function f : Rd → R
1: function ADAM(x1, β1, β2, α, ξ)
2: Initialize : m0 = 0, v0 = 0
3: for t = 1, 2, . . . do
4: gt = f (xt)
5: mt = β1mt−1 + (1 − β1)gt
6: vt = β2vt−1 + (1 − β2)g2
t
7: Vt = diag(vt)
8: xt+1 = xt − αt V
1
2
t + diag(ξ1d )
−1
mt
9: end for
10: end function

Algorithm ADAM on a diﬀerentiable function f : Rd → R
1: function ADAM(x1, β1, β2, α, ξ)
2: Initialize : m0 = 0, v0 = 0
3: for t = 1, 2, . . . do
4: gt = f (xt)
5: mt = β1mt−1 + (1 − β1)gt
6: vt = β2vt−1 + (1 − β2)g2
t
7: Vt = diag(vt)
8: xt+1 = xt − αt V
1
2
t + diag(ξ1d )
−1
mt
9: end for
10: end function
These “adaptive gradient” algorithms like ADAM (or RMSProp =
ADAM at β1 = 0) which seem to work the best on autoencoder
neural nets are currently very poorly understood!

Our experimental conclusions and proofs about ADAM
We have shown controlled experiments to suggest that for
large enough autoencoders standard methods possibly cannot
surpass ADAM’s ability of reducing training as well as test
losses particularly when its parameters are set as, β1 ∼ 0.99 for
both full-batch as well as mini-batch settings.
[Theorem] There exists a sequence of step-size choices and
ranges of values of ξ and β1 for which ADAM provably
converges to criticality with no convexity assumptions.
(The proof technique here might be of independent interest!)

Our experimental conclusions and proofs about ADAM
We have shown controlled experiments to suggest that for
large enough autoencoders standard methods possibly cannot
surpass ADAM’s ability of reducing training as well as test
losses particularly when its parameters are set as, β1 ∼ 0.99 for
both full-batch as well as mini-batch settings.
[Theorem] There exists a sequence of step-size choices and
ranges of values of ξ and β1 for which ADAM provably
converges to criticality with no convexity assumptions.
(The proof technique here might be of independent interest!)
Now lets try to gain some mathematical control on the neural
net landscape - at least in the depth 2 case where RMSProp
and ADAM have almost similar performance.

Why can deep nets do sparse coding?
After laborious algebra (over months!) we can oﬀer the following insight,

Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.

Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
Such criticality around the right answer is clearly a plausible reason why
gradient descent might ﬁnd the right answer! Experiments infact
suggest that asymptotically in h, A∗ might even be a global minima
- but as of now we have no clue how to prove such a thing!

Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)

Open questions
Even for the speciﬁc case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?

Open questions
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?

Open questions
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?

Open questions
complex functions?
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)

Open questions
complex functions?
the architecture?
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)

Open questions
complex functions?
the architecture?
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Are there Boolean functions which have smaller representations using
ReLU gates than LTF gates? (A peculiarly puzzling question!)

My invited talk at the 23rd International Symposium of Mathematical Programming (ISMP, 2018)

More Related Content

What's hot (20)

Similar to My invited talk at the 23rd International Symposium of Mathematical Programming (ISMP, 2018) (20)

Recently uploaded (20)

My invited talk at the 23rd International Symposium of Mathematical Programming (ISMP, 2018)