Mathematics Of Neural Networks
Anirbit
AMS
Johns Hopkins University
( AMS Johns Hopkins University ) 1 / 25
Outline
1 Introduction
2 An overview of our results about neural nets
What functions does a deep net represent?
Why can the deep net do dictionary learning?
3 Open questions
( AMS Johns Hopkins University ) 2 / 25
Introduction
This overview is based on the following 4 papers of ours,
( AMS Johns Hopkins University ) 3 / 25
Introduction
This overview is based on the following 4 papers of ours,
ICML 2018 Workshop On Non-Convex Optimization (Not yet public)
“Convergence guarantees for RMSProp and ADAM in non-convex optimiza-
tion and their comparison to Nesterov acceleration on autoencoders”
https://eccc.weizmann.ac.il/report/2017/190/
“Lower bounds over Boolean inputs for deep neural networks with
ReLU gates”
https://arxiv.org/abs/1708.03735 (ISIT 2018)
“Sparse Coding and Autoencoders”
https://eccc.weizmann.ac.il/report/2017/098/(ICLR 2018)
“Understanding Deep Neural Networks with Rectified Linear Units”
( AMS Johns Hopkins University ) 3 / 25
Introduction
The collaborators!
These are works with Amitabh Basu (AMS, JHU)
and different subsets of,
Akshay Rangamani (ECE, JHU)
Soham De (CS, UMD)
Enayat Ullah (CS, JHU)
Tejaswini Ganapathy (Salesforce, San Francisco Bay Area)
Ashish Arora, Trac D.Tran (ECE, JHU)
Raman Arora, Poorya Mianjy (CS, JHU)
Sang (Peter) Chin (CS, BU)
( AMS Johns Hopkins University ) 4 / 25
Introduction
What is a neural network?
The following diagram (imagine it as a directed acyclic graph where all
edges are pointing to the right) represents an instance of a “neural
network”.
Since there are no “weights” assigned to the edges of the above graph,
one should think of this as representing a certain class (set) of R4 → R3
functions which can be computed by the above “architecture” for a
*fixed* choice of “activation functions” (like, ReLU(x) = max{0, x}) at
each of the blue nodes. The yellow nodes are where the input vector
comes in and the orange nodes are where the output vector comes out.
( AMS Johns Hopkins University ) 5 / 25
An overview of our results about neural nets
Formalizing the questions about neural nets
(1) Exact trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} 2
2
can be done in time, 2width
Sn×width
poly(n, S, width).
An overview of our results about neural nets
Formalizing the questions about neural nets
(1) Exact trainability of the nets
Theorem (Ours)
Empirical risk minimization on 1-DNN with a convex loss,
like minwi ,ai ,bi ,b
1
S
S
i=1 yi − width
p=1 ap max{0, wp, xi + bp} 2
2
can be done in time, 2width
Sn×width
poly(n, S, width).
This is the *only* algorithm we are aware of which gets exact
global minima of the empirical risk of some net in time
polynomial in any of the parameters.
The possibility of a similar result for deeper networks or
ameliorating the dependency on width remains wildly
open!
( AMS Johns Hopkins University ) 6 / 25
An overview of our results about neural nets
Formalizing the questions about neural nets
(2) Structure discovery by the nets
Real-life data can be modeled as observations of some structured
distribution. One view of the success of neural nets can be to say
that somehow nets can often be set up in such a way that they give
a function to optimize over which reveals this hidden structure at
its optima/critical points. In one classic scenario called the “sparse
coding” we will show proofs about how the net’s loss function has
certain nice properties which are possibly helping towards revealing
the hidden data generation model (the “dictionary”).
( AMS Johns Hopkins University ) 7 / 25
An overview of our results about neural nets
Formalizing the questions about neural nets
(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to find
good descriptions of the functions that nets can compute.
( AMS Johns Hopkins University ) 8 / 25
An overview of our results about neural nets
Formalizing the questions about neural nets
(3) The deep-net functions.
One of the themes that we have looked into a lot is to try to find
good descriptions of the functions that nets can compute.
Let us start with this last kind of questions!
( AMS Johns Hopkins University ) 8 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture?
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture? No Clue!
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
“The Big Question!”
Can one find a complete characterization of the neural functions
parametrized by architecture? No Clue!
Theorem (Ours)
A function f : Rn → R is continuous piecewise linear iff it is
representable by a ReLU deep net. Further a ReLU deep net of
depth at most, 1 + log2(n + 1) is required to represent f . For
n = 1 there is also a sharp width lowerbound.
( AMS Johns Hopkins University ) 9 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A very small part of “The Big Question”
A simple (but somewhat surprising!) is the following fact,
Theorem (Ours)
1-DNN 2-DNN and the following R2 → R function
(x1, x2) → max{0, x1, x2} is in the gap.
Proof.
That 1-DNN ⊂ 2-DNN is obvious. Now observe that any R2 → R
1−DNN function is non-differentiable on a union of lines (one line
along each ReLU gate’s argument) but the given function is
non-differentiable on a union of 3 half-lines. Hence proved!
( AMS Johns Hopkins University ) 10 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A small part of “The Big Question” which is already unclear!
The family of 2-DNN functions is parameterized as follows by
(dimension compatible) choices of matrices W1, W2, vectors
b1, b2 and a number b3,
f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})}
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
A small part of “The Big Question” which is already unclear!
The family of 2-DNN functions is parameterized as follows by
(dimension compatible) choices of matrices W1, W2, vectors
b1, b2 and a number b3,
f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})}
Can the R4 → R function given as x → max{0, x1, x2, x3, x4} be
written in the above form?
(While its easy to see that max{0, x1, x2, .., x2k } ∈ (k+1)-DNN)
( AMS Johns Hopkins University ) 11 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (We generalize a result by Matus Telgarsky (UIUC))
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k1.
( AMS Johns Hopkins University ) 12 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Depth separation for R → R nets
Can one show neural functions at every depth such that lower depths
will necessarily require a much larger size to represent them?
Theorem (We generalize a result by Matus Telgarsky (UIUC))
∀k ∈ N, there exists a continuum of R → R neural net functions
of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths
≤ 1 + k1.
Here the basic intuition is that if one starts with a small depth func-
tion which is oscillating then *without* blowing up the width too
much higher depths can be set up to recursively increase the number
of oscillations. And then such functions get very hard for the smaller
depths to even approximate in 1 norm unless they blow up in size.
( AMS Johns Hopkins University ) 12 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation?
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions with one layer of gates
For real valued functions on the Boolean hypercube, is ReLU stronger
than the LTF activation? The best gap we know of is the following,
Theorem (Ours)
There is at least a Ω(n) gap between Sum-of-ReLU and
Sum-of-LTF
Proof.
This follows by looking at this function on the hypercube, {0, 1}n
given as, f (x) = n
i=i 2i−1xi . This has 2n level sets on the discrete
cube and hence needs that many polyhedral cells to be produced by
the hyperplanes of the Sum-of-LTF circuit whereas being a linear
function it can be implemented by just 2 ReLU gates!
( AMS Johns Hopkins University ) 13 / 25
An overview of our results about neural nets What functions does a deep net represent?
Now that we are done with the preliminaries, we move on to
the results which seem to need significantly more effort.
( AMS Johns Hopkins University ) 14 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
( AMS Johns Hopkins University ) 15 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
The *ideal* depth separation!
Can one show neural functions at every depth such that it
will necessarily require Ω edimension size to represent them by
circuits of even one depth less? This is a major open question
and over real inputs this is currently known only between 2-DNN and
1-DNN from the works of Eldan-Shamir and Amit Daniely.
We go beyond small depth lower bounds in the following restricted sense,
( AMS Johns Hopkins University ) 15 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω

(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1

 when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
( AMS Johns Hopkins University ) 16 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Theorem (Ours)
There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1
circuits require size Ω

(d − 1) 2
(dimension)
1
8
d−1
((dimension)W)
1
d−1

 when the bottom most
layer weight vectors are such that their coordinates are integers of size at
most W and that these weight vectors induce the same ordering on the set
{−1, 1}(dimension) when ranked by value of the innerproduct with them.
(Note that all other weights are left completely free!)
This is achieved by showing that under the above restriction the
“sign-rank” is quadratically (in dimension) bounded for the func-
tions computed by such circuits, thought of as the matrix of dimen-
sion 2
dimension
2 × 2
dimension
2 . (And we recall that small depth small size
functions are known which have exponentially large sign-rank.)
( AMS Johns Hopkins University ) 16 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
( AMS Johns Hopkins University ) 17 / 25
An overview of our results about neural nets What functions does a deep net represent?
The questions about the function space
Separations for Boolean functions
Despite the result by Eldan-Shamir and Amit Daniely this curiosity
still remains as to how powerful is the LTF-of-ReLU-of-ReLU than
LTF-of-ReLU for Boolean functions.
Theorem (Ours)
For any δ ∈ (0, 1
2), there exists N(δ) ∈ N such that for all n ≥ N(δ)
and > 2 log
2
2−δ (n)
n , any LTF-of-ReLU circuit on n bits that
matches the Andreev function on n−bits for at least 1/2 +
fraction of the inputs, has size Ω( 2(1−δ)n1−δ).
This is proven by the “method of random restrictions” and in particular a very
recent version of it by Daniel Kane (UCSD) and Ryan Williams (MIT) based on
the Littlewood-Offord theorem.
( AMS Johns Hopkins University ) 17 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life learning problems. This
is a serious mathematical challenge to be able to understand as to how the
deep net “sees” these as optimization questions.
( AMS Johns Hopkins University ) 18 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life learning problems. This
is a serious mathematical challenge to be able to understand as to how the
deep net “sees” these as optimization questions.
For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as,
L(D, N) = Ex,y∈D[ (y, N(x))]
( AMS Johns Hopkins University ) 18 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What makes the deep net landscape special?
A fundamental challenge with deep nets is to be able to explain as to why
is it able to solve so many diverse kinds of real-life learning problems. This
is a serious mathematical challenge to be able to understand as to how the
deep net “sees” these as optimization questions.
For a net say N and a distribution D lets call its “landscape” (L)
corresponding to a “loss function ( )” (typically the squared-loss) as,
L(D, N) = Ex,y∈D[ (y, N(x))]
Why is this L so often somehow a nice function to optimize on to solve a
question which a priori had nothing to do with nets?
( AMS Johns Hopkins University ) 18 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Sparse coding
We isolate one special optimization question where we can attempt to
offer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
( AMS Johns Hopkins University ) 19 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Sparse coding
We isolate one special optimization question where we can attempt to
offer some mathematical explanation for this phenomenon.
“Sparse Coding” is a classic learning challenge where given access
to vectors y = A∗x∗ and some distributional (sparsity) guarantees
about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang
and Wright (2012) : This is sometimes provably doable in poly-time!
In this work we attempt to progress towards giving some rigorous
explanation for the observation that nets seem to solve sparse coding!
( AMS Johns Hopkins University ) 19 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Sparse coding
The defining equation of our autoencoder computing ˜y ∈ Rn
from y ∈ Rn
The generative model: Sparse x∗
∈ Rh
and y = A∗
x∗
∈ Rn
and h n
h = ReLU(W y − ) = max{0, W y − } ∈ Rh
˜y = W T
h ∈ Rn
( AMS Johns Hopkins University ) 20 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
The power of autoencoders is surprisingly easy to demonstrate!
Software : TensorFlow (with a complicated iterative technique
called “RMSProp” which we shall explain in the next slide!)
( AMS Johns Hopkins University ) 21 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
The power of autoencoders is surprisingly easy to demonstrate!
Software : TensorFlow (with a complicated iterative technique
called “RMSProp” which we shall explain in the next slide!)
6000 training examples and 1000 testing examples for each digit
( AMS Johns Hopkins University ) 21 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
The power of autoencoders is surprisingly easy to demonstrate!
Software : TensorFlow (with a complicated iterative technique
called “RMSProp” which we shall explain in the next slide!)
6000 training examples and 1000 testing examples for each digit
n = 784 and the number of ReLU gates were 10000 for the 1−DNN
and 5000 and 784 for the 2−DNN.
( AMS Johns Hopkins University ) 21 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Algorithm ADAM on a differentiable function f : Rd → R
1: function ADAM(x1, β1, β2, α, ξ)
2: Initialize : m0 = 0, v0 = 0
3: for t = 1, 2, . . . do
4: gt = f (xt)
5: mt = β1mt−1 + (1 − β1)gt
6: vt = β2vt−1 + (1 − β2)g2
t
7: Vt = diag(vt)
8: xt+1 = xt − αt V
1
2
t + diag(ξ1d )
−1
mt
9: end for
10: end function
( AMS Johns Hopkins University ) 22 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Algorithm ADAM on a differentiable function f : Rd → R
1: function ADAM(x1, β1, β2, α, ξ)
2: Initialize : m0 = 0, v0 = 0
3: for t = 1, 2, . . . do
4: gt = f (xt)
5: mt = β1mt−1 + (1 − β1)gt
6: vt = β2vt−1 + (1 − β2)g2
t
7: Vt = diag(vt)
8: xt+1 = xt − αt V
1
2
t + diag(ξ1d )
−1
mt
9: end for
10: end function
These “adaptive gradient” algorithms like ADAM (or RMSProp =
ADAM at β1 = 0) which seem to work the best on autoencoder
neural nets are currently very poorly understood!
( AMS Johns Hopkins University ) 22 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
( AMS Johns Hopkins University ) 23 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Our experimental conclusions and proofs about ADAM
We have shown controlled experiments to suggest that for
large enough autoencoders standard methods possibly cannot
surpass ADAM’s ability of reducing training as well as test
losses particularly when its parameters are set as, β1 ∼ 0.99 for
both full-batch as well as mini-batch settings.
[Theorem] There exists a sequence of step-size choices and
ranges of values of ξ and β1 for which ADAM provably
converges to criticality with no convexity assumptions.
(The proof technique here might be of independent interest!)
( AMS Johns Hopkins University ) 23 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
What exactly do algorithms like ADAM and RMSProp do?
Our experimental conclusions and proofs about ADAM
We have shown controlled experiments to suggest that for
large enough autoencoders standard methods possibly cannot
surpass ADAM’s ability of reducing training as well as test
losses particularly when its parameters are set as, β1 ∼ 0.99 for
both full-batch as well as mini-batch settings.
[Theorem] There exists a sequence of step-size choices and
ranges of values of ξ and β1 for which ADAM provably
converges to criticality with no convexity assumptions.
(The proof technique here might be of independent interest!)
Now lets try to gain some mathematical control on the neural
net landscape - at least in the depth 2 case where RMSProp
and ADAM have almost similar performance.
( AMS Johns Hopkins University ) 23 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
( AMS Johns Hopkins University ) 24 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
( AMS Johns Hopkins University ) 24 / 25
An overview of our results about neural nets Why can the deep net do dictionary learning?
Why can deep nets do sparse coding?
After laborious algebra (over months!) we can offer the following insight,
Theorem (Ours)
If the source sparse vectors x∗ ∈ Rh are such that their non-zero
coordinates are sampled from a interval in R+ and it has a support of size
at most hp with p < 1
2 and A∗ ∈ Rn×h is incoherent enough then a
constant can be chosen such that the autoencoder landscape,
Ey=A∗x∗ [ y − W T
ReLU(0, W y − ) 2
2]
is such that it is asymptotically (in h) critical in a neighbourhood of A∗.
Such criticality around the right answer is clearly a plausible reason why
gradient descent might find the right answer! Experiments infact
suggest that asymptotically in h, A∗ might even be a global minima
- but as of now we have no clue how to prove such a thing!
( AMS Johns Hopkins University ) 24 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
( AMS Johns Hopkins University ) 25 / 25
Open questions
Explain ADAM! Why is ADAM so good at minimizing the
generalization error on autoencoders? (and many other nets!)
Even for the specific case of sparse coding how to analyze all the
critical points of the landscape or even just (dis?)prove that the right
answer is a global minima?
We have shown an example of a manifold of “high complexity” neural
functions. But in the space of deep net functions how dense are such
complex functions?
Can one exactly characterize the set of functions parameterized by
the architecture?
How to (dis?)prove the existence of dimension exponential gaps
between consecutive depths? (This isn’t clear even when restricted to
Boolean inputs and with unrestricted weights!)
Can the max of 2k
+ 1 numbers be taken using k layers of ReLU gates?
(A negative answer immediately shows that with depth the deep net
function class strictly increases!)
Are there Boolean functions which have smaller representations using
ReLU gates than LTF gates? (A peculiarly puzzling question!)
( AMS Johns Hopkins University ) 25 / 25

More Related Content

PDF
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
PDF
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.
PDF
From neural networks to deep learning
PDF
Text prediction based on Recurrent Neural Network Language Model
PDF
Lecture 7: Recurrent Neural Networks
PDF
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
PPTX
Recurrent Neural Network
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.
From neural networks to deep learning
Text prediction based on Recurrent Neural Network Language Model
Lecture 7: Recurrent Neural Networks
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Recurrent Neural Network
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...

What's hot (20)

PPTX
AlexNet
PDF
LSTM Tutorial
PDF
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
PDF
Neural Networks: Multilayer Perceptron
PDF
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
PPTX
Deep learning lecture - part 1 (basics, CNN)
PPTX
Understanding RNN and LSTM
PPTX
TypeScript and Deep Learning
PPTX
Deep neural networks
PDF
Convolutional Neural Networks (CNN)
DOCX
mohsin dalvi artificial neural networks questions
PDF
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
PDF
RNN Explore
PPTX
Recurrent neural networks for sequence learning and learning human identity f...
PDF
Synthetic dialogue generation with Deep Learning
 
PPT
rnn BASICS
PPT
DOCX
Deepwalk vs Node2vec
PDF
The impact of visual saliency prediction in image classification
PPTX
Electricity price forecasting with Recurrent Neural Networks
AlexNet
LSTM Tutorial
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
Neural Networks: Multilayer Perceptron
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep learning lecture - part 1 (basics, CNN)
Understanding RNN and LSTM
TypeScript and Deep Learning
Deep neural networks
Convolutional Neural Networks (CNN)
mohsin dalvi artificial neural networks questions
Hybrid neural networks for time series learning by Tian Guo, EPFL, Switzerland
RNN Explore
Recurrent neural networks for sequence learning and learning human identity f...
Synthetic dialogue generation with Deep Learning
 
rnn BASICS
Deepwalk vs Node2vec
The impact of visual saliency prediction in image classification
Electricity price forecasting with Recurrent Neural Networks
Ad

Similar to My invited talk at the 23rd International Symposium of Mathematical Programming (ISMP, 2018) (20)

PPT
tutorial.ppt
PPTX
Neural networks and deep learning
PDF
20141003.journal club
PPTX
Java and Deep Learning
PPTX
Java and Deep Learning (Introduction)
PPTX
[PR12] Inception and Xception - Jaejun Yoo
PDF
Understanding Deep Learning & Parameter Tuning with MXnet, H2o Package in R
PDF
DEEPLEARNING recurrent neural networs.pdf
PDF
Talk at MIT, Maths on deep neural networks
PPS
Neural Networks Ver1
PPT
ADFUNN
PDF
nlp dl 1.pdf
PPTX
Machine Learning - Neural Networks - Perceptron
PPTX
Machine Learning - Introduction to Neural Networks
PDF
Deep Learning for Computer Vision: Deep Networks (UPC 2016)
PPTX
CNN for modeling sentence
PPTX
INTRODUCTION TO NEURAL NETWORKS FINAL YEAR
PDF
A Survey of Deep Learning Algorithms for Malware Detection
PPT
ai7.ppt
PDF
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
tutorial.ppt
Neural networks and deep learning
20141003.journal club
Java and Deep Learning
Java and Deep Learning (Introduction)
[PR12] Inception and Xception - Jaejun Yoo
Understanding Deep Learning & Parameter Tuning with MXnet, H2o Package in R
DEEPLEARNING recurrent neural networs.pdf
Talk at MIT, Maths on deep neural networks
Neural Networks Ver1
ADFUNN
nlp dl 1.pdf
Machine Learning - Neural Networks - Perceptron
Machine Learning - Introduction to Neural Networks
Deep Learning for Computer Vision: Deep Networks (UPC 2016)
CNN for modeling sentence
INTRODUCTION TO NEURAL NETWORKS FINAL YEAR
A Survey of Deep Learning Algorithms for Malware Detection
ai7.ppt
X-TREPAN: A MULTI CLASS REGRESSION AND ADAPTED EXTRACTION OF COMPREHENSIBLE D...
Ad

Recently uploaded (20)

PDF
Telemedicine: Transforming Healthcare Delivery in Remote Areas (www.kiu.ac.ug)
PDF
The Physiology Of The Red Blood Cells pdf
PDF
CHEM - GOC general organic chemistry.ppt
PDF
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
PDF
Sujay Rao Mandavilli IJISRT25AUG764 context based approaches to population ma...
PPTX
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
PPTX
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
PPTX
ELISA(Enzyme linked immunosorbent assay)
PDF
Chemistry and Changes 8th Grade Science .pdf
PPTX
diabetes and its complications nephropathy neuropathy
PDF
ECG Practice from Passmedicine for MRCP Part 2 2024.pdf
PDF
Social preventive and pharmacy. Pdf
PPTX
Introduction to Immunology (Unit-1).pptx
PDF
2019UpdateAHAASAAISGuidelineSlideDeckrevisedADL12919.pdf
PDF
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
PPTX
Basic principles of chromatography techniques
PPTX
CELL DIVISION Biology meiosis and mitosis
PDF
Sustainable Biology- Scopes, Principles of sustainiability, Sustainable Resou...
PPT
Chapter 6 Introductory course Biology Camp
PDF
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
Telemedicine: Transforming Healthcare Delivery in Remote Areas (www.kiu.ac.ug)
The Physiology Of The Red Blood Cells pdf
CHEM - GOC general organic chemistry.ppt
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
Sujay Rao Mandavilli IJISRT25AUG764 context based approaches to population ma...
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
ELISA(Enzyme linked immunosorbent assay)
Chemistry and Changes 8th Grade Science .pdf
diabetes and its complications nephropathy neuropathy
ECG Practice from Passmedicine for MRCP Part 2 2024.pdf
Social preventive and pharmacy. Pdf
Introduction to Immunology (Unit-1).pptx
2019UpdateAHAASAAISGuidelineSlideDeckrevisedADL12919.pdf
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
Basic principles of chromatography techniques
CELL DIVISION Biology meiosis and mitosis
Sustainable Biology- Scopes, Principles of sustainiability, Sustainable Resou...
Chapter 6 Introductory course Biology Camp
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)

My invited talk at the 23rd International Symposium of Mathematical Programming (ISMP, 2018)

  • 1. Mathematics Of Neural Networks Anirbit AMS Johns Hopkins University ( AMS Johns Hopkins University ) 1 / 25
  • 2. Outline 1 Introduction 2 An overview of our results about neural nets What functions does a deep net represent? Why can the deep net do dictionary learning? 3 Open questions ( AMS Johns Hopkins University ) 2 / 25
  • 3. Introduction This overview is based on the following 4 papers of ours, ( AMS Johns Hopkins University ) 3 / 25
  • 4. Introduction This overview is based on the following 4 papers of ours, ICML 2018 Workshop On Non-Convex Optimization (Not yet public) “Convergence guarantees for RMSProp and ADAM in non-convex optimiza- tion and their comparison to Nesterov acceleration on autoencoders” https://eccc.weizmann.ac.il/report/2017/190/ “Lower bounds over Boolean inputs for deep neural networks with ReLU gates” https://arxiv.org/abs/1708.03735 (ISIT 2018) “Sparse Coding and Autoencoders” https://eccc.weizmann.ac.il/report/2017/098/(ICLR 2018) “Understanding Deep Neural Networks with Rectified Linear Units” ( AMS Johns Hopkins University ) 3 / 25
  • 5. Introduction The collaborators! These are works with Amitabh Basu (AMS, JHU) and different subsets of, Akshay Rangamani (ECE, JHU) Soham De (CS, UMD) Enayat Ullah (CS, JHU) Tejaswini Ganapathy (Salesforce, San Francisco Bay Area) Ashish Arora, Trac D.Tran (ECE, JHU) Raman Arora, Poorya Mianjy (CS, JHU) Sang (Peter) Chin (CS, BU) ( AMS Johns Hopkins University ) 4 / 25
  • 6. Introduction What is a neural network? The following diagram (imagine it as a directed acyclic graph where all edges are pointing to the right) represents an instance of a “neural network”. Since there are no “weights” assigned to the edges of the above graph, one should think of this as representing a certain class (set) of R4 → R3 functions which can be computed by the above “architecture” for a *fixed* choice of “activation functions” (like, ReLU(x) = max{0, x}) at each of the blue nodes. The yellow nodes are where the input vector comes in and the orange nodes are where the output vector comes out. ( AMS Johns Hopkins University ) 5 / 25
  • 7. An overview of our results about neural nets Formalizing the questions about neural nets (1) Exact trainability of the nets Theorem (Ours) Empirical risk minimization on 1-DNN with a convex loss, like minwi ,ai ,bi ,b 1 S S i=1 yi − width p=1 ap max{0, wp, xi + bp} 2 2 can be done in time, 2width Sn×width poly(n, S, width).
  • 8. An overview of our results about neural nets Formalizing the questions about neural nets (1) Exact trainability of the nets Theorem (Ours) Empirical risk minimization on 1-DNN with a convex loss, like minwi ,ai ,bi ,b 1 S S i=1 yi − width p=1 ap max{0, wp, xi + bp} 2 2 can be done in time, 2width Sn×width poly(n, S, width). This is the *only* algorithm we are aware of which gets exact global minima of the empirical risk of some net in time polynomial in any of the parameters. The possibility of a similar result for deeper networks or ameliorating the dependency on width remains wildly open! ( AMS Johns Hopkins University ) 6 / 25
  • 9. An overview of our results about neural nets Formalizing the questions about neural nets (2) Structure discovery by the nets Real-life data can be modeled as observations of some structured distribution. One view of the success of neural nets can be to say that somehow nets can often be set up in such a way that they give a function to optimize over which reveals this hidden structure at its optima/critical points. In one classic scenario called the “sparse coding” we will show proofs about how the net’s loss function has certain nice properties which are possibly helping towards revealing the hidden data generation model (the “dictionary”). ( AMS Johns Hopkins University ) 7 / 25
  • 10. An overview of our results about neural nets Formalizing the questions about neural nets (3) The deep-net functions. One of the themes that we have looked into a lot is to try to find good descriptions of the functions that nets can compute. ( AMS Johns Hopkins University ) 8 / 25
  • 11. An overview of our results about neural nets Formalizing the questions about neural nets (3) The deep-net functions. One of the themes that we have looked into a lot is to try to find good descriptions of the functions that nets can compute. Let us start with this last kind of questions! ( AMS Johns Hopkins University ) 8 / 25
  • 12. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space “The Big Question!” Can one find a complete characterization of the neural functions parametrized by architecture?
  • 13. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space “The Big Question!” Can one find a complete characterization of the neural functions parametrized by architecture? No Clue!
  • 14. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space “The Big Question!” Can one find a complete characterization of the neural functions parametrized by architecture? No Clue! Theorem (Ours) A function f : Rn → R is continuous piecewise linear iff it is representable by a ReLU deep net. Further a ReLU deep net of depth at most, 1 + log2(n + 1) is required to represent f . For n = 1 there is also a sharp width lowerbound. ( AMS Johns Hopkins University ) 9 / 25
  • 15. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space A very small part of “The Big Question” A simple (but somewhat surprising!) is the following fact, Theorem (Ours) 1-DNN 2-DNN and the following R2 → R function (x1, x2) → max{0, x1, x2} is in the gap.
  • 16. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space A very small part of “The Big Question” A simple (but somewhat surprising!) is the following fact, Theorem (Ours) 1-DNN 2-DNN and the following R2 → R function (x1, x2) → max{0, x1, x2} is in the gap. Proof. That 1-DNN ⊂ 2-DNN is obvious. Now observe that any R2 → R 1−DNN function is non-differentiable on a union of lines (one line along each ReLU gate’s argument) but the given function is non-differentiable on a union of 3 half-lines. Hence proved! ( AMS Johns Hopkins University ) 10 / 25
  • 17. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space A small part of “The Big Question” which is already unclear! The family of 2-DNN functions is parameterized as follows by (dimension compatible) choices of matrices W1, W2, vectors b1, b2 and a number b3, f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})}
  • 18. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space A small part of “The Big Question” which is already unclear! The family of 2-DNN functions is parameterized as follows by (dimension compatible) choices of matrices W1, W2, vectors b1, b2 and a number b3, f2-DNN(x) = b3 + a, max {0, b2 + W2(max{0, b1 + W1x})} Can the R4 → R function given as x → max{0, x1, x2, x3, x4} be written in the above form? (While its easy to see that max{0, x1, x2, .., x2k } ∈ (k+1)-DNN) ( AMS Johns Hopkins University ) 11 / 25
  • 19. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Depth separation for R → R nets Can one show neural functions at every depth such that lower depths will necessarily require a much larger size to represent them? Theorem (We generalize a result by Matus Telgarsky (UIUC)) ∀k ∈ N, there exists a continuum of R → R neural net functions of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths ≤ 1 + k1. ( AMS Johns Hopkins University ) 12 / 25
  • 20. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Depth separation for R → R nets Can one show neural functions at every depth such that lower depths will necessarily require a much larger size to represent them? Theorem (We generalize a result by Matus Telgarsky (UIUC)) ∀k ∈ N, there exists a continuum of R → R neural net functions of depth 1 + k2 (and size k3) which needs size Ω kk+1 for depths ≤ 1 + k1. Here the basic intuition is that if one starts with a small depth func- tion which is oscillating then *without* blowing up the width too much higher depths can be set up to recursively increase the number of oscillations. And then such functions get very hard for the smaller depths to even approximate in 1 norm unless they blow up in size. ( AMS Johns Hopkins University ) 12 / 25
  • 21. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation?
  • 22. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation? The best gap we know of is the following,
  • 23. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation? The best gap we know of is the following, Theorem (Ours) There is at least a Ω(n) gap between Sum-of-ReLU and Sum-of-LTF
  • 24. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions with one layer of gates For real valued functions on the Boolean hypercube, is ReLU stronger than the LTF activation? The best gap we know of is the following, Theorem (Ours) There is at least a Ω(n) gap between Sum-of-ReLU and Sum-of-LTF Proof. This follows by looking at this function on the hypercube, {0, 1}n given as, f (x) = n i=i 2i−1xi . This has 2n level sets on the discrete cube and hence needs that many polyhedral cells to be produced by the hyperplanes of the Sum-of-LTF circuit whereas being a linear function it can be implemented by just 2 ReLU gates! ( AMS Johns Hopkins University ) 13 / 25
  • 25. An overview of our results about neural nets What functions does a deep net represent? Now that we are done with the preliminaries, we move on to the results which seem to need significantly more effort. ( AMS Johns Hopkins University ) 14 / 25
  • 26. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space The *ideal* depth separation! Can one show neural functions at every depth such that it will necessarily require Ω edimension size to represent them by circuits of even one depth less? This is a major open question and over real inputs this is currently known only between 2-DNN and 1-DNN from the works of Eldan-Shamir and Amit Daniely. ( AMS Johns Hopkins University ) 15 / 25
  • 27. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space The *ideal* depth separation! Can one show neural functions at every depth such that it will necessarily require Ω edimension size to represent them by circuits of even one depth less? This is a major open question and over real inputs this is currently known only between 2-DNN and 1-DNN from the works of Eldan-Shamir and Amit Daniely. We go beyond small depth lower bounds in the following restricted sense, ( AMS Johns Hopkins University ) 15 / 25
  • 28. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Theorem (Ours) There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1 circuits require size Ω  (d − 1) 2 (dimension) 1 8 d−1 ((dimension)W) 1 d−1   when the bottom most layer weight vectors are such that their coordinates are integers of size at most W and that these weight vectors induce the same ordering on the set {−1, 1}(dimension) when ranked by value of the innerproduct with them. (Note that all other weights are left completely free!) ( AMS Johns Hopkins University ) 16 / 25
  • 29. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Theorem (Ours) There exists small depth 2 Boolean functions such that LTF-of-(ReLU)d−1 circuits require size Ω  (d − 1) 2 (dimension) 1 8 d−1 ((dimension)W) 1 d−1   when the bottom most layer weight vectors are such that their coordinates are integers of size at most W and that these weight vectors induce the same ordering on the set {−1, 1}(dimension) when ranked by value of the innerproduct with them. (Note that all other weights are left completely free!) This is achieved by showing that under the above restriction the “sign-rank” is quadratically (in dimension) bounded for the func- tions computed by such circuits, thought of as the matrix of dimen- sion 2 dimension 2 × 2 dimension 2 . (And we recall that small depth small size functions are known which have exponentially large sign-rank.) ( AMS Johns Hopkins University ) 16 / 25
  • 30. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions Despite the result by Eldan-Shamir and Amit Daniely this curiosity still remains as to how powerful is the LTF-of-ReLU-of-ReLU than LTF-of-ReLU for Boolean functions. Theorem (Ours) For any δ ∈ (0, 1 2), there exists N(δ) ∈ N such that for all n ≥ N(δ) and > 2 log 2 2−δ (n) n , any LTF-of-ReLU circuit on n bits that matches the Andreev function on n−bits for at least 1/2 + fraction of the inputs, has size Ω( 2(1−δ)n1−δ). ( AMS Johns Hopkins University ) 17 / 25
  • 31. An overview of our results about neural nets What functions does a deep net represent? The questions about the function space Separations for Boolean functions Despite the result by Eldan-Shamir and Amit Daniely this curiosity still remains as to how powerful is the LTF-of-ReLU-of-ReLU than LTF-of-ReLU for Boolean functions. Theorem (Ours) For any δ ∈ (0, 1 2), there exists N(δ) ∈ N such that for all n ≥ N(δ) and > 2 log 2 2−δ (n) n , any LTF-of-ReLU circuit on n bits that matches the Andreev function on n−bits for at least 1/2 + fraction of the inputs, has size Ω( 2(1−δ)n1−δ). This is proven by the “method of random restrictions” and in particular a very recent version of it by Daniel Kane (UCSD) and Ryan Williams (MIT) based on the Littlewood-Offord theorem. ( AMS Johns Hopkins University ) 17 / 25
  • 32. An overview of our results about neural nets Why can the deep net do dictionary learning? What makes the deep net landscape special? A fundamental challenge with deep nets is to be able to explain as to why is it able to solve so many diverse kinds of real-life learning problems. This is a serious mathematical challenge to be able to understand as to how the deep net “sees” these as optimization questions. ( AMS Johns Hopkins University ) 18 / 25
  • 33. An overview of our results about neural nets Why can the deep net do dictionary learning? What makes the deep net landscape special? A fundamental challenge with deep nets is to be able to explain as to why is it able to solve so many diverse kinds of real-life learning problems. This is a serious mathematical challenge to be able to understand as to how the deep net “sees” these as optimization questions. For a net say N and a distribution D lets call its “landscape” (L) corresponding to a “loss function ( )” (typically the squared-loss) as, L(D, N) = Ex,y∈D[ (y, N(x))] ( AMS Johns Hopkins University ) 18 / 25
  • 34. An overview of our results about neural nets Why can the deep net do dictionary learning? What makes the deep net landscape special? A fundamental challenge with deep nets is to be able to explain as to why is it able to solve so many diverse kinds of real-life learning problems. This is a serious mathematical challenge to be able to understand as to how the deep net “sees” these as optimization questions. For a net say N and a distribution D lets call its “landscape” (L) corresponding to a “loss function ( )” (typically the squared-loss) as, L(D, N) = Ex,y∈D[ (y, N(x))] Why is this L so often somehow a nice function to optimize on to solve a question which a priori had nothing to do with nets? ( AMS Johns Hopkins University ) 18 / 25
  • 35. An overview of our results about neural nets Why can the deep net do dictionary learning? Sparse coding We isolate one special optimization question where we can attempt to offer some mathematical explanation for this phenomenon. “Sparse Coding” is a classic learning challenge where given access to vectors y = A∗x∗ and some distributional (sparsity) guarantees about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang and Wright (2012) : This is sometimes provably doable in poly-time! ( AMS Johns Hopkins University ) 19 / 25
  • 36. An overview of our results about neural nets Why can the deep net do dictionary learning? Sparse coding We isolate one special optimization question where we can attempt to offer some mathematical explanation for this phenomenon. “Sparse Coding” is a classic learning challenge where given access to vectors y = A∗x∗ and some distributional (sparsity) guarantees about x∗ we try to infer A∗. Breakthrough work by Spielman, Wang and Wright (2012) : This is sometimes provably doable in poly-time! In this work we attempt to progress towards giving some rigorous explanation for the observation that nets seem to solve sparse coding! ( AMS Johns Hopkins University ) 19 / 25
  • 37. An overview of our results about neural nets Why can the deep net do dictionary learning? Sparse coding The defining equation of our autoencoder computing ˜y ∈ Rn from y ∈ Rn The generative model: Sparse x∗ ∈ Rh and y = A∗ x∗ ∈ Rn and h n h = ReLU(W y − ) = max{0, W y − } ∈ Rh ˜y = W T h ∈ Rn ( AMS Johns Hopkins University ) 20 / 25
  • 38. An overview of our results about neural nets Why can the deep net do dictionary learning? The power of autoencoders is surprisingly easy to demonstrate! Software : TensorFlow (with a complicated iterative technique called “RMSProp” which we shall explain in the next slide!) ( AMS Johns Hopkins University ) 21 / 25
  • 39. An overview of our results about neural nets Why can the deep net do dictionary learning? The power of autoencoders is surprisingly easy to demonstrate! Software : TensorFlow (with a complicated iterative technique called “RMSProp” which we shall explain in the next slide!) 6000 training examples and 1000 testing examples for each digit ( AMS Johns Hopkins University ) 21 / 25
  • 40. An overview of our results about neural nets Why can the deep net do dictionary learning? The power of autoencoders is surprisingly easy to demonstrate! Software : TensorFlow (with a complicated iterative technique called “RMSProp” which we shall explain in the next slide!) 6000 training examples and 1000 testing examples for each digit n = 784 and the number of ReLU gates were 10000 for the 1−DNN and 5000 and 784 for the 2−DNN. ( AMS Johns Hopkins University ) 21 / 25
  • 41. An overview of our results about neural nets Why can the deep net do dictionary learning? What exactly do algorithms like ADAM and RMSProp do? Algorithm ADAM on a differentiable function f : Rd → R 1: function ADAM(x1, β1, β2, α, ξ) 2: Initialize : m0 = 0, v0 = 0 3: for t = 1, 2, . . . do 4: gt = f (xt) 5: mt = β1mt−1 + (1 − β1)gt 6: vt = β2vt−1 + (1 − β2)g2 t 7: Vt = diag(vt) 8: xt+1 = xt − αt V 1 2 t + diag(ξ1d ) −1 mt 9: end for 10: end function ( AMS Johns Hopkins University ) 22 / 25
  • 42. An overview of our results about neural nets Why can the deep net do dictionary learning? What exactly do algorithms like ADAM and RMSProp do? Algorithm ADAM on a differentiable function f : Rd → R 1: function ADAM(x1, β1, β2, α, ξ) 2: Initialize : m0 = 0, v0 = 0 3: for t = 1, 2, . . . do 4: gt = f (xt) 5: mt = β1mt−1 + (1 − β1)gt 6: vt = β2vt−1 + (1 − β2)g2 t 7: Vt = diag(vt) 8: xt+1 = xt − αt V 1 2 t + diag(ξ1d ) −1 mt 9: end for 10: end function These “adaptive gradient” algorithms like ADAM (or RMSProp = ADAM at β1 = 0) which seem to work the best on autoencoder neural nets are currently very poorly understood! ( AMS Johns Hopkins University ) 22 / 25
  • 43. An overview of our results about neural nets Why can the deep net do dictionary learning? What exactly do algorithms like ADAM and RMSProp do? ( AMS Johns Hopkins University ) 23 / 25
  • 44. An overview of our results about neural nets Why can the deep net do dictionary learning? What exactly do algorithms like ADAM and RMSProp do? Our experimental conclusions and proofs about ADAM We have shown controlled experiments to suggest that for large enough autoencoders standard methods possibly cannot surpass ADAM’s ability of reducing training as well as test losses particularly when its parameters are set as, β1 ∼ 0.99 for both full-batch as well as mini-batch settings. [Theorem] There exists a sequence of step-size choices and ranges of values of ξ and β1 for which ADAM provably converges to criticality with no convexity assumptions. (The proof technique here might be of independent interest!) ( AMS Johns Hopkins University ) 23 / 25
  • 45. An overview of our results about neural nets Why can the deep net do dictionary learning? What exactly do algorithms like ADAM and RMSProp do? Our experimental conclusions and proofs about ADAM We have shown controlled experiments to suggest that for large enough autoencoders standard methods possibly cannot surpass ADAM’s ability of reducing training as well as test losses particularly when its parameters are set as, β1 ∼ 0.99 for both full-batch as well as mini-batch settings. [Theorem] There exists a sequence of step-size choices and ranges of values of ξ and β1 for which ADAM provably converges to criticality with no convexity assumptions. (The proof technique here might be of independent interest!) Now lets try to gain some mathematical control on the neural net landscape - at least in the depth 2 case where RMSProp and ADAM have almost similar performance. ( AMS Johns Hopkins University ) 23 / 25
  • 46. An overview of our results about neural nets Why can the deep net do dictionary learning? Why can deep nets do sparse coding? After laborious algebra (over months!) we can offer the following insight, ( AMS Johns Hopkins University ) 24 / 25
  • 47. An overview of our results about neural nets Why can the deep net do dictionary learning? Why can deep nets do sparse coding? After laborious algebra (over months!) we can offer the following insight, Theorem (Ours) If the source sparse vectors x∗ ∈ Rh are such that their non-zero coordinates are sampled from a interval in R+ and it has a support of size at most hp with p < 1 2 and A∗ ∈ Rn×h is incoherent enough then a constant can be chosen such that the autoencoder landscape, Ey=A∗x∗ [ y − W T ReLU(0, W y − ) 2 2] is such that it is asymptotically (in h) critical in a neighbourhood of A∗. ( AMS Johns Hopkins University ) 24 / 25
  • 48. An overview of our results about neural nets Why can the deep net do dictionary learning? Why can deep nets do sparse coding? After laborious algebra (over months!) we can offer the following insight, Theorem (Ours) If the source sparse vectors x∗ ∈ Rh are such that their non-zero coordinates are sampled from a interval in R+ and it has a support of size at most hp with p < 1 2 and A∗ ∈ Rn×h is incoherent enough then a constant can be chosen such that the autoencoder landscape, Ey=A∗x∗ [ y − W T ReLU(0, W y − ) 2 2] is such that it is asymptotically (in h) critical in a neighbourhood of A∗. Such criticality around the right answer is clearly a plausible reason why gradient descent might find the right answer! Experiments infact suggest that asymptotically in h, A∗ might even be a global minima - but as of now we have no clue how to prove such a thing! ( AMS Johns Hopkins University ) 24 / 25
  • 49. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) ( AMS Johns Hopkins University ) 25 / 25
  • 50. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? ( AMS Johns Hopkins University ) 25 / 25
  • 51. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? ( AMS Johns Hopkins University ) 25 / 25
  • 52. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? Can one exactly characterize the set of functions parameterized by the architecture? ( AMS Johns Hopkins University ) 25 / 25
  • 53. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even when restricted to Boolean inputs and with unrestricted weights!) ( AMS Johns Hopkins University ) 25 / 25
  • 54. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even when restricted to Boolean inputs and with unrestricted weights!) Can the max of 2k + 1 numbers be taken using k layers of ReLU gates? (A negative answer immediately shows that with depth the deep net function class strictly increases!) ( AMS Johns Hopkins University ) 25 / 25
  • 55. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even when restricted to Boolean inputs and with unrestricted weights!) Can the max of 2k + 1 numbers be taken using k layers of ReLU gates? (A negative answer immediately shows that with depth the deep net function class strictly increases!) ( AMS Johns Hopkins University ) 25 / 25
  • 56. Open questions Explain ADAM! Why is ADAM so good at minimizing the generalization error on autoencoders? (and many other nets!) Even for the specific case of sparse coding how to analyze all the critical points of the landscape or even just (dis?)prove that the right answer is a global minima? We have shown an example of a manifold of “high complexity” neural functions. But in the space of deep net functions how dense are such complex functions? Can one exactly characterize the set of functions parameterized by the architecture? How to (dis?)prove the existence of dimension exponential gaps between consecutive depths? (This isn’t clear even when restricted to Boolean inputs and with unrestricted weights!) Can the max of 2k + 1 numbers be taken using k layers of ReLU gates? (A negative answer immediately shows that with depth the deep net function class strictly increases!) Are there Boolean functions which have smaller representations using ReLU gates than LTF gates? (A peculiarly puzzling question!) ( AMS Johns Hopkins University ) 25 / 25