Theoretical Deep Learning

Theoretical Deep LearningTheoretical Deep Learning
Xiaohu Zhu
Cofounder & Chief Scientist

Reason 1Reason 1
To understand things better and
deeper

Reason 2Reason 2
Devise more eﬀicient algorithms

Reason 3Reason 3
To connect with other solid
theories and methods

RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD ﬁnd much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?

The killer application of DLThe killer application of DL

Composite functionsComposite functions
# of parameters grow exponentially with the dimension of
the equations
# of units grows linearly with the dimension of functions
worse performance for deep learning for non-composite
functions

Optimization 1Optimization 1
Linear equations: # of unknowns > # of equations ⇒ more
than one solution
Neural net for ImageNet: # of parameters(~millions) ≫ # of
samples(~60,000) Overparameterization
Bézout's Theorem: # of solutions > # of atoms in the
universe ⇒ degenerate: each solution corresponds to a
inﬁnite solution set

Optimization 2Optimization 2
Overparameterization: neural nets have inﬁnite number of
global optimum solution, which form a plato valley in the
loss space.
SGD could stay in the degenerating valley with high
probability
Good news: easy to optimize, global optimum exist, many,
easy to ﬁnd by opt algorithms

Generalization 1Generalization 1
Overparameterization: good for optimization, bad for
generalization
Deep learning: tasks reasonably mix well with loss functions
Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏
overfits classification error
Differential equation dynamic system: near global minimum,
deep nn works like a linear network

Generalization 2Generalization 2
Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏
overfits classification error
Cross Entropy ∈ Exponential loss
asymmetricity ?⇒ Special property

Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD ﬁnd much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
WHAT'sWHAT's
More?More?

Plato optimumPlato optimum
=> better=> better
generalization?generalization?
Overfitting?Overfitting?
Look out!Look out!
Do we needDo we need
Prior?Prior?
Whether BrainWhether Brain
research isresearch is
useful for DL?useful for DL?

ReferencesReferences
1. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1), 1-49.
2. Neyshabur, B., Tomioka, R., Salakhutdinov, R., & Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. arXiv
preprint arXiv:1705.03071.
3. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of
dimensionality: A review. International Journal of Automation and Computing, 14(5), 503-519.
4. Liao, Q., & Poggio, T. (2017). Theory of Deep Learning II: Landscape of the Empirical Risk in Deep Learning. arXiv preprint arXiv:1703.09833.
5. Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., & Poggio, T. (2018). Theory of Deep Learning IIb: Optimization Properties of SGD.
arXiv preprint arXiv:1801.02254.
6. Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., ... & Mhaskar, H. (2017). Theory of Deep Learning III: explaining the non-
overﬁtting puzzle. arXiv preprint arXiv:1801.00173.
7. Zhang, C., Liao, Q., Rakhlin, A., Sridharan, K., Miranda, B., Golowich, N., & Poggio, T. (2017). Theory of deep learning iii: Generalization properties
of sgd. Center for Brains, Minds and Machines (CBMM).
8. Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933.
9. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural computation, 8(7), 1341-1390.

Theoretical Deep Learning

More Related Content

What's hot (20)

Similar to Theoretical Deep Learning (20)

More from Xiaohu ZHU (9)

Recently uploaded (20)

Theoretical Deep Learning