Theoretical Deep LearningTheoretical Deep Learning
Xiaohu Zhu
Cofounder & Chief Scientist
Why?Why?
Reason 1Reason 1
To understand things better and
deeper
Reason 2Reason 2
Devise more efficient algorithms
Reason 3Reason 3
To connect with other solid
theories and methods
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
RepresentationRepresentation
The killer application of DLThe killer application of DL
Composite functionsComposite functions
# of parameters grow exponentially with the dimension of
the equations
# of units grows linearly with the dimension of functions
worse performance for deep learning for non-composite
functions
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
Optimization 1Optimization 1
Linear equations: # of unknowns > # of equations ⇒ more
than one solution
Neural net for ImageNet: # of parameters(~millions) ≫ # of
samples(~60,000) Overparameterization
Bézout's Theorem: # of solutions > # of atoms in the
universe ⇒ degenerate: each solution corresponds to a
infinite solution set
Optimization 2Optimization 2
Overparameterization: neural nets have infinite number of
global optimum solution, which form a plato valley in the
loss space.
SGD could stay in the degenerating valley with high
probability
Good news: easy to optimize, global optimum exist, many,
easy to find by opt algorithms
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
Generalization 1Generalization 1
Overparameterization: good for optimization, bad for
generalization
Deep learning: tasks reasonably mix well with loss functions
Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏
overfits classification error 
Differential equation dynamic system: near global minimum,
deep nn works like a linear network
Generalization 2Generalization 2
Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏
overfits classification error 
Cross Entropy ∈ Exponential loss
asymmetricity ?⇒ Special property
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
RepresentationRepresentation
Why are deeper nets better
than shallower nets?
OptimizationOptimization
Why can SGD find much
better local optimum? What
characteristics better
optimum are?
GeneralizationGeneralization
Why still generalize well as
the number of parameters is
bigger than that of data?
WHAT'sWHAT's
More?More?
Plato optimumPlato optimum
=> better=> better
generalization?generalization?
Overfitting?Overfitting?
Look out!Look out!
Do we needDo we need
Prior?Prior?
Whether BrainWhether Brain
research isresearch is
useful for DL?useful for DL?
ReferencesReferences
1. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1), 1-49.
2. Neyshabur, B., Tomioka, R., Salakhutdinov, R., & Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. arXiv
preprint arXiv:1705.03071.
3. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of
dimensionality: A review. International Journal of Automation and Computing, 14(5), 503-519.
4. Liao, Q., & Poggio, T. (2017). Theory of Deep Learning II: Landscape of the Empirical Risk in Deep Learning. arXiv preprint arXiv:1703.09833.
5. Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., & Poggio, T. (2018). Theory of Deep Learning IIb: Optimization Properties of SGD.
arXiv preprint arXiv:1801.02254.
6. Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., ... & Mhaskar, H. (2017). Theory of Deep Learning III: explaining the non-
overfitting puzzle. arXiv preprint arXiv:1801.00173.
7. Zhang, C., Liao, Q., Rakhlin, A., Sridharan, K., Miranda, B., Golowich, N., & Poggio, T. (2017). Theory of deep learning iii: Generalization properties
of sgd. Center for Brains, Minds and Machines (CBMM).
8. Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933.
9. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural computation, 8(7), 1341-1390.
ThanksThanks

More Related Content

PPTX
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
PPTX
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
PPTX
Neural Information Retrieval: In search of meaningful progress
PDF
Hk3312911294
PPTX
5 Lessons Learned from Designing Neural Models for Information Retrieval
PDF
PhD Day: Entity Linking using Ontology Modularization
PPTX
[Chung il kim] 0829 thesis
PPTX
A Simple Introduction to Neural Information Retrieval
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Neural Information Retrieval: In search of meaningful progress
Hk3312911294
5 Lessons Learned from Designing Neural Models for Information Retrieval
PhD Day: Entity Linking using Ontology Modularization
[Chung il kim] 0829 thesis
A Simple Introduction to Neural Information Retrieval

What's hot (20)

PPTX
Neural Models for Information Retrieval
PPTX
Deep Neural Methods for Retrieval
PPTX
Dual Embedding Space Model (DESM)
PDF
Privacy Protectin Models and Defamation caused by k-anonymity
PPTX
Deep Learning for Search
PPT
Steganography
PDF
Deep learning in Crypto Currency Trading
PPTX
STEGANOGRAPHY PRESENTATION SLIDES
PPTX
The Duet model
PPTX
Steganography
PDF
DATA HIDING BY IMAGE STEGANOGRAPHY APPLING DNA SEQUENCE ARITHMETIC & LSB INSE...
PDF
Automatic Personality Prediction with Attention-based Neural Networks
PDF
Icml2018 naver review
PPTX
Exploring Session Context using Distributed Representations of Queries and Re...
PDF
SOFIA - Cross Domain Interoperability Case Study. NOKIA
PPTX
DWT based approach for steganography using biometrics
PDF
Data Tactics Analytics Brown Bag (Aug 22, 2013)
PPTX
Rsa cryptography &steganography
PDF
Document Classification Using KNN with Fuzzy Bags of Word Representation
PDF
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Neural Models for Information Retrieval
Deep Neural Methods for Retrieval
Dual Embedding Space Model (DESM)
Privacy Protectin Models and Defamation caused by k-anonymity
Deep Learning for Search
Steganography
Deep learning in Crypto Currency Trading
STEGANOGRAPHY PRESENTATION SLIDES
The Duet model
Steganography
DATA HIDING BY IMAGE STEGANOGRAPHY APPLING DNA SEQUENCE ARITHMETIC & LSB INSE...
Automatic Personality Prediction with Attention-based Neural Networks
Icml2018 naver review
Exploring Session Context using Distributed Representations of Queries and Re...
SOFIA - Cross Domain Interoperability Case Study. NOKIA
DWT based approach for steganography using biometrics
Data Tactics Analytics Brown Bag (Aug 22, 2013)
Rsa cryptography &steganography
Document Classification Using KNN with Fuzzy Bags of Word Representation
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Ad

Similar to Theoretical Deep Learning (20)

PDF
[PR12] understanding deep learning requires rethinking generalization
PDF
AI Beyond Deep Learning
PPTX
Recurrent Neural Networks for Text Analysis
PDF
Model Evaluation in the land of Deep Learning
PPTX
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
PPTX
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
PDF
Performance Comparison between Pytorch and Mindspore
PDF
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
PDF
AI and Deep Learning
PDF
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
PDF
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
PPTX
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
PDF
Resnet.pdf
PPTX
Deep Content Learning in Traffic Prediction and Text Classification
PDF
W4301117121
PPT
deepnet-lourentzou.ppt
PPT
Deep learning is a subset of machine learning and AI
PPT
Overview of Deep Learning and its advantage
PPT
Introduction to Deep Learning presentation
PDF
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
[PR12] understanding deep learning requires rethinking generalization
AI Beyond Deep Learning
Recurrent Neural Networks for Text Analysis
Model Evaluation in the land of Deep Learning
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Performance Comparison between Pytorch and Mindspore
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
AI and Deep Learning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Resnet.pdf
Deep Content Learning in Traffic Prediction and Text Classification
W4301117121
deepnet-lourentzou.ppt
Deep learning is a subset of machine learning and AI
Overview of Deep Learning and its advantage
Introduction to Deep Learning presentation
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Ad

More from Xiaohu ZHU (9)

PDF
A Brief Introduction on Recurrent Neural Network and Its Application
PDF
CBIR in the Era of Deep Learning
PPTX
Deep cv 101
PPTX
苏宁图像智能分析实践
PDF
Deep Reinforcement Learning An Introduction
PDF
Hangzhou Deep Learning Meetup-Deep Reinforcement Learning
PDF
神经网络与深度学习
PDF
Shanghai deep learning meetup 4
PDF
Shanghai Deep Learning Meetup #1
A Brief Introduction on Recurrent Neural Network and Its Application
CBIR in the Era of Deep Learning
Deep cv 101
苏宁图像智能分析实践
Deep Reinforcement Learning An Introduction
Hangzhou Deep Learning Meetup-Deep Reinforcement Learning
神经网络与深度学习
Shanghai deep learning meetup 4
Shanghai Deep Learning Meetup #1

Recently uploaded (20)

PDF
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
PDF
CuO Nps photocatalysts 15156456551564161
PPT
Chapter 6 Introductory course Biology Camp
PDF
2019UpdateAHAASAAISGuidelineSlideDeckrevisedADL12919.pdf
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PPTX
ELISA(Enzyme linked immunosorbent assay)
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PPTX
Understanding the Circulatory System……..
PPTX
Thyroid disorders presentation for MBBS.pptx
PPTX
Basic principles of chromatography techniques
PDF
Social preventive and pharmacy. Pdf
PDF
No dilute core produced in simulations of giant impacts on to Jupiter
PDF
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
PPTX
A powerpoint on colorectal cancer with brief background
PPTX
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
PPT
ecg for noob ecg interpretation ecg recall
PDF
Micro 4 New.ppt.pdf thesis main microbio
PPTX
HAEMATOLOGICAL DISEASES lack of red blood cells, which carry oxygen throughou...
PPTX
Spectroscopic Techniques for M Tech Civil Engineerin .pptx
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
CuO Nps photocatalysts 15156456551564161
Chapter 6 Introductory course Biology Camp
2019UpdateAHAASAAISGuidelineSlideDeckrevisedADL12919.pdf
Enhancing Laboratory Quality Through ISO 15189 Compliance
ELISA(Enzyme linked immunosorbent assay)
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
Presentation1 INTRODUCTION TO ENZYMES.pptx
Understanding the Circulatory System……..
Thyroid disorders presentation for MBBS.pptx
Basic principles of chromatography techniques
Social preventive and pharmacy. Pdf
No dilute core produced in simulations of giant impacts on to Jupiter
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
A powerpoint on colorectal cancer with brief background
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
ecg for noob ecg interpretation ecg recall
Micro 4 New.ppt.pdf thesis main microbio
HAEMATOLOGICAL DISEASES lack of red blood cells, which carry oxygen throughou...
Spectroscopic Techniques for M Tech Civil Engineerin .pptx

Theoretical Deep Learning

  • 1. Theoretical Deep LearningTheoretical Deep Learning Xiaohu Zhu Cofounder & Chief Scientist
  • 3. Reason 1Reason 1 To understand things better and deeper
  • 4. Reason 2Reason 2 Devise more efficient algorithms
  • 5. Reason 3Reason 3 To connect with other solid theories and methods
  • 6. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 7. RepresentationRepresentation The killer application of DLThe killer application of DL
  • 8. Composite functionsComposite functions # of parameters grow exponentially with the dimension of the equations # of units grows linearly with the dimension of functions worse performance for deep learning for non-composite functions
  • 9. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 10. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 11. Optimization 1Optimization 1 Linear equations: # of unknowns > # of equations ⇒ more than one solution Neural net for ImageNet: # of parameters(~millions) ≫ # of samples(~60,000) Overparameterization Bézout's Theorem: # of solutions > # of atoms in the universe ⇒ degenerate: each solution corresponds to a infinite solution set
  • 12. Optimization 2Optimization 2 Overparameterization: neural nets have infinite number of global optimum solution, which form a plato valley in the loss space. SGD could stay in the degenerating valley with high probability Good news: easy to optimize, global optimum exist, many, easy to find by opt algorithms
  • 13. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 14. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 15. Generalization 1Generalization 1 Overparameterization: good for optimization, bad for generalization Deep learning: tasks reasonably mix well with loss functions Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏ overfits classification error  Differential equation dynamic system: near global minimum, deep nn works like a linear network
  • 16. Generalization 2Generalization 2 Srebro's work: CROSS ENTROPY wins, i.e., overfits test set ⇏ overfits classification error  Cross Entropy ∈ Exponential loss asymmetricity ?⇒ Special property
  • 17. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data?
  • 18. RepresentationRepresentation Why are deeper nets better than shallower nets? OptimizationOptimization Why can SGD find much better local optimum? What characteristics better optimum are? GeneralizationGeneralization Why still generalize well as the number of parameters is bigger than that of data? WHAT'sWHAT's More?More?
  • 19. Plato optimumPlato optimum => better=> better generalization?generalization? Overfitting?Overfitting? Look out!Look out! Do we needDo we need Prior?Prior? Whether BrainWhether Brain research isresearch is useful for DL?useful for DL?
  • 20. ReferencesReferences 1. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1), 1-49. 2. Neyshabur, B., Tomioka, R., Salakhutdinov, R., & Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071. 3. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. International Journal of Automation and Computing, 14(5), 503-519. 4. Liao, Q., & Poggio, T. (2017). Theory of Deep Learning II: Landscape of the Empirical Risk in Deep Learning. arXiv preprint arXiv:1703.09833. 5. Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., & Poggio, T. (2018). Theory of Deep Learning IIb: Optimization Properties of SGD. arXiv preprint arXiv:1801.02254. 6. Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., ... & Mhaskar, H. (2017). Theory of Deep Learning III: explaining the non- overfitting puzzle. arXiv preprint arXiv:1801.00173. 7. Zhang, C., Liao, Q., Rakhlin, A., Sridharan, K., Miranda, B., Golowich, N., & Poggio, T. (2017). Theory of deep learning iii: Generalization properties of sgd. Center for Brains, Minds and Machines (CBMM). 8. Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933. 9. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural computation, 8(7), 1341-1390.