SlideShare a Scribd company logo
Minimax statistical learning with Wasserstein distances
by Jaeho Lee and Maxim Raginsky
January 26, 2019
Presenter: Kenta Oono @ NeurIPS 2018 Reading Club
Kenta Oono (@delta2323 )
Profile
• 2011.3: MSc. (Mathematics)
• 2011.4-2014.10: Preferred Infrastructure (PFI)
• 2014.10-current: Preferred Networks (PFN)
• 2018.4-current: Ph.D student @U.Tokyo
Interests
• Mathematics
• Bioinformatics
• Theory of Deep Learning
2/18
Summary
What this paper does.
• Develop a distributionally-robust risk minimization problem.
• Derive the excess-risk rate O(n−1
2 ), same as the non-robust case.
• Application to domain adaptation.
Why I choose this paper?
• Spotlight talk
• Wanted to learn statistics learning theory
• Especially minimax optimality of DL. But this paper turned out to not be about it.
• Wanted to learn Wasserstein distance
3/18
Problem Setting (Expected Risk)
Given
• Z: sample space
• P: (unknown) distribution over Z
• Dataset: D = (z1, . . . , zN) ∼ P i.i.d.
For a hypothesis f : Z → R, we evaluate its expected risk by
• Expected Risk: R(P, f ) = EZ∼P[f (Z)]
• Hypothesis space: F ⊂ {Z → R}
4/18
Problem Setting (Estimator)
Goal:
• Devise an algorithm A : D → ˆf = ˆf (D)
• We treat D as a random variable. So, is ˆf .
• If A is a random algorithm (e.g. SGD), randomness of ˆf (D) comes from A, too.
• Evaluate excess risk: R(P, ˆf ) − inff ∈F R(P, f )
Typical form of theorems:
• EA,D[R(P, ˆf ) − inff ∈F R(P, f )] = O(g(n))
• R(P, ˆf ) − inff ∈F R(P, f ) = O(g(n, δ)) with probability 1 − δ with respect to the
choice of D (and A)
5/18
Problem Setting (ERM Estimator)
Since we cannot compute the expected risk R, we compute empirical risk instead:
ˆRD(f ) =
1
n
n
i=1
f (zi )
= R(Pn, f ) (Pn: empirical distribution).
ERM (Empirical Risk Minimization) estimator for hypothesis space F is
ˆf = ˆf (D) ∈ min
f ∈F
R(Pn, f )
6/18
Relation
7/18
Assumptions
+
OR
Ref. Lee and Raginsky (2018)
8/18
Example
Supervised learning
• Z = (X, Y ), X = RD: input space, Y = R: label space
• : Y × Y → R: loss function
• H ⊂ {X → Y }: set of models
• F = {fh(x, y) = (h(x), y)|h ∈ H}
Regression
• X = RD, Y = R, (y, y) = (y − y)2
• H = (Function realized by a neural networks with a fixed architecture)
9/18
Classical Result
Typically, we have
R(P, ˆf ) − inf
f ∈F
R(P, f ) = OP
complexity of F
√
n
Model complexity measure complexity of F (intuitively, how ”large” F is)
10/18
Covering number
Definition (Covering Number)
For F ⊂ F0 := {f : [−1, 1]D → R}, and ε > 0, the (external) covering number of F is
N(F, ε) := inf N ∈ N
∃f1, . . . , fN ∈ F0 s.t. ∀f ∈ F, ∃n ∈ [N] s.t.
f − fn ∞ ≤ ε
.
• Intuition: the minimum # of balls
(with radius ε) to cover the space F.
• Entropy integral:
C(F) :=
∞
0 log N(F, u) du.
11/18
Distributionally Robust Framework
Minimize the worst-case risk close to true distribution P.
minimize R(P, f )
↓
minimize Rρ,p(P, f ) := supQ∈Aρ,p(P) R(Q, f )
We consider p-Wasserstein distance:
Aρ,p(P) = {Q|Wp(P, Q) ≤ ρ}
Applications
• Adversarial attack: ρ = noise level
• Domain adaptation: ρ = discrepancy level of train/test dists.
12/18
Estimator
Correspondingly, we change the estimator
ˆf ∈ inf
f ∈F
Rρ,p(Pn, f )
Want to evaluate
Rρ,p(P, ˆf ) − inf
f ∈F
Rρ,pR(P, f )
13/18
Main Theorems
Same excess-risk rate as the non-robust setting.
Ref. Lee and Raginsky (2018)
14/18
Strategy
From authors slide
Ref: https://nips.cc/media/Slides/nips/2018/517cd(05-09-45)
-05-10-20-12649-Minimax_Statist.pdf
15/18
Key Lemmas
Ref. Lee
and Raginsky (2018)
16/18
Why these lemmas are important?
(Complexity of ΨΛ,F ) ≈ (Complexity of F) × (Complexity of Λ)
17/18
Impression
• Duality form of risk (Rρ(P, f ) = infλ≥0 E[ψλ,f (Z)]) may be useful of its own.
• Mysterious assumption 4 (incredibly local property of F).
• Special structure of p=1-Wasserstein distance?
18/18

More Related Content

PDF
The Newton polytope of the sparse resultant
Vissarion Fisikopoulos
 
PDF
ds2010
Artur Aiguzhinov
 
PPT
Alpaydin - Chapter 2
butest
 
PDF
Polynomial functions
Leo Crisologo
 
PDF
Math 4 graphing rational functions
Leo Crisologo
 
PDF
A basic introduction to learning
Andres Mendez-Vazquez
 
PDF
Lecture notes
graciasabineKAGLAN
 
PDF
Lausanne 2019 #1
Arthur Charpentier
 
The Newton polytope of the sparse resultant
Vissarion Fisikopoulos
 
Alpaydin - Chapter 2
butest
 
Polynomial functions
Leo Crisologo
 
Math 4 graphing rational functions
Leo Crisologo
 
A basic introduction to learning
Andres Mendez-Vazquez
 
Lecture notes
graciasabineKAGLAN
 
Lausanne 2019 #1
Arthur Charpentier
 

Similar to Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club) (20)

DOCX
Machine Learning Printable For Studying Exam
YusufFakhriAldrian1
 
PPTX
STLtalk about statistical analysis and its application
JulieDash5
 
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible I...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
2019 PMED Spring Course - SMARTs-Part II - Eric Laber, April 10, 2019
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Maximum likelihood estimation of regularisation parameters in inverse problem...
Valentin De Bortoli
 
PDF
QMC: Transition Workshop - Probability Models for Discretization Uncertainty ...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
Estimationtheory2
Gopi Saiteja
 
PDF
Statistical Inference Using Stochastic Gradient Descent
Center for Transportation Research - UT Austin
 
PDF
Statistical Inference Using Stochastic Gradient Descent
Center for Transportation Research - UT Austin
 
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
ssuserbbbef4
 
PPT
Telling the Story of Support Vector Mmachines.ppt
TanXiaoyang1
 
PDF
Gradient-based optimization for Deep Learning: a short introduction
Christian Perone
 
PDF
Statistical Decision Theory
Sangwoo Mo
 
PPTX
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Jason Tsai
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
Machine Learning Printable For Studying Exam
YusufFakhriAldrian1
 
STLtalk about statistical analysis and its application
JulieDash5
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Coverage of Credible I...
The Statistical and Applied Mathematical Sciences Institute
 
2019 PMED Spring Course - SMARTs-Part II - Eric Laber, April 10, 2019
The Statistical and Applied Mathematical Sciences Institute
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Valentin De Bortoli
 
QMC: Transition Workshop - Probability Models for Discretization Uncertainty ...
The Statistical and Applied Mathematical Sciences Institute
 
Estimationtheory2
Gopi Saiteja
 
Statistical Inference Using Stochastic Gradient Descent
Center for Transportation Research - UT Austin
 
Statistical Inference Using Stochastic Gradient Descent
Center for Transportation Research - UT Austin
 
When Models Meet Data: From ancient science to todays Artificial Intelligence...
ssuserbbbef4
 
Telling the Story of Support Vector Mmachines.ppt
TanXiaoyang1
 
Gradient-based optimization for Deep Learning: a short introduction
Christian Perone
 
Statistical Decision Theory
Sangwoo Mo
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Jason Tsai
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
Ad

More from Kenta Oono (20)

PDF
Deep learning for molecules, introduction to chainer chemistry
Kenta Oono
 
PDF
Overview of Machine Learning for Molecules and Materials Workshop @ NIPS2017
Kenta Oono
 
PDF
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Kenta Oono
 
PDF
深層学習フレームワーク概要とChainerの事例紹介
Kenta Oono
 
PDF
20170422 数学カフェ Part2
Kenta Oono
 
PDF
20170422 数学カフェ Part1
Kenta Oono
 
PDF
情報幾何学の基礎、第7章発表ノート
Kenta Oono
 
PDF
GTC Japan 2016 Chainer feature introduction
Kenta Oono
 
PDF
On the benchmark of Chainer
Kenta Oono
 
PDF
Tokyo Webmining Talk1
Kenta Oono
 
PDF
VAE-type Deep Generative Models
Kenta Oono
 
PDF
Common Design of Deep Learning Frameworks
Kenta Oono
 
PDF
Introduction to Chainer and CuPy
Kenta Oono
 
PDF
Stochastic Gradient MCMC
Kenta Oono
 
PDF
Chainer Contribution Guide
Kenta Oono
 
PDF
2015年9月18日 (GTC Japan 2015) 深層学習フレームワークChainerの導入と化合物活性予測への応用
Kenta Oono
 
PDF
Introduction to Chainer (LL Ring Recursive)
Kenta Oono
 
PDF
日本神経回路学会セミナー「DeepLearningを使ってみよう!」資料
Kenta Oono
 
PDF
提供AMIについて
Kenta Oono
 
PDF
Chainerインストール
Kenta Oono
 
Deep learning for molecules, introduction to chainer chemistry
Kenta Oono
 
Overview of Machine Learning for Molecules and Materials Workshop @ NIPS2017
Kenta Oono
 
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Kenta Oono
 
深層学習フレームワーク概要とChainerの事例紹介
Kenta Oono
 
20170422 数学カフェ Part2
Kenta Oono
 
20170422 数学カフェ Part1
Kenta Oono
 
情報幾何学の基礎、第7章発表ノート
Kenta Oono
 
GTC Japan 2016 Chainer feature introduction
Kenta Oono
 
On the benchmark of Chainer
Kenta Oono
 
Tokyo Webmining Talk1
Kenta Oono
 
VAE-type Deep Generative Models
Kenta Oono
 
Common Design of Deep Learning Frameworks
Kenta Oono
 
Introduction to Chainer and CuPy
Kenta Oono
 
Stochastic Gradient MCMC
Kenta Oono
 
Chainer Contribution Guide
Kenta Oono
 
2015年9月18日 (GTC Japan 2015) 深層学習フレームワークChainerの導入と化合物活性予測への応用
Kenta Oono
 
Introduction to Chainer (LL Ring Recursive)
Kenta Oono
 
日本神経回路学会セミナー「DeepLearningを使ってみよう!」資料
Kenta Oono
 
提供AMIについて
Kenta Oono
 
Chainerインストール
Kenta Oono
 
Ad

Recently uploaded (20)

PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
The Future of Artificial Intelligence (AI)
Mukul
 
Software Development Methodologies in 2025
KodekX
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 

Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club)

  • 1. Minimax statistical learning with Wasserstein distances by Jaeho Lee and Maxim Raginsky January 26, 2019 Presenter: Kenta Oono @ NeurIPS 2018 Reading Club
  • 2. Kenta Oono (@delta2323 ) Profile • 2011.3: MSc. (Mathematics) • 2011.4-2014.10: Preferred Infrastructure (PFI) • 2014.10-current: Preferred Networks (PFN) • 2018.4-current: Ph.D student @U.Tokyo Interests • Mathematics • Bioinformatics • Theory of Deep Learning 2/18
  • 3. Summary What this paper does. • Develop a distributionally-robust risk minimization problem. • Derive the excess-risk rate O(n−1 2 ), same as the non-robust case. • Application to domain adaptation. Why I choose this paper? • Spotlight talk • Wanted to learn statistics learning theory • Especially minimax optimality of DL. But this paper turned out to not be about it. • Wanted to learn Wasserstein distance 3/18
  • 4. Problem Setting (Expected Risk) Given • Z: sample space • P: (unknown) distribution over Z • Dataset: D = (z1, . . . , zN) ∼ P i.i.d. For a hypothesis f : Z → R, we evaluate its expected risk by • Expected Risk: R(P, f ) = EZ∼P[f (Z)] • Hypothesis space: F ⊂ {Z → R} 4/18
  • 5. Problem Setting (Estimator) Goal: • Devise an algorithm A : D → ˆf = ˆf (D) • We treat D as a random variable. So, is ˆf . • If A is a random algorithm (e.g. SGD), randomness of ˆf (D) comes from A, too. • Evaluate excess risk: R(P, ˆf ) − inff ∈F R(P, f ) Typical form of theorems: • EA,D[R(P, ˆf ) − inff ∈F R(P, f )] = O(g(n)) • R(P, ˆf ) − inff ∈F R(P, f ) = O(g(n, δ)) with probability 1 − δ with respect to the choice of D (and A) 5/18
  • 6. Problem Setting (ERM Estimator) Since we cannot compute the expected risk R, we compute empirical risk instead: ˆRD(f ) = 1 n n i=1 f (zi ) = R(Pn, f ) (Pn: empirical distribution). ERM (Empirical Risk Minimization) estimator for hypothesis space F is ˆf = ˆf (D) ∈ min f ∈F R(Pn, f ) 6/18
  • 8. Assumptions + OR Ref. Lee and Raginsky (2018) 8/18
  • 9. Example Supervised learning • Z = (X, Y ), X = RD: input space, Y = R: label space • : Y × Y → R: loss function • H ⊂ {X → Y }: set of models • F = {fh(x, y) = (h(x), y)|h ∈ H} Regression • X = RD, Y = R, (y, y) = (y − y)2 • H = (Function realized by a neural networks with a fixed architecture) 9/18
  • 10. Classical Result Typically, we have R(P, ˆf ) − inf f ∈F R(P, f ) = OP complexity of F √ n Model complexity measure complexity of F (intuitively, how ”large” F is) 10/18
  • 11. Covering number Definition (Covering Number) For F ⊂ F0 := {f : [−1, 1]D → R}, and ε > 0, the (external) covering number of F is N(F, ε) := inf N ∈ N ∃f1, . . . , fN ∈ F0 s.t. ∀f ∈ F, ∃n ∈ [N] s.t. f − fn ∞ ≤ ε . • Intuition: the minimum # of balls (with radius ε) to cover the space F. • Entropy integral: C(F) := ∞ 0 log N(F, u) du. 11/18
  • 12. Distributionally Robust Framework Minimize the worst-case risk close to true distribution P. minimize R(P, f ) ↓ minimize Rρ,p(P, f ) := supQ∈Aρ,p(P) R(Q, f ) We consider p-Wasserstein distance: Aρ,p(P) = {Q|Wp(P, Q) ≤ ρ} Applications • Adversarial attack: ρ = noise level • Domain adaptation: ρ = discrepancy level of train/test dists. 12/18
  • 13. Estimator Correspondingly, we change the estimator ˆf ∈ inf f ∈F Rρ,p(Pn, f ) Want to evaluate Rρ,p(P, ˆf ) − inf f ∈F Rρ,pR(P, f ) 13/18
  • 14. Main Theorems Same excess-risk rate as the non-robust setting. Ref. Lee and Raginsky (2018) 14/18
  • 15. Strategy From authors slide Ref: https://nips.cc/media/Slides/nips/2018/517cd(05-09-45) -05-10-20-12649-Minimax_Statist.pdf 15/18
  • 16. Key Lemmas Ref. Lee and Raginsky (2018) 16/18
  • 17. Why these lemmas are important? (Complexity of ΨΛ,F ) ≈ (Complexity of F) × (Complexity of Λ) 17/18
  • 18. Impression • Duality form of risk (Rρ(P, f ) = infλ≥0 E[ψλ,f (Z)]) may be useful of its own. • Mysterious assumption 4 (incredibly local property of F). • Special structure of p=1-Wasserstein distance? 18/18