Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club)

Minimax statistical learning with Wasserstein distances
by Jaeho Lee and Maxim Raginsky
January 26, 2019
Presenter: Kenta Oono @ NeurIPS 2018 Reading Club

Kenta Oono (@delta2323 )
Proﬁle
• 2011.3: MSc. (Mathematics)
• 2011.4-2014.10: Preferred Infrastructure (PFI)
• 2014.10-current: Preferred Networks (PFN)
• 2018.4-current: Ph.D student @U.Tokyo
Interests
• Mathematics
• Bioinformatics
• Theory of Deep Learning
2/18

Summary
What this paper does.
• Develop a distributionally-robust risk minimization problem.
• Derive the excess-risk rate O(n−1
2 ), same as the non-robust case.
• Application to domain adaptation.
Why I choose this paper?
• Spotlight talk
• Wanted to learn statistics learning theory
• Especially minimax optimality of DL. But this paper turned out to not be about it.
• Wanted to learn Wasserstein distance
3/18

Problem Setting (Expected Risk)
Given
• Z: sample space
• P: (unknown) distribution over Z
• Dataset: D = (z1, . . . , zN) ∼ P i.i.d.
For a hypothesis f : Z → R, we evaluate its expected risk by
• Expected Risk: R(P, f ) = EZ∼P[f (Z)]
• Hypothesis space: F ⊂ {Z → R}
4/18

Problem Setting (Estimator)
Goal:
• Devise an algorithm A : D → ˆf = ˆf (D)
• We treat D as a random variable. So, is ˆf .
• If A is a random algorithm (e.g. SGD), randomness of ˆf (D) comes from A, too.
• Evaluate excess risk: R(P, ˆf ) − inff ∈F R(P, f )
Typical form of theorems:
• EA,D[R(P, ˆf ) − inff ∈F R(P, f )] = O(g(n))
• R(P, ˆf ) − inff ∈F R(P, f ) = O(g(n, δ)) with probability 1 − δ with respect to the
choice of D (and A)
5/18

Problem Setting (ERM Estimator)
Since we cannot compute the expected risk R, we compute empirical risk instead:
ˆRD(f ) =
1
n
n
i=1
f (zi )
= R(Pn, f ) (Pn: empirical distribution).
ERM (Empirical Risk Minimization) estimator for hypothesis space F is
ˆf = ˆf (D) ∈ min
f ∈F
R(Pn, f )
6/18

Assumptions
+
OR
Ref. Lee and Raginsky (2018)
8/18

Example
Supervised learning
• Z = (X, Y ), X = RD: input space, Y = R: label space
• : Y × Y → R: loss function
• H ⊂ {X → Y }: set of models
• F = {fh(x, y) = (h(x), y)|h ∈ H}
Regression
• X = RD, Y = R, (y, y) = (y − y)2
• H = (Function realized by a neural networks with a ﬁxed architecture)
9/18

Classical Result
Typically, we have
R(P, ˆf ) − inf
f ∈F
R(P, f ) = OP
complexity of F
√
n
Model complexity measure complexity of F (intuitively, how ”large” F is)
10/18

Covering number
Deﬁnition (Covering Number)
For F ⊂ F0 := {f : [−1, 1]D → R}, and ε > 0, the (external) covering number of F is
N(F, ε) := inf N ∈ N
∃f1, . . . , fN ∈ F0 s.t. ∀f ∈ F, ∃n ∈ [N] s.t.
f − fn ∞ ≤ ε
.
• Intuition: the minimum # of balls
(with radius ε) to cover the space F.
• Entropy integral:
C(F) :=
∞
0 log N(F, u) du.
11/18

Distributionally Robust Framework
Minimize the worst-case risk close to true distribution P.
minimize R(P, f )
↓
minimize Rρ,p(P, f ) := supQ∈Aρ,p(P) R(Q, f )
We consider p-Wasserstein distance:
Aρ,p(P) = {Q|Wp(P, Q) ≤ ρ}
Applications
• Adversarial attack: ρ = noise level
• Domain adaptation: ρ = discrepancy level of train/test dists.
12/18

Estimator
Correspondingly, we change the estimator
ˆf ∈ inf
f ∈F
Rρ,p(Pn, f )
Want to evaluate
Rρ,p(P, ˆf ) − inf
f ∈F
Rρ,pR(P, f )
13/18

Main Theorems
Same excess-risk rate as the non-robust setting.
Ref. Lee and Raginsky (2018)
14/18

Strategy
From authors slide
Ref: https://nips.cc/media/Slides/nips/2018/517cd(05-09-45)
-05-10-20-12649-Minimax_Statist.pdf
15/18

Key Lemmas
Ref. Lee
and Raginsky (2018)
16/18

Why these lemmas are important?
(Complexity of ΨΛ,F ) ≈ (Complexity of F) × (Complexity of Λ)
17/18

Impression
• Duality form of risk (Rρ(P, f ) = infλ≥0 E[ψλ,f (Z)]) may be useful of its own.
• Mysterious assumption 4 (incredibly local property of F).
• Special structure of p=1-Wasserstein distance?
18/18

Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club)

More Related Content

Similar to Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club) (20)

More from Kenta Oono (20)

Recently uploaded (20)

Minimax statistical learning with Wasserstein distances (NeurIPS2018 Reading Club)