ABC short course: model choice chapter

ABC for model choice
1 simulation-based methods in
Econometrics
2 Genetics of ABC
3 Approximate Bayesian computation
4 ABC for model choice
5 ABC model choice via random forests
6 ABC estimation via random forests
7 [some] asymptotics of ABC

Bayesian model choice
Several models M1, M2, . . . are considered simultaneously for a
dataset y and the model index M is part of the inference.
Use of a prior distribution. π(M = m), plus a prior distribution on
the parameter conditional on the value m of the model index,
πm(θm)
Goal is to derive the posterior distribution of M, challenging
computational target when models are complex.

Generic ABC for model choice
Algorithm 4 Likelihood-free model choice sampler (ABC-MC)
for t = 1 to T do
repeat
Generate m from the prior π(M = m)
Generate θm from the prior πm(θm)
Generate z from the model fm(z|θm)
until ρ{η(z), η(y)} <
Set m(t) = m and θ(t)
= θm
end for

ABC estimates
Posterior probability π(M = m|y) approximated by the frequency
of acceptances from model m
1
T
T
t=1
Im(t)=m .
Issues with implementation:
• should tolerances be the same for all models?
• should summary statistics vary across models (incl. their
dimension)?
• should the distance measure ρ vary as well?

ABC estimates
Posterior probability π(M = m|y) approximated by the frequency
of acceptances from model m
1
T
T
t=1
Im(t)=m .
Extension to a weighted polychotomous logistic regression estimate
of π(M = m|y), with non-parametric kernel weights
[Cornuet et al., DIYABC, 2009]

The Great ABC controversy
On-going controvery in phylogeographic genetics about the validity
of using ABC for testing
Against: Templeton, 2008,
2009, 2010a, 2010b, 2010c
argues that nested hypotheses
cannot have higher probabilities
than nesting hypotheses (!)

The Great ABC controversy
On-going controvery in phylogeographic genetics about the validity
of using ABC for testing
Against: Templeton, 2008,
2009, 2010a, 2010b, 2010c
argues that nested hypotheses
cannot have higher probabilities
than nesting hypotheses (!)
Replies: Fagundes et al., 2008,
Beaumont et al., 2010, Berger et
al., 2010, Csill`ery et al., 2010
point out that the criticisms are
addressed at [Bayesian]
model-based inference and have
nothing to do with ABC...

Gibbs random fields
Gibbs distribution
The rv y = (y1, . . . , yn) is a Gibbs random field associated with
the graph G if
f (y) =
1
Z
exp −
c∈C
Vc(yc) ,
where Z is the normalising constant, C is the set of cliques of G
and Vc is any function also called potential sufficient statistic
U(y) = c∈C Vc(yc) is the energy function

Gibbs random fields
Gibbs distribution
The rv y = (y1, . . . , yn) is a Gibbs random field associated with
the graph G if
f (y) =
1
Z
exp −
c∈C
Vc(yc) ,
where Z is the normalising constant, C is the set of cliques of G
and Vc is any function also called potential sufficient statistic
U(y) = c∈C Vc(yc) is the energy function
c Z is usually unavailable in closed form

Potts model
Potts model
Vc(y) is of the form
Vc(y) = θS(y) = θ
l∼i
δyl =yi
where l∼i denotes a neighbourhood structure

Potts model
Potts model
Vc(y) is of the form
Vc(y) = θS(y) = θ
l∼i
δyl =yi
where l∼i denotes a neighbourhood structure
In most realistic settings, summation
Zθ =
x∈X
exp{θT
S(x)}
involves too many terms to be manageable and numerical
approximations cannot always be trusted
[Cucala, Marin, CPR & Titterington, 2009]

Bayesian Model Choice
Comparing a model with potential S0 taking values in Rp0 versus a
model with potential S1 taking values in Rp1 can be done through
the Bayes factor corresponding to the priors π0 and π1 on each
parameter space
Bm0/m1
(x) =
exp{θT
0 S0(x)}/Zθ0,0π0(dθ0)
exp{θT
1 S1(x)}/Zθ1,1π1(dθ1)

Bayesian Model Choice
Comparing a model with potential S0 taking values in Rp0 versus a
model with potential S1 taking values in Rp1 can be done through
the Bayes factor corresponding to the priors π0 and π1 on each
parameter space
Bm0/m1
(x) =
exp{θT
0 S0(x)}/Zθ0,0π0(dθ0)
exp{θT
1 S1(x)}/Zθ1,1π1(dθ1)
Use of Jeﬀreys’ scale to select most appropriate model

Neighbourhood relations
Choice to be made between M neighbourhood relations
i
m
∼ i (0 ≤ m ≤ M − 1)
with
Sm(x) =
i
m
∼i
I{xi =xi }
driven by the posterior probabilities of the models.

Model index
Formalisation via a model index M that appears as a new
parameter with prior distribution π(M = m) and
π(θ|M = m) = πm(θm)

Model index
Formalisation via a model index M that appears as a new
parameter with prior distribution π(M = m) and
π(θ|M = m) = πm(θm)
Computational target:
P(M = m|x) ∝
Θm
fm(x|θm)πm(θm) dθm π(M = m) ,

Sufficient statistics
By definition, if S(x) sufficient statistic for the joint parameters
(M, θ0, . . . , θM−1),
P(M = m|x) = P(M = m|S(x)) .

Sufficient statistics
By definition, if S(x) sufficient statistic for the joint parameters
(M, θ0, . . . , θM−1),
P(M = m|x) = P(M = m|S(x)) .
For each model m, own sufficient statistic Sm(·) and
S(·) = (S0(·), . . . , SM−1(·)) also sufficient.

Sufficient statistics in Gibbs random fields
For Gibbs random fields,
x|M = m ∼ fm(x|θm) = f 1
m(x|S(x))f 2
m(S(x)|θm)
=
1
n(S(x))
f 2
m(S(x)|θm)
where
n(S(x)) = {˜x ∈ X : S(˜x) = S(x)}
c S(x) is therefore also sufficient for the joint parameters
[Specific to Gibbs random fields!]

ABC model choice Algorithm
ABC-MC
• Generate m∗ from the prior π(M = m).
• Generate θ∗
m∗ from the prior πm∗ (·).
• Generate x∗ from the model fm∗ (·|θ∗
m∗ ).
• Compute the distance ρ(S(x0), S(x∗)).
• Accept (θ∗
m∗ , m∗) if ρ(S(x0), S(x∗)) < .
Note When = 0 the algorithm is exact

ABC approximation to the Bayes factor
Frequency ratio:
BFm0/m1
(x0
) =
ˆP(M = m0|x0)
ˆP(M = m1|x0)
×
π(M = m1)
π(M = m0)
=
{mi∗ = m0}
{mi∗ = m1}
×
π(M = m1)
π(M = m0)
,

ABC approximation to the Bayes factor
Frequency ratio:
BFm0/m1
(x0
) =
ˆP(M = m0|x0)
ˆP(M = m1|x0)
×
π(M = m1)
π(M = m0)
=
{mi∗ = m0}
{mi∗ = m1}
×
π(M = m1)
π(M = m0)
,
replaced with
BFm0/m1
(x0
) =
1 + {mi∗ = m0}
1 + {mi∗ = m1}
×
π(M = m1)
π(M = m0)
to avoid indeterminacy (also Bayes estimate).

Toy example
iid Bernoulli model versus two-state ﬁrst-order Markov chain, i.e.
f0(x|θ0) = exp θ0
n
i=1
I{xi =1} {1 + exp(θ0)}n
,
versus
f1(x|θ1) =
1
2
exp θ1
n
i=2
I{xi =xi−1} {1 + exp(θ1)}n−1
,
with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phase
transition” boundaries).

Toy example (2)
−40 −20 0 10
−505
BF01
BF
^
01
−40 −20 0 10
−10−50510
BF01
BF
^
01
(left) Comparison of the true BFm0/m1
(x0) with BFm0/m1
(x0) (in
logs) over 2, 000 simulations and 4.106 proposals from the prior.
(right) Same when using tolerance corresponding to the 1%
quantile on the distances.

Back to suﬃciency
‘Suﬃcient statistics for individual models are unlikely to
be very informative for the model probability.’
[Scott Sisson, Jan. 31, 2011, X.’Og]

Back to sufficiency
If η1(x) sufficient statistic for model m = 1 and parameter θ1 and
η2(x) sufficient statistic for model m = 2 and parameter θ2,
(η1(x), η2(x)) is not always sufficient for (m, θm)

Back to sufficiency
If η1(x) sufficient statistic for model m = 1 and parameter θ1 and
η2(x) sufficient statistic for model m = 2 and parameter θ2,
(η1(x), η2(x)) is not always sufficient for (m, θm)
c Potential loss of information at the testing level

Limiting behaviour of B12 (T → ∞)
ABC approximation
B12(y) =
T
t=1 Imt =1 Iρ{η(zt ),η(y)}≤
T
,
where the (mt, zt)’s are simulated from the (joint) prior

Limiting behaviour of B12 (T → ∞)
ABC approximation
B12(y) =
T
T
,
where the (mt, zt)’s are simulated from the (joint) prior
As T go to inﬁnity, limit
B12(y) =
Iρ{η(z),η(y)}≤ π1(θ1)f1(z|θ1) dz dθ1
Iρ{η(z),η(y)}≤ π2(θ2)f2(z|θ2) dz dθ2
=
Iρ{η,η(y)}≤ π1(θ1)f η
1 (η|θ1) dη dθ1
Iρ{η,η(y)}≤ π2(θ2)f η
2 (η|θ2) dη dθ2
,
where f η
1 (η|θ1) and f η
2 (η|θ2) distributions of η(z)

Limiting behaviour of B12 ( → 0)
When goes to zero,
Bη
12(y) =
π1(θ1)f η
1 (η(y)|θ1) dθ1
π2(θ2)f η
2 (η(y)|θ2) dθ2
,

Limiting behaviour of B12 ( → 0)
When goes to zero,
Bη
12(y) =
π1(θ1)f η
1 (η(y)|θ1) dθ1
π2(θ2)f η
2 (η(y)|θ2) dθ2
,
c Bayes factor based on the sole observation of η(y)

Limiting behaviour of B12 (under suﬃciency)
If η(y) suﬃcient statistic for both models,
fi (y|θi ) = gi (y)f η
i (η(y)|θi )
Thus
B12(y) = Θ1
π(θ1)g1(y)f η
1 (η(y)|θ1) dθ1
Θ2
π(θ2)g2(y)f η
2 (η(y)|θ2) dθ2
=
g1(y) π1(θ1)f η
1 (η(y)|θ1) dθ1
g2(y) π2(θ2)f η
2 (η(y)|θ2) dθ2
=
g1(y)
g2(y)
Bη
12(y) .
[Didelot, Everitt, Johansen & Lawson, 2011]

Limiting behaviour of B12 (under sufficiency)
If η(y) sufficient statistic for both models,
fi (y|θi ) = gi (y)f η
i (η(y)|θi )
Thus
B12(y) = Θ1
π(θ1)g1(y)f η
1 (η(y)|θ1) dθ1
Θ2
π(θ2)g2(y)f η
2 (η(y)|θ2) dθ2
=
g1(y) π1(θ1)f η
1 (η(y)|θ1) dθ1
g2(y) π2(θ2)f η
2 (η(y)|θ2) dθ2
=
g1(y)
g2(y)
Bη
12(y) .
c No discrepancy only when cross-model sufficiency

Poisson/geometric example
Sample
x = (x1, . . . , xn)
from either a Poisson P(λ) or from a geometric G(p) Then
S =
n
i=1
yi = η(x)
suﬃcient statistic for either model but not simultaneously
Discrepancy ratio
g1(x)
g2(x)
=
S!n−S / i yi !
1 n+S−1
S

Poisson/geometric discrepancy
Range of B12(x) versus Bη
12(x) B12(x): The values produced have
nothing in common.

Formal recovery
Creating an encompassing exponential family
f (x|θ1, θ2, α1, α2) ∝ exp{θT
1 η1(x) + θT
1 η1(x) + α1t1(x) + α2t2(x)}
leads to a suﬃcient statistic (η1(x), η2(x), t1(x), t2(x))

Formal recovery
f (x|θ1, θ2, α1, α2) ∝ exp{θT
1 η1(x) + θT
1 η1(x) + α1t1(x) + α2t2(x)}
In the Poisson/geometric case, if i xi ! is added to S, no
discrepancy

Formal recovery
f (x|θ1, θ2, α1, α2) ∝ exp{θT
1 η1(x) + θT
1 η1(x) + α1t1(x) + α2t2(x)}
Only applies in genuine suﬃciency settings...
c Inability to evaluate loss brought by summary statistics

Meaning of the ABC-Bayes factor
‘This is also why focus on model discrimination typically
(...) proceeds by (...) accepting that the Bayes Factor
that one obtains is only derived from the summary
statistics and may in no way correspond to that of the
full model.’

Meaning of the ABC-Bayes factor
‘This is also why focus on model discrimination typically
(...) proceeds by (...) accepting that the Bayes Factor
that one obtains is only derived from the summary
statistics and may in no way correspond to that of the
full model.’
In the Poisson/geometric case, if E[yi ] = θ0 > 0,
lim
n→∞
Bη
12(y) =
(θ0 + 1)2
θ0
e−θ0

MA(q) divergence
1 2
0.00.20.40.60.81.0
1 2
0.00.20.40.60.81.0
1 2
0.00.20.40.60.81.0
1 2
0.00.20.40.60.81.0
Evolution [against ] of ABC Bayes factor, in terms of frequencies of
visits to models MA(1) (left) and MA(2) (right) when equal to
10, 1, .1, .01% quantiles on insuﬃcient autocovariance distances. Sample
of 50 points from a MA(2) with θ1 = 0.6, θ2 = 0.2. True Bayes factor
equal to 17.71.

MA(q) divergence
1 2
0.00.20.40.60.81.0
1 2
0.00.20.40.60.81.0
1 2
0.00.20.40.60.81.0
1 2
0.00.20.40.60.81.0
Evolution [against ] of ABC Bayes factor, in terms of frequencies of
visits to models MA(1) (left) and MA(2) (right) when equal to
10, 1, .1, .01% quantiles on insuﬃcient autocovariance distances. Sample
of 50 points from a MA(1) model with θ1 = 0.6. True Bayes factor B21
equal to .004.

Further comments
‘There should be the possibility that for the same model,
but different (non-minimal) [summary] statistics (so
different η’s: η1 and η∗
1) the ratio of evidences may no
longer be equal to one.’
[Michael Stumpf, Jan. 28, 2011, ’Og]
Using different summary statistics [on different models] may
indicate the loss of information brought by each set but agreement
does not lead to trustworthy approximations.

A stylised problem
Central question to the validation of ABC for model choice:
When is a Bayes factor based on an insuﬃcient statistic T(y)
consistent?

A stylised problem
Central question to the validation of ABC for model choice:
When is a Bayes factor based on an insuﬃcient statistic T(y)
consistent?
Note/warnin: c drawn on T(y) through BT
12(y) necessarily diﬀers
from c drawn on y through B12(y)
[Marin, Pillai, X, & Rousseau, JRSS B, 2013]

A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks!]:
[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposed
to model M2: y ∼ L(θ2, 1/
√
2), Laplace distribution with mean θ2
and scale parameter 1/
√
2 (variance one).
Four possible statistics
1 sample mean y (suﬃcient for M1 if not M2);
2 sample median med(y) (insuﬃcient);
3 sample variance var(y) (ancillary);
4 median absolute deviation mad(y) = med(|y − med(y)|);

A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks!]:
[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposed
to model M2: y ∼ L(θ2, 1/
√
2), Laplace distribution with mean θ2
and scale parameter 1/
√
2 (variance one).
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.10.20.30.40.50.60.7
n=100
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.20.40.60.81.0
n=100

Framework
Starting from sample
y = (y1, . . . , yn)
the observed sample, not necessarily iid with true distribution
y ∼ Pn
Summary statistics
T(y) = Tn
= (T1(y), T2(y), · · · , Td (y)) ∈ Rd
with true distribution Tn
∼ Gn.

Assumptions
A collection of asymptotic “standard” assumptions:
[A1] is a standard central limit theorem under the true model with
asymptotic mean µ0
[A2] controls the large deviations of the estimator Tn
from the
model mean µ(θ)
[A3] is the standard prior mass condition found in Bayesian
asymptotics (di eﬀective dimension of the parameter)
[A4] restricts the behaviour of the model density against the true
density
[Think CLT!]

Asymptotic marginals
Asymptotically, under [A1]–[A4]
mi (t) =
Θi
gi (t|θi ) πi (θi ) dθi
is such that
(i) if inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0,
Cl vd−di
n ≤ mi (Tn
) ≤ Cuvd−di
n
and
(ii) if inf{|µi (θi ) − µ0|; θi ∈ Θi } > 0
mi (Tn
) = oPn [vd−τi
n + vd−αi
n ].

Between-model consistency
Consequence of above is that asymptotic behaviour of the Bayes
factor is driven by the asymptotic mean value µ(θ) of Tn
under
both models. And only by this mean value!

under
Indeed, if
inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0
then
Cl v
−(d1−d2)
n ≤ m1(Tn
) m2(Tn
) ≤ Cuv
−(d1−d2)
n ,
where Cl , Cu = OPn (1), irrespective of the true model.
c Only depends on the diﬀerence d1 − d2: no consistency

under
Else, if
inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0
then
m1(Tn
)
m2(Tn
)
≥ Cu min v
−(d1−α2)
n , v
−(d1−τ2)
n

Checking for adequate statistics
Run a practical check of the relevance (or non-relevance) of Tn
null hypothesis that both models are compatible with the statistic
Tn
H0 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} = 0
against
H1 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} > 0
testing procedure provides estimates of mean of Tn
under each
model and checks for equality

Checking in practice
• Under each model Mi , generate ABC sample θi,l , l = 1, · · · , L
• For each θi,l , generate yi,l ∼ Fi,n(·|ψi,l ), derive Tn
(yi,l ) and
compute
ˆµi =
1
L
L
l=1
Tn
(yi,l ), i = 1, 2 .
• Conditionally on Tn
(y),
√
L { ˆµi − Eπ
[µi (θi )|Tn
(y)]} N(0, Vi ),
• Test for a common mean
H0 : ˆµ1 ∼ N(µ0, V1) , ˆµ2 ∼ N(µ0, V2)
against the alternative of diﬀerent means
H1 : ˆµi ∼ N(µi , Vi ), with µ1 = µ2 .

Toy example: Laplace versus Gauss
qqqqqqqqqqqqqqq
qqqqqqqqqq
q
qq
q
q
Gauss Laplace Gauss Laplace
010203040
Normalised χ2 without and with mad

ABC short course: model choice chapter

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to ABC short course: model choice chapter (20)

More from Christian Robert (16)

Recently uploaded (20)

ABC short course: model choice chapter