Monte Carlo in Montréal 2017

ABC: from convergence guarantees to automated
implementation
Christian P. Robert
Universit´e Paris-Dauphine PSL, Paris & University of Warwick,
Coventry
Joint work with A. Estoup, J.M. Marin, P. Pudlo, L Raynal, & M. Ribatet

Outline
motivatoy example
Approximate Bayesian computation
ABC for model choice
ABC model choice via random forests
ABC estimation via random forests
[some] asymptotics of ABC

A motivating if pedestrian example
paired and orphan socks
A drawer contains an unknown number of socks, some of which
can be paired and some of which are orphans (single). One takes
at random 11 socks without replacement from this drawer: no pair
can be found among those. What can we infer about the total
number of socks in the drawer?
sounds like an impossible task
one observation x = 11 and two unknowns, nsocks and npairs
writing the likelihood is a challenge [exercise]

Feller’s shoes
A closet contains n pairs of shoes. If 2r shoes are chosen
at random (with 2r < n), what is the probability that
there will be (a) no complete pair, (b) exactly one
complete pair, (c) exactly two complete pairs among
them?
[Feller, 1970, Chapter II, Exercise 26]

Feller’s shoes
them?
Resolution as
pj =
n
j
22r−2j n − j
2r − 2j
2n
2r
being probability of obtaining js pairs among those 2r shoes, or for
an odd number t of shoes
pj = 2t−2j n
j
n − j
t − 2j
2n
t

Feller’s shoes
them?
If one draws 11 socks out of m socks made of f orphans and g
pairs, with f + 2g = m, number k of socks from the orphan group
is hypergeometric H(11, m, f ) and probability to observe 11
orphan socks total is
11
k=0
f
k
2g
11−k
m
11
×
211−k g
11−k
2g
11−k

A prioris on socks
Given parameters nsocks and npairs, set of socks
S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks
and 11 socks picked at random from S give X unique socks.
Rassmus’ reasoning
If you are a family of 3-4 persons then a guesstimate would be that
you have something like 15 pairs of socks in store. It is also
possible that you have much more than 30 socks. So as a prior for
nsocks I’m going to use a negative binomial with mean 30 and
standard deviation 15.
On npairs/2nsocks I’m going to put a Beta prior distribution that puts
most of the probability over the range 0.75 to 1.0,
[Rassmus B˚a˚ath’s Research Blog, Oct 20th, 2014]

Simulating the experiment
Given a prior distribution on nsocks and npairs,
nsocks ∼ Neg(30, 15) npairs|nsocks ∼ nsocks/2Be(15, 2)
possible to
1. generate new values
of nsocks and npairs,
2. generate a new
observation of X,
number of unique
socks out of 11.
3. accept the pair
(nsocks, npairs) if the
realisation of X is
equal to 11

Meaning
ns
Density
0 10 20 30 40 50 60
0.000.010.020.030.040.050.06
The outcome of this simulation method returns a distribution on
the pair (nsocks, npairs) that is the conditional distribution of the
pair given the observation X = 11
Proof: Generations from π(nsocks, npairs) are accepted with probability
P {X = 11|(nsocks, npairs)}

Meaning
ns
Density
0 10 20 30 40 50 60
0.000.010.020.030.040.050.06
The outcome of this simulation method returns a distribution on
the pair (nsocks, npairs) that is the conditional distribution of the
pair given the observation X = 11
Proof: Hence accepted values distributed from
π(nsocks, npairs) × P {X = 11|(nsocks, npairs)} ∝ π(nsocks, npairs|X = 11)

motivatoy example
ABC basics
Automated summary selection

Untractable likelihoods
Cases when the likelihood function
f (y|θ) is unavailable and when the
completion step
f (y|θ) =
Z
f (y, z|θ) dz
is impossible or too costly because of
the dimension of z
c MCMC cannot be implemented

The ABC method
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavar´e et al., 1997]

A as A...pproximative
When y is a continuous random variable, equality z = y is
replaced with a tolerance condition,
ρ(y, z)
where ρ is a distance
Output distributed from
π(θ) Pθ{ρ(y, z) < } ∝ π(θ|ρ(y, z) < )
[Pritchard et al., 1999]

ABC algorithm
Algorithm 1 Likelihood-free rejection sampler 2
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)}
set θi = θ
end for
where η(y) deﬁnes a (not necessarily suﬃcient) statistic

Output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f (z|θ)IA ,y (z)
A ,y×Θ π(θ)f (z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a
small tolerance should provide a good approximation of the
posterior distribution:
π (θ|y) = π (θ, z|y)dz ≈ π(θ|η(y)) .

Dogger Bank re-enactment
Battle of Dogger Bank on Jan 24, 1915, between British and
German ﬂeets : how likely was the British victory?
[MacKay, Price, and Wood, 2016]

Dogger Bank re-enactment
Battle of Dogger Bank on Jan 24, 1915, between British and
German ﬂeets : ABC simulation of posterior distribution
[MacKay, Price, and Wood, 2016]

ABC advances
Simulating from the prior is often poor in eﬃciency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Bortot et al., 2007, Beaumont et al., 2009]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2009]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]

ABC consistency
Recent studies on large sample properties of ABC posterior
distributions and ABC posterior means
[Liu & Fearnhead, 2016; Frazier et al., 2016]
Under regularity conditions on summary statistics,
incl. convergence at speed dT , characterisation of rate of posterior
concentration as a function of tolerance convergence
less stringent condition on tolerance decrease than for
asymptotic normality of posterior;
asymptotic normality of posterior mean does not require
asymptotic normality of posterior itself
Cases for limiting ABC distributions
1. dT T −→ +∞;
2. dT T −→ c;
3. dT T −→ 0
and limiting ABC mean convergent for 2
T = o(1/dT )
[Frazier et al., 2016]

Noisily exact ABC
Idea: Modify the data from the start
˜y = y0 + ζ1
with the same scale as ABC
[ see Fearnhead-Prangle ]
run ABC on ˜y
Then ABC produces an exact simulation from π(θ|˜y) = π(θ|˜y)
[Dean et al., 2011; Fearnhead and Prangle, 2012]

Consistent noisy ABC
Degrading the data improves the estimation performances:
Noisy ABC-MLE is asymptotically (in n) consistent
under further assumptions, the noisy ABC-MLE is
asymptotically normal
increase in variance of order −2
likely degradation in precision or computing time due to the
lack of summary statistic [curse of dimensionality]

Semi-automatic ABC
Fearnhead and Prangle (2012) study ABC and the selection of the
summary statistic
ABC then considered from a purely inferential viewpoint and
calibrated for estimation purposes
Use of a randomised (or ‘noisy’) version of the summary statistics
˜η(y) = η(y) + τ
Derivation of a well-calibrated version of ABC, i.e. an algorithm
that gives proper predictions for the distribution associated with
this randomised summary statistic [calibration constraint: ABC
approximation with same posterior mean as the true randomised
posterior]
Optimality of the posterior expectation E[θ|y] of the parameter of
interest as summary statistics η(y)!

Fully automatic ABC
Implementation of ABC still requires input of collection of
summaries
Towards automation
statistical projection techniques (LDA, PCA, NP-GLS, &tc.)
variable selection
machine learning approaches
bypassing summaries altogether

ABC with Wasserstein distance
Use as distance between simulated and observed samples the
Wasserstein distance:
Wp(y1:n, z1:n)p
= inf
σ∈Sn
1
n
n
i=1
ρ(yi , zσ(i))p
, (1)
covers well- and mis-speciﬁed cases
only depends on data space distance ρ(·, ·)
covers iid and dynamic models (curve matching)
computional feasible (linear in dimension, cubic in sample size)
Hilbert curve approximation
[Bernton et al., 2017]

Consistent inference with Wasserstein distance
As ε → 0 [and n ﬁxed]
If either
1. f
(n)
θ is n-exchangeable and D(y1:n, z1:n) = 0 if and only if
z1:n = yσ(1:n) for some σ ∈ Sn, or
2. D(y1:n, z1:n) = 0 if and only if z1:n = y1:n.
then, at y1:n ﬁxed, ABC posterior converges strongly to posterior
as ε → 0.

As n → ∞ [at ε ﬁxed]
WABC distribution with a ﬁxed ε does not converge in n to a
Dirac mass

As εn → 0 and n → ∞
Under range of assumptions, if fn(εn) → 0, and
P(W(^µn, µ ) εn) → 1, then WABC posterior with threshold
εn + ε satisﬁes
πεn+ε
{θ ∈ H : W(µ , µθ) > ε + 4εn/3 + f −1
n (εL
n/R)} |y1:n
P
δ

A bivariate Gaussian illustration
100 observations from bivariate Normal with variance 1 and
covariance 0.55
Compare WABC with ABC based on raw Euclidean distance and
Euclidean distance between sample means on 106 model
simulations.

motivatoy example

Bayesian model choice
Several models M1, M2, . . . are considered simultaneously for a
dataset y and the model index M is part of the inference.
Use of a prior distribution. π(M = m), plus a prior distribution on
the parameter conditional on the value m of the model index,
πm(θm)
Goal is to derive the posterior distribution of M, challenging
computational target when models are complex.

Generic ABC for model choice
Algorithm 2 Likelihood-free model choice sampler (ABC-MC)
for t = 1 to T do
repeat
Generate m from the prior π(M = m)
Generate θm from the prior πm(θm)
Generate z from the model fm(z|θm)
until ρ{η(z), η(y)} <
Set m(t) = m and θ(t)
= θm
end for
[Cornuet et al., DIYABC, 2009]

ABC estimates
Posterior probability π(M = m|y) approximated by the frequency
of acceptances from model m
1
T
T
t=1
Im(t)=m .

Limiting behaviour of B12 (under sufficiency)
If η(y) sufficient statistic for both models,
fi (y|θi ) = gi (y)f η
i (η(y)|θi )
Thus
B12(y) =
Θ1
π(θ1)g1(y)f η
1 (η(y)|θ1) dθ1
Θ2
π(θ2)g2(y)f η
2 (η(y)|θ2) dθ2
=
g1(y) π1(θ1)f η
1 (η(y)|θ1) dθ1
g2(y) π2(θ2)f η
2 (η(y)|θ2) dθ2
=
g1(y)
g2(y)
Bη
12(y) .
[Didelot, Everitt, Johansen & Lawson, 2011]
c No discrepancy only when cross-model sufficiency
c Inability to evaluate loss brought by summary statistics

A stylised problem
Central question to the validation of ABC for model choice:
When is a Bayes factor based on an insuﬃcient statistic
T(y) consistent?
Note/warnin: c drawn on T(y) through BT
12(y) necessarily diﬀers
from c drawn on y through B12(y)
[Marin, Pillai, X, & Rousseau, JRSS B, 2013]

A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks!]:
[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposed
to model M2: y ∼ L(θ2, 1/
√
2), Laplace distribution with mean θ2
and scale parameter 1/
√
2 (variance one).
Four possible statistics
1. sample mean y (suﬃcient for M1 if not M2);
2. sample median med(y) (insuﬃcient);
3. sample variance var(y) (ancillary);
4. median absolute deviation mad(y) = med(|y − med(y)|);

A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks!]:
[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposed
to model M2: y ∼ L(θ2, 1/
√
2), Laplace distribution with mean θ2
and scale parameter 1/
√
2 (variance one).
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.10.20.30.40.50.60.7
n=100
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.20.40.60.81.0
n=100

Framework
Starting from sample
y = (y1, . . . , yn)
the observed sample, not necessarily iid with true distribution
y ∼ Pn
Summary statistics
T(y) = Tn
= (T1(y), T2(y), · · · , Td (y)) ∈ Rd
with true distribution Tn
∼ Gn.

Assumptions
A collection of asymptotic “standard” assumptions:
[A1] is a standard central limit theorem under the true model with
asymptotic mean µ0
[A2] controls the large deviations of the estimator Tn
from the
model mean µ(θ)
[A3] is the standard prior mass condition found in Bayesian
asymptotics (di eﬀective dimension of the parameter)
[A4] restricts the behaviour of the model density against the true
density
[Think CLT!]

Asymptotic marginals
Asymptotically, under [A1]–[A4]
mi (t) =
Θi
gi (t|θi ) πi (θi ) dθi
is such that
(i) if inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0,
Cl vd−di
n mi (Tn
) Cuvd−di
n
and
(ii) if inf{|µi (θi ) − µ0|; θi ∈ Θi } > 0
mi (Tn
) = oPn [vd−τi
n + vd−αi
n ].

Between-model consistency
Consequence of above is that asymptotic behaviour of the Bayes
factor is driven by the asymptotic mean value µ(θ) of Tn
under
both models. And only by this mean value!

under
Indeed, if
inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0
then
Cl v
−(d1−d2)
n m1(Tn
) m2(Tn
) Cuv
−(d1−d2)
n ,
where Cl , Cu = OPn (1), irrespective of the true model.
c Only depends on the diﬀerence d1 − d2: no consistency

under
Else, if
inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0
then
m1(Tn
)
m2(Tn
)
Cu min v
−(d1−α2)
n , v
−(d1−τ2)
n

Checking for adequate statistics
Run a practical check of the relevance (or non-relevance) of Tn
null hypothesis that both models are compatible with the statistic
Tn
H0 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} = 0
against
H1 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} > 0
testing procedure provides estimates of mean of Tn
under each
model and checks for equality

Checking in practice
Under each model Mi , generate ABC sample θi,l , l = 1, · · · , L
For each θi,l , generate yi,l ∼ Fi,n(·|ψi,l ), derive Tn
(yi,l ) and
compute
^µi =
1
L
L
l=1
Tn
(yi,l ), i = 1, 2 .
Conditionally on Tn
(y),
√
L { ^µi − Eπ
[µi (θi )|Tn
(y)]} N(0, Vi ),
Test for a common mean
H0 : ^µ1 ∼ N(µ0, V1) , ^µ2 ∼ N(µ0, V2)
against the alternative of diﬀerent means
H1 : ^µi ∼ N(µi , Vi ), with µ1 = µ2 .

Toy example: Laplace versus Gauss
qqqqqqqqqqqqqqq
qqqqqqqqqq
q
qq
q
q
Gauss Laplace Gauss Laplace
010203040
Normalised χ2 without and with mad

motivatoy example
Random forests
ABC with random forests

Leaning towards machine learning
Main notions:
ABC-MC seen as learning about which model is most
appropriate from a huge (reference) table
exploiting a large number of summary statistics not an issue
for machine learning methods intended to estimate eﬃcient
combinations
abandoning (temporarily?) the idea of estimating posterior
probabilities of the models, poorly approximated by machine
learning methods, and replacing those by posterior predictive
expected loss
[Cornuet et al., 2016]

Random forests
Technique that stemmed from Leo Breiman’s bagging (or
bootstrap aggregating) machine learning algorithm for both
classification and regression
[Breiman, 1996]
Improved classification performances by averaging over
classification schemes of randomly generated training sets, creating
a “forest” of (CART) decision trees, inspired by Amit and Geman
(1997) ensemble learning
[Breiman, 2001]

random forests as non-parametric regression
CART means Classiﬁcation and Regression Trees
For regression purposes, i.e., to predict y as f (x), similar binary
trees in random forests
1. at each tree node, split data into two daughter nodes
2. split variable and bound chosen to minimise heterogeneity
criterion
3. stop splitting when enough homogeneity in current branch
4. predicted values at terminal nodes (or leaves) are average
response variable y for all observations in ﬁnal leaf

Growing the forest
Breiman’s solution for inducing random features in the trees of the
forest:
boostrap resampling of the dataset and
random subset-ing [of size
√
t] of the covariates driving the
classiﬁcation at every node of each tree
Covariate xτ that drives the node separation
xτ cτ
and the separation bound cτ chosen by minimising entropy or Gini
index

Idea: Starting with
possibly large collection of summary statistics (s1i , . . . , spi )
(from scientiﬁc theory input to available statistical softwares,
to machine-learning alternatives)
ABC reference table involving model index, parameter values
and summary statistics for the associated simulated
pseudo-data
run R randomforest to infer M from (s1i , . . . , spi )

Idea: Starting with
pseudo-data
at each step O(
√
p) indices sampled at random and most
discriminating statistic selected, by minimising entropy Gini loss

Idea: Starting with
pseudo-data
Average of the trees is resulting summary statistics, highly
non-linear predictor of the model index

Outcome of ABC-RF
Random forest predicts a (MAP) model index, from the observed
dataset: The predictor provided by the forest is “suﬃcient” to
select the most likely model but not to derive associated posterior
probability
exploit entire forest by computing how many trees lead to
picking each of the models under comparison but variability
too high to be trusted
frequency of trees associated with majority model is no proper
substitute to the true posterior probability
And usual ABC-MC approximation equally highly variable and
hard to assess

Posterior predictive expected losses
We suggest replacing unstable approximation of
P(M = m|xo)
with xo observed sample and m model index, by
average of the selection errors across all models given the data xo,
P( ^M(X) = M|xo)
where pair (M, X) generated from the predictive
f (x|θ)π(θ, M|xo)dθ
and ^M(x) denotes the random forest model (MAP) predictor

Posterior predictive expected losses
Arguments:
Bayesian estimate of the posterior error
integrates error over most likely part of the parameter space
gives an averaged error rather than the posterior probability of
the null hypothesis
easily computed: Given ABC subsample of parameters from
reference table, simulate pseudo-samples associated with
those and derive error frequency

Comments
real-data implementation for population genetics with high
performances
unlimited aggregation of arbitrary summary statistics
recovery of discriminant statistics when available
automated implementation with reduced calibration
self-evaluation by posterior predictive error
soon to be included within DIYABC
[Pudlo et al., 2016]

motivatoy example
Random forests
the ODOF principle

Two basic issues with ABC
ABC compares numerous simulated dataset to the observed one
Two major diﬃculties:
to decrease approximation error (or tolerance ) and hence
ensure reliability of ABC, total number of simulations very
large;
calibration of ABC (tolerance, distance, summary statistics,
post-processing, &tc) critical and hard to automatise

classiﬁcation of summaries by random forests
Given a large collection of summary statistics, rather than selecting
a subset and excluding the others, estimate each parameter of
interest by a machine learning tool like random forests
RF can handle thousands of predictors
ignore useless components
fast estimation method with good local properties
automatised method with few calibration steps
substitute to Fearnhead and Prangle (2012) preliminary
estimation of ^θ(yobs)
includes a natural (classiﬁcation) distance measure that avoids
choice of both distance and tolerance
[Marin et al., 2016]

ABC parameter estimation (ODOF)
One dimension = one forest (ODOF) methodology
parametric statistical model:
{f (y; θ): y ∈ Y, θ ∈ Θ}, Y ⊆ Rn
, Θ ⊆ Rp
with intractable density f (·; θ)
plus prior distribution π(θ)
Inference on quantity of interest
ψ(θ) ∈ R
(posterior means, variances,
quantiles or covariances)

common reference table
Given η: Y → Rk a collection of summary statistics
produce reference table (RT) used as learning dataset for
multiple random forests
meaning, for 1 t N
1. simulate θ(t)
∼ π(θ)
2. simulate ˜yt = (˜y1,t, . . . , ˜yn,t) ∼ f (y; θ(t)
)
3. compute η(˜yt) = {η1(˜yt), . . . , ηk (˜yt)}

ABC posterior expectations
Recall that θ = (θ1, . . . , θd ) ∈ Rd
ODOF principle:
For each θj , construct a separate RF regression with predictors
variables equal to summary statistics η(y) = {η1(y), . . . , ηk(y)}
If Lb(η(y∗)) denotes leaf index of b-th tree associated with η(y∗)
—leaf reached through path of binary choices in tree b—, with |Lb|
response variables
E(θj | η(y∗)) =
1
B
B
b=1
1
|Lb(η(y∗))|
t:η(yt )∈Lb(η(y∗))
θ
(t)
j
is our ABC estimate

ABC posterior expectations
ODOF principle:
For each θj , construct a separate RF regression with predictors
variables equal to summary statistics η(y) = {η1(y), . . . , ηk(y)}
If Lb(η(y∗)) denotes leaf index of b-th tree associated with η(y∗)
—leaf reached through path of binary choices in tree b—, with |Lb|
response variables
E(θj | η(y∗)) =
1
B
B
b=1
1
|Lb(η(y∗))|
t:η(yt )∈Lb(η(y∗))
θ
(t)
j
is our ABC estimate

ABC posterior quantile estimate
Random forests also available for quantile regression
[Meinshausen, 2006, JMLR]
Since
^E(θj | η(y∗
)) =
N
t=1
wt(η(y∗
))θ
(t)
j
with
wt(η(y∗
)) =
1
B
B
b=1
ILb(η(y∗))(η(yt))
|Lb(η(y∗))|
natural estimate of the cdf of θj is
^F(u | η(y∗
)) =
N
t=1
wt(η(y∗
))I{θ
(t)
j u}
.

ABC posterior quantile estimate
Since
^E(θj | η(y∗
)) =
N
t=1
wt(η(y∗
))θ
(t)
j
with
wt(η(y∗
)) =
1
B
B
b=1
ILb(η(y∗))(η(yt))
|Lb(η(y∗))|
natural estimate of the cdf of θj is
^F(u | η(y∗
)) =
N
t=1
wt(η(y∗
))I{θ
(t)
j u}
.
ABC posterior quantiles + credible intervals given by ^F−1

ABC variances
While approximation of Var(θj | η(y∗)) available based on ^F,
choice of alternative if more involved version:
In a given tree b in a random forest, existence of out-of-bag (oob)
entries, i.e., not sampled in associated bootstrap subsample
Use of oob simulations to produce estimate of E{θj | η(yt)}, ˜θj
(t)
,
Apply weights ωt(η(y∗)) to oob residuals:
Var(θj | η(y∗
)) =
N
t=1
ωt(η(y∗
)) (θ
(t)
j − ˜θj
(t) 2

ABC covariances
For estimating Cov(θj , θ | η(y∗)), construction of a speciﬁc
random forest
product of oob errors for θj and θ
θ
(t)
j − ˜θj
(t)
θ
(t)
− ˜θ
(t)
with again predictors variables the summary statistics
η(y) = {η1(y), . . . , ηk(y)}

Human populations example
50,000 SNP markers genotyped in four Human populations:
Yoruba (Africa), Han (East Asia), British (Europe[??]) and
American individuals of African Ancestry; 30 individuals per
population.
Comparison of 6 scenarios of evolution which diﬀer from each
other by one ancient plus one recent historical events:
A) a single out-of-Africa colonisation event giving an ancestral
out-of-Africa versus two independent out-of-Africa
colonisation events;
B) the possibility of a recent genetic admixture of Americans of
African origin with their African ancestors and individuals of
European or East Asia origins.

summaries
use of 112 summary statistics provided by DIYABC for SNP
markers complemented by the ﬁve LDA axes as additional statistics
Classiﬁcation method Prior error rates (%)
trained on M = 10, 000 M = 20, 000 M = 50, 000
Linear Discriminant Analysis 9.91 9.97 10.03
Rejection ABC, DIYABC summaries 23.18 20.55 17.76
Rejection ABC, LDA summaries 6.29 5.76 5.70
Local logistic reg. on LDA 6.85 6.42 6.07
RF, DIYABC summaries 8.84 7.32 6.34
RF, DIYABC and LDA summaries 5.01 4.66 4.18

outcome
ABC-RF picks scenario 2 as forecasted scenario on the Human
dataset
not obvious fmor LDA projections (where scenario 2 corresponds to
blue

comments
Considering previous population genetics studies in the ﬁeld,
unsurprising that
single out-of-Africa colonization event giving an ancestral
out-of- Africa population
secondarily split into one European and one East Asian
population lineage
recent genetic admixture of Americans of African origin with
their African ancestors and European
estimate of the posterior probability of scenario 2 equal to 0.998,
corresponding to a high level of conﬁdence [?] in choosing scenario
2

further comments
For scenario 2, parameters of interest
ra admixture rate between Europeans and Africans,
t3 out-of-Africa time,
NA eﬀective size of the ancestral population.
Reference table containing 2e5 points from which 300 simulations
were excluded to evaluate accuracy of diﬀerent methodologies

estimates
RF rejection local linear reg. ridge reg. neural nets
coverage 95% 96.6 97.6 92.3 93.3 85
q.range 95% 4276.12 7241.66 3594.01 3813.93 2675.63
coverage 90% 92.6 94 85.3 86.3 76.3
range 90% 3644.28 6422.49 2897.32 3101.17 2146.01
parameter Na coverages and quantile ranges

[not so famous] last words
ABC RF methods mostly insensitive both to strong correlations
between the summary statistics and to the presence of noisy
variables.
involves less simulations and no calibration
Next steps: adaptive schemes, deep learning, inclusion in DIYABC

motivatoy example
asymptotic setup
consistency of ABC posteriors
asymptotic posterior shape
asymptotic behaviour of EABC [θ]

asymptotic setup
asymptotic: y = y(n) ∼ Pn
θ and = n, n → +∞
parametric: θ ∈ Rk, k ﬁxed
concentration of summary statistics η(zn):
∃b : θ → b(θ) η(zn
) − b(θ) = oP
θ
(1), ∀θ
Objects of interest:
posterior concentration and asymptotic shape of π (·|η(y(n)))
(normality?)
convergence of the posterior mean ^θ = EABC[θ|η(y(n))]
asymptotic acceptance rate
[Frazier et al., 2016]

ABC algorithm Bayesian consistent at θ0 if for any δ > 0,
Π ( θ − θ0 > δ| η(y) − η(z) ε) → 0
as n → +∞, ε → 0
Bayesian consistency implies that sets containing θ0 have posterior
probability tending to one as n → +∞, with implication being the
existence of a speciﬁc rate of concentration

ABC algorithm Bayesian consistent at θ0 if for any δ > 0,
Π ( θ − θ0 > δ| η(y) − η(z) ε) → 0
as n → +∞, ε → 0
Concentration around true value and Bayesian consistency
impose less stringent conditions on the convergence speed of
tolerance n to zero, when compared with asymptotic
normality of ABC posterior
asymptotic normality of ABC posterior mean does not require
asymptotic normality of ABC posterior

Concentration of summary η(z): there exists b(θ) such that
η(z) − b(θ) = oP
θ
(1)
Consistency:
Π n ( θ − θ0 δ|η(y)) = 1 + op(1)
Convergence rate: there exists δn = o(1) such that
Π n ( θ − θ0 δn|η(y)) = 1 + op(1)

Rate of convergence
Π (·| η(y) − η(z) ε) concentrates at rate λn → 0 if
lim sup
ε→0
lim sup
n→+∞
Π ( θ − θ0 > λnM| η(y)η(z) ε) → 0
in P0-probability when M goes to inﬁnity.
Posterior rate of concentration related to rate at which information
accumulates about true parameter vector

Related results
existing studies on the large sample properties of ABC, in which
the asymptotic properties of point estimators derived from ABC
have been the primary focus
[Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2015]

Convergence when n σn
Under (main) assumptions
(A1) ∃σn → 0
Pθ σ−1
n η(z) − b(θ) > u c(θ)h(u), lim
u→+∞
h(u) = 0
(A2)
Π( b(θ) − b(θ0) u) uD
, u ≈ 0
posterior consistency
posterior concentration rate λn that depends on the deviation
control of d2{η(z), b(θ)}
posterior concentration rate for b(θ) bounded from below by O( n)

Convergence when n σn
Under (main) assumptions
(A1) ∃σn → 0
Pθ σ−1
n η(z) − b(θ) > u c(θ)h(u), lim
u→+∞
h(u) = 0
(A2)
Π( b(θ) − b(θ0) u) uD
, u ≈ 0
then
Π n b(θ) − b(θ0) n + σnh−1
( D
n )|η(y) = 1 + op0 (1)
If also θ − θ0 L b(θ) − c(θ0) α, locally and θ → b(θ) 1-1
Π n ( θ − θ0
α
n + σα
n (h−1
( D
n ))α
δn
|η(y)) = 1 + op0 (1)

Comments
(A1) : if Pθ σ−1
n η(z) − b(θ) > u c(θ)h(u), two cases
1. Polynomial tail: h(u) u−κ
, then δn = n + σn
−D/κ
n
2. Exponential tail: h(u) e−cu
, then δn = n + σn log(1/ n)
E.g., η(y) = n−1
i g(yi ) with moments on g (case 1) or
Laplace transform (case 2)

Comments
(A2) : Π( b(θ) − b(θ0) u) uD : If Π regular enough then
D = dim(θ)
no need to approximate the density f (η(y)|θ).
Same results holds when n = o(σn) if (A1) replaced with
inf
|x| M
Pθ σ−1
n (η(z) − b(θ)) − x u uD
, u ≈ 0

proof
Simple enough proof: assume σn δ n and
η(y) − b(θ0) σn, η(y) − η(z) n
Hence
b(θ) − b(θ0) > δn ⇒ η(z) − b(θ) > δn − n − σn := tn
Also, if b(θ) − b(θ0) n/3
η(y) − η(z) η(z) − b(θ) + σn
n/3
+ n/3
and
Π n ( b(θ) − b(θ0) > δn|y)
b(θ)−b(θ0) >δn
Pθ ( η(z) − b(θ) > tn) dΠ(θ)
|b(θ)−b(θ0)| n/3
Pθ ( η(z) − b(θ) n/3) dΠ(θ)
−D
n h(tnσ−1
n )
Θ
c(θ)dΠ(θ)

Assumptions
Applicable to broad range of data structures
[A1] ensures that η(z) concentrates on b(θ), unescapable
[A2] controls degree of prior mass in a neighbourhood of θ0,
standard in Bayesian asymptotics
[A2] If Π absolutely continuous with prior density p bounded,
above and below, near θ0, then D = dim(θ) = kθ
[A3] identiﬁcation condition critical for getting posterior
concentration around θ0, b being injective depending on true
structural model and particular choice of η.

Summary statistic and (in)consistency
Consider the moving average MA(2) model
yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1)
and
−2 θ1 2, θ1 + θ2 −1, θ1 − θ2 1.
summary statistics equal to sample autocovariances
ηj (y) = T−1
T
t=1+j
yt yt−j j = 0, 1
with
η0(y)
P
→ E[y2
t ] = 1 + (θ01)2
+ (θ02)2
and η1(y)
P
→ E[yt yt−1] = θ01(1 + θ02)
For ABC target pε (θ|η(y)) to be degenerate at θ0
0 = b(θ0) − b (θ) =
1 + (θ01)2
+ (θ02)2
θ01(1 + θ02)
−
1 + (θ1)2
+ (θ2)2
θ1(1 + θ2)
must have unique solution θ = θ0
Take θ01 = .6, θ02 = .2: equation has two solutions
θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204

Concentration for the MA(2) model
True value θ0 = (0.6, 0.2)
Summaries ﬁrst three autocorrelations
Tolerance proportional to εT = 1/T0.4
Rejection of normality of these posteriors

Asymptotic shape of posterior distribution
Shape of
Π (·| η(y), η(z) εn)
for several connections between εn and rate at which η(yn) satisfy
CLT
Three diﬀerent regimes:
1. σn = o( n) −→ Uniform limit
2. σn n −→ perturbated Gaussian limit
3. σn n −→ Gaussian limit

scaling matrices
Introduction of sequence of (k, k) p.d. matrices Σn(θ) such that
for all θ near θ0
c1 Dn ∗ Σn(θ) ∗ c2 Dn ∗, Dn = diag(dn(1), · · · , dn(k)),
with 0 < c1, c2 < +∞, dn(j) → +∞ for all j’s
Possibly diﬀerent convergence rates for components of η(z)
Reordering components so that
dn(1) · · · dn(k)
with assumption that
lim inf
n
dn(j)εn = lim sup
n
dn(j)εn

New assumptions
(B1) Concentration of summary η: Σn(θ) ∈ Rk1×k1 is o(1)
Σn(θ)−1
{η(z)−b(θ)} ⇒ Nk1 (0, Id), (Σn(θ)Σn(θ0)−1
)n = Co
(B2) b(θ) is C1 and
θ − θ0 b(θ) − b(θ0)
(B3) Dominated convergence and
lim
n
Pθ(Σn(θ)−1{η(z) − b(θ)} ∈ u + B(0, un))
j un(j)
= ϕ(u)

main result
Set Σn(θ) = σnD(θ) for θ ≈ θ0 and
Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2)
when nσ−1
n → +∞
Π n [ −1
n (θ−θ0) ∈ A|y] ⇒ UB0 (A), B0 = {x ∈ Rk
; b (θ0)T
x 1

main result
when nσ−1
n → c
Π n [Σn(θ0)−1
(θ − θ0) − Zo
∈ A|y] ⇒ Qc(A), Qc = N

main result
when nσ−1
n → 0 and (B3) holds, set
Vn = [b (θ0)]n
Σn(θ0)b (θ0)
then
Π n [V −1
n (θ − θ0) − ˜Zo
∈ A|y] ⇒ Φ(A),

intuition (?!)
Set x(θ) = σ−1
n (θ − θ0) − Zo (k = 1)
πn := Π n [ −1
n (θ − θ0) ∈ A|y]
=
|θ−θ0| un
Ix(θ)∈A
Pθ ( σ−1
n (η(z) − b(θ)) + x(θ) σ−1
n n)p(θ)dθ
|θ−θ0| un
Pθ ( σ−1
n (η(z) − b(θ)) + x(θ) σ−1
n n)p(θ)dθ
+ op(1)
If n/σn 1 :
Pθ σ−1
n (η(z) − b(θ)) + x(θ) σ−1
n n = 1+o(1), iﬀ x σ−1
n n+o(1)
If n/σn = o(1)
Pθ σ−1
n (η(z) − b(θ)) + x σ−1
n n = φ(x)σn(1 + o(1))

more comments
Surprising : U(− n, n) limit when n σn but not that
surprising since n = o(1) means concentration around θ0
and σn = o( n) implies that b(θ) − b(θ0) ≈ η(z) − η(y)
again, no need to control approximation of f (η(y)|θ) by a
Gaussian density: merely a control of the distribution
generalisation to the case where eigenvalues of Σn are
dn,1 = · · · = dn,k
behaviour of EABC (θ|y) consistent with Li & Fearnhead
(2016)

even more comments
If (also) p(θ) is Hölder β
EABC (θ|y) − θ0 = σn
Zo
b(θ0)
score for f (η(y)|θ)
+
β/2
j=1
2j
n Hj (θ0, p, b)
bias from threshold approx
+o(σn) + O( β+1
n )
with
if 2
n = o(σn) : Efficiency
EABC (θ|y) − θ0 = σn
Zo
b(θ0)
+ o(σn)
the Hj (θ0, p, b)’s are deterministic
we gain nothing by getting a first crude ^θ(y) = EABC (θ|y)
for some η(y) and then rerun ABC with ^θ(y)

Illustration in the MA(2) setting
Sample sizes of T = 500, 1000
Asymptotic normality rejected for εT = 1/T0.4 and for θ1,
T = 500 and εT = 1/T0.55

asymptotic behaviour of EABC [θ]
When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10)
EABC [dT (θ − θ0)|yo
] ⇒ N(0, ( bo
)T
Σ−1
bo −1
[Li & Fearnhead (2016)]
In fact, if β+1
n
√
n = o(1), with β H¨older-smoothness of π
EABC [(θ−θ0)|yo
] =
( bo)−1Zo
√
n
+
k
j=1
hj (θ0) 2j
n +op(1), 2k = β
Iterating for ﬁxed p mildly interesting: if
˜η(y) = EABC [θ|yo
]
then
EABC [θ|˜η(y)] = θ0 +
( bo)−1Zo
√
n
+
π (θ0)
π(θ0)
2
n + o()
[Fearnhead & Prangle, 2012]

more asymptotic behaviour of EABC [θ]
Li and Fearnhead (2016,2017) consider that
EABC [dT (θ − θ0)|yo
]
not optimal when p > d
If
√
n 2
n = o(1) and n
√
n = o(1)
√
n[EABC (θ) − θ0] = P bo Zo
+ op(1)
Zo
=
√
n(η(y) − bo
)
P bo Zo
= (( bo
)T
bo
)−1
( bo
)T
Zo
and Vas(P bo Zo
) ( bo
)T
Vas(Zo
)−1
( bo
)
−1
If n
√
n = o(1)
√
n[EABC (θ)−θ0] = ( bo
)T
Σ−1
bo −1
( bo
)T
Σ−1
Zo
+op(1)

impact of the dimension of η
dimension of η(.) does not impact above result, but impacts
acceptance probability
if n = o(σn), k1 = dim(η(y)), k = dim(θ) & k1 k
αn := Pr ( y − z n) k1
n σ−k1+k
n
if n σn
αn := Pr ( y − z n) k
n
If we choose αn
αn = o(σk
n) leads to n = σn(αnσ−k
n )1/k1
= o(σn)
αn σn leads to n α
1/k
n .

Illustration in the MA(2) setting
Sample sizes of T = 500, 1000
Asymptotic normality accepted for all graphs

Practical implications
In practice, tolerance determined by quantile (nearest neighbours):
Select all θi associated with the α = δ/N smallest distances
d2{η(zi ), η(y)} for some δ
Then (i) if εT v−1
T or εT = o(v−1
T ), acceptance rate associated
with the threshold εT is
αT = pr ( η(z) − η(y) εT ) (vT εT )kη
× v−kθ
T v−kθ
T
(ii) if εT v−1
T ,
αT = pr ( η(z) − η(y) εT ) εkθ
T v−kθ
T

Monte Carlo error
Link the choice of εT to Monte Carlo error associated with NT
draws in Algorithm
Conditions (on εT ) under which
^αT = αT {1 + op(1)}
where ^αT = NT
i=1 1l [d{η(y), η(z)} εT ] /NT proportion of
accepted draws from NT simulated draws of θ
Either
(i) εT = o(v−1
T ) and (vT εT )−kη ε−kθ
T MNT
or
(ii) εT v−1
T and ε−kθ
T MNT
for M large enough;

conclusion on ABC consistency
asymptotic description of ABC: diﬀerent regimes depending
on n & σn
no point in choosing n arbitrarily small: just n = o(σn)
no asymptotic gain in iterative ABC
results under weak conditions by not studying g(η(z)|θ)

Monte Carlo in Montréal 2017

More Related Content

What's hot (20)

Similar to Monte Carlo in Montréal 2017 (20)

More from Christian Robert (18)

Recently uploaded (20)

Monte Carlo in Montréal 2017