WSC 2011, advanced tutorial on simulation in Statistics

Simulation methods in Statistics (on recent advances)

Simulation methods in Statistics
(on recent advances)

Christian P. Robert

Universit´ Paris-Dauphine, IuF, & CRESt
e
http://www.ceremade.dauphine.fr/~xian

WSC 2011, Phoenix, December 12, 2011


Outline

1 Motivation and leading example

2 Monte Carlo Integration

3 The Metropolis-Hastings Algorithm

4 Approximate Bayesian computation

Motivation and leading example


Latent variables
Inferential methods




Latent variables

Latent structures make life harder!

Even simple statistical models may lead to computational
complications, as in latent variable models

f(x|θ) = f (x, x |θ) dx

Latent variables



f(x|θ) = f (x, x |θ) dx

If (x, x ) observed, ﬁne!

Latent variables



f(x|θ) = f (x, x |θ) dx

If (x, x ) observed, ﬁne!
If only x observed, trouble!
[mixtures, HMMs, state-space models, &tc]

Latent variables

Mixture models
Models of mixtures of distributions:

X ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

X ∼ p1 f1 (x) + · · · + pk fk (x) .

Latent variables

Mixture models



X ∼ p1 f1 (x) + · · · + pk fk (x) .

For a sample of independent random variables (X1 , · · · , Xn ),
sample density
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1

Latent variables

Mixture models



X ∼ p1 f1 (x) + · · · + pk fk (x) .

For a sample of independent random variables (X1 , · · · , Xn ),
sample density
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1

Expanding this product involves kn elementary terms: prohibitive
to compute in large samples.

Latent variables

Mixture likelihood
3
2
µ2

1
0
−1

−1 0 1 2 3

µ1

Case of the 0.3N (µ1 , 1) + 0.7N (µ2 , 1) likelihood

Inferential methods

Maximum likelihood methods

goto Bayes

For an iid sample X1 , . . . , Xn from a population with density
f(x|θ1 , . . . , θk ), the likelihood function is

L(x|θ) = L(x1 , . . . , xn |θ1 , . . . , θk )
n
= f(xi |θ1 , . . . , θk ).
i=1

Inferential methods


goto Bayes


L(x|θ) = L(x1 , . . . , xn |θ1 , . . . , θk )
n
= f(xi |θ1 , . . . , θk ).
i=1

◦ Maximum likelihood has global justiﬁcations from asymptotics

Inferential methods


goto Bayes


L(x|θ) = L(x1 , . . . , xn |θ1 , . . . , θk )
n
= f(xi |θ1 , . . . , θk ).
i=1

◦ Maximum likelihood has global justiﬁcations from asymptotics
◦ Computational diﬃculty depends on structure, eg latent
variables

Inferential methods

Maximum likelihood methods (2)

Example (Mixtures)
For a mixture of two normal distributions,

pN(µ, τ2 ) + (1 − p)N(θ, σ2 ) ,
likelihood proportional to

n
xi − µ xi − θ
pτ−1 ϕ + (1 − p) σ−1 ϕ
τ σ
i=1

can be expanded into 2n terms.

Inferential methods


Standard maximization techniques often fail to ﬁnd the global
maximum because of multimodality or undesirable behavior
(usually at the frontier of the domain) of the likelihood function.

Example
In the special case

f(x|µ, σ) = (1 − ) exp{(−1/2)x2 } + exp{(−1/2σ2 )(x − µ)2 }
σ
with > 0 known,

Inferential methods


Standard maximization techniques often fail to ﬁnd the global
maximum because of multimodality or undesirable behavior
(usually at the frontier of the domain) of the likelihood function.

Example
In the special case

f(x|µ, σ) = (1 − ) exp{(−1/2)x2 } + exp{(−1/2σ2 )(x − µ)2 }
σ
with > 0 known, whatever n, the likelihood is unbounded:

lim L(x1 , . . . , xn |µ = x1 , σ) = ∞
σ→0

Inferential methods

The Bayesian Perspective

In the Bayesian paradigm, the information brought by the data x,
realization of
X ∼ f(x|θ),

Inferential methods

The Bayesian Perspective

In the Bayesian paradigm, the information brought by the data x,
realization of
X ∼ f(x|θ),
is combined with prior information speciﬁed by prior distribution
with density
π(θ)

Inferential methods

Central tool...

Summary in a probability distribution, π(θ|x), called the posterior
distribution

Inferential methods

Central tool...

distribution
Derived from the joint distribution f(x|θ)π(θ), according to

f(x|θ)π(θ)
π(θ|x) = ,
f(x|θ)π(θ)dθ

[Bayes Theorem]

Inferential methods

Central tool...

distribution
Derived from the joint distribution f(x|θ)π(θ), according to

f(x|θ)π(θ)
π(θ|x) = ,
f(x|θ)π(θ)dθ

[Bayes Theorem]
where
Z(x) = f(x|θ)π(θ)dθ

is the marginal density of X also called the (Bayesian) evidence

Inferential methods

Central tool...central to Bayesian inference
Posterior deﬁned up to a constant as

π(θ|x) ∝ f(x|θ) π(θ)

Operates conditional upon the observations

Inferential methods



Integrate simultaneously prior information and information
brought by x

Inferential methods



brought by x
Avoids averaging over the unobserved values of x

Inferential methods



brought by x
Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected

Inferential methods



brought by x
Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected
Provides a complete inferential scope and a unique motor of
inference

Inferential methods

Examples of Bayes computational problems

1 complex parameter space, as e.g. constrained parameter sets
like those resulting from imposing stationarity constraints in
time series

Inferential methods


time series
2 complex sampling model with an intractable likelihood, as
e.g. in some graphical models;

Inferential methods


time series
3 use of a huge dataset;

Inferential methods


time series
4 complex prior distribution (which may be the posterior
distribution associated with an earlier sample);

Inferential methods


time series
4 complex prior distribution (which may be the posterior
distribution associated with an earlier sample);
5 involved inferential procedure as for instance, Bayes factors

P(θ ∈ Θ0 | x) π(θ ∈ Θ0 )
Bπ (x) = .
01
P(θ ∈ Θ1 | x) π(θ ∈ Θ1 )

Inferential methods

Mixtures again
Observations from

x1 , . . . , xn ∼ f(x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 )

Inferential methods

Mixtures again
Observations from


Prior

µi |σi ∼ N (ξi , σ2 /ni ),
i σ2 ∼ I G (νi /2, s2 /2),
i i p ∼ Be(α, β)

Inferential methods

Mixtures again
Observations from


Prior

µi |σi ∼ N (ξi , σ2 /ni ),
i σ2 ∼ I G (νi /2, s2 /2),
i i p ∼ Be(α, β)

Posterior
n
π(θ|x1 , . . . , xn ) ∝ pϕ(xj ; µ1 , σ1 ) + (1 − p)ϕ(xj ; µ2 , σ2 ) π(θ)
j=1
n
= ω(kt )π(θ|(kt ))
=0 (kt )

[O(2n )]

Inferential methods

Mixtures again [2]

For a given permutation (kt ), conditional posterior distribution

σ2
π(θ|(kt )) = N ξ1 (kt ), 1
n1 +
×I G ((ν1 + )/2, s1 (kt )/2)
σ22
×N ξ2 (kt ),
n2 + n −
×I G ((ν2 + n − )/2, s2 (kt )/2)
×Be(α + , β + n − )

Inferential methods

Mixtures again [3]

where
1 2
¯
x1 (kt ) = t=1 xkt , ˆ
s1 (kt ) = ¯
t=1 (xkt − x1 (kt )) ,
1 n n 2
¯
x2 (kt ) = n− t= +1 xkt , ˆ
s2 (kt ) = ¯
t= +1 (xkt − x2 (kt ))

and
¯
n1 ξ1 + x1 (kt ) n2 ξ2 + (n − )¯2 (kt )
x
ξ1 (kt ) = , ξ2 (kt ) = ,
n1 + n2 + n −
n1
s1 (kt ) = s2 + s2 (kt ) +
1 ˆ1 (ξ1 − x1 (kt ))2 ,
¯
n1 +
n2 (n − )
s2 (kt ) = s2 + s2 (kt ) +
2 ˆ2 (ξ2 − x2 (kt ))2 ,
¯
n2 + n −

posterior updates of the hyperparameters

Inferential methods

Mixtures again [4]

Bayes estimator of θ:
n
π
δ (x1 , . . . , xn ) = ω(kt )Eπ [θ|x, (kt )]
=0 (kt )

c Too costly: 2n terms

Inferential methods

Mixtures again [4]

Bayes estimator of θ:
n
π
δ (x1 , . . . , xn ) = ω(kt )Eπ [θ|x, (kt )]
=0 (kt )

c Too costly: 2n terms
Unfortunate as the decomposition is meaningfull for clustering
purposes

Monte Carlo Integration

Monte Carlo integration


Importance Sampling
Bayesian importance sampling





Theme:
Generic problem of evaluating the integral

I = Ef [h(X)] = h(x) f(x) dx
X

where X is uni- or multidimensional, f is a closed form, partly
closed form, or implicit density, and h is a function


Monte Carlo integration (2)

Monte Carlo solution
First use a sample (X1 , . . . , Xm ) from the density f to approximate
the integral I by the empirical average
m
1
hm = h(xj )
m
j=1


Monte Carlo integration (2)

Monte Carlo solution
First use a sample (X1 , . . . , Xm ) from the density f to approximate
the integral I by the empirical average
m
1
hm = h(xj )
m
j=1

which converges
hm −→ Ef [h(X)]
by the Strong Law of Large Numbers


Monte Carlo precision

Estimate the variance with
m
1
vm = [h(xj ) − hm ]2 ,
m−1
j=1

and for m large,

hm − Ef [h(X)]
√ ∼ N (0, 1).
vm

Note: This can lead to the construction of a convergence test and
of conﬁdence bounds on the approximation of Ef [h(X)].


Example (Cauchy prior/normal sample)
For estimating a normal mean, a robust prior is a Cauchy prior

X ∼ N (θ, 1), θ ∼ C(0, 1).

Under squared error loss, posterior mean
∞
θ 2
2
e−(x−θ) /2 dθ
−∞ 1+θ
δπ (x) = ∞
1 2
e−(x−θ) /2 dθ
−∞ 1 + θ2


Example (Cauchy prior/normal sample (2))
Form of δπ suggests simulating iid variables

θ1 , · · · , θm ∼ N (x, 1)

and calculating
m m
ˆm θi 1
δπ (x) = .
1 + θ2
i 1 + θ2
i
i=1 i=1

The Law of Large Numbers implies

δπ (x) −→ δπ (x) as m −→ ∞.
ˆm


10.6
10.4
10.2
10.0
9.8
9.6

0 200 400 600 800 1000

iterations

Range of estimators δπ for 100 runs and x = 10
m

Importance Sampling

Importance sampling

Paradox
Simulation from f (the true density) is not necessarily optimal

Importance Sampling

Importance sampling

Paradox
Simulation from f (the true density) is not necessarily optimal

Alternative to direct sampling from f is importance sampling,
based on the alternative representation

f(x)
Ef [h(X)] = h(x) g(x) dx .
X g(x)

which allows us to use other distributions than f

Importance Sampling

Importance sampling algorithm

Evaluation of

Ef [h(X)] = h(x) f(x) dx
X

by
1 Generate a sample X1 , . . . , Xn from a distribution g
2 Use the approximation
m
1 f(Xj )
h(Xj )
m g(Xj )
j=1

Importance Sampling

Implementation details

◦ Instrumental distribution g chosen from distributions easy to
simulate
◦ The same sample (generated from g) can be used repeatedly,
not only for diﬀerent functions h, but also for diﬀerent
densities f
◦ Dependent proposals can be used, as seen later Pop’MC

Importance Sampling

Finite vs. inﬁnite variance

Although g can be any density, some choices are better than
others:
◦ Finite variance only when

f(X) f2 (X)
Ef h2 (X) = h2 (x) dx < ∞ .
g(X) X g(X)

Importance Sampling


others:

f(X) f2 (X)
Ef h2 (X) = h2 (x) dx < ∞ .
g(X) X g(X)

◦ Instrumental distributions with tails lighter than those of f
(that is, with sup f/g = ∞) not appropriate.
◦ If sup f/g = ∞, the weights f(xj )/g(xj ) vary widely, giving
too much importance to a few values xj .

Importance Sampling


others:

f(X) f2 (X)
Ef h2 (X) = h2 (x) dx < ∞ .
g(X) X g(X)

◦ Instrumental distributions with tails lighter than those of f
(that is, with sup f/g = ∞) not appropriate.
◦ If sup f/g = ∞, the weights f(xj )/g(xj ) vary widely, giving
too much importance to a few values xj .
◦ If sup f/g = M < ∞, ﬁnite variance for L2 functions

Importance Sampling

Selfnormalised importance sampling

For ratio estimator
n n
δn
h = ωi h(xi ) ωi
i=1 i=1

with Xi ∼ g(y) and Wi such that

E[Wi |Xi = x] = κf(x)/g(x)

Importance Sampling

Selfnormalised variance

then
1
var(δn ) ≈
h var(Sn ) − 2Eπ [h] cov(Sn , Sn ) + Eπ [h]2 var(Sn ) .
h h 1 1
n2 κ2
for
n n
Sn =
h Wi h(Xi ) , Sn =
1 Wi
i=1 i=1

Rough approximation
1
varδn ≈
h varπ (h(X)) {1 + varg (W)}
n


Bayes factor approximation

When approximating the Bayes factor

f0 (x|θ0 )π0 (θ0 )dθ0
Θ0
B01 =
f1 (x|θ1 )π1 (θ1 )dθ1
Θ1

use of importance functions 0 and 1 and
n0
n−1
0
i i
i=1 f0 (x|θ0 )π0 (θ0 )/
i
0 (θ0 )
B01 = n1
n−1
1
i i
i=1 f1 (x|θ1 )π1 (θ1 )/
i
1 (θ1 )


Diabetes in Pima Indian women

Example (R benchmark)
“A population of women who were at least 21 years old, of Pima
Indian heritage and living near Phoenix (AZ), was tested for
diabetes according to WHO criteria. The data were collected by
the US National Institute of Diabetes and Digestive and Kidney
Diseases.”
200 Pima Indian women with observed variables
plasma glucose concentration in oral glucose tolerance test
diastolic blood pressure
diabetes pedigree function
presence/absence of diabetes


Probit modelling on Pima Indian women

Probability of diabetes function of above variables

P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,


Probit modelling on Pima Indian women

Probability of diabetes function of above variables

P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a
g-prior modelling:

β ∼ N3 (0, n XT X)−1


Importance sampling for the Pima Indian dataset

Use of the importance function inspired from the MLE estimate
distribution
β ∼ N(β, Σ)
ˆ ˆ


Importance sampling for the Pima Indian dataset

Use of the importance function inspired from the MLE estimate
distribution
β ∼ N(β, Σ)
ˆ ˆ

R Importance sampling code
model1=summary(glm(y~-1+X1,family=binomial(link="probit")))
is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)
is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)
bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],
sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-
dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))


Diabetes in Pima Indian women
Comparison of the variation of the Bayes factor approximations
based on 100 replicas for 20, 000 simulations from the prior and
the above MLE importance sampler
5
4
3
2

Basic Monte Carlo Importance sampling


Illustration for the Pima Indian dataset

Use of the MLE induced conditional of β3 given (β1 , β2 ) as a
pseudo-posterior and mixture of both MLE approximations on β3
in bridge sampling estimate


Illustration for the Pima Indian dataset

Use of the MLE induced conditional of β3 given (β1 , β2 ) as a
pseudo-posterior and mixture of both MLE approximations on β3
in bridge sampling estimate
R bridge sampling code
cova=model2$cov.unscaled
expecta=model2$coeff[,1]
covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]

probit1=hmprobit(Niter,y,X1)
probit2=hmprobit(Niter,y,X2)
pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))
probit1p=cbind(probit1,pseudo)

bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),
sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],
cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+
dnorm(pseudo,expecta[3],cova[3,3])))


Diabetes in Pima Indian women (cont’d)
based on 100 × 20, 000 simulations from the prior (MC), the above
bridge sampler and the above importance sampler


The original harmonic mean estimator

When θki ∼ πk (θ|x),
T
1 1
T L(θkt |x)
t=1

is an unbiased estimator of 1/mk (x)
[Newton & Raftery, 1994]


The original harmonic mean estimator

When θki ∼ πk (θ|x),
T
1 1
T L(θkt |x)
t=1

is an unbiased estimator of 1/mk (x)
[Newton & Raftery, 1994]

Highly dangerous: Most often leads to an inﬁnite variance!!!


“The Worst Monte Carlo Method Ever”

“The good news is that the Law of Large Numbers guarantees that
this estimator is consistent ie, it will very likely be very close to the
correct answer if you use a suﬃciently large number of points from
the posterior distribution.


“The Worst Monte Carlo Method Ever”

“The good news is that the Law of Large Numbers guarantees that
this estimator is consistent ie, it will very likely be very close to the
correct answer if you use a suﬃciently large number of points from
the posterior distribution.
The bad news is that the number of points required for this
estimator to get close to the right answer will often be greater
than the number of atoms in the observable universe. The even
worse news is that it’s easy for people to not realize this, and to
na¨ıvely accept estimates that are nowhere close to the correct
value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]


Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1
Eπk x = dθk =
πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk

no matter what the proposal ϕ(·) is.
[Gelfand & Dey, 1994; Bartolucci et al., 2006]


Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1
Eπk x = dθk =
πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk

no matter what the proposal ϕ(·) is.
[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the MCMC output


Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance sampling
constraints: ϕ(θ) must have lighter (rather than fatter) tails than
πk (θk )Lk (θk ) for the approximation
T (t)
1 ϕ(θk )
Z1k = 1 (t) (t)
T πk (θk )Lk (θk )
t=1

to enjoy ﬁnite variance


Comparison with regular importance sampling (cont’d)

Compare Z1k with a standard importance sampling approximation
T (t) (t)
1 πk (θk )Lk (θk )
Z2k = (t)
T ϕ(θk )
t=1

(t)
where the θk ’s are generated from the density ϕ(·) (with fatter
tails like t’s)


HPD indicator as ϕ
Use the convex hull of MCMC simulations corresponding to the
10% HPD region (easily derived!) and ϕ as indicator:
10
ϕ(θ) = Id(θ,θ(t) )
T
t∈HPD


based on 100 replicas for 20, 000 simulations for a simulation from
the above harmonic mean sampler and importance samplers
3.102 3.104 3.106 3.108 3.110 3.112 3.114 3.116

Harmonic mean Importance sampling


Chib’s representation

Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and
θk ∼ πk (θk ),
fk (x|θk ) πk (θk )
mk (x) =
πk (θk |x)
[Bayes Theorem]


Case of latent variables

For missing variable z as in mixture models, natural Rao-Blackwell
estimate
T
∗ 1 (t)
πk (θk |x) = πk (θ∗ |x, zk ) ,
k
T
t=1
(t)
where the zk ’s are Gibbs sampled latent variables


Case of the probit model

For the completion by z,
1
ˆ
π(θ|x) = π(θ|x, z(t) )
T t

is a simple average of normal densities


Case of the probit model

For the completion by z,
1
ˆ
π(θ|x) = π(θ|x, z(t) )
T t

is a simple average of normal densities
R Bridge sampling code
gibbs1=gibbsprobit(Niter,y,X1)
gibbs2=gibbsprobit(Niter,y,X2)
bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3),
sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/
mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2),
sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))


based on 100 replicas for 20, 000 simulations for a simulation from
the above Chib’s and importance samplers

The Metropolis-Hastings Algorithm




Monte Carlo Methods based on Markov Chains
The Metropolis–Hastings algorithm
The random walk Metropolis-Hastings algorithm
Adaptive MCMC



Running Monte Carlo via Markov Chains

Epiphany! It is not necessary to use a sample from the distribution
f to approximate the integral

I= h(x)f(x)dx ,


Running Monte Carlo via Markov Chains

Epiphany! It is not necessary to use a sample from the distribution
f to approximate the integral

I= h(x)f(x)dx ,

Principle: Obtain X1 , . . . , Xn ∼ f (approx) without directly
simulating from f, using an ergodic Markov chain with stationary
distribution f
[Metropolis et al., 1953]


Running Monte Carlo via Markov Chains (2)

Idea
For an arbitrary starting value x(0) , an ergodic chain (X(t) ) is
generated using a transition kernel with stationary distribution f



Idea

Insures the convergence in distribution of (X(t) ) to a random
variable from f.
For a “large enough” T0 , X(T0 ) can be considered as
distributed from f
Produce a dependent sample X(T0 ) , X(T0 +1) , . . ., which is
generated from f, suﬃcient for most approximation purposes.



Idea

Insures the convergence in distribution of (X(t) ) to a random
variable from f.
For a “large enough” T0 , X(T0 ) can be considered as
distributed from f
Produce a dependent sample X(T0 ) , X(T0 +1) , . . ., which is
generated from f, suﬃcient for most approximation purposes.
Problem: How can one build a Markov chain with a given
stationary distribution?



Basics
The algorithm uses the objective (target) density

f

and a conditional density
q(y|x)
called the instrumental (or proposal) distribution


The MH algorithm

Algorithm (Metropolis–Hastings)
Given x(t) ,
1. Generate Yt ∼ q(y|x(t) ).
2. Take

Yt with prob. ρ(x(t) , Yt ),
X(t+1) =
x(t) with prob. 1 − ρ(x(t) , Yt ),

where
f(y) q(x|y)
ρ(x, y) = min ,1 .
f(x) q(y|x)


Features

Independent of normalizing constants for both f and q(·|x)
(ie, those constants independent of x)
Never move to values with f(y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain


Convergence properties

The M-H Markov chain is reversible, with invariant/stationary
density f since it satisﬁes the detailed balance condition
f(y) K(y, x) = f(x) K(x, y)


Convergence properties

The M-H Markov chain is reversible, with invariant/stationary
density f since it satisﬁes the detailed balance condition
f(y) K(y, x) = f(x) K(x, y)
If
q(y|x) > 0 for every (x, y),
the chain is Harris recurrent and
T
1
lim h(X(t) ) = h(x)df(x) a.e. f.
T →∞ T
t=1

lim Kn (x, ·)µ(dx) − f =0
n→∞
TV
for every initial distribution µ


Random walk Metropolis–Hastings

Use of a local perturbation as proposal

Yt = X(t) + εt ,

where εt ∼ g, independent of X(t) .
The instrumental density is now of the form g(y − x) and the
Markov chain is a random walk if we take g to be symmetric
g(x) = g(−x)


Algorithm (Random walk Metropolis)
Given x(t)
1 Generate Yt ∼ g(y − x(t) )
2 Take

Y f(Yt )
(t+1) t with prob. min 1, ,
X = f(x(t) )

x(t) otherwise.


RW-MH on mixture posterior distribution
3
2
µ2

1
0

X
−1

−1 0 1 2 3

µ1

Random walk MCMC output for .7N(µ1 , 1) + .3N(µ2 , 1)


Acceptance rate

A high acceptance rate is not indication of eﬃciency since the
random walk may be moving “too slowly” on the target surface


Acceptance rate


If x(t) and yt are “too close”, i.e. f(x(t) ) f(yt ), yt is accepted
with probability
f(yt )
min ,1 1.
f(x(t) )
and acceptance rate high


Acceptance rate


If average acceptance rate low, the proposed values f(yt ) tend to
be small wrt f(x(t) ), i.e. the random walk [not the algorithm!]
moves quickly on the target surface often reaching its boundaries


Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]


Noisy AR(1)

Target distribution of x given x1 , x2 and y is

−1 τ2
exp (x − ϕx1 )2 + (x2 − ϕx)2 + (y − x2 )2 .
2τ2 σ2

For a Gaussian random walk with scale ω small enough, the
random walk never jumps to the other mode. But if the scale ω is
suﬃciently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.


Noisy AR(2)

Markov chain based on a random walk with scale ω = .1.


Noisy AR(3)

Markov chain based on a random walk with scale ω = .5.

Adaptive MCMC

No free lunch!!

MCMC algorithm trained on-line usually invalid:

Adaptive MCMC

No free lunch!!

using the whole past of the “chain” implies that this is not a
Markov chain any longer!

Adaptive MCMC

No free lunch!!

using the whole past of the “chain” implies that this is not a
Markov chain any longer!
This means standard Markov chain (ergodic) theory does not apply
[Meyn & Tweedie, 1994]

Adaptive MCMC

Example (Poly t distribution)
t T(3, θ, 1) sample (x1 , . . . , xn ) with ﬂat prior π(θ) = 1
Fit a normal proposal from empirical mean and empirical variance
of the chain so far,
t t
1 1
µt = θ(i)
and σ2
t = (θ(i) − µt )2 ,
t t
i=1 i=1

Adaptive MCMC

Example (Poly t distribution)
t T(3, θ, 1) sample (x1 , . . . , xn ) with ﬂat prior π(θ) = 1
Fit a normal proposal from empirical mean and empirical variance
of the chain so far,
t t
1 1
µt = θ(i)
and σ2
t = (θ(i) − µt )2 ,
t t
i=1 i=1

Metropolis–Hastings algorithm with acceptance probability

n −(ν+1)/2
ν + (xj − θ(t) )2 exp −(µt − θ(t) )2 /2σ2t
,
ν + (xj − ξ)2 exp −(µt − ξ)2 /2σ2 t
j=2

where ξ ∼ N(µt , σ2 ).
t

Adaptive MCMC

Invalid scheme

invariant distribution not invariant any longer
when range of initial values too small, the θ(i) ’s cannot
converge to the target distribution and concentrates on too
small a support.
long-range dependence on past values modiﬁes the
distribution of the sequence.
using past simulations to create a non-parametric
approximation to the target distribution does not work either

Adaptive MCMC

0.2

3
0.0

2
−0.2
x

1
−0.4

0
0 1000 2000 3000 4000 5000 −1.5 −1.0 −0.5 0.0 0.5

Iterations θ
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.6
0.4
x

0.2
0.0
0 1000 2000 3000 4000 5000 −2 −1 0 1 2

Iterations θ

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
2
1
0
x

−1
−2

0 1000 2000 3000 4000 5000 −2 −1 0 1 2 3

Iterations θ

Adaptive scheme for a sample of 10 xj ∼ T3 and initial
variances of (top) 0.1, (middle) 0.5, and (bottom) 2.5.

Adaptive MCMC

1.0
1.5

0.8
1.0

0.6
0.5
x

0.4
0.0

0.2
−0.5

0.0
−1.0

0 10000 30000 50000 −1.5 −0.5 0.5 1.0 1.5

Iterations θ

Sample produced by 50, 000 iterations of a nonparametric
adaptive MCMC scheme and comparison of its distribution
with the target distribution.

Adaptive MCMC

Simply forget about it!

Warning:
One should not constantly adapt the proposal on past
performances

Either adaptation ceases after a period of burnin...
or the adaptive scheme must be theoretically assessed on its own
right.
[Haario & Saaksman, 1999;Andrieu & Robert, 2001]

Adaptive MCMC

Diminishing adaptation
Adaptivity of cyberparameter γt has to be gradually tuned down
to recover ergodicity
[Roberts & Rosenthal, 2007]

Adaptive MCMC

Suﬃcient conditions:
1 total variation distance between two consecutive kernels must
uniformly decrease to zero
[diminishing adaptation]

lim sup Kγt (x, ·) − Kγt+1 (x, ·) TV =0
t→∞ x

2 times to stationary remains bounded for any ﬁxed γt
[containment]

Adaptive MCMC


t→∞ x

[containment]
Works for random walk proposal that relies on the
empirical variance of the sample modulo a ridge-like stabilizing factor
[Haario, Sacksman & Tamminen, 1999]

Adaptive MCMC


t→∞ x

[containment]
Tune the scale in each direction toward an optimal acceptance rate
of 0.44.
[Roberts & Rosenthal,2006]

Adaptive MCMC


t→∞ x

[containment]
Packages amcmc and grapham

Approximate Bayesian computation





ABC basics
Alphabet soup
Calibration of ABC

ABC basics

Untractable likelihoods

There are cases when the likelihood function f(y|θ) is unavailable
and when the completion step

f(y|θ) = f(y, z|θ) dz
Z

is impossible or too costly because of the dimension of z
c MCMC cannot be implemented!
[Robert & Casella, 2004]

ABC basics

Illustrations

Example
Stochastic volatility model: for Highest weight trajectories

t = 1, . . . , T ,

0.4
0.2
yt = exp(zt ) t, zt = a+bzt−1 +σηt ,

0.0
−0.2
T very large makes it diﬃcult to
−0.4
include z within the simulated 0 200 400

t
600 800 1000

parameters

ABC basics

Illustrations

Example
Potts model: if y takes values on a grid Y of size kn and

f(y|θ) ∝ exp θ Iyl =yi
l∼i

where l∼i denotes a neighbourhood relation, n moderately large
prohibits the computation of the normalising constant

ABC basics

Illustrations

Example
Inference on CMB: in cosmology, study of the Cosmic Microwave
Background via likelihoods immensely slow to computate (e.g
WMAP, Plank), because of numerically costly spectral transforms
[Data is a Fortran program]
[Kilbinger et al., 2010, MNRAS]

ABC basics

Illustrations

Example
Coalescence tree: in population
genetics, reconstitution of a common
ancestor from a sample of genes via
a phylogenetic tree that is close to
impossible to integrate out
[100 processor days with 4
parameters]
[Cornuet et al., 2009, Bioinformatics]

ABC basics

The ABC method

Bayesian setting: target is π(θ)f(x|θ)

ABC basics

The ABC method

When likelihood f(x|θ) not in closed form, likelihood-free rejection
technique:

ABC basics

The ABC method

When likelihood f(x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f(z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.

[Tavar´ et al., 1997]
e

ABC basics

Why does it work?!

The proof is trivial:

f(θi ) ∝ π(θi )f(z|θi )Iy (z)
z∈D
∝ π(θi )f(y|θi )
= π(θi |y) .

[Accept–Reject 101]

ABC basics

Earlier occurrence

‘Bayesian statistics and Monte Carlo methods are ideally
suited to the task of passing many models over one
dataset’
[Don Rubin, Annals of Statistics, 1984]

Note Rubin (1984) does not promote this algorithm for
likelihood-free simulation but frequentist intuition on posterior
distributions: parameters from posteriors are more likely to be
those that could have generated the data.

ABC basics

A as approximative

When y is a continuous random variable, equality z = y is replaced
with a tolerance condition,

ρ(y, z)

where ρ is a distance

ABC basics

A as approximative

When y is a continuous random variable, equality z = y is replaced
with a tolerance condition,

ρ(y, z)

where ρ is a distance
Output distributed from

π(θ) Pθ {ρ(y, z) < } ∝ π(θ|ρ(y, z) < )

ABC basics

ABC algorithm

Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f(·|θ )
until ρ{η(z), η(y)}
set θi = θ
end for

where η(y) deﬁnes a (maybe in-suﬃcient) statistic

ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

π(θ)f(z|θ)IA ,y (z)
π (θ, z|y) = ,
A ,y ×Θ π(θ)f(z|θ)dzdθ

where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.

ABC basics

Output

The likelihood-free algorithm samples from the marginal in z of:

π(θ)f(z|θ)IA ,y (z)
π (θ, z|y) = ,
A ,y ×Θ π(θ)f(z|θ)dzdθ

where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a
small tolerance should provide a good approximation of the
posterior distribution:

π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) .

ABC basics

Pima Indian benchmark

80
100

1.0
80

60

0.8
60

0.6
Density

Density

Density
40
40

0.4
20
20

0.2
0.0
0

0

−0.005 0.010 0.020 0.030 −0.05 −0.03 −0.01 −1.0 0.0 1.0 2.0

Figure: Comparison between density estimates of the marginals on β1
(left), β2 (center) and β3 (right) from ABC rejection samples (red) and
MCMC samples (black)

.

ABC basics

MA example

Consider the MA(q) model
q
xt = t+ ϑi t−i
i=1

Simple prior: uniform prior over the identiﬁability zone, e.g.
triangle for MA(2)

ABC basics

MA example (2)
ABC algorithm thus made of
1 picking a new value (ϑ1 , ϑ2 ) in the triangle
2 generating an iid sequence ( t )−q<t T
3 producing a simulated series (xt )1 t T

ABC basics

MA example (2)
ABC algorithm thus made of
1 picking a new value (ϑ1 , ϑ2 ) in the triangle
2 generating an iid sequence ( t )−q<t T
3 producing a simulated series (xt )1 t T
Distance: basic distance between the series
T
ρ((xt )1 t T , (xt )1 t T) = (xt − xt )2
t=1

or between summary statistics like the ﬁrst q autocorrelations
T
τj = xt xt−j
t=j+1

ABC basics

Comparison of distance impact

Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model

ABC basics

Comparison of distance impact

4

1.5
3

1.0
2

0.5
1

0.0
0

0.0 0.2 0.4 0.6 0.8 −2.0 −1.0 0.0 0.5 1.0 1.5

θ1 θ2

Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model

ABC basics

ABC advances

Simulating from the prior is often poor in eﬃciency

ABC basics

ABC advances

Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

ABC basics

ABC advances


...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002]

ABC basics

ABC advances


...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger

.....or even by including in the inferential framework [ABCµ ]
[Ratmann et al., 2009]

Alphabet soup

ABC-NP

Better usage of [prior] simulations by
adjustement: instead of throwing away
θ such that ρ(η(z), η(y)) > , replace
θs with locally regressed

θ∗ = θ − {η(z) − η(y)}T β
ˆ
[Csill´ry et al., TEE, 2010]
e

ˆ
where β is obtained by [NP] weighted least square regression on
(η(z) − η(y)) with weights

Kδ {ρ(η(z), η(y))}

[Beaumont et al., 2002, Genetics]

Alphabet soup

ABC-MCMC

Markov chain (θ(t) ) created via the transition function

θ ∼ Kω (θ |θ(t) ) if x ∼ f(x|θ ) is such that x = y


π(θ )Kω (t) |θ )
θ(t+1)
= and u ∼ U(0, 1) π(θ(t) )K (θ |θ(t) ) ,

 (t) ω (θ
θ otherwise,

Alphabet soup

ABC-MCMC

Markov chain (θ(t) ) created via the transition function

θ ∼ Kω (θ |θ(t) ) if x ∼ f(x|θ ) is such that x = y


π(θ )Kω (t) |θ )
θ(t+1)
= and u ∼ U(0, 1) π(θ(t) )K (θ |θ(t) ) ,

 (t) ω (θ
θ otherwise,

has the posterior π(θ|y) as stationary distribution
[Marjoram et al, 2003]

Alphabet soup

ABC-MCMC (2)

Algorithm 2 Likelihood-free MCMC sampler
Use Algorithm 1 to get (θ(0) , z(0) )
for t = 1 to N do
Generate θ from Kω ·|θ(t−1) ,
Generate z from the likelihood f(·|θ ),
Generate u from U[0,1] ,
π(θ )Kω (θ(t−1) |θ )
if u I
π(θ(t−1) Kω (θ |θ(t−1) ) A ,y (z ) then
set (θ(t) , z(t) ) = (θ , z )
else
(θ(t) , z(t) )) = (θ(t−1) , z(t−1) ),
end if
end for

Alphabet soup

ABCµ


Use of a joint density

f(θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

where y is the data, and ξ( |y, θ) is the prior predictive density of
ρ(η(z), η(y)) given θ and x when z ∼ f(z|θ)

Alphabet soup

ABCµ


Use of a joint density

f(θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

where y is the data, and ξ( |y, θ) is the prior predictive density of
ρ(η(z), η(y)) given θ and x when z ∼ f(z|θ)
Warning! Replacement of ξ( |y, θ) with a non-parametric kernel
approximation.

Alphabet soup

A PMC version
Use of the same kernel idea as ABC-PRC but with IS correction
[Beaumont et al., 2009; Toni et al., 2009]
Generate a sample at iteration t by
N
(t−1) (t−1)
ˆ
πt (θ (t)
)∝ ωj Kt (θ(t) |θj )
j=1

modulo acceptance of the associated xt , and use an importance
(t)
weight associated with an accepted simulation θi
(t) (t) (t)
ωi ∝ π(θi ) πt (θi ) .
ˆ

c Still likelihood-free

Alphabet soup

Sequential Monte Carlo

SMC is a simulation technique to approximate a sequence of
related probability distributions πn with π0 “easy” and πT target.
Iterated IS as PMC: particles moved from time n to time n via
kernel Kn and use of a sequence of extended targets πn˜
n
˜
πn (z0:n ) = πn (zn ) Lj (zj+1 , zj )
j=0

where the Lj ’s are backward Markov kernels [check that πn (zn ) is
a marginal]
[Del Moral, Doucet & Jasra, 2006]

Alphabet soup

ABC-SMC

True derivation of an SMC-ABC algorithm
Use of a kernel Kn associated with target π n and derivation of
the backward kernel
π n (z )Kn (z , z)
Ln−1 (z, z ) =
πn (z)

Update of the weights
M
m=1 IA n
(xm )
in
win ∝ wi(n−1) M
m=1 IA n−1
(xm
i(n−1) )

when xm ∼ K(xi(n−1) , ·)
in

Alphabet soup

Properties of ABC-SMC

The ABC-SMC method properly uses a backward kernel L(z, z ) to
simplify the importance weight and to remove the dependence on
the unknown likelihood from this weight. Update of importance
weights is reduced to the ratio of the proportions of surviving
particles
Major assumption: the forward kernel K is supposed to be invariant
against the true target [tempered version of the true posterior]

Alphabet soup

Properties of ABC-SMC

The ABC-SMC method properly uses a backward kernel L(z, z ) to
simplify the importance weight and to remove the dependence on
the unknown likelihood from this weight. Update of importance
weights is reduced to the ratio of the proportions of surviving
particles
Major assumption: the forward kernel K is supposed to be invariant
against the true target [tempered version of the true posterior]
Adaptivity in ABC-SMC algorithm only found in on-line
construction of the thresholds t , slowly enough to keep a large
number of accepted transitions
[Del Moral, Doucet & Jasra, 2009]

Calibration of ABC

Which summary statistics?

Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistic [except when done by the
experimenters in the field]

Calibration of ABC


Starting from a large collection of summary statistics is available,
Joyce and Marjoram (2008) consider the sequential inclusion into
the ABC target, with a stopping rule based on a likelihood ratio
test.

Calibration of ABC


Starting from a large collection of summary statistics is available,
Joyce and Marjoram (2008) consider the sequential inclusion into
the ABC target, with a stopping rule based on a likelihood ratio
test.
Does not taking into account the sequential nature of the tests
Depends on parameterisation
Order of inclusion matters.

Calibration of ABC

Point estimation vs....

In the case of the computation of E[h(θ)|y], Fearnhead and
Prangle [12/14/2011] demonstrate that the optimal summary
statistic is
η (y) = E[h(θ)|y]

Calibration of ABC

Point estimation vs....

In the case of the computation of E[h(θ)|y], Fearnhead and
Prangle [12/14/2011] demonstrate that the optimal summary
statistic is
η (y) = E[h(θ)|y]
Unavailable but approximated by a prior ABC run and ABC-NP
corrections

Calibration of ABC

...vs. model choice

In the case of the computation of a Bayes factor B12 (y), ABC
approximation
T T
Im(t) =1 Im(t) =2
t=1 t=1

may fail to converge
[Robert et al., 2011]

Calibration of ABC

...vs. model choice

In the case of the computation of a Bayes factor B12 (y), ABC
approximation
T T
Im(t) =1 Im(t) =2
t=1 t=1

may fail to converge
[Robert et al., 2011]

Separation conditions on the summary statistics for convergence to
occur
[Marin et al., 2011]

WSC 2011, advanced tutorial on simulation in Statistics

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to WSC 2011, advanced tutorial on simulation in Statistics (20)

More from Christian Robert (20)

Recently uploaded (20)

WSC 2011, advanced tutorial on simulation in Statistics