Multiple Imputation of Missing Blood Pressure Covariates in
Survival Analysis
STATS 756: Topics in Biostatistics - Fall 2021
Lim, Kyuson
Department of Mathematics and Statistics
McMaster University
December 1st, 2021
Kyuson Lim 1 / 25
Outline
1 Introduction/Motivation
2 Multiple imputation
3 Linear regression: multiple imputation
4 Fully conditional specification (FCS): Multivariate imputation
5 Pooling: parameters
6 Simulation Study
7 Biblibiography
Kyuson Lim 2 / 25
Background of research- Question of interest
Main interest: determine influence of measures on relation between mortality and Blood Pressure
(BP), over 85 years old, 1236 citizens in Leiden (1986), examined between 1987 and 1989.
Concern whether prescription of anti-hypertensive drugs inadvertently shorten life.
Variables: BP, age (85-89, 90-94, 95+), types of resident, activities of daily living (independent,
dependent), history of hypertension, uses of diuretics, blood samples.
2 Cox regression =



Relation between mortality and BP adjusted for age & sex
Relation between mortality and BP adjusted for age, sex & health
⇒ difference between health between different BP groups
Suspect: Individuals with lower BP and higher mortality risks, had fewer BP measurements.
Excluding incomplete cases (≈ 12.5%) produced deflated mortality estimates for lower BP groups
⇒ distortion of influence of BP on survival.
Kyuson Lim 3 / 25
Background of research- Assumption of non-responses
Missing At Random (MAR): create small number of complete matrices, where missing values are
replaced by plausible values.
Valid inference ⇒ Number of imputation depends on amount of missing information (ie. 3 or 5)
Proper procedure ⇒ Variability, reflects uncertainty about hypothetically observed (unknown)
value.
Goal: provide Bayesian approach for switching linear regression model to handle with multiple
imputation, under types of missing values.
1 What information to use for choosing between non-response mechanisms
2 How to choose set of predictors.
3 How to generate imputations (ie. Mixed types).
4 How to specify different models for non-responses.
Kyuson Lim 4 / 25
Study of data and problems
Observational problem: groups without BP measure have much higher mortality rates.
BP not measured for 121 individuals
BP is measured more often if suspected that BP was too high (hypertension).
BP is measured less frequently for very old people and subjects who are too ill to be measured.
Data collection period: rate increase (5-40%) and then drops to constant level (10-15%).
Survived > 3 years History of previous hypertension
No Yes Total
Yes 8.7% (34/390) 8.1% (10/124) 8.6% (44/514)
No 19.2% (69/360) 9.8% (8/82) 17.4% (77/442)
Total 13.7% (103/750) 8.7% (18/206) 12.7% (121/956)
Table 1. Proportion of no BP measured
Individuals without BP measures have higher mortality rates.
Relatively large group of individuals without hypertension and with high mortality risk is missing.
The goal of the sensitivity analysis is to explore the result of the analysis under alternative scenarios for
the missing data.
Kyuson Lim 5 / 25
Selection of variables: Chi-square (χ2
) of independence
Variables related to non-response includes age, type of residence, activities of daily living,
and uses of diuretics (year of interview, blood samples are not categorical to be excluded).
Figure: For 835 individuals, the chi-square of independence
BP was measured less frequently for very old (95+) people and for those who have a
health problem (hypertension).
Kyuson Lim 6 / 25
Missing data mechanism- MCAR, MAR and MNAR
MCAR (missing completely at random): the probability of being missing is the same for all cases ⇒ cause
of missing data unrelated to data.
p(R = 0|Yobs, Ymis, Z) = p(R = 0|Z)
Unrealistic: survival model between BP measured and no BP measured shows systematic difference in
mortality.
MAR (missing at random): the probability of being missing is the same only within groups, defined by
observed data.
p(R = 0|Yobs, Ymis, Z) = p(R = 0|Yobs, Z)
MAR on Yobs: the probability of BP measurement depends on the survival ⇒ correction for non-response.
MAR on Z: probability of non-response related to covariates (χ2
test) ⇒ correction for non-response.
MNAR (missing not at random): the probability to be missing also depends on unobserved information,
including Ymis itself.
p(R = 0|Yobs, Ymis, Z)
Investigation: probability of non-response related to BP (unobserved) ⇒ distribution of Ymis, sensitivity
analysis.
Kyuson Lim 7 / 25
Appropriate selection of variables: Influx and outflux
Influx and outflux are summaries of the missing data pattern intended to aid in the construction of
imputation models.
The influx of a variable quantifies how well its missing data connect to the observed data on
other variables.
Variables with higher influx depend strongly on the imputation model.
The outflux of a variable quantifies how well its observed data connect to the missing data on
other variables.
Variable with higher outflux is better connected to the missing data, and thus potentially more useful for
imputing other variables.
Figure: Global influx-outflux pattern of the Leiden 85+ Cohort data
All points are relatively close to the diagonal, which indicates that influx and outflux are balanced.
Kyuson Lim 8 / 25
Generating multiple imputation
A MICE (Multivariate Imputation by Chained Equations) algorithm is a MCMC method that is
univariate optimal.
Starts with a random draw from the observed data, and imputes the incomplete data
One iteration consists of one cycle through all Yj .
Then, samples from the conditional distributions in order to obtain samples from the joint
distribution.
Generates multiple imputations in parallel m times.
Response mechanism:
Outcome variables:
Y1 = Systolic BP
Y2 = Diastolic BP
Y3 = Survival or censoring times
Y4 = censoring indicator (0,1)
⇒
p(Y3, Y4|Y1, Z)
p(Y3, Y4|Y2, Z)
Y = (Yobs, Ymis),
Rij = 1 if Yij is
observed.
⇒
p(R = 1|Yobs, Ymis, Z)
(NMAR)
p(R = 1|Yobs, Z) (MAR)
Yobs, Ymis, Z define different
types of response mechanism
Observed covariates: 31 columns
Kyuson Lim 9 / 25
Multiple imputation: algorithm
1 Posterior predictive density, p(Ymis|X, R) (X is set of predictors) given non-response mechanism
p(R|Y, Z) and p(Y, Z).
p(Ymis|X): linear regression, missing BP as predictor variable X.
Select suitable subset of data containing no more than 15-25 variables, X = [Yobs, Z, U, V].
1 Yobs, Z: Include all variables, especially if complete model contains strong predictive relations.
2 U: Variables that differ between the response and non-response groups, inspect by correlation.
3 V: Variance with considerable variability, to reduce uncertainty.
4 U and V: remove for those with many missing values (%) within incomplete cases.
2 Draw imputations from p(Ymis|X, R) to produce m complete datasets.
3 Perform m complete Cox regression model on each completed data.
4 Pool m analysis results and variance estimates.
Kyuson Lim 10 / 25
Selection of variables: 24 variables
1 Includes variable appear in complete data: blood pressure, survival, sex, and age
2 Variables related to non-response: type of residence, activity of daily living, previous hypertension, use of diuretics, year of interview,
and blood sample.
3 Select variables absolute correlation > 0.15 with SBP/DBP.
4 Remove variables with usable cases < 50%.
Figure: Correlations between the cumulative death hazard H0(T), survival time T, log(T), SBP and DBP
The high correlation may be caused by the fact that nearly everyone in this cohort has died, so the percentage of censoring is low.
Observe that the correlation between log(T) and blood pressure is higher than for H0(T) or T, so it makes sense to add log(T) as an
additional predictor.
Kyuson Lim 11 / 25
Multiple imputation: algorithm
1 Posterior predictive density, p(Ymis|X, R) (X is set of predictors) given non-response mechanism
p(R|Y, Z) and p(Y, Z).
2 Draw imputations from p(Ymis|X, R) to produce m complete datasets.
p(Ymis|X, R) =
R
p(Ymis|X, R, θ)p(θ|X, R)dθ, θ = (β, log σ)
1 Draw value of θ∗
from p(θ|X, R) ⇒ p(Ymis|X, R, θ = θ∗
).
2 Draw value Y∗
mis from its conditional posterior distribution given θ∗
.
3 Multiple imputation: Repeat m times from the posterior distribution of Ymis.
3 Perform m complete Cox regression model on each completed data.
4 Pool m analysis results and variance estimates.
Kyuson Lim 12 / 25
Multiple imputation Ymis by linear regression
1 Obtain β̂ and Ŷobs from linear regression.
Take W = (X0
obsXobs)−1
for β̂ = WX0
obsYobs to Ŷobs = Xobsβ̂.
2 Random draw from posterior distribution of β.
Calculate β̂∗ = β̂ + σ∗W1/2
D
Draw r-dimensional Normal random vector D ∼ N(0, Ir ), where r = 23 is the number of predictors.
Similarity between cases is the distance predicted means of BP with observed data.
Take predicted values Ŷmis = Xmisβ̂∗
3 Repeat m = 3 to 5 times to create Y
(1)
mis, Y
(2)
mis, ..., Y
(m)
mis .
Incorporate uncertainty due to deviations, but also reflect variations due to finite sampling.
Kyuson Lim 13 / 25
Example multiple imputation
The goal is to reduce the difference of two iteration close to 0 by the specified model.
Step 0. Identify missing data
> head(nhanes)
age bmi hyp chl
1 1 NA NA NA
2 2 22.7 1 187
3 1 NA 1 187
4 3 NA NA NA
5 1 20.4 1 113
6 3 NA NA 184
⇒
Step 1. Model based imputation
> imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415)
> fit <- with(imp, lm(bmi ~ age))
> head(imp$imp)
$age
[1] 1 2 3 4 5 6 7 8 9 10
<0 rows> (or 0-length row.names)
$bmi
1 2 3 4 5 6 7 8 9 10
1 27.2 21.7 25.5 22.5 28.7 30.1 27.4 22.5 22.5 27.2
3 22.0 30.1 20.4 33.2 27.2 35.3 29.6 22.0 27.2 28.7
4 21.7 20.4 27.2 25.5 21.7 25.5 22.7 22.5 24.9 22.5
$hyp
1 2 3 4 5 6 7 8 9 10
1 1 1 1 1 1 1 1 1 1 1
4 1 2 2 2 1 1 2 1 2 1
$chl
1 2 3 4 5 6 7 8 9 10
1 187 238 186 238 187 187 187 131 238 187
4 206 204 204 184 206 187 218 186 204 284
⇒
Step 2. Repeat m = 10 times
> est <- pool(fit)
> est
Class: mipo m = 10
term m estimate ubar b t dfcom df
1 (Intercept) 10 29.621111 3.4810048 1.4312926 5.055427
2 age 10 -1.802222 0.9257992 0.2759968 1.229396
Multivariate missing data algorithm for mice is different from model base multiple imputation algorithm.
Circular dependence can occur, Ymis
j depends on Ymis
h which depends on Ymis
j , j 6= h, as Yj and Yh is correlated.
With large p and small n, collinearity or empty cells can occur.
The non-linear relation is not considered, combination is problematic.
Kyuson Lim 14 / 25
Multivariate imputation method
Split multivariate problems into series of univariate problems.
Apply iterative algorithm to draw samples from sequence of univariate linear regression
Each incomplete entry is initialized by filling in random draw from Yobs.
Regression switching: executed m times in parallel, where Yi imputed conditional on all other data
and Z, U, V.
Gibbs sampler: under the condition that draws converge to multivariate posterior density,
p(Ymis|Yobs, X, R), iterates about 20 steps (Partially incompatible MCMC).
Kyuson Lim 15 / 25
MICE- example (nhanes2)
The nhanes2 data in mice contains 3 out of 27 missing values that destroy the monotone pattern: one for hyp (in
row 6) and two for bmi (in rows 3 and 6).
> library(mice}
> data(nhanes2}
> nhanes2
age bmi hyp chl
1 20-39 NA <NA> NA
2 40-59 22.7 no 187
3 20-39 NA no 187
4 60-99 NA <NA> NA
5 20-39 20.4 no 113
6 60-99 NA <NA> 184
.
.
.
> length(nhanes2[is.na(nhanes2)])
[1] 27
⇒
> where <- make.where(nhanes2, "none")
> where[6, "hyp"] <- TRUE
> where[c(3, 6), "bmi"] <- TRUE
> imp1 <- mice(nhanes2, where = where,
+ method = "sample",seed = 21991, maxit = 1,
+ print = FALSE)
> data <- mice::complete(imp1)
> data
age bmi hyp chl
1 20-39 NA <NA> NA
2 40-59 22.7 no 187
3 20-39 26.3 no 187
4 60-99 NA <NA> NA
5 20-39 20.4 no 113
6 60-99 22.7 no 184
.
.
.
⇒
> imp2 <- mice(data, maxit = 1,
+ visitSequence = "monotone",
+ print = FALSE)
> data2 <- mice::complete(imp2)
> data2
age bmi hyp chl
1 20-39 35.3 no 206
2 40-59 22.7 no 187
3 20-39 26.3 no 187
4 60-99 24.9 no 186
5 20-39 20.4 no 113
6 60-99 22.7 no 184
.
.
.
1 Imputes these 3 values by a simple random sample, and then fills in the remaining missing data by monotone data
multiple imputation.
2 Observe that the imputed values for the missing hyp data in row 3 could also depend on bmi and chl, but in the
procedure both predictors are ignored.
Kyuson Lim 16 / 25
NMAR method: δ-adjustment
Purpose: to investigate robustness of MAR assumption against violation.
To determine whether the relation between BP and mortality is affected by non-response.
1 Suppose BP distribution to be known, apply Bayes rule to calculate distribution for
p(BP|R = 1) and p(BP|R = 0).
2 Both are normal but differs by δ.
3 Generate imputation by subtracting amount δ from random draw of p(BP|R = 1).
Incorporate into Y1 = Xβ + (1 − R1)δ + , R1 is an indicator for systolic BP.
Postulates mean difference, δ, between responders and non-responders.
δ is chosen as 0 (same as MAR), -5, -10, -15, -20.
Kyuson Lim 17 / 25
Pooling: parameters
1 Posterior predictive density, p(Ymis|X, R) (X is set of predictors) given non-response mechanism
p(R|Y, Z) and p(Y, Z).
2 Draw imputations from p(Ymis|X, R) to produce m complete datasets.
3 Perform m complete Cox regression model on each completed data.
4 Pool m analysis results and variance estimates.
Combined point estimate Q̂ =
Pm
i=1
Q̂i
m , Q̂i is k-dimensional column vector obtained by ith
imputed dataset (i ∈ [1, m]).
3 sources of variation: Total covariance T = U +

1 + 1
m

B
Complete data variance
Standard unbiased estimate of variance(Note that Within sample variance is
V(Q|Yobs) = E[V(Q|Yobs, Ymis)|Yobs] + V[E(Q|Yobs, Ymis)|Yobs])
Simulation variance
Relative risk of 95% confidence interval in the proportional hazards model is given by
exp(Q̂ ± 1.96
√
T).
Kyuson Lim 18 / 25
R code: Pooling
Realized difference in means of the observed and imputed SBP (mmHg) data under various δ-adjustments.
The mean of the observed SBP is152.9 mmHg.
 delta - c(0, -5, -10, -15, -20)
 post - imp.qp$post
 imp.all.undamped - vector(list, length(delta))
 for (i in 1:length(delta)) {
+ d - delta[i]
+ cmd - paste(imp[[j]][,i] - imp[[j]][,i] +, d)
+ post[rrsyst] - cmd
+ imp - mice(data2, pred = pred, post = post, maxit = 10,
+ seed = i * 22)
+ imp.all.undamped[[i]] - imp
}
⇒
δ for SBP Avg. Difference
0 -8.2
-5 -12.3
-10 -20.7
-15 -26.1
-20 -31.5
Table 4. Realized difference in means
The strength of the effect depends on the correlation between SBP and the variable.
Under MAR, the imputations are on average 8.2mmHg lower than the observed blood pressure.
For example, δ = −10mmHg means the magnitude of difference in MAR case, −20.5 + 8.2 = −12.2mmHg.,
larger in size than δ.
Kyuson Lim 19 / 25
Summary and reference
The standard multiple imputation scheme of stepwise model selection consists of three
phases:
1 Imputation of the missing data m times.
2 Analysis of the m imputed datasets.
3 Pooling of the parameters across m analyses.
R codes and the output is stated in the textbook ‘Flexible imputation of missing data.
CRC press’ written by the same author ‘Van Buuren, S.’ who write the paper.
Chapter 3 (p.97-101) and Chapter 9 (p.259-283) contains all results to be stated and
interpreted based on the data ‘Leiden 85+’.
The package ‘mice’ contains with documents of the codes and examples.
Kyuson Lim 20 / 25
Generating imputation work
Figure: Scatterplot of systolic and diastolic blood pressure from the first imputation.
The left-hand-side plot was obtained after just running ‘mice’ on the data without any data screening.
The right-hand-side plot is the result after cleaning the data and setting up the predictor matrix with
‘quickpred()’ (quick selection of predictors) in mice.
Determine values in column size and correlation threshold such that the average number of predictors is
around 25.
Kyuson Lim 21 / 25
Simulation study: Mean BP
N δ SBP DBP
Mean SD Mean SD
Observed BP 835 152.9 25.7 82.8 13.1
Imputed BP 121 0 151.1 26.2 81.5 14
121 -5 142.3 24.6 78.4 13.7
121 -10 135.9 24.7 78.2 12.8
121 -15 128.6 25 75.3 12.9
121 -20 122.3 25.2 74 12.1
Table 5. Imputed BP are pooled over m = 5 multiple imputation
Under MAR (δ = 0), x̄ observed SBP = 152.9 and x̄SBP = 151.1 for difference of 1.8 (mmHg) as
well as x̄ observed DBP = 82.8 and x̄DBP = 81.5 for difference of 1.3 (mmHg).
Decreasing trend for δ = −5, −10, −15, −20 in {142.3, 135.9, 128.6, 122.3}.
Only small difference in mortality exists, even among non-response models with different δ’s
⇒ risk estimates are insensitive to missing data.
Kyuson Lim 22 / 25
Relative mortality risk estimates: SBP and DBP
A relative mortality risks for Cox proportional hazard model is estimated with the age and sex.
Figure: 95% confidence interval Relative mortality risk estimates: SBP and DBP
At δ = 0, SBP groups  125mmHg has risk ratio of 1.76, meaning that the mortality risk (after correction for sex and age)
in the group is 1.76 times the risk of the reference group 125 − 140 mmHg.
Imputed BP are lowered by δ but the risk estimated does not change much.
A hazard ratio estimates for different δ are close.
Mortality between responders and non-responders are simply too small for serious impact on estimates.
Conclude missing data hardly influence the risk estimates.
Kyuson Lim 23 / 25
NMAR non-response mechanism- relative mortality risks
The pattern-mixture model decomposes the density at a point to be
P(Y, R) = P(Y|R)P(R) = P(Y|R = 1)P(R = 1) + P(Y|R = 0)P(R = 0), emphasizing that the combined distribution is a mix
of the distributions of Y in the responders and non-responders.
By Bayes rule, observable probability is computed as P(R = 1|Y = y) = P(Y = y|R = 1)P(R = 1)/P(Y = y), where the
marginal distribution of Y is P(Y = y) = P(Y = y|R = 1)P(R = 1) + P(Y = y|R = 0)P(R = 0).
The right-hand plot provides the distributions P(Y|R) in the observed (blue) and missing (red) data in the pattern-mixture
model. The hypothetically complete distribution is the black curve.
Figure: Graphic representation of the response mechanism for SBP
The distribution of blood pressure in the group with missing blood pressures is quite different, both in form and location.
The effect of missingness on the combined distribution shows only slight difference.
Kyuson Lim 24 / 25
References
Multiple imputation of missing blood pressure covariates in survival analysis [Van Buuren, S.,
Boshuizen, H. C.,  Knook, D. L.] https://www.sfu.ca/~jackd/Stat302/A4_Reading.pdf.
Flexible imputation of missing data. CRC press. [Van Buuren, S. (2018).]
https://stefvanbuuren.name/fimd/
Thank you for the participation and understandings !
Kyuson Lim 25 / 25

More Related Content

PDF
Regularization and variable selection via elastic net
PDF
Survival analysis 1
PDF
BlUP and BLUE- REML of linear mixed model
PDF
Dag in mmhc
PDF
Lec 2 discrete random variable
PDF
Statistics symposium talk, Harvard University
PDF
ABC short course: survey chapter
PDF
random forests for ABC model choice and parameter estimation
Regularization and variable selection via elastic net
Survival analysis 1
BlUP and BLUE- REML of linear mixed model
Dag in mmhc
Lec 2 discrete random variable
Statistics symposium talk, Harvard University
ABC short course: survey chapter
random forests for ABC model choice and parameter estimation

What's hot (18)

PDF
Testing as estimation: the demise of the Bayes factor
PDF
ABC short course: introduction chapters
PDF
Bayesian inference on mixtures
PDF
testing as a mixture estimation problem
PDF
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
PDF
ISBA 2016: Foundations
PDF
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
PDF
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
PDF
Problem_Session_Notes
PDF
from model uncertainty to ABC
DOC
Chi square tests
PDF
Classification
PDF
A comparative analysis of predictve data mining techniques3
PDF
Discussion of Persi Diaconis' lecture at ISBA 2016
PDF
Econometrics, PhD Course, #1 Nonlinearities
PDF
Proba stats-r1-2017
PDF
Statistics (1): estimation, Chapter 1: Models
PDF
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
Testing as estimation: the demise of the Bayes factor
ABC short course: introduction chapters
Bayesian inference on mixtures
testing as a mixture estimation problem
Tutorial on testing at O'Bayes 2015, Valencià, June 1, 2015
ISBA 2016: Foundations
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Problem_Session_Notes
from model uncertainty to ABC
Chi square tests
Classification
A comparative analysis of predictve data mining techniques3
Discussion of Persi Diaconis' lecture at ISBA 2016
Econometrics, PhD Course, #1 Nonlinearities
Proba stats-r1-2017
Statistics (1): estimation, Chapter 1: Models
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
Ad

Similar to Missing value imputation (slide) (20)

PDF
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
PPTX
Imputation techniques for missing data in clinical trials
PDF
Statistical Methods to Handle Missing Data
PPT
Imputation of Data For Business Analytics
PDF
STRATOS ISCB 2019: Ruth Keogh
PDF
missing-data-and-multiple-imputation-in-clinical-epidemiolog
PDF
CHE Seminar 20 November 2013
PDF
Biostatistics Workshop: Missing Data
DOC
Missing Value imputation, Poor man's
PPTX
A survey on missing information strategies and imputation methods in healthcare
DOC
Poor man's missing value imputation
PPTX
cc.pptx
PPTX
ppt1221[1][1].pptx
PDF
Machine learning with missing values
PPTX
Analysis-of-data-with-missing-values.pptx
PPTX
Imputation of missing data in clinical trials
PPTX
missingdatahandling-160923201313.pptx
PPTX
Clinicaldataanalysis in r
PDF
Hybrid prediction model with missing value imputation for medical data 2015-g...
PDF
How to manage your Experimental Protocol with Basic Statistics
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
Imputation techniques for missing data in clinical trials
Statistical Methods to Handle Missing Data
Imputation of Data For Business Analytics
STRATOS ISCB 2019: Ruth Keogh
missing-data-and-multiple-imputation-in-clinical-epidemiolog
CHE Seminar 20 November 2013
Biostatistics Workshop: Missing Data
Missing Value imputation, Poor man's
A survey on missing information strategies and imputation methods in healthcare
Poor man's missing value imputation
cc.pptx
ppt1221[1][1].pptx
Machine learning with missing values
Analysis-of-data-with-missing-values.pptx
Imputation of missing data in clinical trials
missingdatahandling-160923201313.pptx
Clinicaldataanalysis in r
Hybrid prediction model with missing value imputation for medical data 2015-g...
How to manage your Experimental Protocol with Basic Statistics
Ad

Recently uploaded (20)

PPTX
ch20 Database System Architecture by Rizvee
PPTX
GPS sensor used agriculture land for automation
PDF
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PPTX
Reinforcement learning in artificial intelligence and deep learning
PPT
Classification methods in data analytics.ppt
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PDF
NU-MEP-Standards معايير تصميم جامعية .pdf
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PPTX
Bussiness Plan S Group of college 2020-23 Final
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PPTX
Basic Statistical Analysis for experimental data.pptx
PPTX
ifsm.pptx, institutional food service management
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PPT
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PPTX
cyber row.pptx for cyber proffesionals and hackers
PDF
Nucleic-Acids_-Structure-Typ...-1.pdf 011
ch20 Database System Architecture by Rizvee
GPS sensor used agriculture land for automation
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
Reinforcement learning in artificial intelligence and deep learning
Classification methods in data analytics.ppt
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
NU-MEP-Standards معايير تصميم جامعية .pdf
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
Bussiness Plan S Group of college 2020-23 Final
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
inbound2857676998455010149.pptxmmmmmmmmm
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
Basic Statistical Analysis for experimental data.pptx
ifsm.pptx, institutional food service management
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
cyber row.pptx for cyber proffesionals and hackers
Nucleic-Acids_-Structure-Typ...-1.pdf 011

Missing value imputation (slide)

  • 1. Multiple Imputation of Missing Blood Pressure Covariates in Survival Analysis STATS 756: Topics in Biostatistics - Fall 2021 Lim, Kyuson Department of Mathematics and Statistics McMaster University December 1st, 2021 Kyuson Lim 1 / 25
  • 2. Outline 1 Introduction/Motivation 2 Multiple imputation 3 Linear regression: multiple imputation 4 Fully conditional specification (FCS): Multivariate imputation 5 Pooling: parameters 6 Simulation Study 7 Biblibiography Kyuson Lim 2 / 25
  • 3. Background of research- Question of interest Main interest: determine influence of measures on relation between mortality and Blood Pressure (BP), over 85 years old, 1236 citizens in Leiden (1986), examined between 1987 and 1989. Concern whether prescription of anti-hypertensive drugs inadvertently shorten life. Variables: BP, age (85-89, 90-94, 95+), types of resident, activities of daily living (independent, dependent), history of hypertension, uses of diuretics, blood samples. 2 Cox regression =    Relation between mortality and BP adjusted for age & sex Relation between mortality and BP adjusted for age, sex & health ⇒ difference between health between different BP groups Suspect: Individuals with lower BP and higher mortality risks, had fewer BP measurements. Excluding incomplete cases (≈ 12.5%) produced deflated mortality estimates for lower BP groups ⇒ distortion of influence of BP on survival. Kyuson Lim 3 / 25
  • 4. Background of research- Assumption of non-responses Missing At Random (MAR): create small number of complete matrices, where missing values are replaced by plausible values. Valid inference ⇒ Number of imputation depends on amount of missing information (ie. 3 or 5) Proper procedure ⇒ Variability, reflects uncertainty about hypothetically observed (unknown) value. Goal: provide Bayesian approach for switching linear regression model to handle with multiple imputation, under types of missing values. 1 What information to use for choosing between non-response mechanisms 2 How to choose set of predictors. 3 How to generate imputations (ie. Mixed types). 4 How to specify different models for non-responses. Kyuson Lim 4 / 25
  • 5. Study of data and problems Observational problem: groups without BP measure have much higher mortality rates. BP not measured for 121 individuals BP is measured more often if suspected that BP was too high (hypertension). BP is measured less frequently for very old people and subjects who are too ill to be measured. Data collection period: rate increase (5-40%) and then drops to constant level (10-15%). Survived > 3 years History of previous hypertension No Yes Total Yes 8.7% (34/390) 8.1% (10/124) 8.6% (44/514) No 19.2% (69/360) 9.8% (8/82) 17.4% (77/442) Total 13.7% (103/750) 8.7% (18/206) 12.7% (121/956) Table 1. Proportion of no BP measured Individuals without BP measures have higher mortality rates. Relatively large group of individuals without hypertension and with high mortality risk is missing. The goal of the sensitivity analysis is to explore the result of the analysis under alternative scenarios for the missing data. Kyuson Lim 5 / 25
  • 6. Selection of variables: Chi-square (χ2 ) of independence Variables related to non-response includes age, type of residence, activities of daily living, and uses of diuretics (year of interview, blood samples are not categorical to be excluded). Figure: For 835 individuals, the chi-square of independence BP was measured less frequently for very old (95+) people and for those who have a health problem (hypertension). Kyuson Lim 6 / 25
  • 7. Missing data mechanism- MCAR, MAR and MNAR MCAR (missing completely at random): the probability of being missing is the same for all cases ⇒ cause of missing data unrelated to data. p(R = 0|Yobs, Ymis, Z) = p(R = 0|Z) Unrealistic: survival model between BP measured and no BP measured shows systematic difference in mortality. MAR (missing at random): the probability of being missing is the same only within groups, defined by observed data. p(R = 0|Yobs, Ymis, Z) = p(R = 0|Yobs, Z) MAR on Yobs: the probability of BP measurement depends on the survival ⇒ correction for non-response. MAR on Z: probability of non-response related to covariates (χ2 test) ⇒ correction for non-response. MNAR (missing not at random): the probability to be missing also depends on unobserved information, including Ymis itself. p(R = 0|Yobs, Ymis, Z) Investigation: probability of non-response related to BP (unobserved) ⇒ distribution of Ymis, sensitivity analysis. Kyuson Lim 7 / 25
  • 8. Appropriate selection of variables: Influx and outflux Influx and outflux are summaries of the missing data pattern intended to aid in the construction of imputation models. The influx of a variable quantifies how well its missing data connect to the observed data on other variables. Variables with higher influx depend strongly on the imputation model. The outflux of a variable quantifies how well its observed data connect to the missing data on other variables. Variable with higher outflux is better connected to the missing data, and thus potentially more useful for imputing other variables. Figure: Global influx-outflux pattern of the Leiden 85+ Cohort data All points are relatively close to the diagonal, which indicates that influx and outflux are balanced. Kyuson Lim 8 / 25
  • 9. Generating multiple imputation A MICE (Multivariate Imputation by Chained Equations) algorithm is a MCMC method that is univariate optimal. Starts with a random draw from the observed data, and imputes the incomplete data One iteration consists of one cycle through all Yj . Then, samples from the conditional distributions in order to obtain samples from the joint distribution. Generates multiple imputations in parallel m times. Response mechanism: Outcome variables: Y1 = Systolic BP Y2 = Diastolic BP Y3 = Survival or censoring times Y4 = censoring indicator (0,1) ⇒ p(Y3, Y4|Y1, Z) p(Y3, Y4|Y2, Z) Y = (Yobs, Ymis), Rij = 1 if Yij is observed. ⇒ p(R = 1|Yobs, Ymis, Z) (NMAR) p(R = 1|Yobs, Z) (MAR) Yobs, Ymis, Z define different types of response mechanism Observed covariates: 31 columns Kyuson Lim 9 / 25
  • 10. Multiple imputation: algorithm 1 Posterior predictive density, p(Ymis|X, R) (X is set of predictors) given non-response mechanism p(R|Y, Z) and p(Y, Z). p(Ymis|X): linear regression, missing BP as predictor variable X. Select suitable subset of data containing no more than 15-25 variables, X = [Yobs, Z, U, V]. 1 Yobs, Z: Include all variables, especially if complete model contains strong predictive relations. 2 U: Variables that differ between the response and non-response groups, inspect by correlation. 3 V: Variance with considerable variability, to reduce uncertainty. 4 U and V: remove for those with many missing values (%) within incomplete cases. 2 Draw imputations from p(Ymis|X, R) to produce m complete datasets. 3 Perform m complete Cox regression model on each completed data. 4 Pool m analysis results and variance estimates. Kyuson Lim 10 / 25
  • 11. Selection of variables: 24 variables 1 Includes variable appear in complete data: blood pressure, survival, sex, and age 2 Variables related to non-response: type of residence, activity of daily living, previous hypertension, use of diuretics, year of interview, and blood sample. 3 Select variables absolute correlation > 0.15 with SBP/DBP. 4 Remove variables with usable cases < 50%. Figure: Correlations between the cumulative death hazard H0(T), survival time T, log(T), SBP and DBP The high correlation may be caused by the fact that nearly everyone in this cohort has died, so the percentage of censoring is low. Observe that the correlation between log(T) and blood pressure is higher than for H0(T) or T, so it makes sense to add log(T) as an additional predictor. Kyuson Lim 11 / 25
  • 12. Multiple imputation: algorithm 1 Posterior predictive density, p(Ymis|X, R) (X is set of predictors) given non-response mechanism p(R|Y, Z) and p(Y, Z). 2 Draw imputations from p(Ymis|X, R) to produce m complete datasets. p(Ymis|X, R) = R p(Ymis|X, R, θ)p(θ|X, R)dθ, θ = (β, log σ) 1 Draw value of θ∗ from p(θ|X, R) ⇒ p(Ymis|X, R, θ = θ∗ ). 2 Draw value Y∗ mis from its conditional posterior distribution given θ∗ . 3 Multiple imputation: Repeat m times from the posterior distribution of Ymis. 3 Perform m complete Cox regression model on each completed data. 4 Pool m analysis results and variance estimates. Kyuson Lim 12 / 25
  • 13. Multiple imputation Ymis by linear regression 1 Obtain β̂ and Ŷobs from linear regression. Take W = (X0 obsXobs)−1 for β̂ = WX0 obsYobs to Ŷobs = Xobsβ̂. 2 Random draw from posterior distribution of β. Calculate β̂∗ = β̂ + σ∗W1/2 D Draw r-dimensional Normal random vector D ∼ N(0, Ir ), where r = 23 is the number of predictors. Similarity between cases is the distance predicted means of BP with observed data. Take predicted values Ŷmis = Xmisβ̂∗ 3 Repeat m = 3 to 5 times to create Y (1) mis, Y (2) mis, ..., Y (m) mis . Incorporate uncertainty due to deviations, but also reflect variations due to finite sampling. Kyuson Lim 13 / 25
  • 14. Example multiple imputation The goal is to reduce the difference of two iteration close to 0 by the specified model. Step 0. Identify missing data > head(nhanes) age bmi hyp chl 1 1 NA NA NA 2 2 22.7 1 187 3 1 NA 1 187 4 3 NA NA NA 5 1 20.4 1 113 6 3 NA NA 184 ⇒ Step 1. Model based imputation > imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415) > fit <- with(imp, lm(bmi ~ age)) > head(imp$imp) $age [1] 1 2 3 4 5 6 7 8 9 10 <0 rows> (or 0-length row.names) $bmi 1 2 3 4 5 6 7 8 9 10 1 27.2 21.7 25.5 22.5 28.7 30.1 27.4 22.5 22.5 27.2 3 22.0 30.1 20.4 33.2 27.2 35.3 29.6 22.0 27.2 28.7 4 21.7 20.4 27.2 25.5 21.7 25.5 22.7 22.5 24.9 22.5 $hyp 1 2 3 4 5 6 7 8 9 10 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 1 1 2 1 2 1 $chl 1 2 3 4 5 6 7 8 9 10 1 187 238 186 238 187 187 187 131 238 187 4 206 204 204 184 206 187 218 186 204 284 ⇒ Step 2. Repeat m = 10 times > est <- pool(fit) > est Class: mipo m = 10 term m estimate ubar b t dfcom df 1 (Intercept) 10 29.621111 3.4810048 1.4312926 5.055427 2 age 10 -1.802222 0.9257992 0.2759968 1.229396 Multivariate missing data algorithm for mice is different from model base multiple imputation algorithm. Circular dependence can occur, Ymis j depends on Ymis h which depends on Ymis j , j 6= h, as Yj and Yh is correlated. With large p and small n, collinearity or empty cells can occur. The non-linear relation is not considered, combination is problematic. Kyuson Lim 14 / 25
  • 15. Multivariate imputation method Split multivariate problems into series of univariate problems. Apply iterative algorithm to draw samples from sequence of univariate linear regression Each incomplete entry is initialized by filling in random draw from Yobs. Regression switching: executed m times in parallel, where Yi imputed conditional on all other data and Z, U, V. Gibbs sampler: under the condition that draws converge to multivariate posterior density, p(Ymis|Yobs, X, R), iterates about 20 steps (Partially incompatible MCMC). Kyuson Lim 15 / 25
  • 16. MICE- example (nhanes2) The nhanes2 data in mice contains 3 out of 27 missing values that destroy the monotone pattern: one for hyp (in row 6) and two for bmi (in rows 3 and 6). > library(mice} > data(nhanes2} > nhanes2 age bmi hyp chl 1 20-39 NA <NA> NA 2 40-59 22.7 no 187 3 20-39 NA no 187 4 60-99 NA <NA> NA 5 20-39 20.4 no 113 6 60-99 NA <NA> 184 . . . > length(nhanes2[is.na(nhanes2)]) [1] 27 ⇒ > where <- make.where(nhanes2, "none") > where[6, "hyp"] <- TRUE > where[c(3, 6), "bmi"] <- TRUE > imp1 <- mice(nhanes2, where = where, + method = "sample",seed = 21991, maxit = 1, + print = FALSE) > data <- mice::complete(imp1) > data age bmi hyp chl 1 20-39 NA <NA> NA 2 40-59 22.7 no 187 3 20-39 26.3 no 187 4 60-99 NA <NA> NA 5 20-39 20.4 no 113 6 60-99 22.7 no 184 . . . ⇒ > imp2 <- mice(data, maxit = 1, + visitSequence = "monotone", + print = FALSE) > data2 <- mice::complete(imp2) > data2 age bmi hyp chl 1 20-39 35.3 no 206 2 40-59 22.7 no 187 3 20-39 26.3 no 187 4 60-99 24.9 no 186 5 20-39 20.4 no 113 6 60-99 22.7 no 184 . . . 1 Imputes these 3 values by a simple random sample, and then fills in the remaining missing data by monotone data multiple imputation. 2 Observe that the imputed values for the missing hyp data in row 3 could also depend on bmi and chl, but in the procedure both predictors are ignored. Kyuson Lim 16 / 25
  • 17. NMAR method: δ-adjustment Purpose: to investigate robustness of MAR assumption against violation. To determine whether the relation between BP and mortality is affected by non-response. 1 Suppose BP distribution to be known, apply Bayes rule to calculate distribution for p(BP|R = 1) and p(BP|R = 0). 2 Both are normal but differs by δ. 3 Generate imputation by subtracting amount δ from random draw of p(BP|R = 1). Incorporate into Y1 = Xβ + (1 − R1)δ + , R1 is an indicator for systolic BP. Postulates mean difference, δ, between responders and non-responders. δ is chosen as 0 (same as MAR), -5, -10, -15, -20. Kyuson Lim 17 / 25
  • 18. Pooling: parameters 1 Posterior predictive density, p(Ymis|X, R) (X is set of predictors) given non-response mechanism p(R|Y, Z) and p(Y, Z). 2 Draw imputations from p(Ymis|X, R) to produce m complete datasets. 3 Perform m complete Cox regression model on each completed data. 4 Pool m analysis results and variance estimates. Combined point estimate Q̂ = Pm i=1 Q̂i m , Q̂i is k-dimensional column vector obtained by ith imputed dataset (i ∈ [1, m]). 3 sources of variation: Total covariance T = U + 1 + 1 m B Complete data variance Standard unbiased estimate of variance(Note that Within sample variance is V(Q|Yobs) = E[V(Q|Yobs, Ymis)|Yobs] + V[E(Q|Yobs, Ymis)|Yobs]) Simulation variance Relative risk of 95% confidence interval in the proportional hazards model is given by exp(Q̂ ± 1.96 √ T). Kyuson Lim 18 / 25
  • 19. R code: Pooling Realized difference in means of the observed and imputed SBP (mmHg) data under various δ-adjustments. The mean of the observed SBP is152.9 mmHg. delta - c(0, -5, -10, -15, -20) post - imp.qp$post imp.all.undamped - vector(list, length(delta)) for (i in 1:length(delta)) { + d - delta[i] + cmd - paste(imp[[j]][,i] - imp[[j]][,i] +, d) + post[rrsyst] - cmd + imp - mice(data2, pred = pred, post = post, maxit = 10, + seed = i * 22) + imp.all.undamped[[i]] - imp } ⇒ δ for SBP Avg. Difference 0 -8.2 -5 -12.3 -10 -20.7 -15 -26.1 -20 -31.5 Table 4. Realized difference in means The strength of the effect depends on the correlation between SBP and the variable. Under MAR, the imputations are on average 8.2mmHg lower than the observed blood pressure. For example, δ = −10mmHg means the magnitude of difference in MAR case, −20.5 + 8.2 = −12.2mmHg., larger in size than δ. Kyuson Lim 19 / 25
  • 20. Summary and reference The standard multiple imputation scheme of stepwise model selection consists of three phases: 1 Imputation of the missing data m times. 2 Analysis of the m imputed datasets. 3 Pooling of the parameters across m analyses. R codes and the output is stated in the textbook ‘Flexible imputation of missing data. CRC press’ written by the same author ‘Van Buuren, S.’ who write the paper. Chapter 3 (p.97-101) and Chapter 9 (p.259-283) contains all results to be stated and interpreted based on the data ‘Leiden 85+’. The package ‘mice’ contains with documents of the codes and examples. Kyuson Lim 20 / 25
  • 21. Generating imputation work Figure: Scatterplot of systolic and diastolic blood pressure from the first imputation. The left-hand-side plot was obtained after just running ‘mice’ on the data without any data screening. The right-hand-side plot is the result after cleaning the data and setting up the predictor matrix with ‘quickpred()’ (quick selection of predictors) in mice. Determine values in column size and correlation threshold such that the average number of predictors is around 25. Kyuson Lim 21 / 25
  • 22. Simulation study: Mean BP N δ SBP DBP Mean SD Mean SD Observed BP 835 152.9 25.7 82.8 13.1 Imputed BP 121 0 151.1 26.2 81.5 14 121 -5 142.3 24.6 78.4 13.7 121 -10 135.9 24.7 78.2 12.8 121 -15 128.6 25 75.3 12.9 121 -20 122.3 25.2 74 12.1 Table 5. Imputed BP are pooled over m = 5 multiple imputation Under MAR (δ = 0), x̄ observed SBP = 152.9 and x̄SBP = 151.1 for difference of 1.8 (mmHg) as well as x̄ observed DBP = 82.8 and x̄DBP = 81.5 for difference of 1.3 (mmHg). Decreasing trend for δ = −5, −10, −15, −20 in {142.3, 135.9, 128.6, 122.3}. Only small difference in mortality exists, even among non-response models with different δ’s ⇒ risk estimates are insensitive to missing data. Kyuson Lim 22 / 25
  • 23. Relative mortality risk estimates: SBP and DBP A relative mortality risks for Cox proportional hazard model is estimated with the age and sex. Figure: 95% confidence interval Relative mortality risk estimates: SBP and DBP At δ = 0, SBP groups 125mmHg has risk ratio of 1.76, meaning that the mortality risk (after correction for sex and age) in the group is 1.76 times the risk of the reference group 125 − 140 mmHg. Imputed BP are lowered by δ but the risk estimated does not change much. A hazard ratio estimates for different δ are close. Mortality between responders and non-responders are simply too small for serious impact on estimates. Conclude missing data hardly influence the risk estimates. Kyuson Lim 23 / 25
  • 24. NMAR non-response mechanism- relative mortality risks The pattern-mixture model decomposes the density at a point to be P(Y, R) = P(Y|R)P(R) = P(Y|R = 1)P(R = 1) + P(Y|R = 0)P(R = 0), emphasizing that the combined distribution is a mix of the distributions of Y in the responders and non-responders. By Bayes rule, observable probability is computed as P(R = 1|Y = y) = P(Y = y|R = 1)P(R = 1)/P(Y = y), where the marginal distribution of Y is P(Y = y) = P(Y = y|R = 1)P(R = 1) + P(Y = y|R = 0)P(R = 0). The right-hand plot provides the distributions P(Y|R) in the observed (blue) and missing (red) data in the pattern-mixture model. The hypothetically complete distribution is the black curve. Figure: Graphic representation of the response mechanism for SBP The distribution of blood pressure in the group with missing blood pressures is quite different, both in form and location. The effect of missingness on the combined distribution shows only slight difference. Kyuson Lim 24 / 25
  • 25. References Multiple imputation of missing blood pressure covariates in survival analysis [Van Buuren, S., Boshuizen, H. C., Knook, D. L.] https://www.sfu.ca/~jackd/Stat302/A4_Reading.pdf. Flexible imputation of missing data. CRC press. [Van Buuren, S. (2018).] https://stefvanbuuren.name/fimd/ Thank you for the participation and understandings ! Kyuson Lim 25 / 25