
1714 S. Serneels, T. Verdonck / Computational Statistics & Data Analysis 52 (2008) 1712 – 1727
terms this implies that the influence of contamination placed at z is approximately the same as the influence of a point
of contamination placed at z + with arbitrarily small. This property is also referred to as the local shift sensitivity:an
estimator should have a small local shift sensitivity. Whereas the influence function measures infinitesimal robustness,
the MaxBias curve (Rousseeuw, 1999) assesses global robustness. The MaxBias curve expresses how biased a robust
estimator is with respect to the fraction of contaminated data, given that these are situated at the worst possible position
in space. MaxBias curves all typically have an asymptote: there exists a fraction of contamination beyond which the
estimator is totally unreliable and breaks down. This fraction is the breakdown point and cannot exceed 0.5 (except
for estimators based on ranks or signs, see e.g. (Grize, 1978)or(Davies and Gather, 2005)). The breakdown point
is the most cited property derived from the MaxBias curve, but the shape of the curve should also be considered:
it is possible that an estimator breaks down at 0.5, but has at 0.2 a much higher bias than another estimator which
breaks down at 0.3. In practice data containing 50% outliers seldomly occur so the latter estimator would be preferable
for most applications. Finally, robust estimators invariably have a higher variance at the normal distribution than
classical estimators. Depending on the design of the estimator, the increase in variance compared to the maximum
likelihood (ML) estimator may or may not be drastic. A measure for the increase in variance is the statistical efficiency
(or efficacity). The efficiency of an estimator is the sampling variance of the classical estimator, divided by the sampling
variance of the robust estimator and lies between 0% and 100%.
3. Robust PCA for incomplete data based on an ER covariance matrix
As stated in the introduction, PCA corresponds to a spectral decomposition of the variance–covariance matrix as
= PP
T
, (2)
where the matrix P contains as columns the eigenvectors p
i
of and is a diagonal matrix where the diagonal elements
ii
are the eigenvalues of corresponding to p
i
. In order to construct a method for PCA which is robust and can deal
with missing data, it suffices to obtain an estimate
ˆ
which fulfils both requirements, which can then be decomposed
(2) in order to obtain the principal components.
The literature on robust covariance estimators for data containing missing elements is not abundant, but up to our
knowledge, four approaches exist. The earliest proposal consists of inserting an M estimator into the EM algorithm
(Little, 1988) (which is then called the ER algorithm). However, the M estimator for the covariance matrix used there
is monotonic. A monotonic M estimator is an M estimator where the dispersion matrix is given by
n
−1
n
i=1
W [(x
i
−ˆμ)
ˆ
−1
(x
i
−ˆμ)
T
](x
i
−ˆμ)
T
(x
i
−ˆμ) =
ˆ
, (3)
where x
i
are the rows of the data matrix X, μ is the location as a row vector and the function W(t)t must be nondecreasing
in t. Such M estimators are known to have a low breakdown point (i.e. the fraction of outliers the data may contain
before the estimator yields unreliable results). The breakdown point of monotonic M estimators for the covariance
matrix equals 1/(p + 1) which implies that if the data are for instance, 100-variate, the estimator can only resist
1% of outliers in the data. To remedy this drawback, recently three other estimators have been proposed (Cheng and
Victoria-Feser, 2002). The first proposal is inspired on results concerning robust covariance estimation, where it is
noted that the robustness of an M estimator can be improved drastically if in the algorithm a more resistant estimator
is used as a starting value for the iterative reweighting algorithm. Cheng and Victoria-Feser (2002) propose to use the
minimum covariance determinant MCD estimator (Rousseeuw, 1985) with breakdown 0.5 for these purposes. They
also note that alternatively, the MCD estimator could be extended to missing data as such, without being a part of the
M algorithm. However, they prefer to use the MCD as a starting value for the M estimator as the MCD is known to
have a rather poor statistical efficiency (Cheng and Victoria-Feser, 2002; Croux and Haesbroeck, 1999). Finally, Cheng
and Victoria-Feser also propose to insert a translate-biweight S (TBS) (Rousseeuw and Yohai, 1984) estimator into the
ER algorithm for missing data. Based on simulations and on an example from psychometrics they conclude that both
estimators produce similar results. As the TBS estimator is slightly more complex from the mathematical point of view,
we suggest to use the M estimator as a starting point for PCA. In our simulation study (Section 6) we include both the
original M (Little, 1988) estimator and the M estimator with MCD (k = 0.5) starting value (Cheng and Victoria-Feser,
2002). It is expected that the latter estimator outperforms the plain M estimator.