Principalcomponentanalysisfordatacontainingoutliersandmissingelements资源-CSDN下载

需积分: 7 31 浏览量 2009-09-07 23:23:44 上传评论收藏 227KB PDF 举报

### 主成分分析在含异常值与缺失元素数据中的应用 #### 概述主成分分析（Principal Component Analysis, PCA）是一种广泛应用于统计学、数据挖掘、模式识别等领域的数据分析技术，其主要目的是通过线性变换将原始数据转换为一组彼此无关的新特征变量，即主成分。这些主成分按照解释原始数据变异性的大小进行排序，使得我们可以通过较少数量的主成分来近似原始数据集的信息。 #### 引言当数据集中存在异常值(outliers)或缺失值(missing values)时，传统的PCA方法可能无法得到理想的结果。异常值会扭曲数据分布，导致主成分的估计不准确；而缺失值则会使数据矩阵不完整，影响PCA的计算过程。因此，开发一种能够同时处理这两种问题的PCA方法变得尤为重要。 #### 处理含异常值与缺失元素的数据该研究提出了两种方法来进行主成分分析： 1. **基于协方差矩阵的特征分解法**：通过构建一个能够处理含有异常值和缺失值数据的协方差矩阵，并对其进行特征分解。这种方法在数据集中的变量数小于样本数时是有效的。然而，对于高维数据（即变量数远大于样本数的情况），这种做法可能不再适用。 2. **期望鲁棒算法（Expectation Robust Algorithm, ER算法）**：作为一种替代方案，研究人员提出了一种结合了期望最大化算法（Expectation Maximization, EM算法）和鲁棒统计原理的ER算法。EM算法主要用于处理缺失数据问题，而鲁棒统计则是为了应对异常值的影响。ER算法有效地结合了这两种技术，从而能够在存在异常值和缺失值的情况下执行鲁棒PCA。 #### 期望鲁棒算法详解 ER算法的核心思想在于通过迭代的方式逐步逼近真实数据的参数估计。具体来说，它包括以下几个步骤： 1. **初始化**：选择初始参数估计。 2. **E步**：根据当前参数估计，对缺失值进行填充，这里使用了EM算法的思想。 3. **R步**：在填补后的数据基础上，使用鲁棒统计方法更新参数估计，以降低异常值的影响。 4. **迭代**：重复E步和R步直到收敛。 #### 性能评估通过对多种不同大小的数据集进行广泛的模拟实验，研究发现ER算法在各种情况下都能表现出良好的性能。此外，通过模拟和实例研究，还证明了即使在数据中存在缺失值的情况下，现有的鲁棒PCA方法仍然可以保持其原有的特性。 #### 关键词解析 - **鲁棒性**（Robustness）：指算法或模型在面对异常值时仍能保持良好性能的能力。 - **主成分分析**（PCA）：用于降维的技术，通过将数据投影到由数据本身的变异方向构成的空间中来实现。 - **鲁棒PCA**（Robust PCA）：专门设计用来应对异常值的PCA方法。 - **缺失数据**（Missing Data）：数据集中未记录的部分信息。 - **不完整数据**（Incomplete Data）：包含缺失值的数据集。 - **期望最大化**（Expectation Maximization, EM）：一种迭代优化算法，常用于处理包含缺失数据的统计模型。 - **期望鲁棒**（Expectation Robust, ER）：本研究提出的结合EM和鲁棒统计的方法。通过引入ER算法，研究者们成功地扩展了现有鲁棒PCA方法的应用范围，使其能够在含有异常值和缺失值的数据集中进行有效的分析。这对于实际应用场景中的数据预处理和探索性数据分析具有重要意义。

资源推荐

资源详情

资源评论

Computational Statistics & Data Analysis 52 (2008) 1712 – 1727

www.elsevier.com/locate/csda

Principal component analysis for data containing outliers and

missing elements

Sven Serneels

a,∗

, Tim Verdonck

ChemometriX Group, Department of Chemistry, University of Antwerp, Belgium

Agoras Group, Department of Mathematics and Informatics, University of Antwerp, Belgium

Received 22 August 2006; received in revised form 21 May 2007; accepted 21 May 2007

Available online 24 May 2007

Abstract

Two approaches are presented to perform principal component analysis (PCA) on data which contain both outlying cases and

missing elements. At ﬁrst an eigendecomposition of a covariance matrix which can deal with such data is proposed, but this approach

is not ﬁt for data where the number of variables exceeds the number of cases. Alternatively, an expectation robust (ER) algorithm

is proposed so as to adapt the existing methodology for robust PCA to data containing missing elements. According to an extensive

simulation study, the ER approach performs well for all data sizes concerned. Using simulations and an example, it is shown that

by virtue of the ER algorithm, the properties of the existing methods for robust PCA carry through to data with missing elements.

Keywords: Robustness; Principal component analysis; Robust PCA; Missing data; Incomplete data; Expectation maximization; Expectation robust

1. Introduction

Principal component analysis (PCA) is one of the key tools in multivariate statistical analysis. It aims at constructing

components, each of which contain a maximal amount of variation from the data unexplained by the other components.

The user thus hopes that the information in the data can be summarized into a few principal components, which is often

the case in practice. Once the principal components have been determined, all further analysis can be carried out on them

instead of on the original data, as they carry the relevant information in them. PCA is thus frequently considered a ﬁrst

step of a statistical data analysis which aims at compression of the data: decreasing their dimensionality without losing

much information. Further analysis on the principal components can consist of various methods, such as clustering,

discriminant analysis, regression, etc.

Principal components contain a maximal amount of variation from the data. In mathematics this means that prin-

cipal components are deﬁned according to a maximization criterion of variance. Let X ∈ R

n×p

be the data, con-

sisting of n cases observed at p variables. Then the principal components t

are deﬁned as linear combinations of the

∗

Corresponding author at. Shell Global Solutions International B.V., Shell Research and Technology Centre, Amsterdam, P.O. Box 38000, 1030

BN Amsterdam, The Netherlands. Tel.: +31 20 6303856.

E-mail addresses: [email protected] (S. Serneels), [email protected] (T. Verdonck).

doi:10.1016/j.csda.2007.05.024

S. Serneels, T. Verdonck / Computational Statistics & Data Analysis 52 (2008) 1712 – 1727 1713

data t

= Xp

, where

= arg max

{var(Xa)} (1a)

under the constraints that

p

=1 and cov(Xp

, Xp

) = 0 forj<i. (1b)

Exact maximization of this criterion can be done by the Lagrange multiplier method and leads to the conclusion that

the principal components are the eigenvectors of the variance–covariance matrix  = n

−1

X (here and elsewhere, we

will assume the data to be centred).

Both the varianceand the variance–covariance matrix are known to be sensitive to outliers. Hence, the same conclusion

holds for PCA as a whole: it is a nonrobust method. A single bad outlier may cause that principal components are

distorted so as to ﬁt the outlier well, leading to bad interpretation of the results. Outliers can also cause the so-called

masking effect: due to their presence, the model is distorted in such a way that based on the principal components, no

outliers can be detected.

The sensitivity of principal components to outliers is well known and various robust alternatives to it have been

proposed in literature.A topic which has, however, not yet been discussed in the context of robust principal components is

how to deal with missing data. Especially in the biological and environmental sciences, missing data frequently occur.

Data can be missing due to different reasons. In what follows, we will assume that the reason why a data point is

missing is not related to its actual value, i.e. the data are missing at random (MAR) in the sense of Rubin (1976).

Missing completely at random (MCAR) is a stronger hypothesis but we will assume that the data are at least MAR.

A good method to deal with data containing missing elements is the expectation maximization (EM) algorithm

(Dempster et al., 1977). The EM algorithm basically consists of an iterative scheme where in each iteration two steps

are carried out: (i) the missing elements are ﬁlled in by the values which they are expected to be (the expectation step

or E-step) and (ii) the desired entity (e.g. the variance–covariance matrix) is estimated from the data in which missing

elements have been ﬁlled in (called the maximization step or M-step if the estimates are obtained via maximum likelihood

and the robust estimation step or R-step if the estimates are obtained by means of a robust estimation technique). Since

the true values of the missing elements are unknown, the procedure is repeated until some convergence criterion is

fulﬁlled. The EM algorithm has been applied to PCA (Walczak and Massart, 2001) on the one hand as well as to robust

estimation of the variance–covariance matrix (Cheng and Victoria-Feser, 2002; Little, 1988) on the other hand (in the

latter context called expectation-robust, ER).

In this article we investigate how the EM (or ER) approach to dealing with missing data can be extended to robust

PCA. Two ways to solve this problem seem viable. On the one hand, it is possible to take the eigenvectors of a covariance

matrix which has been estimated by the ER scheme. On the other hand, one can incorporate a robust PCA algorithm

into the iterative EM scheme, thus obtaining a robust PCA method which can deal with missing elements. Either way,

robust PCA for incomplete data always consists in some sense of inclusion of a robust estimator into an iterative scheme

to estimate missing elements. Thus it is important to know the properties of the robust estimator used; in Section 2 we

present a brief description of properties a robust estimator ideally should possess. In Section 3 the approach to robust

PCA based on an ER covariance matrix is discussed, whereas in Section 4 we introduce the EM algorithm for robust

PCA. As there is no agreement in the literature on how PCA should be robustiﬁed best, several robust PCA algorithms

are being considered. In Section 4 we present an extensive simulation study which enables us to compare in an objective

manner the different approaches. Finally, we give an example from the biological sciences.

2. Robustness properties of robust estimators

It is interesting to know which properties a robust method should have.These properties fall into three basic categories:

properties related to the inﬂuence function, to the MaxBias curve and the statistical efﬁciency. The inﬂuence function

(Hampel et al., 1986) is a tool which measures the effect an inﬁnitesimally small amount of contaminated data has on an

estimator as a function of the contaminated data’s position in space. For an estimator to be robust, its inﬂuence function

has to be bounded. However, note that in some special cases, this statement is not true: e.g. a location M-estimator with

a bounded  function also has a bounded inﬂuence function, but a null breakdown point. A good property for a robust

estimator is to have not only a bounded inﬂuence function, but also to have a smooth inﬂuence function. In practical

1714 S. Serneels, T. Verdonck / Computational Statistics & Data Analysis 52 (2008) 1712 – 1727

terms this implies that the inﬂuence of contamination placed at z is approximately the same as the inﬂuence of a point

of contamination placed at z +  with  arbitrarily small. This property is also referred to as the local shift sensitivity:an

estimator should have a small local shift sensitivity. Whereas the inﬂuence function measures inﬁnitesimal robustness,

the MaxBias curve (Rousseeuw, 1999) assesses global robustness. The MaxBias curve expresses how biased a robust

estimator is with respect to the fraction of contaminated data, given that these are situated at the worst possible position

in space. MaxBias curves all typically have an asymptote: there exists a fraction of contamination beyond which the

estimator is totally unreliable and breaks down. This fraction is the breakdown point and cannot exceed 0.5 (except

for estimators based on ranks or signs, see e.g. (Grize, 1978)or(Davies and Gather, 2005)). The breakdown point

is the most cited property derived from the MaxBias curve, but the shape of the curve should also be considered:

it is possible that an estimator breaks down at 0.5, but has at 0.2 a much higher bias than another estimator which

breaks down at 0.3. In practice data containing 50% outliers seldomly occur so the latter estimator would be preferable

for most applications. Finally, robust estimators invariably have a higher variance at the normal distribution than

classical estimators. Depending on the design of the estimator, the increase in variance compared to the maximum

likelihood (ML) estimator may or may not be drastic. A measure for the increase in variance is the statistical efﬁciency

(or efﬁcacity). The efﬁciency of an estimator is the sampling variance of the classical estimator, divided by the sampling

variance of the robust estimator and lies between 0% and 100%.

3. Robust PCA for incomplete data based on an ER covariance matrix

As stated in the introduction, PCA corresponds to a spectral decomposition of the variance–covariance matrix as

 = PP

, (2)

where the matrix P contains as columns the eigenvectors p

of  and  is a diagonal matrix where the diagonal elements



are the eigenvalues of  corresponding to p

. In order to construct a method for PCA which is robust and can deal

with missing data, it sufﬁces to obtain an estimate

 which fulﬁls both requirements, which can then be decomposed

(2) in order to obtain the principal components.

The literature on robust covariance estimators for data containing missing elements is not abundant, but up to our

knowledge, four approaches exist. The earliest proposal consists of inserting an M estimator into the EM algorithm

(Little, 1988) (which is then called the ER algorithm). However, the M estimator for the covariance matrix used there

is monotonic. A monotonic M estimator is an M estimator where the dispersion matrix is given by

−1



i=1

W [(x

−ˆμ)



−1

−ˆμ)

](x

−ˆμ)

−ˆμ) =

, (3)

where x

are the rows of the data matrix X, μ is the location as a row vector and the function W(t)t must be nondecreasing

in t. Such M estimators are known to have a low breakdown point (i.e. the fraction of outliers the data may contain

before the estimator yields unreliable results). The breakdown point of monotonic M estimators for the covariance

matrix equals 1/(p + 1) which implies that if the data are for instance, 100-variate, the estimator can only resist

1% of outliers in the data. To remedy this drawback, recently three other estimators have been proposed (Cheng and

Victoria-Feser, 2002). The ﬁrst proposal is inspired on results concerning robust covariance estimation, where it is

noted that the robustness of an M estimator can be improved drastically if in the algorithm a more resistant estimator

is used as a starting value for the iterative reweighting algorithm. Cheng and Victoria-Feser (2002) propose to use the

minimum covariance determinant MCD estimator (Rousseeuw, 1985) with breakdown 0.5 for these purposes. They

also note that alternatively, the MCD estimator could be extended to missing data as such, without being a part of the

M algorithm. However, they prefer to use the MCD as a starting value for the M estimator as the MCD is known to

have a rather poor statistical efﬁciency (Cheng and Victoria-Feser, 2002; Croux and Haesbroeck, 1999). Finally, Cheng

and Victoria-Feser also propose to insert a translate-biweight S (TBS) (Rousseeuw and Yohai, 1984) estimator into the

ER algorithm for missing data. Based on simulations and on an example from psychometrics they conclude that both

estimators produce similar results. As the TBS estimator is slightly more complex from the mathematical point of view,

we suggest to use the M estimator as a starting point for PCA. In our simulation study (Section 6) we include both the

original M (Little, 1988) estimator and the M estimator with MCD (k = 0.5) starting value (Cheng and Victoria-Feser,

2002). It is expected that the latter estimator outperforms the plain M estimator.

剩余15页未读，继续阅读

评论收藏

内容反馈

wangaaaaa123

粉丝: 0

Principal component analysis for data containing outliers and mi...

最新资源

Principal component analysis for data containing outliers and mi...

sparse principal component analysis.pdf

Improved Kernel Principal Component Analysis

Robust principal component analysis?

A Tutorial on Principal Component Analysis [2014]

Principal component Analysis

Graph-dual Laplacian principal component analysis(gDLPCA代码实现)

Generalized Principal Component Analysis

PCA算法.pdf(主成分分析 ( Principal Component Analysis ， PCA ))

A Tutorial on Principal Component Analysis [2002]

广义主成因分析（Generalized Principal Component Analysis）英文版

Kernel Principal Component Analysis

Generalised Principal component analysis

PCA主成分分析(Principal Component Analysis)

local principal component analysis（局部PCA）（英文pdf）

Shrinking principal component analysis for enhanced process monitoring and fault isolation

【纯干货啊】华为IPD流程管理(完整版).pptx

仿真电路以及操作方法

信号与系统——保研复习资料.pdf

python大作业 含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar

可编程语言标准IEC61131-3中文版.pdf

Landsat_WRS2.zip

数字信号处理——保研复习资料.pdf

使用STM32F103C8T6+L298N+MG513P30电机使用外部中断法和输入捕获法进行编码器测速

系统规划与管理师全套资料.zip

线性代数——保研复习资料.pdf

数学家的趣闻轶事65则

精确率应该比准确率高还是低

最新资源

python大作业含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar