Principal Component Analysis
Mason Ziemer
12/2/16
Abstract
One problemthat oftencropsup indata analysis isthe
presence of a high-dimensional dataset.We will explorea
specificdimensionreductiontechnique inthisreportcalled
Principal ComponentAnalysis.The aimof thisreportisto use
eigenvectors,eigenvalues,andorthogonality tounderstand the
conceptof Principal ComponentAnalysis(PCA) andtoshow why
PCA is useful.
Introduction
The aim for Principal ComponentAnalysisliesinthe title;
findingthe principal componentsof the data.PCA is usedto
projectdata intoa new,lower-dimension, coordinate system
where the axescorrespond toeachprincipal component. What
isPCA useful for? Ithelpswithreducingthe dimensionalityof a
datasetwhichinturn helpswiththe efficiencyof runninga
machine learningalgorithm.Italsosimplifiesthe dataset,
allowingforthese algorithmstorunfaster. So,whatis a
principal component?A principalcomponentisthe direction
where the mostvariance liesinthe data.To get a visual,the
firstprincipal componentof the dataonthe x-yplane isshown
below.
As youcan see above,the firstprincipal componentisthe line
where the datavariesthe most.Let’ssay we want to projectthe
data alongthe firstprincipal componentonly.Thiswould
effectivelyreduce the dimensionof the dataset fromtwo
dimensionstoone while retainingthe mostdatapossible.
Althoughthiswill destroysome of the data,itstill holds
informationfromboth the x and y. This iswhatthe projection
lookslike.
x
y
Firstprincipal component
y Firstprincipal component
Here is the data projectedontothe firstprincipal component.
Now,lookingbacktothe original graph,the secondprincipal
componentmustbe orthogonal tothe firstin orderto capture
the most remainingvariance thatfirstprincipal componentdid
not.Here iswhatsecondprincipal componentlookslike.
If we were to performPCA and projectthe data ontothe first
twoprincipal components,thennone of the datawouldbe lost
inthe transformation.Thisisbecause we are transformingthe
data fromthe x-yplane,whichhastwodimensions, intoanew
two-dimensional space wherethe axesare undefined.
Completingthistransformationmerelyrotatesthe dataonto
the newaxesand lookslike this:
x
x
y
Firstprincipal component
x
y
Firstprincipal component
SecondPrincipal Component
As youcan see none of the data has changed,we are just
lookingatit froma differentangle.
Eigenvalues and Eigenvectors
In mathematical terms, the principalcomponentsare the
eigenvectorsof the covariance matrix forthe dataset. Itwill be
illustratedinthe example below of how toobtainthe
covariance matrix alongwith the eigenvectorsandeigenvalues.
The eigenvectors of the covariance matrix pointinthe direction
where the mostvariance inthe data lies.Eacheigenvectorhasa
correspondingeigenvalue,whichisascalar,that denotesthe
amountof variance inthe data alongitscorresponding
eigenvector. There canonlybe as manyeigenvectorswith
correspondingeigenvaluesasthere are variables inthe dataset.
As the eigenvaluesgetlarger,itmeansthatmore variance in
the data is accountedfor.In the previous example fromabove,
since the data liesonthe x-yplane,there are onlytwo
eigenvectorswithcorrespondingeigenvalues.Foranydata set,
the firstprincipal componentisthe eigenvectorthat
correspondstothe largesteigenvalue.Itisalsoimportantto
note that the matrices that are formedby the datasetdonot
have to be square; the variablesmake upthe columnsof the
matrix,while the observationsmake upthe rows. Theydonot
have to be square matricesbecause we are takingthe
eigenvaluesfromthe covariance matrix,whichwill be explained
inthe example below.
New Axis
NewAxis
Example
For an example andimplementationof PCA Iwill refertothe iris
datasetinR. Iris contains measurements,incm, for150 iris
flowersonfourdifferentfeatures:Sepal.Length,Sepal.Width,
Petal.Length,andPetal.Width. The datasetalsocontainsthe
speciesof irisforeachirisflower.The three differentspecies
are Setosa,Versicolor,andVirginica. The fourfeaturesmake up
the columnsof our matrix,while the 150 observationsof each
feature make upthe rows of our matrix. Here iswhat the first
six rowsof the data lookslike.
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
The nextstepisto findthe covariance matrix,whichcanbe
computedbythe followingformula.
𝐶𝑂𝑉( 𝑋, 𝑌) =
∑ (𝑋𝑖
̅ − 𝑋)(𝑌𝑖
̅ − 𝑌)𝑛
𝑖=1
𝑛 − 1
Since our data sethas fourvariables (fourdimensionaldataset),
the covariance betweenall fourvariablescanbe measured.Itis
alsoimportantto rememberthatthe covariance of a variable
withitself COV(X,X) justequalsitsvariance VAR(X). Suppose we
use arbitraryvariablesW,X,Y,andZ to setup the covariance
matrix for thisexample.The resulting4x4matrix will looklike t
his:
𝑉𝐴𝑅(𝑊) 𝐶𝑂𝑉(𝑊, 𝑋)
𝐶𝑂𝑉( 𝑋, 𝑊) 𝑉𝐴𝑅(𝑋)
𝐶𝑂𝑉(𝑊, 𝑌) 𝐶𝑂𝑉(𝑊, 𝑍)
𝐶𝑂𝑉(𝑋, 𝑌) 𝐶𝑂𝑉( 𝑋, 𝑍)
𝐶𝑂𝑉(𝑌, 𝑊) 𝐶𝑂𝑉(𝑌, 𝑋)
𝐶𝑂𝑉(𝑍, 𝑊) 𝐶𝑂𝑉( 𝑍, 𝑋)
𝑉𝐴𝑅(𝑌) 𝐶𝑂𝑉(𝑌, 𝑍)
𝐶𝑂𝑉(𝑍, 𝑌) 𝑉𝐴𝑅(𝑍)
It’salso importanttonote that COV(X,Y) equal toCOV(Y,X),
hence the matrix issymmetrical aboutthe diagonal,where the
diagonal equalsthe variancesof W,X,Y,andZ.The covariance
matrix forour datasetcan be obtainedinRby the following
command.
> cov(iris[-5])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
Nowthat we have obtainedthe covariance matrix forthe iris
datasetwe can now go aheadandfind eigenvectorsandtheir
correspondingeigenvalues. Remember,aneigenvectorisa
nonzerovector 𝑥⃑ such that A𝑥⃑= λ𝑥⃑ for some scalarλ. The scalar
λ is oureigenvalueforthe correspondingeigenvector.Solving
for the eigenvectorsandeigenvalues,we get:
> eigen(X)$vectors
𝑣1⃑⃑⃑⃑⃑ 𝑣2⃑⃑⃑⃑⃑ 𝑣3⃑⃑⃑⃑⃑ 𝑣4⃑⃑⃑⃑
Sepal.Length 0.36138659 -0.65658877 -0.58202985 0.3154872
Sepal.Width -0.08452251 -0.73016143 0.59791083 -0.3197231
Petal.Length 0.85667061 0.17337266 0.07623608 -0.4798390
Petal.Width 0.35828920 0.07548102 0.54583143 0.7536574
As youcan see,ourfirsteigenvector,betterknownasthe first
principle component,isdominatedbyPetal.Length,withavalue
of .85667. This meansthatPetal.Lengthcapturesthe most
variationinthe data for the firstdimension.So,if we wantedto
reduce our datasettoone variable,Petal.Lengthwouldbe the
bestchoice.
> eigen(X)$values
[1] 4.22824171 0.24267075 0.07820950 0.02383509
The eigenvaluesof the covariance matrix are able totell ushow
much variance isexplainedbyeacheigenvector. Notethatthe
firsteigenvalue,4.228,ismuch largerthan the followingthree.
Thus,the proportionof variance explainedbythe first
eigenvectoris equal to:
𝜆1
𝜆1+𝜆2+𝜆3+𝜆4
= .9246
Thismeansthat 92.46% of the data iscapturedby the first
principle component.If we wantedthe proportionof overall
variance explainedforthe firsttwoprinciplecomponentsjust
add 𝜆2 to the numeratorand the equationthenequals97.77%.
So,97.77% of the variance in the data containing4 variablescan
be explainedbythe firsttwoprinciple components.
Projection
Nowthat we have obtainedthe eigenvectorsandeigenvalues,it
istime to projectthe data ontofewerdimensions.Since we
computedabove thatthe firsttwo principle componentsmake
up almost98% of the variance inthe data, we will projectour
data ontothe firsttwoprinciple components.Inordertofind
the coordinatesbysolvingthe equationA =XV where Xisour
original matrix with4columnsand 150 rows (note thismatrix
has to be centeredwithmean= 0), A is the matrix of
coordinatesinthe new principle componentspace thatare
spannedbythe eigenvectorsin V.Remember,V isourmatrix of
eigenvectorsinthiscase.
X = scale(iris[1:4], center = TRUE, scale = FALSE)
scores = data.frame(X %*% eig$vectors)
colnames(scores) = c("Prin1", "Prin2", "Prin3", "Prin4")
scores[1:10, ]
Prin1 Prin2 Prin3 Prin4
1 -2.684126 -0.31939725 -0.02791483 0.002262437
2 -2.714142 0.17700123 -0.21046427 0.099026550
3 -2.888991 0.14494943 0.01790026 0.019968390
4 -2.745343 0.31829898 0.03155937 -0.075575817
5 -2.728717 -0.32675451 0.09007924 -0.061258593
6 -2.280860 -0.74133045 0.16867766 -0.024200858
7 -2.820538 0.08946138 0.25789216 -0.048143106
8 -2.626145 -0.16338496 -0.02187932 -0.045297871
9 -2.886383 0.57831175 0.02075957 -0.026744736
10 -2.672756 0.11377425 -0.19763272 -0.056295401
The commandsabove will give the coordinates,orscores,for
each principle component.Since we know about98% of the
variance inthe data is capturedbythe firsttwoprinciple
components,we willuse the firsttwocolumnsof coordinates
fromabove to plotour datasetin2 dimensionswiththe
followingcommand inR.The axesof thisnew twodimensional
projectionare the firsttwoprinciple components.
plot(scores$Prin1,scores$Prin2,main="Data ProjectedonFirst
2 Principal Components",
xlab= "FirstPrincipal Component",ylab="SecondPrincipal
Component",
col = c("green","red","blue")[iris$Species])
Note:The three differentcolorsrepresentthe speciesof iris
flower.
Conclusion
What was justaccomplishedwasthe exactgoal of PCA.We
were able to effectively reduce ourdatasetirisfromfour
dimensionsdownto twodimensions while maintainingnearly
98% of the original data.We were able todo thisbyusingthe
conceptsof eigenvaluesandeigenvectors. Toreview,we start
by settingupthe matrix forthe data whichhasthe observations
as rows,and variablesascolumns.The nextstepistocompute
the covariance matrix forthe data.The covariance matrix
resultsinan NxN matrix,where N isthe numberof variables.
The nextstepisto findthe eigenvectors andthe corresponding
eigenvaluesof the covariance matrix.The eigenvectorsmake up
the principle components.Next,itisimportanttoanalyze the
eigenvectorsandeigenvaluestosee how muchvariabilityis
accountedforby each component andto see whichvariable
contributesthe mostforeach eigenvector.Once itisdecided
howmany dimensionsyouwantyourprojectiontobe,the
scores,or coordinates,forthe new axesneedtobe obtained.
The final step isto plotthe data to see what the reduced
dimensionslooklikeandPCA issuccessfullycompleted!
Sources:
Hamilton,L.D. (n.d.).LauraDiane Hamilton.RetrievedDecember13,2016, from
http://www.lauradhamilton.com/introduction-to-principal-component-
analysis-pca
C. (2015). Principal ComponentAnalysis4Dummies:
Eigenvectors,EigenvaluesandDimensionReduction.
RetrievedDecember13, 2016, from
https://georgemdallas.wordpress.com/2013/10/30/prin
cipal-component-analysis-4-dummies-eigenvectors-
eigenvalues-and-dimension-reduction/
PRINCIPALCOMPONENTSANALYSIS.(n.d.).RetrievedDecember13,2016, from
http://www.bing.com/cr?IG=BE9D91B52E12482181305171F3DE2744&CID=29E
B5DDB568F60A634A7543057BE6161&rd=1&h=w3menlZrSNEgeE4CqkgKpvMgpi
xKBovnov7bpaVv7sg&v=1&r=http://www4.ncsu.edu/~slrace/LinearAlgebra2016
/RChapters/PCA.pdf&p=DevEx,5037.1

More Related Content

PPTX
Lect4 principal component analysis-I
PPTX
PDF
Principal Component Analysis
PPTX
Principal Component Analysis PCA
PDF
Bayesian Networks - A Brief Introduction
PPTX
Data Mining: Outlier analysis
PDF
Principal Component Analysis and Clustering
Lect4 principal component analysis-I
Principal Component Analysis
Principal Component Analysis PCA
Bayesian Networks - A Brief Introduction
Data Mining: Outlier analysis
Principal Component Analysis and Clustering

What's hot (20)

ODP
Introduction to Principle Component Analysis
PPTX
PDF
LDA classifier: Tutorial
PPTX
Introduction to principal component analysis (pca)
PPTX
Principal component analysis
PDF
2.03 bayesian estimation
PDF
Linear discriminant analysis
PDF
Principal component analysis and lda
PDF
13. Cubist
 
PDF
Bayes Belief Networks
PPT
Support Vector Machines
PDF
data mining
PDF
Tutorial on Theory and Application of Generative Adversarial Networks
PPTX
Linear Discriminant Analysis (LDA)
PDF
Approximate Inference (Chapter 10, PRML Reading)
PPTX
Genetic Algorithms
PPTX
PPTX
Lect5 principal component analysis
Introduction to Principle Component Analysis
LDA classifier: Tutorial
Introduction to principal component analysis (pca)
Principal component analysis
2.03 bayesian estimation
Linear discriminant analysis
Principal component analysis and lda
13. Cubist
 
Bayes Belief Networks
Support Vector Machines
data mining
Tutorial on Theory and Application of Generative Adversarial Networks
Linear Discriminant Analysis (LDA)
Approximate Inference (Chapter 10, PRML Reading)
Genetic Algorithms
Lect5 principal component analysis
Ad

Similar to Principal Component Analysis (20)

PPTX
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
PPTX
Principal component analysis
PDF
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
PDF
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
PDF
Linear regression [Theory and Application (In physics point of view) using py...
PPTX
PCA Algorithmthatincludespcathatispca.pptx
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PPTX
Implement principal component analysis (PCA) in python from scratch
PDF
Getting started with chemometric classification
PDF
Eli plots visualizing innumerable number of correlations
PDF
A Novel Algorithm for Design Tree Classification with PCA
PDF
1376846406 14447221
PPTX
Dimensionality Reduction and feature extraction.pptx
PDF
SupportVectorRegression
PDF
Exploring Support Vector Regression - Signals and Systems Project
PDF
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
PPTX
Machine Learning Algorithms (Part 1)
PDF
Neural Networks: Principal Component Analysis (PCA)
PPT
pca analysis principal component pca.ppt
PDF
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
Principal component analysis
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
Linear regression [Theory and Application (In physics point of view) using py...
PCA Algorithmthatincludespcathatispca.pptx
Welcome to International Journal of Engineering Research and Development (IJERD)
Implement principal component analysis (PCA) in python from scratch
Getting started with chemometric classification
Eli plots visualizing innumerable number of correlations
A Novel Algorithm for Design Tree Classification with PCA
1376846406 14447221
Dimensionality Reduction and feature extraction.pptx
SupportVectorRegression
Exploring Support Vector Regression - Signals and Systems Project
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms (Part 1)
Neural Networks: Principal Component Analysis (PCA)
pca analysis principal component pca.ppt
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
Ad

Recently uploaded (20)

PDF
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
PDF
A biomechanical Functional analysis of the masitary muscles in man
PDF
The Role of Pathology AI in Translational Cancer Research and Education
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PDF
technical specifications solar ear 2025.
PPTX
AI_Agriculture_Presentation_Enhanced.pptx
PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
PPTX
lung disease detection using transfer learning approach.pptx
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
MBA JAPAN: 2025 the University of Waseda
PPTX
langchainpptforbeginners_easy_explanation.pptx
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPTX
indiraparyavaranbhavan-240418134200-31d840b3.pptx
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PDF
Grey Minimalist Professional Project Presentation (1).pdf
PPTX
Stats annual compiled ipd opd ot br 2024
PPT
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
A biomechanical Functional analysis of the masitary muscles in man
The Role of Pathology AI in Translational Cancer Research and Education
PPT for Diseases (1)-2, types of diseases.pptx
technical specifications solar ear 2025.
AI_Agriculture_Presentation_Enhanced.pptx
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
lung disease detection using transfer learning approach.pptx
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
MBA JAPAN: 2025 the University of Waseda
langchainpptforbeginners_easy_explanation.pptx
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
indiraparyavaranbhavan-240418134200-31d840b3.pptx
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
expt-design-lecture-12 hghhgfggjhjd (1).ppt
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
1 hour to get there before the game is done so you don’t need a car seat for ...
Grey Minimalist Professional Project Presentation (1).pdf
Stats annual compiled ipd opd ot br 2024
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs

Principal Component Analysis

  • 1. Principal Component Analysis Mason Ziemer 12/2/16 Abstract One problemthat oftencropsup indata analysis isthe presence of a high-dimensional dataset.We will explorea specificdimensionreductiontechnique inthisreportcalled Principal ComponentAnalysis.The aimof thisreportisto use eigenvectors,eigenvalues,andorthogonality tounderstand the conceptof Principal ComponentAnalysis(PCA) andtoshow why PCA is useful.
  • 2. Introduction The aim for Principal ComponentAnalysisliesinthe title; findingthe principal componentsof the data.PCA is usedto projectdata intoa new,lower-dimension, coordinate system where the axescorrespond toeachprincipal component. What isPCA useful for? Ithelpswithreducingthe dimensionalityof a datasetwhichinturn helpswiththe efficiencyof runninga machine learningalgorithm.Italsosimplifiesthe dataset, allowingforthese algorithmstorunfaster. So,whatis a principal component?A principalcomponentisthe direction where the mostvariance liesinthe data.To get a visual,the firstprincipal componentof the dataonthe x-yplane isshown below. As youcan see above,the firstprincipal componentisthe line where the datavariesthe most.Let’ssay we want to projectthe data alongthe firstprincipal componentonly.Thiswould effectivelyreduce the dimensionof the dataset fromtwo dimensionstoone while retainingthe mostdatapossible. Althoughthiswill destroysome of the data,itstill holds informationfromboth the x and y. This iswhatthe projection lookslike. x y Firstprincipal component y Firstprincipal component
  • 3. Here is the data projectedontothe firstprincipal component. Now,lookingbacktothe original graph,the secondprincipal componentmustbe orthogonal tothe firstin orderto capture the most remainingvariance thatfirstprincipal componentdid not.Here iswhatsecondprincipal componentlookslike. If we were to performPCA and projectthe data ontothe first twoprincipal components,thennone of the datawouldbe lost inthe transformation.Thisisbecause we are transformingthe data fromthe x-yplane,whichhastwodimensions, intoanew two-dimensional space wherethe axesare undefined. Completingthistransformationmerelyrotatesthe dataonto the newaxesand lookslike this: x x y Firstprincipal component x y Firstprincipal component SecondPrincipal Component
  • 4. As youcan see none of the data has changed,we are just lookingatit froma differentangle. Eigenvalues and Eigenvectors In mathematical terms, the principalcomponentsare the eigenvectorsof the covariance matrix forthe dataset. Itwill be illustratedinthe example below of how toobtainthe covariance matrix alongwith the eigenvectorsandeigenvalues. The eigenvectors of the covariance matrix pointinthe direction where the mostvariance inthe data lies.Eacheigenvectorhasa correspondingeigenvalue,whichisascalar,that denotesthe amountof variance inthe data alongitscorresponding eigenvector. There canonlybe as manyeigenvectorswith correspondingeigenvaluesasthere are variables inthe dataset. As the eigenvaluesgetlarger,itmeansthatmore variance in the data is accountedfor.In the previous example fromabove, since the data liesonthe x-yplane,there are onlytwo eigenvectorswithcorrespondingeigenvalues.Foranydata set, the firstprincipal componentisthe eigenvectorthat correspondstothe largesteigenvalue.Itisalsoimportantto note that the matrices that are formedby the datasetdonot have to be square; the variablesmake upthe columnsof the matrix,while the observationsmake upthe rows. Theydonot have to be square matricesbecause we are takingthe eigenvaluesfromthe covariance matrix,whichwill be explained inthe example below. New Axis NewAxis
  • 5. Example For an example andimplementationof PCA Iwill refertothe iris datasetinR. Iris contains measurements,incm, for150 iris flowersonfourdifferentfeatures:Sepal.Length,Sepal.Width, Petal.Length,andPetal.Width. The datasetalsocontainsthe speciesof irisforeachirisflower.The three differentspecies are Setosa,Versicolor,andVirginica. The fourfeaturesmake up the columnsof our matrix,while the 150 observationsof each feature make upthe rows of our matrix. Here iswhat the first six rowsof the data lookslike. > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.1 3.5 1.4 0.2 2 4.9 3.0 1.4 0.2 3 4.7 3.2 1.3 0.2 4 4.6 3.1 1.5 0.2 5 5.0 3.6 1.4 0.2 6 5.4 3.9 1.7 0.4 The nextstepisto findthe covariance matrix,whichcanbe computedbythe followingformula. 𝐶𝑂𝑉( 𝑋, 𝑌) = ∑ (𝑋𝑖 ̅ − 𝑋)(𝑌𝑖 ̅ − 𝑌)𝑛 𝑖=1 𝑛 − 1 Since our data sethas fourvariables (fourdimensionaldataset), the covariance betweenall fourvariablescanbe measured.Itis alsoimportantto rememberthatthe covariance of a variable withitself COV(X,X) justequalsitsvariance VAR(X). Suppose we use arbitraryvariablesW,X,Y,andZ to setup the covariance matrix for thisexample.The resulting4x4matrix will looklike t his: 𝑉𝐴𝑅(𝑊) 𝐶𝑂𝑉(𝑊, 𝑋) 𝐶𝑂𝑉( 𝑋, 𝑊) 𝑉𝐴𝑅(𝑋) 𝐶𝑂𝑉(𝑊, 𝑌) 𝐶𝑂𝑉(𝑊, 𝑍) 𝐶𝑂𝑉(𝑋, 𝑌) 𝐶𝑂𝑉( 𝑋, 𝑍) 𝐶𝑂𝑉(𝑌, 𝑊) 𝐶𝑂𝑉(𝑌, 𝑋) 𝐶𝑂𝑉(𝑍, 𝑊) 𝐶𝑂𝑉( 𝑍, 𝑋) 𝑉𝐴𝑅(𝑌) 𝐶𝑂𝑉(𝑌, 𝑍) 𝐶𝑂𝑉(𝑍, 𝑌) 𝑉𝐴𝑅(𝑍) It’salso importanttonote that COV(X,Y) equal toCOV(Y,X), hence the matrix issymmetrical aboutthe diagonal,where the diagonal equalsthe variancesof W,X,Y,andZ.The covariance matrix forour datasetcan be obtainedinRby the following command.
  • 6. > cov(iris[-5]) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707 Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394 Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094 Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063 Nowthat we have obtainedthe covariance matrix forthe iris datasetwe can now go aheadandfind eigenvectorsandtheir correspondingeigenvalues. Remember,aneigenvectorisa nonzerovector 𝑥⃑ such that A𝑥⃑= λ𝑥⃑ for some scalarλ. The scalar λ is oureigenvalueforthe correspondingeigenvector.Solving for the eigenvectorsandeigenvalues,we get: > eigen(X)$vectors 𝑣1⃑⃑⃑⃑⃑ 𝑣2⃑⃑⃑⃑⃑ 𝑣3⃑⃑⃑⃑⃑ 𝑣4⃑⃑⃑⃑ Sepal.Length 0.36138659 -0.65658877 -0.58202985 0.3154872 Sepal.Width -0.08452251 -0.73016143 0.59791083 -0.3197231 Petal.Length 0.85667061 0.17337266 0.07623608 -0.4798390 Petal.Width 0.35828920 0.07548102 0.54583143 0.7536574 As youcan see,ourfirsteigenvector,betterknownasthe first principle component,isdominatedbyPetal.Length,withavalue of .85667. This meansthatPetal.Lengthcapturesthe most variationinthe data for the firstdimension.So,if we wantedto reduce our datasettoone variable,Petal.Lengthwouldbe the bestchoice. > eigen(X)$values [1] 4.22824171 0.24267075 0.07820950 0.02383509 The eigenvaluesof the covariance matrix are able totell ushow much variance isexplainedbyeacheigenvector. Notethatthe firsteigenvalue,4.228,ismuch largerthan the followingthree. Thus,the proportionof variance explainedbythe first eigenvectoris equal to: 𝜆1 𝜆1+𝜆2+𝜆3+𝜆4 = .9246 Thismeansthat 92.46% of the data iscapturedby the first principle component.If we wantedthe proportionof overall variance explainedforthe firsttwoprinciplecomponentsjust add 𝜆2 to the numeratorand the equationthenequals97.77%. So,97.77% of the variance in the data containing4 variablescan be explainedbythe firsttwoprinciple components.
  • 7. Projection Nowthat we have obtainedthe eigenvectorsandeigenvalues,it istime to projectthe data ontofewerdimensions.Since we computedabove thatthe firsttwo principle componentsmake up almost98% of the variance inthe data, we will projectour data ontothe firsttwoprinciple components.Inordertofind the coordinatesbysolvingthe equationA =XV where Xisour original matrix with4columnsand 150 rows (note thismatrix has to be centeredwithmean= 0), A is the matrix of coordinatesinthe new principle componentspace thatare spannedbythe eigenvectorsin V.Remember,V isourmatrix of eigenvectorsinthiscase. X = scale(iris[1:4], center = TRUE, scale = FALSE) scores = data.frame(X %*% eig$vectors) colnames(scores) = c("Prin1", "Prin2", "Prin3", "Prin4") scores[1:10, ] Prin1 Prin2 Prin3 Prin4 1 -2.684126 -0.31939725 -0.02791483 0.002262437 2 -2.714142 0.17700123 -0.21046427 0.099026550 3 -2.888991 0.14494943 0.01790026 0.019968390 4 -2.745343 0.31829898 0.03155937 -0.075575817 5 -2.728717 -0.32675451 0.09007924 -0.061258593 6 -2.280860 -0.74133045 0.16867766 -0.024200858 7 -2.820538 0.08946138 0.25789216 -0.048143106 8 -2.626145 -0.16338496 -0.02187932 -0.045297871 9 -2.886383 0.57831175 0.02075957 -0.026744736 10 -2.672756 0.11377425 -0.19763272 -0.056295401 The commandsabove will give the coordinates,orscores,for each principle component.Since we know about98% of the variance inthe data is capturedbythe firsttwoprinciple components,we willuse the firsttwocolumnsof coordinates fromabove to plotour datasetin2 dimensionswiththe followingcommand inR.The axesof thisnew twodimensional projectionare the firsttwoprinciple components. plot(scores$Prin1,scores$Prin2,main="Data ProjectedonFirst 2 Principal Components", xlab= "FirstPrincipal Component",ylab="SecondPrincipal Component", col = c("green","red","blue")[iris$Species])
  • 8. Note:The three differentcolorsrepresentthe speciesof iris flower. Conclusion What was justaccomplishedwasthe exactgoal of PCA.We were able to effectively reduce ourdatasetirisfromfour dimensionsdownto twodimensions while maintainingnearly 98% of the original data.We were able todo thisbyusingthe conceptsof eigenvaluesandeigenvectors. Toreview,we start by settingupthe matrix forthe data whichhasthe observations as rows,and variablesascolumns.The nextstepistocompute the covariance matrix forthe data.The covariance matrix resultsinan NxN matrix,where N isthe numberof variables. The nextstepisto findthe eigenvectors andthe corresponding eigenvaluesof the covariance matrix.The eigenvectorsmake up the principle components.Next,itisimportanttoanalyze the eigenvectorsandeigenvaluestosee how muchvariabilityis accountedforby each component andto see whichvariable contributesthe mostforeach eigenvector.Once itisdecided howmany dimensionsyouwantyourprojectiontobe,the scores,or coordinates,forthe new axesneedtobe obtained. The final step isto plotthe data to see what the reduced dimensionslooklikeandPCA issuccessfullycompleted! Sources: Hamilton,L.D. (n.d.).LauraDiane Hamilton.RetrievedDecember13,2016, from http://www.lauradhamilton.com/introduction-to-principal-component- analysis-pca
  • 9. C. (2015). Principal ComponentAnalysis4Dummies: Eigenvectors,EigenvaluesandDimensionReduction. RetrievedDecember13, 2016, from https://georgemdallas.wordpress.com/2013/10/30/prin cipal-component-analysis-4-dummies-eigenvectors- eigenvalues-and-dimension-reduction/ PRINCIPALCOMPONENTSANALYSIS.(n.d.).RetrievedDecember13,2016, from http://www.bing.com/cr?IG=BE9D91B52E12482181305171F3DE2744&CID=29E B5DDB568F60A634A7543057BE6161&rd=1&h=w3menlZrSNEgeE4CqkgKpvMgpi xKBovnov7bpaVv7sg&v=1&r=http://www4.ncsu.edu/~slrace/LinearAlgebra2016 /RChapters/PCA.pdf&p=DevEx,5037.1