SlideShare a Scribd company logo
DIMENSION REDUCTION
DEFINITION
• Dimensionality reduction, or dimension
reduction, is the transformation of data from
a high-dimensional space into a low-
dimensional space so that the low-
dimensional representation retains some
meaningful properties of the original data,
ideally close to its intrinsic dimension.
DIMENSIONS
• The number of input variables or features for a dataset is
referred to as its dimensionality.
• Dimensionality reduction refers to techniques that
reduce the number of input variables in a dataset.
• More input features often make a predictive modeling
task more challenging to model, more generally referred
to as the curse of dimensionality.
• High-dimensionality statistics and dimensionality
reduction techniques are often used for data
visualization. Nevertheless these techniques can be used
in applied machine learning to simplify a classification or
regression dataset in order to better fit a predictive
model.
Problem With Many Input Variables
• The performance of machine learning algorithms can degrade with
too many input variables.
• If your data is represented using rows and columns, such as in a
spreadsheet, then the input variables are the columns that are fed
as input to a model to predict the target variable.
• Input variables are also called features.
• We can consider the columns of data representing dimensions on
an n-dimensional feature space and the rows of data as points in
that space.
• This is a useful geometric interpretation of a dataset.
• Having a large number of dimensions in the feature space can mean
that the volume of that space is very large, and in turn, the points
that we have in that space (rows of data) often represent a small
and non-representative sample.
As the number of features increase, the
number of samples also increases
proportionally. The more features we have, the
more number of samples we will need to have
all combinations of feature values well
represented in our sample.
As the number of features increases, the
model becomes more complex.
The more the number of features, the more
the chances of overfitting.
A machine learning model that is trained on a
large number of features, gets increasingly
dependent on the data it was trained on and in
turn overfitted, resulting in poor performance
on real data, beating the purpose.
WHY ????
• This can dramatically impact the performance
of machine learning algorithms fit on data with
many input features, generally referred to as
the “curse of dimensionality.”
• Therefore, it is often desirable to reduce the
number of input features.
• This reduces the number of dimensions of the
feature space, hence the name “dimensionality
reduction.”
Advantages of Dimension reduction
• Less misleading data means model accuracy
improves.
• Less dimensions mean less computing. Less
data means that algorithms train faster.
• Less data means less storage space required.
• Less dimensions allow usage of algorithms unfit
for a large number of dimensions
• Removes redundant features and noise.
Dimensionality Reduction
• High-dimensionality might mean hundreds, thousands,
or even millions of input variables.
• Fewer input dimensions often mean correspondingly
fewer parameters or a simpler structure in the machine
learning model, referred to as degrees of freedom. A
model with too many degrees of freedom is likely to
overfit the training dataset and therefore may not
perform well on new data.
• It is desirable to have simple models that generalize well,
and in turn, input data with few input variables. This is
particularly true for linear models where the number of
inputs and the degrees of freedom of the model are
often closely related.
CURSE OF DIMENTIONALITY
The fundamental reason for the curse of
dimensionality is that high-dimensional
functions have the potential to be much more
complicated than low-dimensional ones, and
that those complications are harder to discern.
The only way to beat the curse is to
incorporate knowledge about the data that is
correct.
When ???
• Dimensionality reduction is a data preparation
technique performed on data prior to modeling.
• It might be performed after data cleaning and
data scaling and before training a predictive
model.
• … dimensionality reduction yields a more
compact, more easily interpretable
representation of the target concept, focusing
the user’s attention on the most relevant
variables.
Which data to be considered???
any dimensionality reduction performed on
training data must also be performed on new
data, such as a test dataset, validation dataset,
and data when making a prediction with the
final model.
• Feature Selection Methods
– use scoring or statistical methods to select which features to
keep and which features to delete.
– … perform feature selection, to remove “irrelevant” features
that do not help much with the classification problem.
• Matrix Factorization
– matrix factorization methods can be used to reduce a dataset
matrix into its constituent parts.
– The parts can then be ranked and a subset of those parts can be
selected that best captures the salient structure of the matrix
that can be used to represent the dataset.
– The most common approach to dimensionality reduction is
called principal components analysis or PCA.
Techniques for Dimensionality Reduction
Techniques for Dimensionality Reduction
• Manifold Learning
– Techniques from high-dimensionality statistics can also be used for
dimensionality reduction.
– In mathematics, a projection is a kind of function or mapping that transforms
data in some way.
– Kohonen Self-Organizing Map (SOM).
• Autoencoder Methods
– Deep learning neural networks can be constructed to perform dimensionality
reduction.
– A popular approach is called autoencoders. This involves framing a self-
supervised learning problem where a model must reproduce the input
correctly.
– An auto-encoder is a kind of unsupervised neural network that is used for
dimensionality reduction and feature discovery. More precisely, an auto-
encoder is a feedforward neural network that is trained to predict the input
itself.
Linear Dimensionality Reduction Methods
• The most common and well known dimensionality reduction methods are the ones
that apply linear transformations, like
• PCA (Principal Component Analysis) : Popularly used for dimensionality reduction
in continuous data,
• PCA rotates and projects data along the direction of increasing variance.
• The features with the maximum variance are the principal components.
PCA
• variables are transformed into a new set of variables, which are
linear combination of original variables.
• These new set of variables are known as principle components.
• They are obtained in such a way that first principle component
accounts for most of the possible variation of original data after
which each succeeding component has the highest possible
variance.
• The second principal component must be orthogonal to the first
principal component.
• In other words, it does its best to capture the variance in the data
that is not captured by the first principal component.
• For two-dimensional dataset, there can be only two principal
components. Below is a snapshot of the data and its first and
second principal components.
• You can notice that second principle component is orthogonal to
first principle component.
• STEP 1: STANDARDIZATION
– The aim of this step is to standardize the range of the
continuous initial variables so that each one of them
contributes equally to the analysis.
– if there are large differences between the ranges of initial
variables, those variables with larger ranges will dominate
over those with small ranges
– For example, a variable that ranges between 0 and 100 will
dominate over a variable that ranges between 0 and 1 ,
which will lead to biased results.
– So, transforming the data to comparable scales can
prevent this problem.
– all the variables will be transformed to the same scale.
STEP 2: COVARIANCE MATRIX COMPUTATION
• The aim of this step is to understand how the variables of
the input data set are varying from the mean with
respect to each other, or in other words, to see if there is
any relationship between them.
• variables are highly correlated in such a way that they
contain redundant information. So, in order to identify
these correlations, we compute the covariance matrix.
• The covariance matrix is a p × p symmetric matrix
(where p is the number of dimensions) that has as
entries the covariances associated with all possible pairs
of the initial variables.
• For example, for a 3-dimensional data set with 3
variables x, y, and z, the covariance matrix is a 3×3
matrix of this from:
• (Cov(a,a)=Var(a)
• the covariance is commutative
(Cov(a,b)=Cov(b,a)),(a)),
• entries of the covariance matrix are symmetric
with respect to the main diagonal, which
means that the upper and the lower triangular
portions are equal.
• What do the covariances that we have as
entries of the matrix tell us about the
correlations between the variables?
• It’s actually the sign of the covariance that
matters :
• if positive then : the two variables increase or
decrease together (correlated)
• if negative then : One increases when the other
decreases (Inversely correlated)
• STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES
OF THE COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL
COMPONENTS
• principal components of the data.
– Principal components are new variables that are constructed as
linear combinations or mixtures of the initial variables.
– These combinations are done in such a way that the new
variables (i.e., principal components) are uncorrelated and most
of the information within the initial variables is squeezed or
compressed into the first components.
– So, the idea is 10-dimensional data gives you 10 principal
components, but PCA tries to put maximum possible
information in the first component, then maximum remaining
information in the second and so on
• will allow you to reduce dimensionality without losing much
information, and this by discarding the components with low
information and considering the remaining components as your
new variables.
• An important thing to realize here is that, the principal components
are less interpretable and don’t have any real meaning since they
are constructed as linear combinations of the initial variables.
• Geometrically speaking, principal components represent the
directions of the data that explain a maximal amount of variance,
that is to say, the lines that capture most information of the data.
• The relationship between variance and information here, is that,
the larger the variance carried by a line, the larger the dispersion of
the data points along it, and the larger the dispersion along a line,
the more the information it has.
• To put all this simply, just think of principal components as new axes
that provide the best angle to see and evaluate the data, so that the
differences between the observations are better visible.
HOW PCA CONSTRUCTS THE
PRINCIPAL COMPONENTS?
• there are as many principal components as
there are variables in the data,
• principal components are constructed in
such a manner that the first principal
component accounts for the largest
possible variance in the data set.
• For example, let’s assume that the scatter
plot of our data set is as shown below, can
we guess the first principal component ?
• Yes, it’s approximately the line that
matches the purple marks because it goes
through the origin and it’s the line in which
the projection of the points (red dots) is
the most spread out.
• Or mathematically speaking, it’s the line
that maximizes the variance (the average
of the squared distances from the
projected points (red dots) to the origin).
• The second principal component is calculated
in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the
first principal component and that it accounts
for the next highest variance.
• This continues until a total of p principal
components have been calculated, equal to
the original number of variables.
eigenvectors and eigenvalues
• they always come in pairs, so that every
eigenvector has an eigenvalue.
• And their number is equal to the number of
dimensions of the data.
• For example, for a 3-dimensional data set,
there are 3 variables, therefore there are 3
eigenvectors with 3 corresponding
eigenvalues.
Eigenvector and Eigenvalue
• For a square matrix A, an Eigenvector and
Eigenvalue make this equation true (if we can
find them):
• We start by finding the eigenvalue: we know this
equation must be true:
Av = λv
• Now let us put in an identity matrix so we are
dealing with matrix-vs-matrix:
Av = λ I v
• Bring all to left hand side:
Av − λIv = 0
• If v is non-zero then we can solve for λ using just
the determinant:
| A − λI | = 0
A * Eigenvector — Eigenvalue * EigenVector = 0
Example
dimension reduction.ppt
dimension reduction.ppt
Eigenvectors & Covariance matrix
Relationship unleashed
• the eigenvectors of the Covariance matrix are
actually the directions of the axes where there is the
most variance(most information) and that we call
Principal Components.
• And eigenvalues are simply the coefficients attached
to eigenvectors, which give the amount of variance
carried in each Principal Component.
• By ranking your eigenvectors in order of their
eigenvalues, highest to lowest, you get the principal
components in order of significance.
• let’s suppose that our data set is 2-dimensional with 2
variables x,y and that the eigenvectors and eigenvalues of the
covariance matrix are as follows:
• If we rank the eigenvalues in descending order, we get λ1>λ2,
which means that the eigenvector that corresponds to the
first principal component (PC1) is v1 and the one that
corresponds to the second component (PC2) isv2.
• After having the principal components, to compute the
percentage of variance (information) accounted for by each
component, we divide the eigenvalue of each component by
the sum of eigenvalues.
• If we apply this on the example above, we find that PC1 and
PC2 carry respectively 96% and 4% of the variance of the data.
Variance is calculated as
1.28/(1.28+.04) = .96 from PC1
0.04/(1.28+.04) = .04 from PC1
STEP 4: FEATURE VECTOR
• As we saw in the previous step, computing the eigenvectors and
ordering them by their eigenvalues in descending order, allow
us to find the principal components in order of significance.
• In this step, what we do is, to choose whether to keep all these
components or discard those of lesser significance (of low
eigenvalues), and form with the remaining ones a matrix of
vectors that we call Feature vector.
• So, the feature vector is simply a matrix that has as columns the
eigenvectors of the components that we decide to keep.
• This makes it the first step towards dimensionality reduction,
because if we choose to keep only p eigenvectors (components)
out of n, the final data set will have only p dimensions.
• Or discard the eigenvector v2, which is the one of lesser
significance, and form a feature vector with v1 only:
• Discarding the eigenvector v2 will reduce dimensionality by 1,
and will consequently cause a loss of information in the final
data set.
• But given that v2 was carrying only 4% of the information, the
loss will be therefore not important and we will still have 96% of
the information that is carried by v1.
• Continuing with the example from the previous step, we can
either form a feature vector with both of the eigenvectors v1
and v2:
• Or discard the eigenvector v2, which is the one of lesser
significance, and form a feature vector with v1 only:
• Discarding the eigenvector v2 will reduce dimensionality by 1,
and will consequently cause a loss of information in the final
data set. But given that v2 was carrying only 4% of the
information, the loss will be therefore not important and we will
still have 96% of the information that is carried by v1.
LAST STEP : RECAST THE DATA ALONG THE
PRINCIPAL COMPONENTS AXES
• the aim is to use the feature vector formed
using the eigenvectors of the covariance
matrix, to reorient the data from the original
axes to the ones represented by the principal
components (hence the name Principal
Components Analysis).
• This can be done by multiplying the transpose
of the original data set by the transpose of the
feature vector.
EXAMPLE
dimension reduction.ppt
TO CALCULATE THE PORJECTIONS
V1 = 





 90
.
0
45
.
0
V2 = 







45
.
0
90
.
0
D =
PROJECTION OF V1 ON D WILL BE
[-1.63 -0.63] *
-1.63*.45+(-0.63*-0.90 = -0.1665
-1.63*0.45 + (-1.63* -0.90) = 0.733






 90
.
0
45
.
0 PROJECTION OF V2ON D WILL BE
[-1.63 -0.63] *
-1.63*-0.90 +(-0.63*-0.45) = 1.73
-1.63*-0.90 + (-1.63* -0.45 ) = 2.18








45
.
0
90
.
0
• Factor Analysis :
• A technique that is used to reduce a large number of variables into
fewer numbers of factors.
• The values of observed data are expressed as functions mof a
number of possible causes in order to find which are the most
important.
• The observations are assumed to be caused by a linear
transformation of lower dimensional latent factors and added
Gaussian noise.
• LDA (Linear Discriminant Analysis):
• projects data in a way that the class separability is maximised.
• Examples from same class are put closely together by the projection.
• Examples from different classes are placed far apart by the projection
Linear Dimensionality Reduction
Methods
• https://builtin.com/data-science/step-step-
explanation-principal-component-analysis
Tips for Dimensionality Reduction
• There is no best technique for dimensionality reduction and
no mapping of techniques to problems.
• Instead, the best approach is to use systematic controlled
experiments to discover what dimensionality reduction
techniques, when paired with your model of choice, result
in the best performance on your dataset.
• Typically, linear algebra and manifold learning methods
assume that all input features have the same scale or
distribution.
• This suggests that it is good practice to either normalize or
standardize data prior to using these methods if the input
variables have differing scales or units.

More Related Content

What's hot (20)

PPT
Decision tree
Ami_Surati
 
PPTX
Dbms
Rupali Salunkhe
 
PPTX
ML - Simple Linear Regression
Andrew Ferlitsch
 
PPT
Database Management System Introduction
Smriti Jain
 
PPTX
Dimensionality reduction
Sabbir Ahmed Saikat
 
PDF
Decision Tree in Machine Learning
Souma Maiti
 
PPT
2.1 Data Mining-classification Basic concepts
Krish_ver2
 
PPT
Data mining techniques unit 1
malathieswaran29
 
PPT
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
PPTX
Kdd process
Rajesh Chandra
 
PPTX
Database indexing techniques
ahmadmughal0312
 
PPTX
Ensemble learning
Haris Jamil
 
PPTX
Information retrieval (introduction)
Primya Tamil
 
PPT
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Salah Amean
 
PPTX
Information retrieval introduction
nimmyjans4
 
PDF
Data preprocessing using Machine Learning
Gopal Sakarkar
 
PPTX
Exploring Data
Datamining Tools
 
PDF
Normalization in Database
Roshni Singh
 
PPTX
Introduction to data mining technique
Pawneshwar Datt Rai
 
PPT
Association rule mining
Acad
 
Decision tree
Ami_Surati
 
ML - Simple Linear Regression
Andrew Ferlitsch
 
Database Management System Introduction
Smriti Jain
 
Dimensionality reduction
Sabbir Ahmed Saikat
 
Decision Tree in Machine Learning
Souma Maiti
 
2.1 Data Mining-classification Basic concepts
Krish_ver2
 
Data mining techniques unit 1
malathieswaran29
 
Kdd process
Rajesh Chandra
 
Database indexing techniques
ahmadmughal0312
 
Ensemble learning
Haris Jamil
 
Information retrieval (introduction)
Primya Tamil
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Salah Amean
 
Information retrieval introduction
nimmyjans4
 
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Exploring Data
Datamining Tools
 
Normalization in Database
Roshni Singh
 
Introduction to data mining technique
Pawneshwar Datt Rai
 
Association rule mining
Acad
 

Similar to dimension reduction.ppt (20)

PPTX
Dimensionality Reduction.pptx
PriyadharshiniG41
 
PPTX
MACHINE LEARNING YEAR DL SECOND PART.pptx
NAGARAJANS68
 
PPTX
PCA.pptx
testuser473730
 
PPTX
introduction to Statistical Theory.pptx
Dr.Shweta
 
PDF
ML-Unit-4.pdf
AnushaSharma81
 
PPTX
11 Principal Component Analysis Computer Graphics.pptx
shehzadshafique51
 
PDF
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Maninda Edirisooriya
 
PPTX
Pricing like a data scientist
Matthew Evans
 
PPTX
EDAB - Principal Components Analysis and Classification -Module - 5.pptx
preethiBP2
 
PPTX
Predicting House Prices: A Machine Learning Approach
Boston Institute of Analytics
 
PPTX
Unveiling the Market: Predicting House Prices with Data Science
Boston Institute of Analytics
 
PDF
13_Data Preprocessing in Python.pptx (1).pdf
andreyhapantenda
 
PPTX
Data reduction
GowriLatha1
 
PPTX
Machine learning module 2
Gokulks007
 
PDF
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
PDF
KNOLX_Data_preprocessing
Knoldus Inc.
 
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
ssuserbbbef4
 
PPTX
04 Classification in Data Mining
Valerii Klymchuk
 
PDF
Machine Learning - Implementation with Python - 3.pdf
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
PDF
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
MargiShah29
 
Dimensionality Reduction.pptx
PriyadharshiniG41
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
NAGARAJANS68
 
PCA.pptx
testuser473730
 
introduction to Statistical Theory.pptx
Dr.Shweta
 
ML-Unit-4.pdf
AnushaSharma81
 
11 Principal Component Analysis Computer Graphics.pptx
shehzadshafique51
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Maninda Edirisooriya
 
Pricing like a data scientist
Matthew Evans
 
EDAB - Principal Components Analysis and Classification -Module - 5.pptx
preethiBP2
 
Predicting House Prices: A Machine Learning Approach
Boston Institute of Analytics
 
Unveiling the Market: Predicting House Prices with Data Science
Boston Institute of Analytics
 
13_Data Preprocessing in Python.pptx (1).pdf
andreyhapantenda
 
Data reduction
GowriLatha1
 
Machine learning module 2
Gokulks007
 
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
KNOLX_Data_preprocessing
Knoldus Inc.
 
When Models Meet Data: From ancient science to todays Artificial Intelligence...
ssuserbbbef4
 
04 Classification in Data Mining
Valerii Klymchuk
 
Machine Learning - Implementation with Python - 3.pdf
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
MargiShah29
 
Ad

Recently uploaded (20)

PPTX
Rules and Regulations of Madhya Pradesh Library Part-I
SantoshKumarKori2
 
PPTX
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
PPTX
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
PPTX
LDP-2 UNIT 4 Presentation for practical.pptx
abhaypanchal2525
 
PPTX
ENGLISH 8 WEEK 3 Q1 - Analyzing the linguistic, historical, andor biographica...
OliverOllet
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
DOCX
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
PPTX
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
PPTX
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
PDF
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
PDF
John Keats introduction and list of his important works
vatsalacpr
 
PPTX
K-Circle-Weekly-Quiz12121212-May2025.pptx
Pankaj Rodey
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
Rules and Regulations of Madhya Pradesh Library Part-I
SantoshKumarKori2
 
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
Basics and rules of probability with real-life uses
ravatkaran694
 
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
LDP-2 UNIT 4 Presentation for practical.pptx
abhaypanchal2525
 
ENGLISH 8 WEEK 3 Q1 - Analyzing the linguistic, historical, andor biographica...
OliverOllet
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
Cybersecurity: How to Protect your Digital World from Hackers
vaidikpanda4
 
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
John Keats introduction and list of his important works
vatsalacpr
 
K-Circle-Weekly-Quiz12121212-May2025.pptx
Pankaj Rodey
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
Ad

dimension reduction.ppt

  • 2. DEFINITION • Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low- dimensional space so that the low- dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
  • 3. DIMENSIONS • The number of input variables or features for a dataset is referred to as its dimensionality. • Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. • More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality. • High-dimensionality statistics and dimensionality reduction techniques are often used for data visualization. Nevertheless these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.
  • 4. Problem With Many Input Variables • The performance of machine learning algorithms can degrade with too many input variables. • If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input to a model to predict the target variable. • Input variables are also called features. • We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that space. • This is a useful geometric interpretation of a dataset. • Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, the points that we have in that space (rows of data) often represent a small and non-representative sample.
  • 5. As the number of features increase, the number of samples also increases proportionally. The more features we have, the more number of samples we will need to have all combinations of feature values well represented in our sample. As the number of features increases, the model becomes more complex. The more the number of features, the more the chances of overfitting. A machine learning model that is trained on a large number of features, gets increasingly dependent on the data it was trained on and in turn overfitted, resulting in poor performance on real data, beating the purpose.
  • 6. WHY ???? • This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as the “curse of dimensionality.” • Therefore, it is often desirable to reduce the number of input features. • This reduces the number of dimensions of the feature space, hence the name “dimensionality reduction.”
  • 7. Advantages of Dimension reduction • Less misleading data means model accuracy improves. • Less dimensions mean less computing. Less data means that algorithms train faster. • Less data means less storage space required. • Less dimensions allow usage of algorithms unfit for a large number of dimensions • Removes redundant features and noise.
  • 8. Dimensionality Reduction • High-dimensionality might mean hundreds, thousands, or even millions of input variables. • Fewer input dimensions often mean correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and therefore may not perform well on new data. • It is desirable to have simple models that generalize well, and in turn, input data with few input variables. This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related.
  • 9. CURSE OF DIMENTIONALITY The fundamental reason for the curse of dimensionality is that high-dimensional functions have the potential to be much more complicated than low-dimensional ones, and that those complications are harder to discern. The only way to beat the curse is to incorporate knowledge about the data that is correct.
  • 10. When ??? • Dimensionality reduction is a data preparation technique performed on data prior to modeling. • It might be performed after data cleaning and data scaling and before training a predictive model. • … dimensionality reduction yields a more compact, more easily interpretable representation of the target concept, focusing the user’s attention on the most relevant variables.
  • 11. Which data to be considered??? any dimensionality reduction performed on training data must also be performed on new data, such as a test dataset, validation dataset, and data when making a prediction with the final model.
  • 12. • Feature Selection Methods – use scoring or statistical methods to select which features to keep and which features to delete. – … perform feature selection, to remove “irrelevant” features that do not help much with the classification problem. • Matrix Factorization – matrix factorization methods can be used to reduce a dataset matrix into its constituent parts. – The parts can then be ranked and a subset of those parts can be selected that best captures the salient structure of the matrix that can be used to represent the dataset. – The most common approach to dimensionality reduction is called principal components analysis or PCA. Techniques for Dimensionality Reduction
  • 13. Techniques for Dimensionality Reduction • Manifold Learning – Techniques from high-dimensionality statistics can also be used for dimensionality reduction. – In mathematics, a projection is a kind of function or mapping that transforms data in some way. – Kohonen Self-Organizing Map (SOM). • Autoencoder Methods – Deep learning neural networks can be constructed to perform dimensionality reduction. – A popular approach is called autoencoders. This involves framing a self- supervised learning problem where a model must reproduce the input correctly. – An auto-encoder is a kind of unsupervised neural network that is used for dimensionality reduction and feature discovery. More precisely, an auto- encoder is a feedforward neural network that is trained to predict the input itself.
  • 14. Linear Dimensionality Reduction Methods • The most common and well known dimensionality reduction methods are the ones that apply linear transformations, like • PCA (Principal Component Analysis) : Popularly used for dimensionality reduction in continuous data, • PCA rotates and projects data along the direction of increasing variance. • The features with the maximum variance are the principal components.
  • 15. PCA • variables are transformed into a new set of variables, which are linear combination of original variables. • These new set of variables are known as principle components. • They are obtained in such a way that first principle component accounts for most of the possible variation of original data after which each succeeding component has the highest possible variance. • The second principal component must be orthogonal to the first principal component. • In other words, it does its best to capture the variance in the data that is not captured by the first principal component. • For two-dimensional dataset, there can be only two principal components. Below is a snapshot of the data and its first and second principal components. • You can notice that second principle component is orthogonal to first principle component.
  • 16. • STEP 1: STANDARDIZATION – The aim of this step is to standardize the range of the continuous initial variables so that each one of them contributes equally to the analysis. – if there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges – For example, a variable that ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1 , which will lead to biased results. – So, transforming the data to comparable scales can prevent this problem. – all the variables will be transformed to the same scale.
  • 17. STEP 2: COVARIANCE MATRIX COMPUTATION • The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. • variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix. • The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables. • For example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a 3×3 matrix of this from:
  • 18. • (Cov(a,a)=Var(a) • the covariance is commutative (Cov(a,b)=Cov(b,a)),(a)), • entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal.
  • 19. • What do the covariances that we have as entries of the matrix tell us about the correlations between the variables? • It’s actually the sign of the covariance that matters : • if positive then : the two variables increase or decrease together (correlated) • if negative then : One increases when the other decreases (Inversely correlated)
  • 20. • STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS • principal components of the data. – Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. – These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. – So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on
  • 21. • will allow you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables. • An important thing to realize here is that, the principal components are less interpretable and don’t have any real meaning since they are constructed as linear combinations of the initial variables. • Geometrically speaking, principal components represent the directions of the data that explain a maximal amount of variance, that is to say, the lines that capture most information of the data. • The relationship between variance and information here, is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more the information it has. • To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible.
  • 22. HOW PCA CONSTRUCTS THE PRINCIPAL COMPONENTS? • there are as many principal components as there are variables in the data, • principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set. • For example, let’s assume that the scatter plot of our data set is as shown below, can we guess the first principal component ? • Yes, it’s approximately the line that matches the purple marks because it goes through the origin and it’s the line in which the projection of the points (red dots) is the most spread out. • Or mathematically speaking, it’s the line that maximizes the variance (the average of the squared distances from the projected points (red dots) to the origin).
  • 23. • The second principal component is calculated in the same way, with the condition that it is uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for the next highest variance. • This continues until a total of p principal components have been calculated, equal to the original number of variables.
  • 24. eigenvectors and eigenvalues • they always come in pairs, so that every eigenvector has an eigenvalue. • And their number is equal to the number of dimensions of the data. • For example, for a 3-dimensional data set, there are 3 variables, therefore there are 3 eigenvectors with 3 corresponding eigenvalues.
  • 25. Eigenvector and Eigenvalue • For a square matrix A, an Eigenvector and Eigenvalue make this equation true (if we can find them):
  • 26. • We start by finding the eigenvalue: we know this equation must be true: Av = λv • Now let us put in an identity matrix so we are dealing with matrix-vs-matrix: Av = λ I v • Bring all to left hand side: Av − λIv = 0 • If v is non-zero then we can solve for λ using just the determinant: | A − λI | = 0 A * Eigenvector — Eigenvalue * EigenVector = 0
  • 30. Eigenvectors & Covariance matrix Relationship unleashed • the eigenvectors of the Covariance matrix are actually the directions of the axes where there is the most variance(most information) and that we call Principal Components. • And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each Principal Component. • By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the principal components in order of significance.
  • 31. • let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the eigenvectors and eigenvalues of the covariance matrix are as follows: • If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the eigenvector that corresponds to the first principal component (PC1) is v1 and the one that corresponds to the second component (PC2) isv2. • After having the principal components, to compute the percentage of variance (information) accounted for by each component, we divide the eigenvalue of each component by the sum of eigenvalues. • If we apply this on the example above, we find that PC1 and PC2 carry respectively 96% and 4% of the variance of the data. Variance is calculated as 1.28/(1.28+.04) = .96 from PC1 0.04/(1.28+.04) = .04 from PC1
  • 32. STEP 4: FEATURE VECTOR • As we saw in the previous step, computing the eigenvectors and ordering them by their eigenvalues in descending order, allow us to find the principal components in order of significance. • In this step, what we do is, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature vector. • So, the feature vector is simply a matrix that has as columns the eigenvectors of the components that we decide to keep. • This makes it the first step towards dimensionality reduction, because if we choose to keep only p eigenvectors (components) out of n, the final data set will have only p dimensions.
  • 33. • Or discard the eigenvector v2, which is the one of lesser significance, and form a feature vector with v1 only: • Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a loss of information in the final data set. • But given that v2 was carrying only 4% of the information, the loss will be therefore not important and we will still have 96% of the information that is carried by v1. • Continuing with the example from the previous step, we can either form a feature vector with both of the eigenvectors v1 and v2: • Or discard the eigenvector v2, which is the one of lesser significance, and form a feature vector with v1 only: • Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a loss of information in the final data set. But given that v2 was carrying only 4% of the information, the loss will be therefore not important and we will still have 96% of the information that is carried by v1.
  • 34. LAST STEP : RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS AXES • the aim is to use the feature vector formed using the eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones represented by the principal components (hence the name Principal Components Analysis). • This can be done by multiplying the transpose of the original data set by the transpose of the feature vector.
  • 37. TO CALCULATE THE PORJECTIONS
  • 38. V1 =        90 . 0 45 . 0 V2 =         45 . 0 90 . 0 D = PROJECTION OF V1 ON D WILL BE [-1.63 -0.63] * -1.63*.45+(-0.63*-0.90 = -0.1665 -1.63*0.45 + (-1.63* -0.90) = 0.733        90 . 0 45 . 0 PROJECTION OF V2ON D WILL BE [-1.63 -0.63] * -1.63*-0.90 +(-0.63*-0.45) = 1.73 -1.63*-0.90 + (-1.63* -0.45 ) = 2.18         45 . 0 90 . 0
  • 39. • Factor Analysis : • A technique that is used to reduce a large number of variables into fewer numbers of factors. • The values of observed data are expressed as functions mof a number of possible causes in order to find which are the most important. • The observations are assumed to be caused by a linear transformation of lower dimensional latent factors and added Gaussian noise. • LDA (Linear Discriminant Analysis): • projects data in a way that the class separability is maximised. • Examples from same class are put closely together by the projection. • Examples from different classes are placed far apart by the projection Linear Dimensionality Reduction Methods
  • 41. Tips for Dimensionality Reduction • There is no best technique for dimensionality reduction and no mapping of techniques to problems. • Instead, the best approach is to use systematic controlled experiments to discover what dimensionality reduction techniques, when paired with your model of choice, result in the best performance on your dataset. • Typically, linear algebra and manifold learning methods assume that all input features have the same scale or distribution. • This suggests that it is good practice to either normalize or standardize data prior to using these methods if the input variables have differing scales or units.