Slides distancecovariance

TECHNIQUES FOR BIG DATA
FEATURE EXTRACTION USING
DISTANCE COVARIANCE
BASED PCA

Big Data
 Big Data' is a blanket term for any collection of data
sets so large and complex that it becomes difficult to
process using on-hand database management tools or
traditional data processing applications.
 Big data requires exceptional technologies to efficiently
process large quantities of data within tolerable
elapsed times. A 2011 McKinsey report suggests
suitable technologies include crowdsourcing, data fusion
and integration, genetic algorithms, machine learning,
natural language processing, signal processing,
simulation, time series analysis and visualization.

How Big is Big Data?
 Very large, distributed aggregations of loosely structured data – often
incomplete and inaccessible.
 Petabytes/exabytes of data Millions/billions of people Billions/trillions of
records.
 Loosely-structured and often distributed data.
 Flat schemas with few complex interrelationships
 Often involving time-stamped events
 Often made up of incomplete data
 Often including connections between data elements that must be
probabilistically inferred.
 Applications that involved Big-data can be: Transactional (e.g., Facebook,
PhotoBox), or, Analytic (e.g., ClickFox, Merced Applications).
 (Reference Wikibon.org)

Big Data
Big Data Can be of three types:
1. Large number of attributes (>16)
2. Large number of samples
3. Large number both of attributes and samples
I have tried to work on the first case.

What is Dimensionality Reduction?
 Dimensionality reduction or dimension reduction
is the process of reducing the number of random
variables under consideration (or attributes or
features or descriptors), and can be divided into
feature selection and feature extraction.

Feature Selection
 Filters: Pearson’s Correlation
 Wrappers: Run a classifier again and again, each
time with a new set of features selected using
backward selection or forward selection.

Feature Extraction
 Feature extraction transforms the data in the high-
dimensional space to a space of fewer dimensions.
The data transformation may be linear, as in
principal component analysis (PCA), but many
nonlinear dimensionality reduction techniques also
exist. For multidimensional data, tensor
representation can be used in dimensionality
reduction through multilinear subspace learning.

Feature Extraction
 The main linear technique for dimensionality
reduction, principal component analysis, performs a
linear mapping of the data to a lower-dimensional
space in such a way that the variance of the data in
the low-dimensional representation is maximized

What is Principal Component Analysis?
 Principal component analysis (PCA) is a statistical procedure that
uses an orthogonal transformation to convert a set of observations
of possibly correlated variables into a set of values of linearly
uncorrelated variables called principal components. The number of
principal components is less than or equal to the number of original
variables. This transformation is defined in such a way that the first
principal component has the largest possible variance (that is,
accounts for as much of the variability in the data as possible), and
each succeeding component in turn has the highest variance possible
under the constraint that it is orthogonal to (i.e., uncorrelated with)
the preceding components. Principal components are guaranteed to
be independent if the data set is jointly normally distributed. PCA is
sensitive to the relative scaling of the original variables.

That is fine, but show me the MATH!
 Online tutorial
(http://www.cs.otago.ac.nz/cosc453/student_tutori
als/principal_components.pdf)

PCA and BIG DATA
 BIG DATA containing thousands will require a lot of
computation time for an average computer.
 PCA becomes an important tool while drawing
inference from such large data sets.

What is Distance Correlation?
 Distance correlation is a measure of statistical
dependence between two random variables or two
random vectors of arbitrary, not necessarily equal
dimension. An important property is that this measure of
dependence is zero if and only if the random variables
are statistically independent. This measure is derived
from a number of other quantities that are used in its
specification, specifically: distance variance, distance
standard deviation and distance covariance. These
take the same roles as the ordinary moments with
corresponding names in the specification of the Pearson
product-moment correlation coefficient.

Distance Covariance Solved Example
Sample Data
Column 1 Column 2
1 1
2 0
-1 2
0 3
Mean 0.5 1.5

Distances
 For Column 1 (aij = pow((ai^2 – aj^2), 0.5))
0 1.73 0 1
1.73 0 1.73 2
0 1.73 0 1
1 2 1 0
Using Euclidean formula to calculate distances
Mean 0.62 1.365 0.62 1
0.62
1.365
0.62
1
Grand Mean : 0.932

Similarly
 Distances for column 2 (bij)
0 1 1.73 2.8
1 0 2 3
1.73 2 0 2.23
2.8 3 2.23 0
Mean
Mean 1.38 1.5 1.49
2.66
1.38
1.5
1.49
2.66
Grand Mean : 1.595

Centering both the columns
 Aij = aij – ~ai – ~aj + ~a;
 where
 ~ai = row mean of ai
 ~aj = column mean of aj
 ~a = grand mean of a

Aij
-0.308 0.677 -0.308 0.312
0.677 -1.668 0.677 0.567
-0.308 0.677 -0.308 0.312
0.312 0.567 0.312 -1.608

Similarly
 We can calculate Bij
 Distance Covariance = (Aij*Bij)/n^2

Distance Covariance Principal
Component Analysis
 After we have obtained distance covariance, we
can find the highest eigen vectors of the covariance
matrix and then use those eigen vectors to extract
new features
 These eigen vectors can be multiplied by the real
dataset to generate the reduced dataset.

PCA vs D-PCA
 The classical measure of dependence, the Pearson
correlation coefficient, is mainly sensitive to a linear
relationship between two variables. Distance correlation
was introduced in 2005 by Gabor J Szekely in several
lectures to address this deficiency of Pearson’s
correlation, namely that it can easily be zero for
dependent variables. Correlation = 0
(uncorrelatedness) does not imply independence while
distance correlation = 0 does imply independence. The
first results on distance correlation were published in
2007 and 2009.

Modifications of D-PCA
 1. pow((ai^2 – aj^2),0.5)/ai+aj
 2. pow((ai^2 – aj^2),0.5)/ai
 These modification can be used to scale the data
which can then eliminate Normalization Step.

Drawbacks
 Cannot handle time series data
 Cannot handle noisy data
 Assumes data distribution to be normal
 Sensitive to scaling of the data

Future work
 Rank correlation
 Distance based source separation

Slides distancecovariance

More Related Content

What's hot (18)

Viewers also liked (6)

Similar to Slides distancecovariance (20)

Recently uploaded (20)

Slides distancecovariance