SlideShare a Scribd company logo
A Brief Presentation on Data Mining
Jason Rodrigues
Data Preprocessing
• Introduction
• Why data proprocessing?
• Data Cleaning
• Data Integration and
Transformation
• Data Reduction
• Discretization and concept
Heirarchy generation
• Takeaways
Agenda
Why Data Preprocessing?
Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!

Quality decisions must be based on quality data

Data warehouse needs consistent integration of quality data
A multi-dimensional measure of data quality

A well-accepted multi-dimensional view:

accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
Broad categories

intrinsic, contextual, representational, and accessibility
Data Preprocessing
Major Tasks of Data Preprocessing
Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration

Integration of multiple databases, data cubes, files, or notes
Data trasformation

Normalization (scaling to a specific range)

Aggregation
Data reduction

Obtains reduced representation in volume but produces the same or
similar analytical results

Data discretization: with particular importance, especially for numerical
data

Data aggregation, dimensionality reduction, data compression,
generalization
Data Preprocessing
Major Tasks of Data Preprocessing
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Data Cleaning
Tasks of Data Cleaning

Fill in missing values

Identify outliers and smooth noisy data

Correct inconsistent data
Data Cleaning
Manage Missing Data

Ignore the tuple: usually done when class label is missing (assuming the
task is classification—not effective in certain cases)

Fill in the missing value manually: tedious + infeasible?

Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples of the same class to fill in
the missing value: smarter

Use the most probable value to fill in the missing value: inference-
based such as regression, Bayesian formula, decision tree
Data Cleaning
Manage Noisy Data
Binning Method:

first sort data and partition into (equi-depth) bins

then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc
Clustering:

detect and remove outliers
Semi Automated

Computer and Manual Intervention
Regression

Use regression functions
Data Cleaning
Cluster Analysis
Data Cleaning
Regression Analysis
x
y
y = x + 1
X1
Y1
Y1’
•Linear regression (best line to fit
two variables)
•Multiple linear regression (more
than two variables, fit to a
multidimensional surface
Data Cleaning
Inconsistant Data

Manual correction using external
references

Semi-automatic using various tools
− To detect violation of known functional
dependencies and data constraints
− To correct redundant data
Data integration and transformation
Tasks of Data Integration and transformation

Data integration:
− combines data from multiple sources into a coherent
store

Schema integration
− integrate metadata from different sources
− Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id ≡ B.cust-#

Detecting and resolving data value conflicts
− for the same real world entity, attribute values from
different sources are different
− possible reasons: different representations, different
scales, e.g., metric vs. British units, different currency
Manage Data Integration
Data integration and transformation

Redundant data occur often when integrating multiple DBs
− The same attribute may have different names in different databases
− One attribute may be a “derived” attribute in another table, e.g., annual
revenue

Redundant data may be able to be detected by correlational analysis
• Careful integration can help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
BA
BA
n
BBAA
r
σσ)1(
))((
,
−
−−Σ
=
Manage Data Transformation
Data integration and transformation
 Smoothing: remove noise from data (binning, clustering,
regression)
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
− min-max normalization
− z-score normalization
− normalization by decimal scaling
 Attribute/feature construction
− New attributes constructed from the given ones
Manage Data Reduction
Data reduction
Data reduction: reduced representation, while still retaining critical
information

Data cube aggregation

Dimensionality reduction

Data compression

Numerosity reduction

Discretization and concept hierarchy generation
Data Cube Aggregation
Data reduction

Multiple levels of aggregation in data cubes
− Further reduce the size of data to deal with

Reference appropriate levels Use the smallest representation capable
to solve the task
Data Compression
Data reduction

String compression
− There are extensive theories and well-tuned algorithms
− Typically lossless
− But only limited manipulation is possible without expansion

Audio/video, image compression
− Typically lossy compression, with progressive refinement
− Sometimes small fragments of signal can be reconstructed
without reconstructing the whole

Time sequence is not audio
− Typically short and vary slowly with time
``
Decision Tree
Data reduction
Similarities and Dissimilarities
Proximity

Proximity is used to refer to Similarity or
Dissimilarity, since proximity between the
object is a function of proximity between
the corresponding attributes of two objects.

Similarity: Numeric measure of the degree
to which the two objects are alike.

Dissimilarity: Numeric measure of the
degree to which the two objects are
different.
Dissimilarities between Data Objects?

Similarity
− Numerical measure of how alike two data
objects are.
− Is higher when objects are more alike.
− Often falls in the range [0,1]

Dissimilarity
− Numerical measure of how different are two data
objects
− Lower when objects are more alike
− Minimum dissimilarity is often 0
− Upper limit varies

Proximity refers to a similarity or dissimilarity
Euclidean Distance

Euclidean Distance
Where n is the number of dimensions (attributes)
and pk and qk are, respectively, the kth
attributes
(components) or data objects p and q.

Standardization is necessary, if scales differ.
∑
=
−=
n
k
kk qpdist
1
2
)(
Euclidean Distance

Euclidean Distance
∑
=
−=
n
k
kk qpdist
1
2
)(
1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
0
5
4
3
Column 2
Minkowski Distance
 r = 1. City block (Manhattan, taxicab, L1
norm)
distance.
− A common example of this is the Hamming distance, which is just
the number of bits that are different between two binary vectors

r = 2. Euclidean distance
 r → ∞. “supremum” (Lmax
norm, L∞
norm) distance.
− This is the maximum difference between any component of the
vectors
− Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ??
− Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
Minkowski Distance
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L∞ p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Euclidean Distance Properties
• Distances, such as the Euclidean distance,
have some well known properties.
1. d(x, y) ≥ 0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and q. (Symmetry)
3. d(x, y) ≤ d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between
points (data objects), x and y.
• A distance that satisfies these properties is a
metric, and a space is called a metric space
Non Metric Dissimilarities – Set Differences

non-metric measures are often robust
(resistant to outliers, errors in objects, etc.)
− the symmetry and mainly the triangular inequality
are often violated

cannot be directly
used with MAMs
a
b
a > b + c
c
a
b
a ≠ b
Non Metric Dissimilarities – Time

various k-median distances
− measure distance between the two (k-th) most
similar portions in objects

COSIMIR
− back-propagation network with single output
neuron serving as a distance, allows training

Dynamic Time Warping distance
− sequence alignment technique
− minimizes the sum of distances between
sequence elements
 fractional Lp distances
− generalization of Minkowski distances (p<1)
− more robust to extreme differences in coordinates
Jaccard Coeffificient

Recall: Jaccard coefficient is a commonly used
measure of overlap of two sets A and B
jaccard(A,B) = |A ∩ B| / |A ∪ B|
jaccard(A,A) = 1
jaccard(A,B) = 0 if A ∩ B = 0

A and B don’t have to be the same size.

JC always assigns a number between 0 and 1.
Takeaways
Why Data Preprocessing?
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization and concept Heirarchy
generation

More Related Content

PPT
Data preprocessing
ankur bhalla
 
PDF
Data preprocessing using Machine Learning
Gopal Sakarkar
 
PPT
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
PPTX
Aerial photography.pptx
PRAMODA G
 
PPTX
Data Mining: an Introduction
Ali Abbasi
 
PDF
Bias and variance trade off
VARUN KUMAR
 
PPTX
Nota lengkap sejarah tingkatan 1
Izzat YP
 
PPTX
Report Writing
Bishara Adam
 
Data preprocessing
ankur bhalla
 
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Aerial photography.pptx
PRAMODA G
 
Data Mining: an Introduction
Ali Abbasi
 
Bias and variance trade off
VARUN KUMAR
 
Nota lengkap sejarah tingkatan 1
Izzat YP
 
Report Writing
Bishara Adam
 

What's hot (20)

PPTX
Association rule mining.pptx
maha797959
 
PPTX
Classification in data mining
Sulman Ahmed
 
PPTX
Data Reduction
Rajan Shah
 
PPT
2.3 bayesian classification
Krish_ver2
 
PPT
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
PPTX
Data Preprocessing || Data Mining
Iffat Firozy
 
PPT
1.7 data reduction
Krish_ver2
 
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
PPT
Data preprocessing ng
datapreprocessing
 
PPTX
Data preprocessing
Gajanand Sharma
 
PPTX
Text mining
Koshy Geoji
 
PPT
2. visualization in data mining
Azad public school
 
PPTX
Application of data mining
SHIVANI SONI
 
PPSX
Frequent itemset mining methods
Prof.Nilesh Magar
 
PPTX
Data mining primitives
lavanya marichamy
 
PPT
1.8 discretization
Krish_ver2
 
PDF
Data warehousing
Juhi Mahajan
 
PPT
Association rule mining
Acad
 
PPTX
Major issues in data mining
Slideshare
 
PPT
Clustering
M Rizwan Aqeel
 
Association rule mining.pptx
maha797959
 
Classification in data mining
Sulman Ahmed
 
Data Reduction
Rajan Shah
 
2.3 bayesian classification
Krish_ver2
 
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Data Preprocessing || Data Mining
Iffat Firozy
 
1.7 data reduction
Krish_ver2
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
Data preprocessing ng
datapreprocessing
 
Data preprocessing
Gajanand Sharma
 
Text mining
Koshy Geoji
 
2. visualization in data mining
Azad public school
 
Application of data mining
SHIVANI SONI
 
Frequent itemset mining methods
Prof.Nilesh Magar
 
Data mining primitives
lavanya marichamy
 
1.8 discretization
Krish_ver2
 
Data warehousing
Juhi Mahajan
 
Association rule mining
Acad
 
Major issues in data mining
Slideshare
 
Clustering
M Rizwan Aqeel
 
Ad

Viewers also liked (11)

PPT
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
PDF
Image processing fundamentals
Dr. A. B. Shinde
 
PPT
Clinical Decision Support Systems
pradhasrini
 
PPTX
Clinical decision support systems
Bhavitha Pulaparthi
 
PDF
Difference between molap, rolap and holap in ssas
Umar Ali
 
PDF
Database aggregation using metadata
Dr Sandeep Kumar Poonia
 
PPT
Cure, Clustering Algorithm
Lino Possamai
 
PDF
Density Based Clustering
SSA KPI
 
PPTX
OLAP
Slideshare
 
PDF
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
Image processing fundamentals
Dr. A. B. Shinde
 
Clinical Decision Support Systems
pradhasrini
 
Clinical decision support systems
Bhavitha Pulaparthi
 
Difference between molap, rolap and holap in ssas
Umar Ali
 
Database aggregation using metadata
Dr Sandeep Kumar Poonia
 
Cure, Clustering Algorithm
Lino Possamai
 
Density Based Clustering
SSA KPI
 
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Ad

Similar to Data preprocessing (20)

PPT
Data preprocessing 2
extraganesh
 
PPT
1.6.data preprocessing
Krish_ver2
 
PPT
data clean.ppt
chatbot9
 
PPT
Data1
suganmca14
 
PPT
Data1
suganmca14
 
PPT
Data preprocessing ng
saranya12345
 
PPT
Preprocessing.ppt
congtran88
 
PPT
Data Mining
Jay Nagar
 
PPT
Datapreprocessingppt
Shree Hari
 
PPTX
Datapreprocessing
Chandrika Sweety
 
PPT
Data preparation
James Wong
 
PPT
Data preperation
Hoang Nguyen
 
PPT
Data preparation
Young Alista
 
PPT
Data preparation
Tony Nguyen
 
PPT
Data preparation
Harry Potter
 
PDF
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
PPT
Data preperation
Fraboni Ec
 
PPT
Data preperation
Luis Goldster
 
PPT
Preprocessing.ppt
Roshan575917
 
PPT
Preprocessing.ppt
Arumugam Prakash
 
Data preprocessing 2
extraganesh
 
1.6.data preprocessing
Krish_ver2
 
data clean.ppt
chatbot9
 
Data1
suganmca14
 
Data1
suganmca14
 
Data preprocessing ng
saranya12345
 
Preprocessing.ppt
congtran88
 
Data Mining
Jay Nagar
 
Datapreprocessingppt
Shree Hari
 
Datapreprocessing
Chandrika Sweety
 
Data preparation
James Wong
 
Data preperation
Hoang Nguyen
 
Data preparation
Young Alista
 
Data preparation
Tony Nguyen
 
Data preparation
Harry Potter
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
Data preperation
Fraboni Ec
 
Data preperation
Luis Goldster
 
Preprocessing.ppt
Roshan575917
 
Preprocessing.ppt
Arumugam Prakash
 

More from Jason Rodrigues (9)

PPTX
Johari WIndow in PPT.pptx
Jason Rodrigues
 
PDF
Startup and incubation
Jason Rodrigues
 
PPTX
Paris Conference on Applied Psychology
Jason Rodrigues
 
PDF
Rodrigues
Jason Rodrigues
 
PPSX
Safety Presentation
Jason Rodrigues
 
ODP
Design and documentation of software architectures
Jason Rodrigues
 
PPT
Wrap up
Jason Rodrigues
 
PPT
Its all about data mining
Jason Rodrigues
 
ODP
A Sales Approach For Cloud Computing
Jason Rodrigues
 
Johari WIndow in PPT.pptx
Jason Rodrigues
 
Startup and incubation
Jason Rodrigues
 
Paris Conference on Applied Psychology
Jason Rodrigues
 
Rodrigues
Jason Rodrigues
 
Safety Presentation
Jason Rodrigues
 
Design and documentation of software architectures
Jason Rodrigues
 
Its all about data mining
Jason Rodrigues
 
A Sales Approach For Cloud Computing
Jason Rodrigues
 

Recently uploaded (20)

PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Software Development Methodologies in 2025
KodekX
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
The Future of Artificial Intelligence (AI)
Mukul
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Software Development Methodologies in 2025
KodekX
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 

Data preprocessing

  • 1. A Brief Presentation on Data Mining Jason Rodrigues Data Preprocessing
  • 2. • Introduction • Why data proprocessing? • Data Cleaning • Data Integration and Transformation • Data Reduction • Discretization and concept Heirarchy generation • Takeaways Agenda
  • 3. Why Data Preprocessing? Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  noisy: containing errors or outliers  inconsistent: containing discrepancies in codes or names No quality data, no quality mining results!  Quality decisions must be based on quality data  Data warehouse needs consistent integration of quality data A multi-dimensional measure of data quality  A well-accepted multi-dimensional view:  accuracy, completeness, consistency, timeliness, believability, value added, interpretability, accessibility Broad categories  intrinsic, contextual, representational, and accessibility
  • 4. Data Preprocessing Major Tasks of Data Preprocessing Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration  Integration of multiple databases, data cubes, files, or notes Data trasformation  Normalization (scaling to a specific range)  Aggregation Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization: with particular importance, especially for numerical data  Data aggregation, dimensionality reduction, data compression, generalization
  • 5. Data Preprocessing Major Tasks of Data Preprocessing Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 6. Data Cleaning Tasks of Data Cleaning  Fill in missing values  Identify outliers and smooth noisy data  Correct inconsistent data
  • 7. Data Cleaning Manage Missing Data  Ignore the tuple: usually done when class label is missing (assuming the task is classification—not effective in certain cases)  Fill in the missing value manually: tedious + infeasible?  Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!  Use the attribute mean to fill in the missing value  Use the attribute mean for all samples of the same class to fill in the missing value: smarter  Use the most probable value to fill in the missing value: inference- based such as regression, Bayesian formula, decision tree
  • 8. Data Cleaning Manage Noisy Data Binning Method:  first sort data and partition into (equi-depth) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc Clustering:  detect and remove outliers Semi Automated  Computer and Manual Intervention Regression  Use regression functions
  • 10. Data Cleaning Regression Analysis x y y = x + 1 X1 Y1 Y1’ •Linear regression (best line to fit two variables) •Multiple linear regression (more than two variables, fit to a multidimensional surface
  • 11. Data Cleaning Inconsistant Data  Manual correction using external references  Semi-automatic using various tools − To detect violation of known functional dependencies and data constraints − To correct redundant data
  • 12. Data integration and transformation Tasks of Data Integration and transformation  Data integration: − combines data from multiple sources into a coherent store  Schema integration − integrate metadata from different sources − Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-#  Detecting and resolving data value conflicts − for the same real world entity, attribute values from different sources are different − possible reasons: different representations, different scales, e.g., metric vs. British units, different currency
  • 13. Manage Data Integration Data integration and transformation  Redundant data occur often when integrating multiple DBs − The same attribute may have different names in different databases − One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant data may be able to be detected by correlational analysis • Careful integration can help reduce/avoid redundancies and inconsistencies and improve mining speed and quality BA BA n BBAA r σσ)1( ))(( , − −−Σ =
  • 14. Manage Data Transformation Data integration and transformation  Smoothing: remove noise from data (binning, clustering, regression)  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range − min-max normalization − z-score normalization − normalization by decimal scaling  Attribute/feature construction − New attributes constructed from the given ones
  • 15. Manage Data Reduction Data reduction Data reduction: reduced representation, while still retaining critical information  Data cube aggregation  Dimensionality reduction  Data compression  Numerosity reduction  Discretization and concept hierarchy generation
  • 16. Data Cube Aggregation Data reduction  Multiple levels of aggregation in data cubes − Further reduce the size of data to deal with  Reference appropriate levels Use the smallest representation capable to solve the task
  • 17. Data Compression Data reduction  String compression − There are extensive theories and well-tuned algorithms − Typically lossless − But only limited manipulation is possible without expansion  Audio/video, image compression − Typically lossy compression, with progressive refinement − Sometimes small fragments of signal can be reconstructed without reconstructing the whole  Time sequence is not audio − Typically short and vary slowly with time ``
  • 19. Similarities and Dissimilarities Proximity  Proximity is used to refer to Similarity or Dissimilarity, since proximity between the object is a function of proximity between the corresponding attributes of two objects.  Similarity: Numeric measure of the degree to which the two objects are alike.  Dissimilarity: Numeric measure of the degree to which the two objects are different.
  • 20. Dissimilarities between Data Objects?  Similarity − Numerical measure of how alike two data objects are. − Is higher when objects are more alike. − Often falls in the range [0,1]  Dissimilarity − Numerical measure of how different are two data objects − Lower when objects are more alike − Minimum dissimilarity is often 0 − Upper limit varies  Proximity refers to a similarity or dissimilarity
  • 21. Euclidean Distance  Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.  Standardization is necessary, if scales differ. ∑ = −= n k kk qpdist 1 2 )(
  • 22. Euclidean Distance  Euclidean Distance ∑ = −= n k kk qpdist 1 2 )( 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 0 5 4 3 Column 2
  • 23. Minkowski Distance  r = 1. City block (Manhattan, taxicab, L1 norm) distance. − A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors  r = 2. Euclidean distance  r → ∞. “supremum” (Lmax norm, L∞ norm) distance. − This is the maximum difference between any component of the vectors − Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ?? − Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
  • 24. Minkowski Distance point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 L1 p1 p2 p3 p4 p1 0 4 4 6 p2 4 0 2 4 p3 4 2 0 2 p4 6 4 2 0 L2 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 L∞ p1 p2 p3 p4 p1 0 2 3 5 p2 2 0 1 3 p3 3 1 0 2 p4 5 3 2 0
  • 25. Euclidean Distance Properties • Distances, such as the Euclidean distance, have some well known properties. 1. d(x, y) ≥ 0 for all x and y and d(x, y) = 0 only if x = y. (Positive definiteness) 2. d(x, y) = d(y, x) for all x and q. (Symmetry) 3. d(x, y) ≤ d(x, y) + d(y, z) for all points x, y, and z. (Triangle Inequality) where d(x, y) is the distance (dissimilarity) between points (data objects), x and y. • A distance that satisfies these properties is a metric, and a space is called a metric space
  • 26. Non Metric Dissimilarities – Set Differences  non-metric measures are often robust (resistant to outliers, errors in objects, etc.) − the symmetry and mainly the triangular inequality are often violated  cannot be directly used with MAMs a b a > b + c c a b a ≠ b
  • 27. Non Metric Dissimilarities – Time  various k-median distances − measure distance between the two (k-th) most similar portions in objects  COSIMIR − back-propagation network with single output neuron serving as a distance, allows training  Dynamic Time Warping distance − sequence alignment technique − minimizes the sum of distances between sequence elements  fractional Lp distances − generalization of Minkowski distances (p<1) − more robust to extreme differences in coordinates
  • 28. Jaccard Coeffificient  Recall: Jaccard coefficient is a commonly used measure of overlap of two sets A and B jaccard(A,B) = |A ∩ B| / |A ∪ B| jaccard(A,A) = 1 jaccard(A,B) = 0 if A ∩ B = 0  A and B don’t have to be the same size.  JC always assigns a number between 0 and 1.
  • 29. Takeaways Why Data Preprocessing? Data Cleaning Data Integration and Transformation Data Reduction Discretization and concept Heirarchy generation