SlideShare a Scribd company logo
Data Preprocessing
 Why

preprocess the data?

 Data

cleaning

 Data

integration and transformation

 Data

reduction

 Discretization
 Summary
Why Data Preprocessing?
 Data in the real world is dirty


incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate
data
e.g., occupation=―‖



noisy: containing errors or outliers
o e.g., Salary=―-10‖

 inconsistent: containing discrepancies in codes or
names
o e.g., Age=―42‖ Birthday=―03/07/1997‖
o e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖
o e.g., discrepancy between duplicate records
Why Is Data Preprocessing Important?


No quality data, no quality mining results!
◦ Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or
even misleading statistics.



Data preparation, cleaning, and transformation
comprises the majority of the work in a data
mining application (90%).
Forms of data preprocessing
Major Tasks in Data Preprocessing


Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers and noisy data, and resolve inconsistencies.



Data integration
• Integration of multiple databases, or files



Data transformation
• Normalization and aggregation



Data reduction
• Obtains reduced representation in volume but produces the
same or similar analytical results



Data discretization (for numerical data)
• Part of data reduction but with particular
importance, especially for numerical data
Data Cleaning
 Importance

 ―Data cleaning is the number one problem in data
warehousing‖



Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data

 Resolve redundancy caused by data integration


Data is not always available
 E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data



Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
 not register history or changes of the data




How to Handle Missing Data?
 Ignore the tuple
 Fill in missing values manually:

tedious +

infeasible?


Fill in it automatically with
◦ a global constant : e.g., ―unknown‖, a new class?!
◦ the attribute mean
◦ the most probable value: inference-based such as
Bayesian formula, decision tree, or EM algorithm
Noisy Data




Noise: random error or variance in a measured
variable.
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
etc



Other data problems which requires data
cleaning
duplicate records, incomplete data, inconsistent data
How to Handle Noisy Data?


Binning method:
first sort data and partition into (equi-depth) bins

then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.


Clustering

detect and remove outliers.


Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers).
Binning Methods for Data Smoothing



Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins:
 Bin 1: 4, 8, 9, 15
 Bin 2: 21, 21, 24, 25
 Bin 3: 26, 28, 29, 34



Smoothing by bin means:
 Bin 1: 9, 9, 9, 9

 Bin 2: 23, 23, 23, 23
 Bin 3: 29, 29, 29, 29


Smoothing by bin boundaries:
 Bin 1: 4, 4, 4, 15

 Bin 2: 21, 21, 25, 25
 Bin 3: 26, 26, 26, 34
Outlier Removal


Data points inconsistent with the majority of
data



Different outliers
◦ Valid: CEO’s salary,
◦ Noisy: One’s age = 200, widely deviated points



Removal methods
◦ Clustering
◦ Curve-fitting
◦ Hypothesis-testing with a given model
Data Integration


Data integration:
 combines data from multiple sources.



Schema integration
 integrate metadata from different sources
 Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id



B.cust-#

Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British
units.



Removing duplicates and redundant data
Data Transformation




Smoothing: remove noise from data.
Normalization: scaled to fall within a

small, specified range.


Attribute/feature construction
◦ New attributes constructed from the given
ones.



Aggregation: summarization



Generalization: concept hierarchy climbing.
Data Reduction Strategies


Data is too big to work with



Data reduction
◦ Obtain a reduced representation of the data
set that is much smaller in volume but yet
produce the same (or almost the same)
analytical results



Data reduction strategies
◦ Dimensionality reduction — remove

unimportant attributes
◦ Aggregation and clustering
◦ Sampling
Dimensionality Reduction
 Feature selection (i.e., attribute subset
selection):
◦ Select a minimum set of attributes (features) that is
sufficient for the data mining task.


Heuristic methods (due to exponential # of
choices):
◦
◦
◦
◦

step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
etc
Clustering


Partition data set into clusters, and one can
store cluster representation only



Can be very effective if data is clustered but
not if data is ―smeared‖



There are many choices of clustering
definitions and clustering algorithms. We will
discuss them later.
sampling


Choose a representative subset of the data
◦ Simple random sampling may have poor performance
in the presence of skew.



Develop adaptive sampling methods
◦ Stratified sampling:
 Approximate the percentage of each class (or
subpopulation of interest) in the overall database

 Used in conjunction with skewed data
Discretization


Three types of attributes:
◦ Nominal — values from an unordered set
◦ Ordinal — values from an ordered set
◦ Continuous — real numbers



Discretization:
◦ divide the range of a continuous attribute into
intervals because some data mining algorithms only
accept categorical attributes.



Some techniques:

◦ Binning methods – equal-width, equal-frequency
◦ Entropy-based methods
summary


Data preparation is a big issue for data mining.



Data preparation includes
◦ Data cleaning and data integration
◦ Data reduction and feature selection
◦ Discretization



Many methods have been proposed but still an
active area of research.
Assignmentdatamining

More Related Content

What's hot (14)

PPTX
XL Miner: Classification
DataminingTools Inc
 
PPTX
XL-MINER: Associations
DataminingTools Inc
 
PPT
Upstate CSCI 525 Data Mining Chapter 3
DanWooster1
 
PPTX
XL-MINER:Partition
DataminingTools Inc
 
PPT
Xlminer demo
Sangjun Woo
 
PPTX
XL-MINER:Prediction
DataminingTools Inc
 
PPTX
Random forest
Musa Hawamdah
 
PPTX
Data Preprocessing
zekeLabs Technologies
 
PPT
Data processing
Joseph Lagod
 
PPTX
XL-MINER: Data Utilities
DataminingTools Inc
 
PPTX
CART – Classification & Regression Trees
Hemant Chetwani
 
PPT
Data preprocessing
Manikandan Tamilselvan
 
PPTX
Data preprocessing
venkadesh236
 
PPTX
Data Structure - Elementary Data Organization
Uma mohan
 
XL Miner: Classification
DataminingTools Inc
 
XL-MINER: Associations
DataminingTools Inc
 
Upstate CSCI 525 Data Mining Chapter 3
DanWooster1
 
XL-MINER:Partition
DataminingTools Inc
 
Xlminer demo
Sangjun Woo
 
XL-MINER:Prediction
DataminingTools Inc
 
Random forest
Musa Hawamdah
 
Data Preprocessing
zekeLabs Technologies
 
Data processing
Joseph Lagod
 
XL-MINER: Data Utilities
DataminingTools Inc
 
CART – Classification & Regression Trees
Hemant Chetwani
 
Data preprocessing
Manikandan Tamilselvan
 
Data preprocessing
venkadesh236
 
Data Structure - Elementary Data Organization
Uma mohan
 

Similar to Assignmentdatamining (20)

PPT
Data preprocessing
kayathri02
 
PPT
Data preprocessing ng
saranya12345
 
PPT
Data preparation
James Wong
 
PPT
Data preparation
Tony Nguyen
 
PPT
Data preparation
Young Alista
 
PPT
Data preperation
Luis Goldster
 
PPT
Data preperation
Hoang Nguyen
 
PPT
Data preperation
Fraboni Ec
 
PPT
Preprocessing.ppt
congtran88
 
PPTX
Datapreprocessing
priya_trehan
 
PDF
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
PDF
data processing.pdf
DimpyJindal4
 
PPT
Preprocessing
mmuthuraj
 
PPT
Data preprocessing
Young Alista
 
PPT
Data preprocessing
James Wong
 
PPT
Data preprocessing
Tony Nguyen
 
PPT
Data preprocessing
Fraboni Ec
 
PPT
Data preprocessing
Luis Goldster
 
PPT
Data preprocessing
Hoang Nguyen
 
PPT
Datapreprocess
sharmila parveen
 
Data preprocessing
kayathri02
 
Data preprocessing ng
saranya12345
 
Data preparation
James Wong
 
Data preparation
Tony Nguyen
 
Data preparation
Young Alista
 
Data preperation
Luis Goldster
 
Data preperation
Hoang Nguyen
 
Data preperation
Fraboni Ec
 
Preprocessing.ppt
congtran88
 
Datapreprocessing
priya_trehan
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
data processing.pdf
DimpyJindal4
 
Preprocessing
mmuthuraj
 
Data preprocessing
Young Alista
 
Data preprocessing
James Wong
 
Data preprocessing
Tony Nguyen
 
Data preprocessing
Fraboni Ec
 
Data preprocessing
Luis Goldster
 
Data preprocessing
Hoang Nguyen
 
Datapreprocess
sharmila parveen
 
Ad

Recently uploaded (20)

PPT
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
People & Earth's Ecosystem -Lesson 2: People & Population
marvinnbustamante1
 
PDF
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
PDF
The dynastic history of the Chahmana.pdf
PrachiSontakke5
 
PPTX
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PPTX
Soil and agriculture microbiology .pptx
Keerthana Ramesh
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PDF
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PDF
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PDF
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
PPSX
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
PDF
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
PPTX
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
People & Earth's Ecosystem -Lesson 2: People & Population
marvinnbustamante1
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
The dynastic history of the Chahmana.pdf
PrachiSontakke5
 
MENINGITIS: NURSING MANAGEMENT, BACTERIAL MENINGITIS, VIRAL MENINGITIS.pptx
PRADEEP ABOTHU
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
Soil and agriculture microbiology .pptx
Keerthana Ramesh
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
Dimensions of Societal Planning in Commonism
StefanMz
 
Isharyanti-2025-Cross Language Communication in Indonesian Language
Neny Isharyanti
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
LAW OF CONTRACT ( 5 YEAR LLB & UNITARY LLB)- MODULE-3 - LEARN THROUGH PICTURE
APARNA T SHAIL KUMAR
 
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
DIGESTION OF CARBOHYDRATES,PROTEINS,LIPIDS
raviralanaresh2
 
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
Ad

Assignmentdatamining

  • 1. Data Preprocessing  Why preprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Discretization  Summary
  • 2. Why Data Preprocessing?  Data in the real world is dirty  incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data e.g., occupation=―‖  noisy: containing errors or outliers o e.g., Salary=―-10‖  inconsistent: containing discrepancies in codes or names o e.g., Age=―42‖ Birthday=―03/07/1997‖ o e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖ o e.g., discrepancy between duplicate records
  • 3. Why Is Data Preprocessing Important?  No quality data, no quality mining results! ◦ Quality decisions must be based on quality data  e.g., duplicate or missing data may cause incorrect or even misleading statistics.  Data preparation, cleaning, and transformation comprises the majority of the work in a data mining application (90%).
  • 4. Forms of data preprocessing
  • 5. Major Tasks in Data Preprocessing  Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies.  Data integration • Integration of multiple databases, or files  Data transformation • Normalization and aggregation  Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization (for numerical data) • Part of data reduction but with particular importance, especially for numerical data
  • 6. Data Cleaning  Importance  ―Data cleaning is the number one problem in data warehousing‖  Data cleaning tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration
  • 7.  Data is not always available  E.g., many tuples have no recorded values for several attributes, such as customer income in sales data  Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry  not register history or changes of the data    
  • 8. How to Handle Missing Data?  Ignore the tuple  Fill in missing values manually: tedious + infeasible?  Fill in it automatically with ◦ a global constant : e.g., ―unknown‖, a new class?! ◦ the attribute mean ◦ the most probable value: inference-based such as Bayesian formula, decision tree, or EM algorithm
  • 9. Noisy Data   Noise: random error or variance in a measured variable. Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems etc  Other data problems which requires data cleaning duplicate records, incomplete data, inconsistent data
  • 10. How to Handle Noisy Data?  Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Clustering detect and remove outliers.  Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers).
  • 11. Binning Methods for Data Smoothing   Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins:  Bin 1: 4, 8, 9, 15  Bin 2: 21, 21, 24, 25  Bin 3: 26, 28, 29, 34  Smoothing by bin means:  Bin 1: 9, 9, 9, 9  Bin 2: 23, 23, 23, 23  Bin 3: 29, 29, 29, 29  Smoothing by bin boundaries:  Bin 1: 4, 4, 4, 15  Bin 2: 21, 21, 25, 25  Bin 3: 26, 26, 26, 34
  • 12. Outlier Removal  Data points inconsistent with the majority of data  Different outliers ◦ Valid: CEO’s salary, ◦ Noisy: One’s age = 200, widely deviated points  Removal methods ◦ Clustering ◦ Curve-fitting ◦ Hypothesis-testing with a given model
  • 13. Data Integration  Data integration:  combines data from multiple sources.  Schema integration  integrate metadata from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-# Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units.  Removing duplicates and redundant data
  • 14. Data Transformation   Smoothing: remove noise from data. Normalization: scaled to fall within a small, specified range.  Attribute/feature construction ◦ New attributes constructed from the given ones.  Aggregation: summarization  Generalization: concept hierarchy climbing.
  • 15. Data Reduction Strategies  Data is too big to work with  Data reduction ◦ Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results  Data reduction strategies ◦ Dimensionality reduction — remove unimportant attributes ◦ Aggregation and clustering ◦ Sampling
  • 16. Dimensionality Reduction  Feature selection (i.e., attribute subset selection): ◦ Select a minimum set of attributes (features) that is sufficient for the data mining task.  Heuristic methods (due to exponential # of choices): ◦ ◦ ◦ ◦ step-wise forward selection step-wise backward elimination combining forward selection and backward elimination etc
  • 17. Clustering  Partition data set into clusters, and one can store cluster representation only  Can be very effective if data is clustered but not if data is ―smeared‖  There are many choices of clustering definitions and clustering algorithms. We will discuss them later.
  • 18. sampling  Choose a representative subset of the data ◦ Simple random sampling may have poor performance in the presence of skew.  Develop adaptive sampling methods ◦ Stratified sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database  Used in conjunction with skewed data
  • 19. Discretization  Three types of attributes: ◦ Nominal — values from an unordered set ◦ Ordinal — values from an ordered set ◦ Continuous — real numbers  Discretization: ◦ divide the range of a continuous attribute into intervals because some data mining algorithms only accept categorical attributes.  Some techniques: ◦ Binning methods – equal-width, equal-frequency ◦ Entropy-based methods
  • 20. summary  Data preparation is a big issue for data mining.  Data preparation includes ◦ Data cleaning and data integration ◦ Data reduction and feature selection ◦ Discretization  Many methods have been proposed but still an active area of research.