SlideShare a Scribd company logo
Data Preprocessing
Jun Du
The University of Western Ontario
jdu43@uwo.ca
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
1
What is Data?
• Collection of data objects
and their attributes
• Data objects  rows
• Attributes  columns
2
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Data Objects
• A data object represents an entity.
• Examples:
– Sales database: customers, store items, sales
– Medical database: patients, treatments
– University database: students, professors, courses
• Also called examples, instances, records, cases,
samples, data points, objects, etc.
• Data objects are described by attributes.
3
Attributes
• An attribute is a data field, representing a
characteristic or feature of a data object.
• Example:
– Customer Data: customer _ID, name, gender, age, address,
phone number, etc.
– Product data: product_ID, price, quantity, manufacturer,
etc.
• Also called features, variables, fields, dimensions, etc.
4
Attribute Types (1)
• Nominal (Discrete) Attribute
– Has only a finite set of values (such as, categories, states,
etc.)
– E.g., Hair_color = {black, blond, brown, grey, red, white, …}
– E.g., marital status, zip codes
• Numeric (Continuous) Attribute
– Has real numbers as attribute values
– E.g., temperature, height, or weight.
• Question: what about student id, SIN, year of birth?
5
Attribute Types (2)
• Binary
– A special case of nominal attribute: with only 2 states (0
and 1)
– Gender = {male, female};
– Medical test = {positive, negative}
• Ordinal
– Usually a special case of nominal attribute: values have a
meaningful order (ranking)
– Size = {small, medium, large}
– Army rankings
6
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
7
Data Preprocessing
• Why preprocess the data?
– Data quality is poor in real world.
– No quality data, no quality mining results!
• Measures for data quality
– Accuracy: noise, outliers, …
– Completeness: missing values, …
– Redundancy: duplicated data, irrelevant data, …
– Consistency: some modified but some not, …
– ……
8
Typical Tasks in Data Preprocessing
• Data Cleaning
– Handle missing values, noisy / outlier data, resolve
inconsistencies, …
• Data Transformation
– Aggregation
– Type Conversion
– Normalization
• Data Reduction
– Data Sampling
– Dimensionality Reduction
• ……
9
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
10
Data Cleaning
• Missing value: lacking attribute values
– E.g., Occupation = “ ”
• Noise (Error): modification of original values
– E.g., Salary = “−10”
• Outlier: considerably different from most of the
other data (not necessarily error)
– E.g., Salary = “2,100,000”
• Inconsistency: discrepancies in codes or names
– E.g., Age=“42”, Birthday=“03/07/2010”
– Was rating “1, 2, 3”, now rating “A, B, C”
• ……
11
Missing Values
• Reasons for missing values
– Information is not collected
• E.g., people decline to give their age and weight
– Attributes may not be applicable to all cases
• E.g., annual income is not applicable to children
– Human / Hardware / Software problems
• E.g., Birthdate information is accidentally deleted for all
people born in 1988.
– ……
12
How to Handle Missing Value?
• Eliminate  ignore missing value
• Eliminate  ignore the examples
• Eliminate  ignore the features
• Simple; not applicable when data is scarce
• Estimate missing value
– Global constant : e.g., “unknown”,
– Attribute mean (median, mode)
– Predict the value based on features (data imputation)
• Estimate gender based on first name (name gender)
• Estimate age based on first name (name popularity)
• Build a predictive model based on other features
– Missing value estimation depends on the missing reason!
13
Demonstration
• ReplaceMissingValues
– WekaVote
– Replacing missing values for nominal and numeric
attributes
• More functions in Rapidminer
14
Noisy (Outlier) Data
• Noise: refers to modification of original values
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
15
How to Handle Noisy (Outlier) Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
16
Binning
Sort data in ascending order: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into equal-frequency (equal-depth) bins:
– Bin 1: 4, 8, 9, 15
– Bin 2: 21, 21, 24, 25
– Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
– Bin 1: 9, 9, 9, 9
– Bin 2: 23, 23, 23, 23
– Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
– Bin 1: 4, 4, 4, 15
– Bin 2: 21, 21, 25, 25
– Bin 3: 26, 26, 26, 34
17
Regression
18
x
y
y = x + 1
X1
Y1
Y1’
Cluster Analysis
19
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
20
Data Transformation
• Aggregation:
– Attribute / example summarization
• Feature type conversion:
– Nominal  Numeric, …
• Normalization:
– Scaled to fall within a small, specified range
• Attribute/feature construction:
– New attributes constructed from the given ones
21
Aggregation
• Combining two or more attributes (examples) into a single
attribute (example)
• Combining two or more attribute values into a single attribute
value
• Purpose
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
– More “predictive” data
• Aggregated data might have high Predictability
22
Demonstration
• MergeTwoValues
– Wekacontact-lenses
– Merge class values “soft” and “hard”
• Effective aggregation in real-world application
23
Feature Type Conversion
• Some algorithms can only handle numeric features; some can
only handle nominal features. Only few can handle both.
• Features have to be converted to satisfy the requirement of
learning algorithms.
– Numeric  Nominal (Discretization)
• E.g., Age Discretization: Young 18-29; Career 30-40; Mid-Life 41-55;
Empty-Nester 56-69; Senior 70+
– Nominal  Numeric
• Introduce multiple numeric features for one nominal feature
• Nominal  Binary (Numeric)
• E.g., size={L, M, S}  size_L: 0, 1; size_M: 0, 1; size_S: 0, 1
24
Demonstration
• Discretize
– Wekadiabetes
– Discretize “age” (equal bins vs equal frequency)
• NumericToNominal
– Wekadiabetes
– Discretize “age” (vs “Discretize” method)
• NominalToBinary
– UCIautos
– Convert “num-of-doors”
– Convert “drive-wheels”
25
Normalization
716.00)00.1(
000,12000,98
000,12600,73



26
Scale the attribute values to a small specified range
• Min-max normalization: to [new_minA, new_maxA]
– E.g., Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
• ……
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



Demonstration
• Normalize
– Wekadiabetes
– Normalize “age”
• Standardize
– Wekadiabetes
– Standardize “age” (vs “Normalize” method)
27
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
28
Sampling
• Big data era: too expensive (or even infeasible) to
process the entire data set
• Sampling: obtaining a small sample to represent the
entire data set ( ---- undersampling)
• Oversampling is also required in some scenarios,
such as class imbalance problem
– E.g., 100 HIV test results: 5 positive, 995 negative
29
Sampling Principle
Key principle for effective sampling:
• Using a sample will work almost as well as using the
entire data sets, if the sample is representative
• A sample is representative if it has approximately the
same property (of interest) as the original set of data
30
Types of Sampling (1)
• Random sampling without replacement
– As each example is selected, it is removed from the population
• Random sampling with replacement
– Examples are not removed from the population after being selected
• The same example can be picked up more than once
31
Raw Data
Types of Sampling (2)
• Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
32
Raw Data Stratified Sampling
Demonstration
• Resample
– UCIwaveform-5000
– Undersampling (with or without replacement)
33
Dimensionality Reduction
• Purpose:
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise
• Techniques
– Feature Selection
– Feature Extraction
34
Feature Selection
• Redundant features
– Duplicated information contained in different features
– E.g., “Age”, “Year of Birth”; “Purchase price”, “Sales tax”
• Irrelevant features
– Containing no information that is useful for the task
– E.g., students' ID is irrelevant to predicting GPA
• Goal:
– A minimum set of features containing all (most)
information
35
Heuristic Search in Feature Selection
• Given d features, there are 2d possible feature
combinations
– Exhaust search won’t work
– Heuristics has to be applied
• Typical heuristic feature selection methods:
– Feature ranking
– Forward feature selection
– Backward feature elimination
– Bidirectional search (selection + elimination)
– Search based on evolution algorithm
– ……
36
Feature Ranking
• Steps:
1) Rank all the individual features according to certain criteria
(e.g., information gain, gain ratio, χ2)
2) Select / keep top N features
• Properties:
– Usually independent of the learning algorithm to be used
– Efficient (no search process)
– Hard to determine the threshold
– Unable to consider correlation between features
37
Forward Feature Selection
• Steps:
1) First select the best single-feature (according to the learning
algorithm)
2) Repeat (until some stop criterion is met):
Select the next best feature, given the already picked features
• Properties:
– Usually learning algorithm dependent
– Feature correlation is considered
– More reliable
– Inefficient
38
Backward Feature Elimination
• Steps:
1) First build a model based on all the features
2) Repeat (until some criterion is met):
Eliminate the feature that makes the least contribution.
• Properties:
– Usually learning algorithm dependent
– Feature correlation is considered
– More reliable
– Inefficient
39
Filter vs Wrapper Model
• Filter model
– Separating feature selection from learning
– Relying on general characteristics of data (information, etc.)
– No bias toward any learning algorithm, fast
– Feature ranking usually falls into here
• Wrapper model
– Relying on a predetermined learning algorithm
– Using predictive accuracy as goodness measure
– High accuracy, computationally expensive
– FFS, BFE usually fall into here
40
Demonstration
• Feature ranking
– Wekaweather
– ChiSquared, InfoGain, GainRatio
• FFS & BFE
– WekaDiabetes
– ClassifierSubsetEval + GreedyStepwise
41
Feature Extraction
• Map original high-dimensional data onto a lower-
dimensional space
– Generate a (smaller) set of new features
– Preserve all (most) information from the original data
• Techniques
– Principal Component Analysis (PCA)
– Canonical Correlation Analysis (CCA)
– Linear Discriminant Analysis (LDA)
– Independent Component Analysis (ICA)
– Manifold Learning
– ……
42
Principal Component Analysis (PCA)
• Find a projection that captures the largest amount of variation
in data
• The original data are projected onto a much smaller space,
resulting in dimensionality reduction.
43
x2
x1
e
Principal Component Analysis (Steps)
• Given data from n-dimensions (n features), find k ≤ n new
features (principal components) that can best represent data
– Normalize input data: each feature falls within the same range
– Compute k principal components (details omitted)
– Each input data is projected in the new k-dimensional space
– The new features (principal components ) are sorted in order of
decreasing “significance” or strength
– Eliminate weak components / features to reduce dimensionality.
• Works for numeric data only
44
PCA Demonstration
• UCIbreast-w
– Accuracy with all features
– PrincipalComponents (data transformation)
– Visualize/save transformed data (first two features, last
two features)
– Accuracy with all transformed features
– Accuracy with top 1 or 2 feature(s)
45
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
46
Summary
• Data (features and instances)
• Data Cleaning: missing values, noise / outliers
• Data Transformation: aggregation, type conversion,
normalization
• Data Reduction
– Sampling: random sampling with replacement, random
sampling without replacement, stratified sampling
– Dimensionality reduction:
• Feature Selection: Feature ranking, FFS, BFE
• Feature Extraction: PCA
47
Notes
• In real world applications, data preprocessing usually
occupies about 70% workload in a data mining task.
• Domain knowledge is usually required to do good
data preprocessing.
• To improve a predictive performance of a model
– Improve learning algorithms (different algorithms,
different parameters)
• Most data mining research focuses on here
– Improve data quality ---- data preprocessing
• Deserve more attention!
48

More Related Content

What's hot (20)

PPT
Data preprocessing ng
datapreprocessing
 
PPTX
Data mining
Akannsha Totewar
 
PPTX
Data mining an introduction
Dr-Dipali Meher
 
PPTX
Data Mining: What is Data Mining?
Seerat Malik
 
PPTX
Database indexing techniques
ahmadmughal0312
 
PPTX
Kdd process
Rajesh Chandra
 
PPTX
Introduction to database
Arpee Callejo
 
PPT
Data Warehousing and Data Mining
idnats
 
PPTX
Exploratory data analysis
Peter Reimann
 
PPT
1.4 data warehouse
Krish_ver2
 
PDF
Data preprocessing using Machine Learning
Gopal Sakarkar
 
PPT
Javascript arrays
Hassan Dar
 
PPTX
Presentation on data preparation with pandas
AkshitaKanther
 
PPTX
Major issues in data mining
Slideshare
 
PPTX
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
PPTX
1.4 data independence
BHARATH KUMAR
 
PPT
Introduction To Predictive Analytics Part I
jayroy
 
PDF
EDA-Unit 1.pdf
Nirmalavenkatachalam
 
PPT
OLAP technology
Dr. Mahendra Srivastava
 
PPTX
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
Data preprocessing ng
datapreprocessing
 
Data mining
Akannsha Totewar
 
Data mining an introduction
Dr-Dipali Meher
 
Data Mining: What is Data Mining?
Seerat Malik
 
Database indexing techniques
ahmadmughal0312
 
Kdd process
Rajesh Chandra
 
Introduction to database
Arpee Callejo
 
Data Warehousing and Data Mining
idnats
 
Exploratory data analysis
Peter Reimann
 
1.4 data warehouse
Krish_ver2
 
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Javascript arrays
Hassan Dar
 
Presentation on data preparation with pandas
AkshitaKanther
 
Major issues in data mining
Slideshare
 
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
1.4 data independence
BHARATH KUMAR
 
Introduction To Predictive Analytics Part I
jayroy
 
EDA-Unit 1.pdf
Nirmalavenkatachalam
 
OLAP technology
Dr. Mahendra Srivastava
 
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 

Similar to Data preprocessing in Data Mining (20)

PPT
Preprocessing.ppt
Revathy V R
 
PPT
ML-ChapterTwo-Data Preprocessing.ppt
belay41
 
PPT
Data_Preparation_Modeling_Evaluation.ppt
AronMozart1
 
PPT
1.6.data preprocessing
Krish_ver2
 
PDF
02Data updated.pdf
saman Iftikhar
 
PPTX
Data pre processing
junnubabu
 
PPTX
Data preprocessing PPT
ANUSUYA T K
 
PPT
Preprocessing.ppt
congtran88
 
PPT
Data preprocessing
Young Alista
 
PPT
Data preprocessing
James Wong
 
PPT
Data preprocessing
Tony Nguyen
 
PPT
Data preprocessing
Harry Potter
 
PPT
Data preprocessing
Fraboni Ec
 
PPT
Data preprocessing
Luis Goldster
 
PPT
Data preprocessing
Hoang Nguyen
 
PPT
Data preprocessing in precision agriculture
mogana98
 
PPT
Data preprocessing ng
saranya12345
 
PPT
Pre_processing_the_data_using_advance_technique
Bhushan134837
 
PPT
Data1
suganmca14
 
PPT
Data1
suganmca14
 
Preprocessing.ppt
Revathy V R
 
ML-ChapterTwo-Data Preprocessing.ppt
belay41
 
Data_Preparation_Modeling_Evaluation.ppt
AronMozart1
 
1.6.data preprocessing
Krish_ver2
 
02Data updated.pdf
saman Iftikhar
 
Data pre processing
junnubabu
 
Data preprocessing PPT
ANUSUYA T K
 
Preprocessing.ppt
congtran88
 
Data preprocessing
Young Alista
 
Data preprocessing
James Wong
 
Data preprocessing
Tony Nguyen
 
Data preprocessing
Harry Potter
 
Data preprocessing
Fraboni Ec
 
Data preprocessing
Luis Goldster
 
Data preprocessing
Hoang Nguyen
 
Data preprocessing in precision agriculture
mogana98
 
Data preprocessing ng
saranya12345
 
Pre_processing_the_data_using_advance_technique
Bhushan134837
 
Data1
suganmca14
 
Data1
suganmca14
 
Ad

Recently uploaded (20)

PDF
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
PPTX
EC3551-Transmission lines Demo class .pptx
Mahalakshmiprasannag
 
PDF
IoT - Unit 2 (Internet of Things-Concepts) - PPT.pdf
dipakraut82
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PDF
PRIZ Academy - Change Flow Thinking Master Change with Confidence.pdf
PRIZ Guru
 
PPTX
drones for disaster prevention response.pptx
NawrasShatnawi1
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PDF
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
PDF
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PPTX
REINFORCEMENT AS CONSTRUCTION MATERIALS.pptx
mohaiminulhaquesami
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PDF
POWER PLANT ENGINEERING (R17A0326).pdf..
haneefachosa123
 
PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PDF
UNIT-4-FEEDBACK AMPLIFIERS AND OSCILLATORS (1).pdf
Sridhar191373
 
PDF
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
PDF
BioSensors glucose monitoring, cholestrol
nabeehasahar1
 
PDF
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
EC3551-Transmission lines Demo class .pptx
Mahalakshmiprasannag
 
IoT - Unit 2 (Internet of Things-Concepts) - PPT.pdf
dipakraut82
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PRIZ Academy - Change Flow Thinking Master Change with Confidence.pdf
PRIZ Guru
 
drones for disaster prevention response.pptx
NawrasShatnawi1
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
Hashing Introduction , hash functions and techniques
sailajam21
 
REINFORCEMENT AS CONSTRUCTION MATERIALS.pptx
mohaiminulhaquesami
 
Thermal runway and thermal stability.pptx
godow93766
 
POWER PLANT ENGINEERING (R17A0326).pdf..
haneefachosa123
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
UNIT-4-FEEDBACK AMPLIFIERS AND OSCILLATORS (1).pdf
Sridhar191373
 
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
BioSensors glucose monitoring, cholestrol
nabeehasahar1
 
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
Ad

Data preprocessing in Data Mining

  • 1. Data Preprocessing Jun Du The University of Western Ontario [email protected]
  • 2. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 1
  • 3. What is Data? • Collection of data objects and their attributes • Data objects  rows • Attributes  columns 2 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 4. Data Objects • A data object represents an entity. • Examples: – Sales database: customers, store items, sales – Medical database: patients, treatments – University database: students, professors, courses • Also called examples, instances, records, cases, samples, data points, objects, etc. • Data objects are described by attributes. 3
  • 5. Attributes • An attribute is a data field, representing a characteristic or feature of a data object. • Example: – Customer Data: customer _ID, name, gender, age, address, phone number, etc. – Product data: product_ID, price, quantity, manufacturer, etc. • Also called features, variables, fields, dimensions, etc. 4
  • 6. Attribute Types (1) • Nominal (Discrete) Attribute – Has only a finite set of values (such as, categories, states, etc.) – E.g., Hair_color = {black, blond, brown, grey, red, white, …} – E.g., marital status, zip codes • Numeric (Continuous) Attribute – Has real numbers as attribute values – E.g., temperature, height, or weight. • Question: what about student id, SIN, year of birth? 5
  • 7. Attribute Types (2) • Binary – A special case of nominal attribute: with only 2 states (0 and 1) – Gender = {male, female}; – Medical test = {positive, negative} • Ordinal – Usually a special case of nominal attribute: values have a meaningful order (ranking) – Size = {small, medium, large} – Army rankings 6
  • 8. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 7
  • 9. Data Preprocessing • Why preprocess the data? – Data quality is poor in real world. – No quality data, no quality mining results! • Measures for data quality – Accuracy: noise, outliers, … – Completeness: missing values, … – Redundancy: duplicated data, irrelevant data, … – Consistency: some modified but some not, … – …… 8
  • 10. Typical Tasks in Data Preprocessing • Data Cleaning – Handle missing values, noisy / outlier data, resolve inconsistencies, … • Data Transformation – Aggregation – Type Conversion – Normalization • Data Reduction – Data Sampling – Dimensionality Reduction • …… 9
  • 11. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 10
  • 12. Data Cleaning • Missing value: lacking attribute values – E.g., Occupation = “ ” • Noise (Error): modification of original values – E.g., Salary = “−10” • Outlier: considerably different from most of the other data (not necessarily error) – E.g., Salary = “2,100,000” • Inconsistency: discrepancies in codes or names – E.g., Age=“42”, Birthday=“03/07/2010” – Was rating “1, 2, 3”, now rating “A, B, C” • …… 11
  • 13. Missing Values • Reasons for missing values – Information is not collected • E.g., people decline to give their age and weight – Attributes may not be applicable to all cases • E.g., annual income is not applicable to children – Human / Hardware / Software problems • E.g., Birthdate information is accidentally deleted for all people born in 1988. – …… 12
  • 14. How to Handle Missing Value? • Eliminate ignore missing value • Eliminate ignore the examples • Eliminate ignore the features • Simple; not applicable when data is scarce • Estimate missing value – Global constant : e.g., “unknown”, – Attribute mean (median, mode) – Predict the value based on features (data imputation) • Estimate gender based on first name (name gender) • Estimate age based on first name (name popularity) • Build a predictive model based on other features – Missing value estimation depends on the missing reason! 13
  • 15. Demonstration • ReplaceMissingValues – WekaVote – Replacing missing values for nominal and numeric attributes • More functions in Rapidminer 14
  • 16. Noisy (Outlier) Data • Noise: refers to modification of original values • Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention 15
  • 17. How to Handle Noisy (Outlier) Data? • Binning – first sort data and partition into (equal-frequency) bins – then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Regression – smooth by fitting the data into regression functions • Clustering – detect and remove outliers • Combined computer and human inspection – detect suspicious values and check by human 16
  • 18. Binning Sort data in ascending order: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Partition into equal-frequency (equal-depth) bins: – Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34 • Smoothing by bin means: – Bin 1: 9, 9, 9, 9 – Bin 2: 23, 23, 23, 23 – Bin 3: 29, 29, 29, 29 • Smoothing by bin boundaries: – Bin 1: 4, 4, 4, 15 – Bin 2: 21, 21, 25, 25 – Bin 3: 26, 26, 26, 34 17
  • 19. Regression 18 x y y = x + 1 X1 Y1 Y1’
  • 21. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 20
  • 22. Data Transformation • Aggregation: – Attribute / example summarization • Feature type conversion: – Nominal  Numeric, … • Normalization: – Scaled to fall within a small, specified range • Attribute/feature construction: – New attributes constructed from the given ones 21
  • 23. Aggregation • Combining two or more attributes (examples) into a single attribute (example) • Combining two or more attribute values into a single attribute value • Purpose – Change of scale • Cities aggregated into regions, states, countries, etc – More “stable” data • Aggregated data tends to have less variability – More “predictive” data • Aggregated data might have high Predictability 22
  • 24. Demonstration • MergeTwoValues – Wekacontact-lenses – Merge class values “soft” and “hard” • Effective aggregation in real-world application 23
  • 25. Feature Type Conversion • Some algorithms can only handle numeric features; some can only handle nominal features. Only few can handle both. • Features have to be converted to satisfy the requirement of learning algorithms. – Numeric  Nominal (Discretization) • E.g., Age Discretization: Young 18-29; Career 30-40; Mid-Life 41-55; Empty-Nester 56-69; Senior 70+ – Nominal  Numeric • Introduce multiple numeric features for one nominal feature • Nominal  Binary (Numeric) • E.g., size={L, M, S}  size_L: 0, 1; size_M: 0, 1; size_S: 0, 1 24
  • 26. Demonstration • Discretize – Wekadiabetes – Discretize “age” (equal bins vs equal frequency) • NumericToNominal – Wekadiabetes – Discretize “age” (vs “Discretize” method) • NominalToBinary – UCIautos – Convert “num-of-doors” – Convert “drive-wheels” 25
  • 27. Normalization 716.00)00.1( 000,12000,98 000,12600,73    26 Scale the attribute values to a small specified range • Min-max normalization: to [new_minA, new_maxA] – E.g., Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): • …… AAA AA A minnewminnewmaxnew minmax minv v _)__('    
  • 28. Demonstration • Normalize – Wekadiabetes – Normalize “age” • Standardize – Wekadiabetes – Standardize “age” (vs “Normalize” method) 27
  • 29. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 28
  • 30. Sampling • Big data era: too expensive (or even infeasible) to process the entire data set • Sampling: obtaining a small sample to represent the entire data set ( ---- undersampling) • Oversampling is also required in some scenarios, such as class imbalance problem – E.g., 100 HIV test results: 5 positive, 995 negative 29
  • 31. Sampling Principle Key principle for effective sampling: • Using a sample will work almost as well as using the entire data sets, if the sample is representative • A sample is representative if it has approximately the same property (of interest) as the original set of data 30
  • 32. Types of Sampling (1) • Random sampling without replacement – As each example is selected, it is removed from the population • Random sampling with replacement – Examples are not removed from the population after being selected • The same example can be picked up more than once 31 Raw Data
  • 33. Types of Sampling (2) • Stratified sampling – Split the data into several partitions; then draw random samples from each partition 32 Raw Data Stratified Sampling
  • 34. Demonstration • Resample – UCIwaveform-5000 – Undersampling (with or without replacement) 33
  • 35. Dimensionality Reduction • Purpose: – Reduce amount of time and memory required by data mining algorithms – Allow data to be more easily visualized – May help to eliminate irrelevant features or reduce noise • Techniques – Feature Selection – Feature Extraction 34
  • 36. Feature Selection • Redundant features – Duplicated information contained in different features – E.g., “Age”, “Year of Birth”; “Purchase price”, “Sales tax” • Irrelevant features – Containing no information that is useful for the task – E.g., students' ID is irrelevant to predicting GPA • Goal: – A minimum set of features containing all (most) information 35
  • 37. Heuristic Search in Feature Selection • Given d features, there are 2d possible feature combinations – Exhaust search won’t work – Heuristics has to be applied • Typical heuristic feature selection methods: – Feature ranking – Forward feature selection – Backward feature elimination – Bidirectional search (selection + elimination) – Search based on evolution algorithm – …… 36
  • 38. Feature Ranking • Steps: 1) Rank all the individual features according to certain criteria (e.g., information gain, gain ratio, χ2) 2) Select / keep top N features • Properties: – Usually independent of the learning algorithm to be used – Efficient (no search process) – Hard to determine the threshold – Unable to consider correlation between features 37
  • 39. Forward Feature Selection • Steps: 1) First select the best single-feature (according to the learning algorithm) 2) Repeat (until some stop criterion is met): Select the next best feature, given the already picked features • Properties: – Usually learning algorithm dependent – Feature correlation is considered – More reliable – Inefficient 38
  • 40. Backward Feature Elimination • Steps: 1) First build a model based on all the features 2) Repeat (until some criterion is met): Eliminate the feature that makes the least contribution. • Properties: – Usually learning algorithm dependent – Feature correlation is considered – More reliable – Inefficient 39
  • 41. Filter vs Wrapper Model • Filter model – Separating feature selection from learning – Relying on general characteristics of data (information, etc.) – No bias toward any learning algorithm, fast – Feature ranking usually falls into here • Wrapper model – Relying on a predetermined learning algorithm – Using predictive accuracy as goodness measure – High accuracy, computationally expensive – FFS, BFE usually fall into here 40
  • 42. Demonstration • Feature ranking – Wekaweather – ChiSquared, InfoGain, GainRatio • FFS & BFE – WekaDiabetes – ClassifierSubsetEval + GreedyStepwise 41
  • 43. Feature Extraction • Map original high-dimensional data onto a lower- dimensional space – Generate a (smaller) set of new features – Preserve all (most) information from the original data • Techniques – Principal Component Analysis (PCA) – Canonical Correlation Analysis (CCA) – Linear Discriminant Analysis (LDA) – Independent Component Analysis (ICA) – Manifold Learning – …… 42
  • 44. Principal Component Analysis (PCA) • Find a projection that captures the largest amount of variation in data • The original data are projected onto a much smaller space, resulting in dimensionality reduction. 43 x2 x1 e
  • 45. Principal Component Analysis (Steps) • Given data from n-dimensions (n features), find k ≤ n new features (principal components) that can best represent data – Normalize input data: each feature falls within the same range – Compute k principal components (details omitted) – Each input data is projected in the new k-dimensional space – The new features (principal components ) are sorted in order of decreasing “significance” or strength – Eliminate weak components / features to reduce dimensionality. • Works for numeric data only 44
  • 46. PCA Demonstration • UCIbreast-w – Accuracy with all features – PrincipalComponents (data transformation) – Visualize/save transformed data (first two features, last two features) – Accuracy with all transformed features – Accuracy with top 1 or 2 feature(s) 45
  • 47. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 46
  • 48. Summary • Data (features and instances) • Data Cleaning: missing values, noise / outliers • Data Transformation: aggregation, type conversion, normalization • Data Reduction – Sampling: random sampling with replacement, random sampling without replacement, stratified sampling – Dimensionality reduction: • Feature Selection: Feature ranking, FFS, BFE • Feature Extraction: PCA 47
  • 49. Notes • In real world applications, data preprocessing usually occupies about 70% workload in a data mining task. • Domain knowledge is usually required to do good data preprocessing. • To improve a predictive performance of a model – Improve learning algorithms (different algorithms, different parameters) • Most data mining research focuses on here – Improve data quality ---- data preprocessing • Deserve more attention! 48