SlideShare a Scribd company logo
Data preprocessing in Data Mining
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 noisy: containing errors or outliers
 inconsistent: containing discrepancies in codes or
names
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of
quality data
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data
Data preprocessing in Data Mining
 Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
• Missing data may need to be inferred.
 Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably)
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”,
a new class?!
 Use the attribute mean to fill in the missing value
 Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
• Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
 Binning method:
 first sort data and partition into (equi-depth) bins
 then smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human
 Regression
 smooth by fitting the data into regression functions
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky.
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
 Data integration:
 combines data from multiple sources into a coherent
store
 Schema integration
 integrate metadata from different sources
 Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id 
B.cust-#
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from
different sources are different
 possible reasons: different representations, different
scales, e.g., metric vs. British units
 Redundant data occur often when integration of
multiple databases
 The same attribute may have different names in
different databasesCareful integration of the data
from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve
mining speed and quality
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 min-max normalization
 z-score normalization
 normalization by decimal scaling
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



A
A
devstand
meanv
v
_
'


j
v
v
10
' Where j is the smallest integer such that Max(| |)<1'v
 Warehouse may store terabytes of data: Complex
data analysis/mining may take a very long time
to run on the complete data set
 Data reduction
 Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same (or
almost the same) analytical results
 Data reduction strategies
 Data cube aggregation
 Dimensionality reduction
 Numerosity reduction
 Discretization and concept hierarchy generation
 The lowest level of a data cube
 the aggregated data for an individual entity of interest
 e.g., a customer in a phone calling data warehouse.
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Feature selection (i.e., attribute subset selection):
 Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features
 reduce # of patterns in the patterns, easier to
understand
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
 Linear regression: Data are modeled to fit a straight
line
 Often uses the least-square method to fit the line
 Multiple regression: allows a response variable Y to be
modeled as a linear function of multidimensional
feature vector
 Log-linear model: approximates discrete
multidimensional probability distributions
 Linear regression: Y =  +  X
 Two parameters ,  and  specify the line and are to
be estimated by using the data at hand.
 using the least squares criterion to the known values
of Y1, Y2, …, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2.
 Many nonlinear functions can be transformed into the
above.
 Log-linear models:
 The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
 Probability: p(a, b, c, d) = ab acad bcd
 A popular data
reduction technique
 Divide data into
buckets and store
average (sum) for each
bucket
 Can be constructed
optimally in one
dimension using
dynamic programming
 Related to quantization
problems. 0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
 Partition data set into clusters, and one can store
cluster representation only
 Can be very effective if data is clustered but not if data
is “smeared”
 Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms, further detailed in Chapter 8
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or
subpopulation of interest) in the overall database
 Used in conjunction with skewed data
Sampling
Raw Data
 Three types of attributes:
 Nominal — values from an unordered set
 Ordinal — values from an ordered set
 Continuous — real numbers
 Discretization:
divide the range of a continuous attribute into
intervals
 Some classification algorithms only accept
categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis
 Discretization
 reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
 Concept hierarchies
 reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior).

More Related Content

What's hot (20)

PPTX
Data Preprocessing || Data Mining
Iffat Firozy
 
PPTX
Data preprocessing
Gajanand Sharma
 
PPTX
Introduction to Data Mining
DataminingTools Inc
 
PPT
Data preprocessing
Jason Rodrigues
 
PPT
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
PPT
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
PPTX
Data Mining: Classification and analysis
DataminingTools Inc
 
PPTX
web mining
Arpit Verma
 
PPT
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
PPTX
Data mining concepts and work
Amr Abd El Latief
 
PPT
Data mining techniques unit 1
malathieswaran29
 
PDF
Introduction to Data Warehouse
SOMASUNDARAM T
 
PPT
Data preprocessing ng
datapreprocessing
 
PPTX
Data mining: Classification and prediction
DataminingTools Inc
 
PPTX
3 Data Mining Tasks
Mahmoud Alfarra
 
PDF
Data mining and data warehouse lab manual updated
Yugal Kumar
 
PPT
3. mining frequent patterns
Azad public school
 
PPTX
Data Mining: Data processing
DataminingTools Inc
 
PPTX
multi dimensional data model
moni sindhu
 
PPT
1.2 steps and functionalities
Krish_ver2
 
Data Preprocessing || Data Mining
Iffat Firozy
 
Data preprocessing
Gajanand Sharma
 
Introduction to Data Mining
DataminingTools Inc
 
Data preprocessing
Jason Rodrigues
 
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Data Mining: Classification and analysis
DataminingTools Inc
 
web mining
Arpit Verma
 
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Data mining concepts and work
Amr Abd El Latief
 
Data mining techniques unit 1
malathieswaran29
 
Introduction to Data Warehouse
SOMASUNDARAM T
 
Data preprocessing ng
datapreprocessing
 
Data mining: Classification and prediction
DataminingTools Inc
 
3 Data Mining Tasks
Mahmoud Alfarra
 
Data mining and data warehouse lab manual updated
Yugal Kumar
 
3. mining frequent patterns
Azad public school
 
Data Mining: Data processing
DataminingTools Inc
 
multi dimensional data model
moni sindhu
 
1.2 steps and functionalities
Krish_ver2
 

Similar to Data preprocessing in Data Mining (20)

PPTX
Datapreprocessing
Chandrika Sweety
 
PPT
Data preprocessing
Manikandan Tamilselvan
 
PPT
Datapreprocessing
Chandrika Sweety
 
PPT
Data PreProcessing
tdharmaputhiran
 
PPT
DataPreProcessing
tdharmaputhiran
 
PPT
Preprocessing
tdharmaputhiran
 
PPT
Data preparation
James Wong
 
PPT
Data preparation
Tony Nguyen
 
PPT
Data preparation
Young Alista
 
PPT
Data preparation
Harry Potter
 
PPT
Data preperation
Luis Goldster
 
PPT
Data preperation
Hoang Nguyen
 
PPT
Data preperation
Fraboni Ec
 
PDF
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
PPT
Data preprocessing
Manikandan Tamilselvan
 
PPT
Datapreprocessingppt
Shree Hari
 
PPT
Data preprocessing ng
saranya12345
 
PPTX
Assignmentdatamining
Chandrika Sweety
 
PPT
Datapreprocess
sharmila parveen
 
PPT
Preprocessing.ppt
chatbot9
 
Datapreprocessing
Chandrika Sweety
 
Data preprocessing
Manikandan Tamilselvan
 
Datapreprocessing
Chandrika Sweety
 
Data PreProcessing
tdharmaputhiran
 
DataPreProcessing
tdharmaputhiran
 
Preprocessing
tdharmaputhiran
 
Data preparation
James Wong
 
Data preparation
Tony Nguyen
 
Data preparation
Young Alista
 
Data preparation
Harry Potter
 
Data preperation
Luis Goldster
 
Data preperation
Hoang Nguyen
 
Data preperation
Fraboni Ec
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
Data preprocessing
Manikandan Tamilselvan
 
Datapreprocessingppt
Shree Hari
 
Data preprocessing ng
saranya12345
 
Assignmentdatamining
Chandrika Sweety
 
Datapreprocess
sharmila parveen
 
Preprocessing.ppt
chatbot9
 
Ad

More from DHIVYADEVAKI (9)

PPT
Computer Networks - DNS
DHIVYADEVAKI
 
PPTX
Error detection methods-computer networks
DHIVYADEVAKI
 
PPTX
Introduction basic schema and SQL QUERIES
DHIVYADEVAKI
 
PPTX
Image compression in digital image processing
DHIVYADEVAKI
 
PPTX
Image segmentation in Digital Image Processing
DHIVYADEVAKI
 
PPT
R graphics
DHIVYADEVAKI
 
PPTX
Apriori algorithm
DHIVYADEVAKI
 
PPT
Types of Load distributing algorithm in Distributed System
DHIVYADEVAKI
 
PPT
Deadlock Detection in Distributed Systems
DHIVYADEVAKI
 
Computer Networks - DNS
DHIVYADEVAKI
 
Error detection methods-computer networks
DHIVYADEVAKI
 
Introduction basic schema and SQL QUERIES
DHIVYADEVAKI
 
Image compression in digital image processing
DHIVYADEVAKI
 
Image segmentation in Digital Image Processing
DHIVYADEVAKI
 
R graphics
DHIVYADEVAKI
 
Apriori algorithm
DHIVYADEVAKI
 
Types of Load distributing algorithm in Distributed System
DHIVYADEVAKI
 
Deadlock Detection in Distributed Systems
DHIVYADEVAKI
 
Ad

Recently uploaded (20)

PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PPTX
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
PPTX
Neurodivergent Friendly Schools - Slides from training session
Pooky Knightsmith
 
PDF
community health nursing question paper 2.pdf
Prince kumar
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PPTX
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
PDF
The-Ever-Evolving-World-of-Science (1).pdf/7TH CLASS CURIOSITY /1ST CHAPTER/B...
Sandeep Swamy
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PDF
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
PDF
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
PPTX
How to Set Maximum Difference Odoo 18 POS
Celine George
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
Neurodivergent Friendly Schools - Slides from training session
Pooky Knightsmith
 
community health nursing question paper 2.pdf
Prince kumar
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
The-Ever-Evolving-World-of-Science (1).pdf/7TH CLASS CURIOSITY /1ST CHAPTER/B...
Sandeep Swamy
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 - GLOBAL SUCCESS - CẢ NĂM - NĂM 2024 (VOCABULARY, ...
Nguyen Thanh Tu Collection
 
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
How to Set Maximum Difference Odoo 18 POS
Celine George
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 

Data preprocessing in Data Mining

  • 2.  Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  noisy: containing errors or outliers  inconsistent: containing discrepancies in codes or names  No quality data, no quality mining results!  Quality decisions must be based on quality data  Data warehouse needs consistent integration of quality data
  • 3.  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation  Normalization and aggregation  Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization  Part of data reduction but with particular importance, especially for numerical data
  • 5.  Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data
  • 6. • Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data • Missing data may need to be inferred.
  • 7.  Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably)  Fill in the missing value manually: tedious + infeasible?  Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!  Use the attribute mean to fill in the missing value  Use the most probable value to fill in the missing value: inference- based such as Bayesian formula or decision tree
  • 8. • Noise: random error or variance in a measured variable • Incorrect attribute values may due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention • Other data problems which requires data cleaning  duplicate records  incomplete data  inconsistent data
  • 9.  Binning method:  first sort data and partition into (equi-depth) bins  then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Clustering  detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human  Regression  smooth by fitting the data into regression functions
  • 10.  Equal-width (distance) partitioning:  It divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.  The most straightforward  But outliers may dominate presentation  Skewed data is not handled well.  Equal-depth (frequency) partitioning:  It divides the range into N intervals, each containing approximately same number of samples  Good data scaling  Managing categorical attributes can be tricky.
  • 11. * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 12.  Data integration:  combines data from multiple sources into a coherent store  Schema integration  integrate metadata from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different  possible reasons: different representations, different scales, e.g., metric vs. British units
  • 13.  Redundant data occur often when integration of multiple databases  The same attribute may have different names in different databasesCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 14.  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization  z-score normalization  normalization by decimal scaling
  • 15.  min-max normalization  z-score normalization  normalization by decimal scaling AAA AA A minnewminnewmaxnew minmax minv v _)__('     A A devstand meanv v _ '   j v v 10 ' Where j is the smallest integer such that Max(| |)<1'v
  • 16.  Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set  Data reduction  Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results  Data reduction strategies  Data cube aggregation  Dimensionality reduction  Numerosity reduction  Discretization and concept hierarchy generation
  • 17.  The lowest level of a data cube  the aggregated data for an individual entity of interest  e.g., a customer in a phone calling data warehouse.  Multiple levels of aggregation in data cubes  Further reduce the size of data to deal with  Reference appropriate levels  Use the smallest representation which is enough to solve the task
  • 18.  Feature selection (i.e., attribute subset selection):  Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features  reduce # of patterns in the patterns, easier to understand
  • 19. Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6}
  • 20.  Linear regression: Data are modeled to fit a straight line  Often uses the least-square method to fit the line  Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector  Log-linear model: approximates discrete multidimensional probability distributions
  • 21.  Linear regression: Y =  +  X  Two parameters ,  and  specify the line and are to be estimated by using the data at hand.  using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….  Multiple regression: Y = b0 + b1 X1 + b2 X2.  Many nonlinear functions can be transformed into the above.  Log-linear models:  The multi-way table of joint probabilities is approximated by a product of lower-order tables.  Probability: p(a, b, c, d) = ab acad bcd
  • 22.  A popular data reduction technique  Divide data into buckets and store average (sum) for each bucket  Can be constructed optimally in one dimension using dynamic programming  Related to quantization problems. 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000
  • 23.  Partition data set into clusters, and one can store cluster representation only  Can be very effective if data is clustered but not if data is “smeared”  Can have hierarchical clustering and be stored in multi-dimensional index tree structures  There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8
  • 24.  Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data  Choose a representative subset of the data  Simple random sampling may have very poor performance in the presence of skew  Develop adaptive sampling methods  Stratified sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database  Used in conjunction with skewed data
  • 26.  Three types of attributes:  Nominal — values from an unordered set  Ordinal — values from an ordered set  Continuous — real numbers  Discretization: divide the range of a continuous attribute into intervals  Some classification algorithms only accept categorical attributes.  Reduce data size by discretization  Prepare for further analysis
  • 27.  Discretization  reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.  Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).