• A well-accepted multi-dimensional view:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Valueable
• Accessibility
Multi-Dimensional Measure of Data Quality
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results
• Data discretization (for numerical data)
Major Tasks in Data Preprocessing
Data Preprocessing…
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization
• Summary
• Importance
• “Data cleaning is the number one problem in data warehousing”
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
Data Cleaning…
• Data is not always available
• E.g., many tuples have no recorded values for several attributes, such as customer income in
sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
Missing Data
• Noise: random error or variance in a measured variable.
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• etc
• Other data problems which requires data cleaning
• duplicate records, incomplete data, inconsistent data
Noisy Data..
How to Handle Noisy Data?
• Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible outliers)
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34
Binning Methods for Data Smoothing..
Outlier Removal..
• Data points inconsistent with the majority of data
• Different outlier
• Noisy: One’s age = 200, widely deviated points
• Removal methods
• Clustering
• Curve-fitting
Data Preprocessing..
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization
• Data integration:
• combines data from multiple sources
• Schema integration
• integrate metadata from different sources
• Entity identification problem: identify real world entities from multiple data sources,
e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different sources are different,
e.g., different scales, metric vs. British units
• Removing duplicates and redundant data
Data Integration..
• Smoothing: remove noise from data
• Normalization: scaled to fall within a small, specified range (-0.1 to 1.0 and
0.0 to 1.0)
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: summarization
• Generalization: concept hierarchy climbing
Data Transformation..
CS583, Bing Liu, UIC 13
: Data Preprocessing
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization
• Summary
CS583, Bing Liu, UIC 14
Data Reduction Strategies
• Data is too big to work with..
• Data reduction
• Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results
• Data reduction strategies
• Dimensionality reduction — remove unimportant attributes
• Aggregation and clustering
• Sampling
CS583, Bing Liu, UIC 15
Dimensionality Reduction
• Feature selection (i.e., attribute subset selection):
• >>>Select a minimum set of attributes (features) that is sufficient
for the data mining task. <<<
CS583, Bing Liu, UIC 16
Clustering..
• Partition data set into clusters..
CS583, Bing Liu, UIC 17
Data Preprocessing
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization
CS583, Bing Liu, UIC 18
Discretization
• Three types of attributes:
• Nominal — values from an unordered set
• Ordinal — values from an ordered set
• Continuous — real numbers
• Discretization:
• divide the range of a continuous attribute into intervals because
some data mining algorithms only accept categorical attributes.
• Some techniques:
• Binning methods – equal-width, equal-frequency
• Entropy-based methods – which measures the uncertainty associated
with a set of data
CS583, Bing Liu, UIC 19
Discretization and Concept Hierarchy
• Discretization
• reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values
• Concept hierarchies
• reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior)
CS583, Bing Liu, UIC 20
Summary of Data Preprocessing
• Data preparation is a big issue for data mining
• Data preparation includes
• Data cleaning and data integration
• Data reduction and feature selection
• Discretization
• Many methods have been proposed but still it is an active
area of research………..

More Related Content

PPT
Data preprocess
PPT
Preprocess
PPT
Datapreprocess
PPT
Data preperation
PPT
Data preperation
PPT
Data preparation
PPT
Data preparation
PPT
Data preparation
Data preprocess
Preprocess
Datapreprocess
Data preperation
Data preperation
Data preparation
Data preparation
Data preparation

Similar to Lecturekjkljkljlkjknklnjkghvblkbbkbkjb.pptx (20)

PPT
Data preparation
PPT
Data preperation
PPT
Data preprocessing
PDF
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
PPT
Data Mining
PPTX
Assignmentdatamining
PPT
Datapreprocessingppt
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PDF
Data preprocessing using Machine Learning
PDF
Data Preprocessing -Data Quality Noisy Data
PPT
Data preprocessing ng
PPT
Data preprocessing ng
PDF
data processing.pdf
PPT
Preprocessing
Data preparation
Data preperation
Data preprocessing
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
Data Mining
Assignmentdatamining
Datapreprocessingppt
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing using Machine Learning
Data Preprocessing -Data Quality Noisy Data
Data preprocessing ng
Data preprocessing ng
data processing.pdf
Preprocessing
Ad

More from JITENDER773791 (20)

PPTX
jkthsjlfd lectsdfdsfdsfdsfsdfdssfsure.pptx
PPTX
Lectureerdjkldfgjkkjkjkjdfgjlmfdgdfgker.pptx
PPTX
Lecturedsfndskfjdsklfjldsdsfdsgmjdflgmdflmg.pptx
PPTX
Lecture (Additional)sdfjksjfkldsfsdf.pptx
PPTX
Lecture (Additional)fghgfhdfghgfhgfhgfh.pptx
PPTX
Analysdsdsdfgdfgdfgdfsgdfis of Data_2.pptx
PPTX
Analysis of hgfhgfhgfjgfjmghjghjghData_1.pptx
PPT
VR_Unit-1_Lec(9)_B_3D_sdfdsfsdfScanner.ppt
PPT
nkllml;m;llkmlmljkjiuhihkjnklnjkhjgjk.ppt
PDF
Unit-4.-Chi-squjkljl;jj;ljl;jlm;lml;mare.pdf
PPT
15hjkljklj'jklj'kljkjkljkljkljkl95867.ppt
PPTX
Lecture dsfgidsjfhjknflkdnkldnklnfklfndls.pptx
PPT
Chghjgkgyhbygukbhyvuhbbubnubuyuvyyvivlh06.ppt
PPT
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
PPT
lghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.ppt
PPT
inmlk;lklkjlk;lklkjlklkojhhkljkbjlkjhbtroDM.ppt
PPT
sequf;lds,g;'dsg;dlld'g;;gldgence - Copy.ppt
PPTX
howweveautosdfdgdsfmateddatamininig-140715072229-phpapp01.pptx
PPT
Cfbcgdhfghdfhghggfhghghgfhgfhgfhhapter11.PPT
PPTX
2.2.1 2jjkl;jljl;j;l;l;ll;jlkjkljl;jl.2.2.pptx
jkthsjlfd lectsdfdsfdsfdsfsdfdssfsure.pptx
Lectureerdjkldfgjkkjkjkjdfgjlmfdgdfgker.pptx
Lecturedsfndskfjdsklfjldsdsfdsgmjdflgmdflmg.pptx
Lecture (Additional)sdfjksjfkldsfsdf.pptx
Lecture (Additional)fghgfhdfghgfhgfhgfh.pptx
Analysdsdsdfgdfgdfgdfsgdfis of Data_2.pptx
Analysis of hgfhgfhgfjgfjmghjghjghData_1.pptx
VR_Unit-1_Lec(9)_B_3D_sdfdsfsdfScanner.ppt
nkllml;m;llkmlmljkjiuhihkjnklnjkhjgjk.ppt
Unit-4.-Chi-squjkljl;jj;ljl;jlm;lml;mare.pdf
15hjkljklj'jklj'kljkjkljkljkljkl95867.ppt
Lecture dsfgidsjfhjknflkdnkldnklnfklfndls.pptx
Chghjgkgyhbygukbhyvuhbbubnubuyuvyyvivlh06.ppt
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
lghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.ppt
inmlk;lklkjlk;lklkjlklkojhhkljkbjlkjhbtroDM.ppt
sequf;lds,g;'dsg;dlld'g;;gldgence - Copy.ppt
howweveautosdfdgdsfmateddatamininig-140715072229-phpapp01.pptx
Cfbcgdhfghdfhghggfhghghgfhgfhgfhhapter11.PPT
2.2.1 2jjkl;jljl;j;l;l;ll;jlkjkljl;jl.2.2.pptx
Ad

Recently uploaded (20)

PDF
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
PPSX
namma_kalvi_12th_botany_chapter_9_ppt.ppsx
PDF
Kalaari-SaaS-Founder-Playbook-2024-Edition-.pdf
PPT
hemostasis and its significance, physiology
PPTX
growth and developement.pptxweeeeerrgttyyy
PPTX
Copy of ARAL Program Primer_071725(1).pptx
PDF
Chevening Scholarship Application and Interview Preparation Guide
PDF
Physical pharmaceutics two in b pharmacy
PDF
FYJC - Chemistry textbook - standard 11.
PDF
Diabetes Mellitus , types , clinical picture, investigation and managment
PPTX
ACFE CERTIFICATION TRAINING ON LAW.pptx
PPTX
operating_systems_presentations_delhi_nc
PDF
FAMILY PLANNING (preventative and social medicine pdf)
PDF
Review of Related Literature & Studies.pdf
PDF
Unleashing the Potential of the Cultural and creative industries
PPTX
Approach to a child with acute kidney injury
PDF
Compact First Student's Book Cambridge Official
PPTX
IT infrastructure and emerging technologies
PDF
GSA-Past-Papers-2010-2024-2.pdf CSS examination
PDF
African Communication Research: A review
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
namma_kalvi_12th_botany_chapter_9_ppt.ppsx
Kalaari-SaaS-Founder-Playbook-2024-Edition-.pdf
hemostasis and its significance, physiology
growth and developement.pptxweeeeerrgttyyy
Copy of ARAL Program Primer_071725(1).pptx
Chevening Scholarship Application and Interview Preparation Guide
Physical pharmaceutics two in b pharmacy
FYJC - Chemistry textbook - standard 11.
Diabetes Mellitus , types , clinical picture, investigation and managment
ACFE CERTIFICATION TRAINING ON LAW.pptx
operating_systems_presentations_delhi_nc
FAMILY PLANNING (preventative and social medicine pdf)
Review of Related Literature & Studies.pdf
Unleashing the Potential of the Cultural and creative industries
Approach to a child with acute kidney injury
Compact First Student's Book Cambridge Official
IT infrastructure and emerging technologies
GSA-Past-Papers-2010-2024-2.pdf CSS examination
African Communication Research: A review

Lecturekjkljkljlkjknklnjkghvblkbbkbkjb.pptx

  • 1. • A well-accepted multi-dimensional view: • Accuracy • Completeness • Consistency • Timeliness • Believability • Valueable • Accessibility Multi-Dimensional Measure of Data Quality
  • 2. • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies • Data integration • Integration of multiple databases, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization (for numerical data) Major Tasks in Data Preprocessing
  • 3. Data Preprocessing… • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization • Summary
  • 4. • Importance • “Data cleaning is the number one problem in data warehousing” • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration Data Cleaning…
  • 5. • Data is not always available • E.g., many tuples have no recorded values for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data Missing Data
  • 6. • Noise: random error or variance in a measured variable. • Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems • etc • Other data problems which requires data cleaning • duplicate records, incomplete data, inconsistent data Noisy Data..
  • 7. How to Handle Noisy Data? • Binning method: • first sort data and partition into (equi-depth) bins • then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Clustering • detect and remove outliers • Combined computer and human inspection • detect suspicious values and check by human (e.g., deal with possible outliers)
  • 8. • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Partition into (equi-depth) bins: • Bin 1: 4, 8, 9, 15 • Bin 2: 21, 21, 24, 25 • Bin 3: 26, 28, 29, 34 • Smoothing by bin means: • Bin 1: 9, 9, 9, 9 • Bin 2: 23, 23, 23, 23 • Bin 3: 29, 29, 29, 29 • Smoothing by bin boundaries: • Bin 1: 4, 4, 4, 15 • Bin 2: 21, 21, 25, 25 • Bin 3: 26, 26, 26, 34 Binning Methods for Data Smoothing..
  • 9. Outlier Removal.. • Data points inconsistent with the majority of data • Different outlier • Noisy: One’s age = 200, widely deviated points • Removal methods • Clustering • Curve-fitting
  • 10. Data Preprocessing.. • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization
  • 11. • Data integration: • combines data from multiple sources • Schema integration • integrate metadata from different sources • Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-# • Detecting and resolving data value conflicts • for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units • Removing duplicates and redundant data Data Integration..
  • 12. • Smoothing: remove noise from data • Normalization: scaled to fall within a small, specified range (-0.1 to 1.0 and 0.0 to 1.0) • Attribute/feature construction • New attributes constructed from the given ones • Aggregation: summarization • Generalization: concept hierarchy climbing Data Transformation..
  • 13. CS583, Bing Liu, UIC 13 : Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization • Summary
  • 14. CS583, Bing Liu, UIC 14 Data Reduction Strategies • Data is too big to work with.. • Data reduction • Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results • Data reduction strategies • Dimensionality reduction — remove unimportant attributes • Aggregation and clustering • Sampling
  • 15. CS583, Bing Liu, UIC 15 Dimensionality Reduction • Feature selection (i.e., attribute subset selection): • >>>Select a minimum set of attributes (features) that is sufficient for the data mining task. <<<
  • 16. CS583, Bing Liu, UIC 16 Clustering.. • Partition data set into clusters..
  • 17. CS583, Bing Liu, UIC 17 Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization
  • 18. CS583, Bing Liu, UIC 18 Discretization • Three types of attributes: • Nominal — values from an unordered set • Ordinal — values from an ordered set • Continuous — real numbers • Discretization: • divide the range of a continuous attribute into intervals because some data mining algorithms only accept categorical attributes. • Some techniques: • Binning methods – equal-width, equal-frequency • Entropy-based methods – which measures the uncertainty associated with a set of data
  • 19. CS583, Bing Liu, UIC 19 Discretization and Concept Hierarchy • Discretization • reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values • Concept hierarchies • reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
  • 20. CS583, Bing Liu, UIC 20 Summary of Data Preprocessing • Data preparation is a big issue for data mining • Data preparation includes • Data cleaning and data integration • Data reduction and feature selection • Discretization • Many methods have been proposed but still it is an active area of research………..