Elective-I
Examination Scheme-
In semester Assessment: 30
End semester Assessment :70
Text Books:
Data Mining Concepts andTechniques- Micheline Kamber
Introduction to Data Mining with case studies-G.k.Gupta
Reference Books:
Mining the Web Discovering Knowledge from Hypertext data-
Saumen charkrobarti
Reinforcement and systemic machine learning for decision
making- Parag Kulkarni
 Data mining described
 Need of data mining
 Kinds of pattern and technologies
 Issues in mining
 KDD vs. Data Mining
 Machine learning Concepts
 OLAP
 Knowledge Representation
 Data Preproccesing-
Cleaning,integration,Reduction,Transformation and Discretization
 Application with mining aspect
(Weather Prediction)
 Data : Data are any facts, numbers, or text that can be processed by a computer.
 operational or transactional data such as, sales, cost, inventory, payroll, and accounting
 nonoperational data, such as industry sales, forecast data, and macro economic data
 meta data - data about the data itself, such as logical database design or data dictionary
definitions
 Information:The patterns, associations, or relationships among all this data can
provide information.
 Knowledge: Information can be converted into knowledge about historical patterns
and future trends. For example, summary information on retail supermarket sales
can be analyzed in terms of promotional efforts to provide knowledge of consumer
buying behavior.
 Thus, a manufacturer or retailer could determine which items are most susceptible
to promotional efforts.
 Data Warehouses: Data warehousing is defined as a process of centralized data
management and retrieval.
5
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
▪ Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
▪ Business: Web, e-commerce, transactions, stocks, …
▪ Science: Remote sensing, bioinformatics, scientific
simulation, …
▪ Society and everyone: news, digital cameras,YouTube
 **We are drowning in data, but starving for knowledge! **
 “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
Data mining- is the principle of sorting through large amounts of data
and picking out relevant information.
In other words…
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
 Other names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Searching through large amounts of data for
correlations, sequences, and trends.
Current “driving applications” in sales (targeted
marketing, inventory) and finance (stock picking)
Sales data
Sequence
Classify
Inference
Cluster
“70% of
customers who
purchase
comforters later
purchase
curtains”
Select information to be mined Choose mining tool (based on
type of results wanted)
Evaluate results
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
Data Rich, Information Poor
Data Mining process
KDD process includes
 data cleaning (to remove noise and inconsistent data)
 data integration (where multiple data sources may be combined)
 data selection (where data relevant to the analysis task are retrieved from the database)
 data transformation (where data are transformed or consolidated into forms appropriate for mining
by performing summary or aggregation operations)
 data mining (an essential process where intelligent methods are applied in order to
extract data patterns.
 pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures)
 knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user)
Data mining is a core of knowledge discovery process
Knowledge Discovery (KDD) Process
 Data mining—core of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
1. Data cleaning – to remove noise and inconsistent data
2. Data integration – to combine multiple source
3. Data selection – to retrieve relevant data for analysis
4. Data transformation – to transform data into appropriate form for data mining
5. Data mining
6. Evaluation
7. Knowledge presentation
 Step 1 to 4 are different forms of data preprocessing
 Although data mining is only one step in the entire process, it
is an essential one since it uncovers hidden patterns for
evaluation
 Based on this view, the architecture of a typical data mining system
may have the following major components:
 Database, data warehouse, world wide web, or other information repository
 Database or data warehouse server
 Data mining engine
 Pattern evaluation model
 User interface
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
 Relational Database
 DataWarehouses
 Transactional Databases
 Advanced data and information systems
 Object-oriented database
 Temporal DB, Sequence DB andTime serious DB
 Spatial DB
 Text DB and Multimedia DB
 … andWWW
Data Mining: Confluence of Multiple
Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithm
Other
Disciplines
Visualization
 In general, data mining tasks can be classified into two categories:
descriptive and predictive
 Descriptive mining tasks characterize the general properties of the data in
database
 Predictive mining tasks performs inference on the current data in order to make
predictions
 Class Description: Characterization and Discrimination
 Mining Frequent Patterns, Associations and correlations
 Classification and Prediction
 Cluster Analysis
 Outlier Analysis
 Evolution Analysis
 Data Characterization:A data mining system should be able
to produce a description summarizing the characteristics of
customers.
 Example:The characteristics of customers who spend more
than $1000 a year at (some store called ) AllElectronics.The
result can be a general profile such as age, employment status
or credit ratings.
 Data Discrimination: It is a comparison of the general features
of targeting class data objects with the general features of
objects from one or a set of contrasting classes. User can
specify target and contrasting classes.
 Example:The user may like to compare the general features
of software products whose sales increased by 10% in the last
year with those whose sales decreased by about 30% in the
same duration.
Frequent Patterns : as the name suggests patterns that occur frequently in data.
AssociationAnalysis: from marketing perspective, determining which items are frequently purchased
together within the same transaction.
Example:An example is mined from the (some store) AllElectronic transactional database.
buys (X, “Computers”)  buys (X, “software”) [Support = 1%, confidence = 50% ]
 X represents customer
 confidence = 50% , if a customer buys a computer there is a 50% chance that he/she will buy
software as well.
 Support = 1%, means that 1% of all the transactions under analysis showed that computer and
software were purchased together.
 Another example: Multidimensional rule:
 Age (X, 20…29) ^ income (X, 20K-29K)  buys(X, “CD
Player”) [Support = 2%, confidence = 60% ]
 Customers between 20 to 29 years of age with an income
$20000-$29000.There is 60% chance they will purchase CD
Player and 2% of all the transactions under analysis showed
that this age group customers with that range of income
bought CD Player.
 Classification is the process of finding a model that describes
and distinguishes data classes or concepts..> this model is
used to predict the class of objects whose class label is
unknown.
 Classification model can be represented in various forms such
as
 IF-THEN Rules
 A decision tree
 Neural network
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
 Clustering analyses data objects without consulting a known
class label.
 Example: Cluster analysis can be performed on AllElectronics
customer data in order to identify homogeneous
subpopulations of customers.These clusters may represent
individual target groups for marketing.
The figure shows a 2-D plot of
customers with respect to customer
locations in a city.
 Outlier Analysis : A database may contain data objects that do not comply with the
general behavior or model of the data.These data objects are outliers.
 Example: Use in finding Fraudulent usage of credit cards. Outlier Analysis may
uncover Fraudulent usage of credit cards by detecting purchases of extremely
large amounts for a given account number in comparison to regular charges
incurred by the same account.Outlier values may also be detected with respect to
the location and type of purchase or the purchase frequency.
Data mining includes many techniques from Domains bellow:
 Statistics
 Machine Learning
 Database systems and DataWarehouses
 Information Retrieval
 Visualization
 High performance computing
 Statistics: It studies Collection,Analyasis
Interpretation and presentation of Data.
#>Statistical research develops tools for prediction and
forecasting using data
#>Statistical methods can also be used to verify data mining
results.
 Information Retrieval: It is science of searching for
documents or information in documents…
 Database Systems Data Warehouses:
This research focuses on the creation,maintainance and use of
databases for organizations and end users.
 Machine Learning: It investigates how computers can learn
or improve their performance based on data.
 KDD-(Knowledge Discovery in Databases) is a field of
computer science, which includes the tools and theories to
help humans in extracting useful and previously unknown
information (i.e. knowledge) from large collections of
digitized data.
 KDD consists of several steps, and Data Mining is one of
them.
 This process deal with the mapping of low-level data into
other forms those are more compact, abstract and useful.
This is achieved by creating short reports, modelling the
process of generating data and developing predictive models
that can predict future cases.
 Data Mining:>> is application of a specific algorithm in order
to extract patterns from data.
 Although, the two terms KDD and Data Mining are heavily
used interchangeably, they refer to two related yet slightly
different concepts. KDD is the overall process of extracting
knowledge from data while Data Mining is a step inside the
KDD process, which deals with identifying patterns in data.
In other words, Data Mining is only the application of a
specific algorithm based on the overall goal of the KDD
process.
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization
 Summary
 Data in the real world is dirty
 incomplete: missing attribute values, lack of certain attributes of interest, or containing
only aggregate data
▪ e.g., occupation=“”
 noisy: containing errors or outliers
▪ e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
▪ e.g.,Age=“42” Birthday=“03/07/1997”
▪ e.g.,Was rating “1,2,3”, now rating “A, B, C”
▪ e.g., discrepancy between duplicate records
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
▪ e.g., duplicate or missing data may cause incorrect or even misleading statistics.
 Data preparation, cleaning, and transformation comprises the majority of
the work in a data mining application (around 90%).
 A well-accepted multi-dimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Valueable
 Accessibility
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve
inconsistencies
 Data integration
 Integration of multiple databases, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or similar analytical results
 Data discretization (for numerical data)
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization
 Summary
 Importance
 “Data cleaning is the number one problem in data warehousing”
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
 Data is not always available
 E.g., many tuples have no recorded values for several attributes, such as customer income in
sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
 Noise: random error or variance in a measured variable.
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 etc
 Other data problems which requires data cleaning
 duplicate records, incomplete data, inconsistent data
 Binning method:
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin median, smooth by bin boundaries,
etc.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal with possible outliers)
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Partition into (equi-depth) bins:
 Bin 1: 4, 8, 9, 15
 Bin 2: 21, 21, 24, 25
 Bin 3: 26, 28, 29, 34
 Smoothing by bin means:
 Bin 1: 9, 9, 9, 9
 Bin 2: 23, 23, 23, 23
 Bin 3: 29, 29, 29, 29
 Smoothing by bin boundaries:
 Bin 1: 4, 4, 4, 15
 Bin 2: 21, 21, 25, 25
 Bin 3: 26, 26, 26, 34
 Data points inconsistent with the majority of data
 Different outlier
 Noisy: One’s age = 200, widely deviated points
 Removal methods
 Clustering
 Curve-fitting
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization
 Data integration:
 combines data from multiple sources
 Schema integration
 integrate metadata from different sources
 Entity identification problem: identify real world entities from multiple data sources,
e.g., A.cust-id  B.cust-#
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different sources are different, e.g.,
different scales, metric vs. British units
 Removing duplicates and redundant data
 Smoothing: remove noise from data
 Normalization: scaled to fall within a small, specified range (-0.1 to 1.0 and
0.0 to 1.0)
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: summarization
 Generalization: concept hierarchy climbing
CS583, Bing Liu, UIC 56
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization
 Summary
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
CS583, Bing Liu, UIC 58
 Data is too big to work with..
 Data reduction
 Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results
 Data reduction strategies
 Dimensionality reduction — remove unimportant attributes
 Aggregation and clustering
 Sampling
CS583, Bing Liu, UIC 59
 Feature selection (i.e., attribute subset selection):
 >>>Select a minimum set of attributes (features) that is sufficient
for the data mining task. <<<
CS583, Bing Liu, UIC 60
 Partition data set into clusters..
CS583, Bing Liu, UIC 61
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization
CS583, Bing Liu, UIC 62
 Three types of attributes:
 Nominal — values from an unordered set
 Ordinal — values from an ordered set
 Continuous — real numbers
 Discretization:
 divide the range of a continuous attribute into intervals because
some data mining algorithms only accept categorical attributes.
 Some techniques:
 Binning methods – equal-width, equal-frequency
 Entropy-based methods
CS583, Bing Liu, UIC 63
 Discretization
 reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values
 Concept hierarchies
 reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior)
CS583, Bing Liu, UIC 64
 Data preparation is a big issue for data mining
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 Many methods have been proposed but still it is an
active area of research………..
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt

More Related Content

PPTX
Data mining an introduction
PPT
Data Mining- Unit-I PPT (1).ppt
PPTX
Data warehousing and mining furc
PPTX
Introduction to Data Mining and Data Warehousing
PDF
Lect 1 introduction
PPTX
Data Mining Intro
PPT
Introduction.ppt
PPT
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data mining an introduction
Data Mining- Unit-I PPT (1).ppt
Data warehousing and mining furc
Introduction to Data Mining and Data Warehousing
Lect 1 introduction
Data Mining Intro
Introduction.ppt
Data Mining Xuequn Shang NorthWestern Polytechnical University

Similar to 1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt (20)

PPTX
Lect 1 introduction
PPT
Introduction
PPT
Data Mining-2023 (2).ppt
PPT
Sanjeev Kumar Dash D ata Mining-2023.ppt
PDF
Data mining chapter for students of university
PPT
Dma unit 1
PPT
PPT
PPT
Chapter 01Intro.ppt full explanation used
DOCX
Seminar Report Vaibhav
PPTX
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
PPTX
Data mining & Decison Trees
PDF
Data Mining and its detail processes with steps
PPTX
Data mining
PPTX
PPTX
Lect 1 2 Data Mining.pptx for the predictive ananlysis
PPT
Data mining final year project in ludhiana
PPT
Data mining final year project in jalandhar
PPT
Unit 1 (Chapter-1) on data mining concepts.ppt
PPTX
Data Mining
Lect 1 introduction
Introduction
Data Mining-2023 (2).ppt
Sanjeev Kumar Dash D ata Mining-2023.ppt
Data mining chapter for students of university
Dma unit 1
Chapter 01Intro.ppt full explanation used
Seminar Report Vaibhav
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
Data mining & Decison Trees
Data Mining and its detail processes with steps
Data mining
Lect 1 2 Data Mining.pptx for the predictive ananlysis
Data mining final year project in ludhiana
Data mining final year project in jalandhar
Unit 1 (Chapter-1) on data mining concepts.ppt
Data Mining
Ad

More from JITENDER773791 (20)

PPTX
jkthsjlfd lectsdfdsfdsfdsfsdfdssfsure.pptx
PPTX
Lectureerdjkldfgjkkjkjkjdfgjlmfdgdfgker.pptx
PPTX
Lecturekjkljkljlkjknklnjkghvblkbbkbkjb.pptx
PPTX
Lecturedsfndskfjdsklfjldsdsfdsgmjdflgmdflmg.pptx
PPTX
Lecture (Additional)sdfjksjfkldsfsdf.pptx
PPTX
Lecture (Additional)fghgfhdfghgfhgfhgfh.pptx
PPTX
Analysdsdsdfgdfgdfgdfsgdfis of Data_2.pptx
PPTX
Analysis of hgfhgfhgfjgfjmghjghjghData_1.pptx
PPT
VR_Unit-1_Lec(9)_B_3D_sdfdsfsdfScanner.ppt
PPT
nkllml;m;llkmlmljkjiuhihkjnklnjkhjgjk.ppt
PDF
Unit-4.-Chi-squjkljl;jj;ljl;jlm;lml;mare.pdf
PPT
15hjkljklj'jklj'kljkjkljkljkljkl95867.ppt
PPTX
Lecture dsfgidsjfhjknflkdnkldnklnfklfndls.pptx
PPT
Chghjgkgyhbygukbhyvuhbbubnubuyuvyyvivlh06.ppt
PPT
lghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.ppt
PPT
inmlk;lklkjlk;lklkjlklkojhhkljkbjlkjhbtroDM.ppt
PPT
sequf;lds,g;'dsg;dlld'g;;gldgence - Copy.ppt
PPTX
howweveautosdfdgdsfmateddatamininig-140715072229-phpapp01.pptx
PPT
Cfbcgdhfghdfhghggfhghghgfhgfhgfhhapter11.PPT
PPTX
2.2.1 2jjkl;jljl;j;l;l;ll;jlkjkljl;jl.2.2.pptx
jkthsjlfd lectsdfdsfdsfdsfsdfdssfsure.pptx
Lectureerdjkldfgjkkjkjkjdfgjlmfdgdfgker.pptx
Lecturekjkljkljlkjknklnjkghvblkbbkbkjb.pptx
Lecturedsfndskfjdsklfjldsdsfdsgmjdflgmdflmg.pptx
Lecture (Additional)sdfjksjfkldsfsdf.pptx
Lecture (Additional)fghgfhdfghgfhgfhgfh.pptx
Analysdsdsdfgdfgdfgdfsgdfis of Data_2.pptx
Analysis of hgfhgfhgfjgfjmghjghjghData_1.pptx
VR_Unit-1_Lec(9)_B_3D_sdfdsfsdfScanner.ppt
nkllml;m;llkmlmljkjiuhihkjnklnjkhjgjk.ppt
Unit-4.-Chi-squjkljl;jj;ljl;jlm;lml;mare.pdf
15hjkljklj'jklj'kljkjkljkljkljkl95867.ppt
Lecture dsfgidsjfhjknflkdnkldnklnfklfndls.pptx
Chghjgkgyhbygukbhyvuhbbubnubuyuvyyvivlh06.ppt
lghjghgggkgjhgjghhjgjhgkhjghjghjghjghect1.ppt
inmlk;lklkjlk;lklkjlklkojhhkljkbjlkjhbtroDM.ppt
sequf;lds,g;'dsg;dlld'g;;gldgence - Copy.ppt
howweveautosdfdgdsfmateddatamininig-140715072229-phpapp01.pptx
Cfbcgdhfghdfhghggfhghghgfhgfhgfhhapter11.PPT
2.2.1 2jjkl;jljl;j;l;l;ll;jlkjkljl;jl.2.2.pptx
Ad

Recently uploaded (20)

PPTX
Cite It Right: A Compact Illustration of APA 7th Edition.pptx
PDF
African Communication Research: A review
PPTX
growth and developement.pptxweeeeerrgttyyy
PDF
Compact First Student's Book Cambridge Official
PPTX
Diploma pharmaceutics notes..helps diploma students
PDF
Review of Related Literature & Studies.pdf
PDF
FYJC - Chemistry textbook - standard 11.
PDF
Health aspects of bilberry: A review on its general benefits
PDF
faiz-khans about Radiotherapy Physics-02.pdf
PDF
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
PDF
Diabetes Mellitus , types , clinical picture, investigation and managment
PPTX
Copy of ARAL Program Primer_071725(1).pptx
PPTX
Thinking Routines and Learning Engagements.pptx
PPT
hsl powerpoint resource goyloveh feb 07.ppt
PDF
Unleashing the Potential of the Cultural and creative industries
PPTX
Key-Features-of-the-SHS-Program-v4-Slides (3) PPT2.pptx
PPTX
4. Diagnosis and treatment planning in RPD.pptx
PDF
LATAM’s Top EdTech Innovators Transforming Learning in 2025.pdf
PPSX
namma_kalvi_12th_botany_chapter_9_ppt.ppsx
PDF
FAMILY PLANNING (preventative and social medicine pdf)
Cite It Right: A Compact Illustration of APA 7th Edition.pptx
African Communication Research: A review
growth and developement.pptxweeeeerrgttyyy
Compact First Student's Book Cambridge Official
Diploma pharmaceutics notes..helps diploma students
Review of Related Literature & Studies.pdf
FYJC - Chemistry textbook - standard 11.
Health aspects of bilberry: A review on its general benefits
faiz-khans about Radiotherapy Physics-02.pdf
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
Diabetes Mellitus , types , clinical picture, investigation and managment
Copy of ARAL Program Primer_071725(1).pptx
Thinking Routines and Learning Engagements.pptx
hsl powerpoint resource goyloveh feb 07.ppt
Unleashing the Potential of the Cultural and creative industries
Key-Features-of-the-SHS-Program-v4-Slides (3) PPT2.pptx
4. Diagnosis and treatment planning in RPD.pptx
LATAM’s Top EdTech Innovators Transforming Learning in 2025.pdf
namma_kalvi_12th_botany_chapter_9_ppt.ppsx
FAMILY PLANNING (preventative and social medicine pdf)

1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt

  • 1. Elective-I Examination Scheme- In semester Assessment: 30 End semester Assessment :70 Text Books: Data Mining Concepts andTechniques- Micheline Kamber Introduction to Data Mining with case studies-G.k.Gupta Reference Books: Mining the Web Discovering Knowledge from Hypertext data- Saumen charkrobarti Reinforcement and systemic machine learning for decision making- Parag Kulkarni
  • 2.  Data mining described  Need of data mining  Kinds of pattern and technologies  Issues in mining  KDD vs. Data Mining  Machine learning Concepts  OLAP  Knowledge Representation  Data Preproccesing- Cleaning,integration,Reduction,Transformation and Discretization  Application with mining aspect (Weather Prediction)
  • 3.  Data : Data are any facts, numbers, or text that can be processed by a computer.  operational or transactional data such as, sales, cost, inventory, payroll, and accounting  nonoperational data, such as industry sales, forecast data, and macro economic data  meta data - data about the data itself, such as logical database design or data dictionary definitions  Information:The patterns, associations, or relationships among all this data can provide information.
  • 4.  Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in terms of promotional efforts to provide knowledge of consumer buying behavior.  Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.  Data Warehouses: Data warehousing is defined as a process of centralized data management and retrieval.
  • 5. 5  The Explosive Growth of Data: from terabytes to petabytes  Data collection and data availability ▪ Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data ▪ Business: Web, e-commerce, transactions, stocks, … ▪ Science: Remote sensing, bioinformatics, scientific simulation, … ▪ Society and everyone: news, digital cameras,YouTube  **We are drowning in data, but starving for knowledge! **  “Necessity is the mother of invention”—Data mining— Automated analysis of massive data sets
  • 6. Data mining- is the principle of sorting through large amounts of data and picking out relevant information. In other words…  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Other names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • 7. Searching through large amounts of data for correlations, sequences, and trends. Current “driving applications” in sales (targeted marketing, inventory) and finance (stock picking) Sales data Sequence Classify Inference Cluster “70% of customers who purchase comforters later purchase curtains” Select information to be mined Choose mining tool (based on type of results wanted) Evaluate results
  • 11. KDD process includes  data cleaning (to remove noise and inconsistent data)  data integration (where multiple data sources may be combined)  data selection (where data relevant to the analysis task are retrieved from the database)  data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations)
  • 12.  data mining (an essential process where intelligent methods are applied in order to extract data patterns.  pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)  knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) Data mining is a core of knowledge discovery process
  • 13. Knowledge Discovery (KDD) Process  Data mining—core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 14. 1. Data cleaning – to remove noise and inconsistent data 2. Data integration – to combine multiple source 3. Data selection – to retrieve relevant data for analysis 4. Data transformation – to transform data into appropriate form for data mining 5. Data mining 6. Evaluation 7. Knowledge presentation
  • 15.  Step 1 to 4 are different forms of data preprocessing  Although data mining is only one step in the entire process, it is an essential one since it uncovers hidden patterns for evaluation
  • 16.  Based on this view, the architecture of a typical data mining system may have the following major components:  Database, data warehouse, world wide web, or other information repository  Database or data warehouse server  Data mining engine  Pattern evaluation model  User interface
  • 20.  Transactional Databases  Advanced data and information systems  Object-oriented database  Temporal DB, Sequence DB andTime serious DB  Spatial DB  Text DB and Multimedia DB  … andWWW
  • 21. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization
  • 22.  In general, data mining tasks can be classified into two categories: descriptive and predictive  Descriptive mining tasks characterize the general properties of the data in database  Predictive mining tasks performs inference on the current data in order to make predictions
  • 23.  Class Description: Characterization and Discrimination  Mining Frequent Patterns, Associations and correlations  Classification and Prediction  Cluster Analysis  Outlier Analysis  Evolution Analysis
  • 24.  Data Characterization:A data mining system should be able to produce a description summarizing the characteristics of customers.  Example:The characteristics of customers who spend more than $1000 a year at (some store called ) AllElectronics.The result can be a general profile such as age, employment status or credit ratings.
  • 25.  Data Discrimination: It is a comparison of the general features of targeting class data objects with the general features of objects from one or a set of contrasting classes. User can specify target and contrasting classes.  Example:The user may like to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by about 30% in the same duration.
  • 26. Frequent Patterns : as the name suggests patterns that occur frequently in data. AssociationAnalysis: from marketing perspective, determining which items are frequently purchased together within the same transaction. Example:An example is mined from the (some store) AllElectronic transactional database. buys (X, “Computers”)  buys (X, “software”) [Support = 1%, confidence = 50% ]  X represents customer  confidence = 50% , if a customer buys a computer there is a 50% chance that he/she will buy software as well.  Support = 1%, means that 1% of all the transactions under analysis showed that computer and software were purchased together.
  • 27.  Another example: Multidimensional rule:  Age (X, 20…29) ^ income (X, 20K-29K)  buys(X, “CD Player”) [Support = 2%, confidence = 60% ]  Customers between 20 to 29 years of age with an income $20000-$29000.There is 60% chance they will purchase CD Player and 2% of all the transactions under analysis showed that this age group customers with that range of income bought CD Player.
  • 28.  Classification is the process of finding a model that describes and distinguishes data classes or concepts..> this model is used to predict the class of objects whose class label is unknown.  Classification model can be represented in various forms such as  IF-THEN Rules  A decision tree  Neural network
  • 30.  Clustering analyses data objects without consulting a known class label.  Example: Cluster analysis can be performed on AllElectronics customer data in order to identify homogeneous subpopulations of customers.These clusters may represent individual target groups for marketing.
  • 31. The figure shows a 2-D plot of customers with respect to customer locations in a city.
  • 32.  Outlier Analysis : A database may contain data objects that do not comply with the general behavior or model of the data.These data objects are outliers.  Example: Use in finding Fraudulent usage of credit cards. Outlier Analysis may uncover Fraudulent usage of credit cards by detecting purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account.Outlier values may also be detected with respect to the location and type of purchase or the purchase frequency.
  • 33. Data mining includes many techniques from Domains bellow:  Statistics  Machine Learning  Database systems and DataWarehouses  Information Retrieval  Visualization  High performance computing
  • 34.  Statistics: It studies Collection,Analyasis Interpretation and presentation of Data. #>Statistical research develops tools for prediction and forecasting using data #>Statistical methods can also be used to verify data mining results.
  • 35.  Information Retrieval: It is science of searching for documents or information in documents…
  • 36.  Database Systems Data Warehouses: This research focuses on the creation,maintainance and use of databases for organizations and end users.
  • 37.  Machine Learning: It investigates how computers can learn or improve their performance based on data.
  • 38.  KDD-(Knowledge Discovery in Databases) is a field of computer science, which includes the tools and theories to help humans in extracting useful and previously unknown information (i.e. knowledge) from large collections of digitized data.  KDD consists of several steps, and Data Mining is one of them.
  • 39.  This process deal with the mapping of low-level data into other forms those are more compact, abstract and useful. This is achieved by creating short reports, modelling the process of generating data and developing predictive models that can predict future cases.  Data Mining:>> is application of a specific algorithm in order to extract patterns from data.
  • 40.  Although, the two terms KDD and Data Mining are heavily used interchangeably, they refer to two related yet slightly different concepts. KDD is the overall process of extracting knowledge from data while Data Mining is a step inside the KDD process, which deals with identifying patterns in data. In other words, Data Mining is only the application of a specific algorithm based on the overall goal of the KDD process.
  • 41.  Why preprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Discretization  Summary
  • 42.  Data in the real world is dirty  incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data ▪ e.g., occupation=“”  noisy: containing errors or outliers ▪ e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names ▪ e.g.,Age=“42” Birthday=“03/07/1997” ▪ e.g.,Was rating “1,2,3”, now rating “A, B, C” ▪ e.g., discrepancy between duplicate records
  • 43.  No quality data, no quality mining results!  Quality decisions must be based on quality data ▪ e.g., duplicate or missing data may cause incorrect or even misleading statistics.  Data preparation, cleaning, and transformation comprises the majority of the work in a data mining application (around 90%).
  • 44.  A well-accepted multi-dimensional view:  Accuracy  Completeness  Consistency  Timeliness  Believability  Valueable  Accessibility
  • 45.  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies  Data integration  Integration of multiple databases, or files  Data transformation  Normalization and aggregation  Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization (for numerical data)
  • 46.  Why preprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Discretization  Summary
  • 47.  Importance  “Data cleaning is the number one problem in data warehousing”  Data cleaning tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration
  • 48.  Data is not always available  E.g., many tuples have no recorded values for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data
  • 49.  Noise: random error or variance in a measured variable.  Incorrect attribute values may due to  faulty data collection instruments  data entry problems  data transmission problems  etc  Other data problems which requires data cleaning  duplicate records, incomplete data, inconsistent data
  • 50.  Binning method:  first sort data and partition into (equi-depth) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Clustering  detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers)
  • 51.  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34  Partition into (equi-depth) bins:  Bin 1: 4, 8, 9, 15  Bin 2: 21, 21, 24, 25  Bin 3: 26, 28, 29, 34  Smoothing by bin means:  Bin 1: 9, 9, 9, 9  Bin 2: 23, 23, 23, 23  Bin 3: 29, 29, 29, 29  Smoothing by bin boundaries:  Bin 1: 4, 4, 4, 15  Bin 2: 21, 21, 25, 25  Bin 3: 26, 26, 26, 34
  • 52.  Data points inconsistent with the majority of data  Different outlier  Noisy: One’s age = 200, widely deviated points  Removal methods  Clustering  Curve-fitting
  • 53.  Why preprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Discretization
  • 54.  Data integration:  combines data from multiple sources  Schema integration  integrate metadata from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units  Removing duplicates and redundant data
  • 55.  Smoothing: remove noise from data  Normalization: scaled to fall within a small, specified range (-0.1 to 1.0 and 0.0 to 1.0)  Attribute/feature construction  New attributes constructed from the given ones  Aggregation: summarization  Generalization: concept hierarchy climbing
  • 56. CS583, Bing Liu, UIC 56  Why preprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Discretization  Summary
  • 58. CS583, Bing Liu, UIC 58  Data is too big to work with..  Data reduction  Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results  Data reduction strategies  Dimensionality reduction — remove unimportant attributes  Aggregation and clustering  Sampling
  • 59. CS583, Bing Liu, UIC 59  Feature selection (i.e., attribute subset selection):  >>>Select a minimum set of attributes (features) that is sufficient for the data mining task. <<<
  • 60. CS583, Bing Liu, UIC 60  Partition data set into clusters..
  • 61. CS583, Bing Liu, UIC 61  Why preprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Discretization
  • 62. CS583, Bing Liu, UIC 62  Three types of attributes:  Nominal — values from an unordered set  Ordinal — values from an ordered set  Continuous — real numbers  Discretization:  divide the range of a continuous attribute into intervals because some data mining algorithms only accept categorical attributes.  Some techniques:  Binning methods – equal-width, equal-frequency  Entropy-based methods
  • 63. CS583, Bing Liu, UIC 63  Discretization  reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values  Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
  • 64. CS583, Bing Liu, UIC 64  Data preparation is a big issue for data mining  Data preparation includes  Data cleaning and data integration  Data reduction and feature selection  Discretization  Many methods have been proposed but still it is an active area of research………..