SlideShare a Scribd company logo
INTRODUCTION 1
 Extraction or ‘mining’ of large amount of data
 Also known as knowledge mining from data / knowledge extraction / data or
pattern analysis / data archaeology / data dredging
 Most popular – Knowledge Discovery from Data (KDD)
 Data available in huge amount -> Imminent need for turning into useful info
 Application – market analysis, fraud detection, customer retention, production
control, science exploration
2
 Data cleaning (remove noise and inconsistent data)
 Data integration (combine multiple data sources)
 Data selection (relevant data is retrieved from database)
 Data transformation (data is transformed or consolidated by mining/aggregation)
 Data mining (extraction of data patterns)
 Pattern evaluation (identifying interesting patterns representing knowledge using
interestingness measures)
 Knowledge presentation (visualization and presentation of mined knowledge)
3
4
 Database, Data Warehouses, WWW, Information Repositories – It may be a set of
databases/warehouses or any other information repositories. Data cleaning and
data integration is performed.
 Database / Data Warehouse servers – responsible for fetching relevant data based
on user’s request
 Knowledge base – it’s the domain knowledge that guides the search. Includes
concept hierarchies used to organize attributes, user believes
 Data mining engine – consist of functional modules for task such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis.
 Pattern evaluation module – employs interestingness measures and interactive
with data mining modules to focus the search towards interesting patterns
 User interface – user specifies a data mining query or task, providing information
to help focus search and perform exploratory data mining based on intermediate
data mining results.
5
 Relational Databases
 Data Warehouses
 Transactional Databases
 Advanced Data and Information Systems and Advanced Application
 Object-Relational Database
 Temporal Database/Sequence Database and Time-Series Database
 Spatial Databases and Spatiotemporal Databases
 Text Databases and Multimedia Databases
 Heterogeneous Databases and Legacy Databases
 Data Streams
 World Wide Web
6
 No coupling
 DM system does not utilize any function of DB/DW.
 Fetches data from source and stores result in different file
 Drawbacks
 Without a DB system, a DM system spends time in searching, collecting, transforming data.
 DM systems doesn’t have any tested, scalable algorithm or data structure implemented
 DM systems needs another tool to extract data
 Loose coupling
 DM system will use some feature of DB system like fetching data, performing data
mining and storing the results in a file/place in database
 Advantage
 Fetch data from database using query processing, indexing
 Has advantages of flexibility, efficiency by the system.
 Disadvantage – mining does not explore data structure/query optimization methods 7
 Semi-tight coupling
 Linking of DM system to DB system and efficient implementation of a few essential data
mining primitives is provided by DB
 Includes sorting, indexing, aggregation, histogram analysis, pre-computation of
statistical measures like sum, count, min-max, standard deviation
 Enhances performance of DM system since some frequently used results is pre-computed
 Tight coupling
 DM system is smoothly integrated into DB system.
 data mining queries and functionalities are optimized based on mining query analysis,
data structure, indexing schemes and query processing methods.
8
 Why preprocess the data?
 Incomplete (lacking attribute values)
 Noisy (containing errors or outliers)
 Inconsistent (containing discrepancies in department codes used to categorize them)
 Redundancy (repetition of the same data)
 Descriptive Data Summarization helps in the study of general characteristics of
the data and identifies the presence of noise or outliers which is useful for
successful for cleaning and data integration.
 Measures of central tendency – mean, median, weighted arithmetic mean, mode
 Measure of data dispersion – quartiles, interquartile range, variance
9
 A distributive measure is a measure that can be computed for a given data set by
partitioning the data into smaller subsets, computing the measure for each subset
and then merging the result in order to arrive at the measure’s value for the
original dataset.
 An algebraic measure is a measure that can be computed by applying an algebraic
function to one or more distributive measures.
 A holistic measure is a measure that must be computed on the entire data set as a
whole. It cannot be computed by partitioning the given data into subsets and
merging the values obtained for the measure in each subset.
10
 The degree to which the numerical data tend to spread is called dispersion or
variance of the data.
 Most common measure of dispersion are range, five-number summary, inter
quartile range, standard deviation.
 For displaying the data summary and dispersion popular graphs include –
histograms, quantile plots, q-q plots, scatter plots, loess curves.
11
12
 Data cleaning tends to fill missing values, smooth out noise, identify outliers,
correct inconsistencies
 Missing values
 Ignore the tuple
 Fill the missing value manually
 Use a global constant to fill the missing value
 Use the architecture mean to fill the missing value
 Use the attribute mean for samples belonging to the same class as the given tuple
 Use the most probable value to fill the missing value
 Use regression, decision-tree induction, Bayesian formation
13
 Noisy data
 Binning
 Consults the neighboring value
 Performs local smoothing
 Smoothing by bin means – each value of bin is replaced by mean value of the bin
 Smoothing by bin median – each value of the bin is replaced by bin median
 smoothing by bin boundaries – max and min value of bin is bin boundary and each value of
bin is replaced by the closest bin boundary
 Regression
 Filters the data into functions
 Linear regression finds the best line to fit two attributes
 Multiple regression involves more than two variables
 Clustering
 Outliers is detected through clustering where similar values are organized into clusters
 Values falling off the set is outlier
14
 Data integration
 Entity identification problem is matching of equivalent real-world entries from multiple
data sources
 Correlation analysis measures how strong one attribute implies the other
 Data transformation
 Smoothing – binning, regression, clustering
 Aggregation
 Generalization – low level data is replaced by higher level concept through the use of
concept hierarchy
 Normalization – data is scaled to fall within a small specified range
 Min-Max method
 Z-score normalization
 Decimal scaling
 Attribute construction
15
 Applied to obtain a reduced representation of data set
 Data cube aggregation
 Attribute subset selection reduces the data size by removing irrelevant or redundant
attribute.
 Dimensionality reduction involves data encoding or transformation to obtain
compressed data. Lossy dimensionality reduction – wavelet transform, principal
component analysis
 Numerosity reduction
 Parametric methods use a model to estimate data ex. Log-Linear model
 Nonparametric method include histogram, clustering and sample for storing reduced
representation
 Discretization and concept hierarchy reduces the number of values for a given attribute
by dividing the range of the attribute into intervals.
16
DM task is divided into two categories: descriptive and predictive
Descriptive mining task characterizes general properties of the data
Predictive mining task performs inference on current data in order to make predictions
17
 Data characterization is summarization of the general characterization or
features of the target class of data.
 Data corresponding to user specific class are typically collected by database query
 Example: to study the characteristics of software products whose sales increased by 10%,
data related to the product is collected
 Data cube OLAP roll-up operation is used for data summarization
 Output is presented in the form of pie charts, histogram
 Data discrimination is comparison of the general features of target class data
objects with general features of the object from one or a set of contrasting class.
 Example: comparison of a product whose sales increased by 10% with that of a product
whose sales decreased by 30%
18
 Classification is the process of finding a model that describes or distinguishes data
classes or concepts for the purpose of being able to use the model to predict the
class of object whose class label is unknown.
 Classifying loan as ‘safe’ or ‘risky’
 Given a customer profile, guess whether he will buy a new computer
 Decision tree induction
 Bayesian classification
 Rule-based classification
 Classification by backpropogation
 Support vector machines
 Classification by association rule analysis
19
 Prediction models continuous valued functions. Numeric prediction is the task of
predicting continues values for the given input.
 Regression analysis is a statistical methodology that is often used for numerical
prediction
 Linear/straight-line regression involves a response variable, y and a single
predictor variable, x. It models y as a function of x. [y=b+wx]
 Multiple linear regression extends straight-line regression to models more than
one predictor variable
 Nonlinear regression models polynomial terms
20
 The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
 A cluster is a collection of data objects that are similar to one another within the
same cluster and are dissimilar to the objects in other cluster.
 Class labels are not present in training data because they are not known to begin
with. Clustering is used to generate such labels
 Applications: taxonomy (organization of observations into hierarchy of classes that
group similar events together)
21
 Partitioning method
 Partitioning method creates k partitions of the database of n objects of data tuples
 Requirements
 Each group must contain at least one object
 Each object must belong to exactly one group
 Objects in the same cluster are close or related to each other whereas objects of different
cluster are fat apart or very different
 k-means algorithm where each cluster is represented by the mean value of the objects
 k-medoids algorithm where each cluster is represented by one of the objects located near
the center of the cluster.
 works well for small to medium databases
22
 Hierarchical method
 Created hierarchical decomposition of the given set of data objects.
 Classification based on how hierarchical decomposition is formed
 Agglomerative/Bottom-up approach merges objects or groups that are close to one another, until
all the groups are merged into one
 Divisive/Top-down approach starts with all of the objects in the same cluster. It breaks down into
smaller cluster until eventually each object is in one cluster
 Density-based method
 Can easily determine clusters of arbitrary shape
 Used to filter out noise
 Grid based method
 Quantize the object space into a finite number of cells that form a grid structure.
 Faster processing
 Model based clustering
 Hypothesizes a model for each cluster and finds the best fit of the data to the given
model
 Locates cluster by constructing a density function that reflects spatial distribution of
data
 Automatically determines the number of clusters based on standard statistics
 Example: self organizing maps
23
 Clustering high dimensional data
 examines objects having a number of features
 Subspace clustering method searches for clusters in subspace
 Frequent pattern based clustering extracts distinct frequent patterns among subset of
dimensions that occur frequently
 Constrain based clustering
 Performs clustering by incorporating user-specific constrains
 A constrain expresses a user’s expectations or desired results
 Example: spatial clustering with the existence of obstacles and clustering under user
specific constrains
24
 Outliers are data that do not comply with the general behavior or model of data
 Its discarded by most data mining applications. However, in applications like
fraud detection, it worth noting. Example: fraudulent usage of credit cards by
detecting purchases extremely of extremely large amount on a given day
 Outliers may be detected by using a statistical test for probability model or using
distance measure where objects that are a substantial distance from any other
cluster is considered outlier.
 Evolution analysis describes and models regularities or trends for objects whose
behavior changes over time.
25
 Massive data, temporally ordered, fast changing and potentially infinite is stream
data.
 Stream data flow in and out of a computer system continuously and with varying
update rates.
 Examples – real-time surveillance system, communication network, internet
traffic, on-line transactions in financial markets or retail industry, electric power
grids, industry production process and other dynamic environments.
 It is impossible to store an entire data stream. Moreover, it tends to be of rather
low level of abstraction.
26
 Mining time-series data
 A time-series database consist of sequence of values spread over repeated measurements
of time.
 Time-series database is popular in stock-market analysis, economic and sales
forecasting, budgetary analysis, utility studies, yield studies, work-load projections,
observation of natural phenomenon
 Mining sequence patterns
 A sequence database consist of sequence of ordered elements or events, recorded with or
without a concrete notion of time. Sequential pattern mining is the discovery of
frequently occurring ordered events or sequence of patterns.
 Applications include customer shopping sequence, web clickstream, biological sequences,
sequences of events in science and engineering.
27

More Related Content

PPTX
Predictive Analytics - An Introduction
Laguna State Polytechnic University
 
PPTX
Multi dimensional model vs (1)
JamesDempsey1
 
PDF
Data Mining and Business Intelligence Tools
Motaz Saad
 
PPT
Data mining basic fundamentals
Siddique Ibrahim
 
PPTX
Introduction to Business Analytics Part 1
Beamsync
 
PDF
Understanding Association Rule Mining
Mohit Rajput
 
PPTX
Lect7 Association analysis to correlation analysis
hktripathy
 
PPT
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
Predictive Analytics - An Introduction
Laguna State Polytechnic University
 
Multi dimensional model vs (1)
JamesDempsey1
 
Data Mining and Business Intelligence Tools
Motaz Saad
 
Data mining basic fundamentals
Siddique Ibrahim
 
Introduction to Business Analytics Part 1
Beamsync
 
Understanding Association Rule Mining
Mohit Rajput
 
Lect7 Association analysis to correlation analysis
hktripathy
 

What's hot (20)

PPTX
Text analytics in social media
Jeremiah Fadugba
 
PPTX
Data Mining: What is Data Mining?
Seerat Malik
 
PPTX
Data Quality: A Raising Data Warehousing Concern
Amin Chowdhury
 
PPTX
Data Mining
SHIKHA GAUTAM
 
PPTX
Text mining
ThejeswiniChivukula
 
PPTX
Classification of data
Dr. C.V. Suresh Babu
 
PPTX
Collaborative Filtering Recommendation System
Milind Gokhale
 
PDF
Customer Clustering For Retail Marketing
Jonathan Sedar
 
PDF
Cluster analysis
Venkata Reddy Konasani
 
PPTX
Spatial Data Mining
Rashmi Bhat
 
PDF
Recsys 2014 Tutorial - The Recommender Problem Revisited
Xavier Amatriain
 
PPTX
Data cubes
Mohammed
 
PPTX
Machine learning clustering
CosmoAIMS Bassett
 
PDF
Dimensionality Reduction
Saad Elbeleidy
 
PDF
An Introduction to Anomaly Detection
Kenneth Graham
 
PPTX
Text mining
Koshy Geoji
 
PPTX
Association rule mining and Apriori algorithm
hina firdaus
 
PPTX
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Simplilearn
 
PDF
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Gigaom
 
PDF
Dimensionality Reduction
mrizwan969
 
Text analytics in social media
Jeremiah Fadugba
 
Data Mining: What is Data Mining?
Seerat Malik
 
Data Quality: A Raising Data Warehousing Concern
Amin Chowdhury
 
Data Mining
SHIKHA GAUTAM
 
Text mining
ThejeswiniChivukula
 
Classification of data
Dr. C.V. Suresh Babu
 
Collaborative Filtering Recommendation System
Milind Gokhale
 
Customer Clustering For Retail Marketing
Jonathan Sedar
 
Cluster analysis
Venkata Reddy Konasani
 
Spatial Data Mining
Rashmi Bhat
 
Recsys 2014 Tutorial - The Recommender Problem Revisited
Xavier Amatriain
 
Data cubes
Mohammed
 
Machine learning clustering
CosmoAIMS Bassett
 
Dimensionality Reduction
Saad Elbeleidy
 
An Introduction to Anomaly Detection
Kenneth Graham
 
Text mining
Koshy Geoji
 
Association rule mining and Apriori algorithm
hina firdaus
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Simplilearn
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Gigaom
 
Dimensionality Reduction
mrizwan969
 
Ad

Similar to Introduction to data mining (20)

PPTX
Data mining an introduction
Dr-Dipali Meher
 
PPT
1.6.data preprocessing
Krish_ver2
 
PDF
Data Mining
SOMASUNDARAM T
 
PPT
Sanjeev Kumar Dash D ata Mining-2023.ppt
gobeli2850
 
PPT
Data Mining-2023 (2).ppt
SATYAJITJENABTECH
 
PPT
Datapreprocessing
Chandrika Sweety
 
PPTX
Explorartory Data Analytics and Knowledge Discovery techniques.pptx
DrNeelamDuhan
 
PPTX
Unit i
AishwaryaLakshmiA
 
PPT
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
JITENDER773791
 
PDF
Chapter 1.pdf
DrGnaneswariG
 
PDF
Data Warehousing and Suitable for BCA, BSC, MCA
Guru Jhambheswar University of science and technology,Hisar-125033
 
PPTX
Datapreprocessing
priya_trehan
 
PPT
Pre_processing_the_data_using_advance_technique
Bhushan134837
 
PPTX
DWDM_UNIT4.pptx ddddddddddddddddddddddddddddd
GangeshSawarkar
 
PPTX
Unit-V-Introduction to Data Mining.pptx
Harsha Patil
 
PPTX
Data preprocessing PPT
ANUSUYA T K
 
PDF
Lect 1 introduction
hktripathy
 
PPT
Preprocessing data mining hhxdzsdsasaasa
Suvedha8
 
PPT
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
PPTX
Unit3-AssociationRuleMining and data techniques.pptx
yokeshmca
 
Data mining an introduction
Dr-Dipali Meher
 
1.6.data preprocessing
Krish_ver2
 
Data Mining
SOMASUNDARAM T
 
Sanjeev Kumar Dash D ata Mining-2023.ppt
gobeli2850
 
Data Mining-2023 (2).ppt
SATYAJITJENABTECH
 
Datapreprocessing
Chandrika Sweety
 
Explorartory Data Analytics and Knowledge Discovery techniques.pptx
DrNeelamDuhan
 
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
JITENDER773791
 
Chapter 1.pdf
DrGnaneswariG
 
Data Warehousing and Suitable for BCA, BSC, MCA
Guru Jhambheswar University of science and technology,Hisar-125033
 
Datapreprocessing
priya_trehan
 
Pre_processing_the_data_using_advance_technique
Bhushan134837
 
DWDM_UNIT4.pptx ddddddddddddddddddddddddddddd
GangeshSawarkar
 
Unit-V-Introduction to Data Mining.pptx
Harsha Patil
 
Data preprocessing PPT
ANUSUYA T K
 
Lect 1 introduction
hktripathy
 
Preprocessing data mining hhxdzsdsasaasa
Suvedha8
 
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
Unit3-AssociationRuleMining and data techniques.pptx
yokeshmca
 
Ad

More from Ujjawal (10)

PPTX
fMRI in machine learning
Ujjawal
 
PPTX
Random forest
Ujjawal
 
PPTX
Neural network for machine learning
Ujjawal
 
PPTX
Information retrieval
Ujjawal
 
PPTX
Genetic algorithm
Ujjawal
 
PPTX
K nearest neighbor
Ujjawal
 
PPTX
Support vector machines
Ujjawal
 
PPTX
Vector space classification
Ujjawal
 
PPTX
Scoring, term weighting and the vector space
Ujjawal
 
PPTX
Bayes’ theorem and logistic regression
Ujjawal
 
fMRI in machine learning
Ujjawal
 
Random forest
Ujjawal
 
Neural network for machine learning
Ujjawal
 
Information retrieval
Ujjawal
 
Genetic algorithm
Ujjawal
 
K nearest neighbor
Ujjawal
 
Support vector machines
Ujjawal
 
Vector space classification
Ujjawal
 
Scoring, term weighting and the vector space
Ujjawal
 
Bayes’ theorem and logistic regression
Ujjawal
 

Recently uploaded (20)

PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Chad Readey - An Independent Thinker
Chad Readey
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
short term internship project on Data visualization
JMJCollegeComputerde
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 

Introduction to data mining

  • 2.  Extraction or ‘mining’ of large amount of data  Also known as knowledge mining from data / knowledge extraction / data or pattern analysis / data archaeology / data dredging  Most popular – Knowledge Discovery from Data (KDD)  Data available in huge amount -> Imminent need for turning into useful info  Application – market analysis, fraud detection, customer retention, production control, science exploration 2
  • 3.  Data cleaning (remove noise and inconsistent data)  Data integration (combine multiple data sources)  Data selection (relevant data is retrieved from database)  Data transformation (data is transformed or consolidated by mining/aggregation)  Data mining (extraction of data patterns)  Pattern evaluation (identifying interesting patterns representing knowledge using interestingness measures)  Knowledge presentation (visualization and presentation of mined knowledge) 3
  • 4. 4
  • 5.  Database, Data Warehouses, WWW, Information Repositories – It may be a set of databases/warehouses or any other information repositories. Data cleaning and data integration is performed.  Database / Data Warehouse servers – responsible for fetching relevant data based on user’s request  Knowledge base – it’s the domain knowledge that guides the search. Includes concept hierarchies used to organize attributes, user believes  Data mining engine – consist of functional modules for task such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis.  Pattern evaluation module – employs interestingness measures and interactive with data mining modules to focus the search towards interesting patterns  User interface – user specifies a data mining query or task, providing information to help focus search and perform exploratory data mining based on intermediate data mining results. 5
  • 6.  Relational Databases  Data Warehouses  Transactional Databases  Advanced Data and Information Systems and Advanced Application  Object-Relational Database  Temporal Database/Sequence Database and Time-Series Database  Spatial Databases and Spatiotemporal Databases  Text Databases and Multimedia Databases  Heterogeneous Databases and Legacy Databases  Data Streams  World Wide Web 6
  • 7.  No coupling  DM system does not utilize any function of DB/DW.  Fetches data from source and stores result in different file  Drawbacks  Without a DB system, a DM system spends time in searching, collecting, transforming data.  DM systems doesn’t have any tested, scalable algorithm or data structure implemented  DM systems needs another tool to extract data  Loose coupling  DM system will use some feature of DB system like fetching data, performing data mining and storing the results in a file/place in database  Advantage  Fetch data from database using query processing, indexing  Has advantages of flexibility, efficiency by the system.  Disadvantage – mining does not explore data structure/query optimization methods 7
  • 8.  Semi-tight coupling  Linking of DM system to DB system and efficient implementation of a few essential data mining primitives is provided by DB  Includes sorting, indexing, aggregation, histogram analysis, pre-computation of statistical measures like sum, count, min-max, standard deviation  Enhances performance of DM system since some frequently used results is pre-computed  Tight coupling  DM system is smoothly integrated into DB system.  data mining queries and functionalities are optimized based on mining query analysis, data structure, indexing schemes and query processing methods. 8
  • 9.  Why preprocess the data?  Incomplete (lacking attribute values)  Noisy (containing errors or outliers)  Inconsistent (containing discrepancies in department codes used to categorize them)  Redundancy (repetition of the same data)  Descriptive Data Summarization helps in the study of general characteristics of the data and identifies the presence of noise or outliers which is useful for successful for cleaning and data integration.  Measures of central tendency – mean, median, weighted arithmetic mean, mode  Measure of data dispersion – quartiles, interquartile range, variance 9
  • 10.  A distributive measure is a measure that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset and then merging the result in order to arrive at the measure’s value for the original dataset.  An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures.  A holistic measure is a measure that must be computed on the entire data set as a whole. It cannot be computed by partitioning the given data into subsets and merging the values obtained for the measure in each subset. 10
  • 11.  The degree to which the numerical data tend to spread is called dispersion or variance of the data.  Most common measure of dispersion are range, five-number summary, inter quartile range, standard deviation.  For displaying the data summary and dispersion popular graphs include – histograms, quantile plots, q-q plots, scatter plots, loess curves. 11
  • 12. 12
  • 13.  Data cleaning tends to fill missing values, smooth out noise, identify outliers, correct inconsistencies  Missing values  Ignore the tuple  Fill the missing value manually  Use a global constant to fill the missing value  Use the architecture mean to fill the missing value  Use the attribute mean for samples belonging to the same class as the given tuple  Use the most probable value to fill the missing value  Use regression, decision-tree induction, Bayesian formation 13
  • 14.  Noisy data  Binning  Consults the neighboring value  Performs local smoothing  Smoothing by bin means – each value of bin is replaced by mean value of the bin  Smoothing by bin median – each value of the bin is replaced by bin median  smoothing by bin boundaries – max and min value of bin is bin boundary and each value of bin is replaced by the closest bin boundary  Regression  Filters the data into functions  Linear regression finds the best line to fit two attributes  Multiple regression involves more than two variables  Clustering  Outliers is detected through clustering where similar values are organized into clusters  Values falling off the set is outlier 14
  • 15.  Data integration  Entity identification problem is matching of equivalent real-world entries from multiple data sources  Correlation analysis measures how strong one attribute implies the other  Data transformation  Smoothing – binning, regression, clustering  Aggregation  Generalization – low level data is replaced by higher level concept through the use of concept hierarchy  Normalization – data is scaled to fall within a small specified range  Min-Max method  Z-score normalization  Decimal scaling  Attribute construction 15
  • 16.  Applied to obtain a reduced representation of data set  Data cube aggregation  Attribute subset selection reduces the data size by removing irrelevant or redundant attribute.  Dimensionality reduction involves data encoding or transformation to obtain compressed data. Lossy dimensionality reduction – wavelet transform, principal component analysis  Numerosity reduction  Parametric methods use a model to estimate data ex. Log-Linear model  Nonparametric method include histogram, clustering and sample for storing reduced representation  Discretization and concept hierarchy reduces the number of values for a given attribute by dividing the range of the attribute into intervals. 16
  • 17. DM task is divided into two categories: descriptive and predictive Descriptive mining task characterizes general properties of the data Predictive mining task performs inference on current data in order to make predictions 17
  • 18.  Data characterization is summarization of the general characterization or features of the target class of data.  Data corresponding to user specific class are typically collected by database query  Example: to study the characteristics of software products whose sales increased by 10%, data related to the product is collected  Data cube OLAP roll-up operation is used for data summarization  Output is presented in the form of pie charts, histogram  Data discrimination is comparison of the general features of target class data objects with general features of the object from one or a set of contrasting class.  Example: comparison of a product whose sales increased by 10% with that of a product whose sales decreased by 30% 18
  • 19.  Classification is the process of finding a model that describes or distinguishes data classes or concepts for the purpose of being able to use the model to predict the class of object whose class label is unknown.  Classifying loan as ‘safe’ or ‘risky’  Given a customer profile, guess whether he will buy a new computer  Decision tree induction  Bayesian classification  Rule-based classification  Classification by backpropogation  Support vector machines  Classification by association rule analysis 19
  • 20.  Prediction models continuous valued functions. Numeric prediction is the task of predicting continues values for the given input.  Regression analysis is a statistical methodology that is often used for numerical prediction  Linear/straight-line regression involves a response variable, y and a single predictor variable, x. It models y as a function of x. [y=b+wx]  Multiple linear regression extends straight-line regression to models more than one predictor variable  Nonlinear regression models polynomial terms 20
  • 21.  The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.  A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other cluster.  Class labels are not present in training data because they are not known to begin with. Clustering is used to generate such labels  Applications: taxonomy (organization of observations into hierarchy of classes that group similar events together) 21
  • 22.  Partitioning method  Partitioning method creates k partitions of the database of n objects of data tuples  Requirements  Each group must contain at least one object  Each object must belong to exactly one group  Objects in the same cluster are close or related to each other whereas objects of different cluster are fat apart or very different  k-means algorithm where each cluster is represented by the mean value of the objects  k-medoids algorithm where each cluster is represented by one of the objects located near the center of the cluster.  works well for small to medium databases 22
  • 23.  Hierarchical method  Created hierarchical decomposition of the given set of data objects.  Classification based on how hierarchical decomposition is formed  Agglomerative/Bottom-up approach merges objects or groups that are close to one another, until all the groups are merged into one  Divisive/Top-down approach starts with all of the objects in the same cluster. It breaks down into smaller cluster until eventually each object is in one cluster  Density-based method  Can easily determine clusters of arbitrary shape  Used to filter out noise  Grid based method  Quantize the object space into a finite number of cells that form a grid structure.  Faster processing  Model based clustering  Hypothesizes a model for each cluster and finds the best fit of the data to the given model  Locates cluster by constructing a density function that reflects spatial distribution of data  Automatically determines the number of clusters based on standard statistics  Example: self organizing maps 23
  • 24.  Clustering high dimensional data  examines objects having a number of features  Subspace clustering method searches for clusters in subspace  Frequent pattern based clustering extracts distinct frequent patterns among subset of dimensions that occur frequently  Constrain based clustering  Performs clustering by incorporating user-specific constrains  A constrain expresses a user’s expectations or desired results  Example: spatial clustering with the existence of obstacles and clustering under user specific constrains 24
  • 25.  Outliers are data that do not comply with the general behavior or model of data  Its discarded by most data mining applications. However, in applications like fraud detection, it worth noting. Example: fraudulent usage of credit cards by detecting purchases extremely of extremely large amount on a given day  Outliers may be detected by using a statistical test for probability model or using distance measure where objects that are a substantial distance from any other cluster is considered outlier.  Evolution analysis describes and models regularities or trends for objects whose behavior changes over time. 25
  • 26.  Massive data, temporally ordered, fast changing and potentially infinite is stream data.  Stream data flow in and out of a computer system continuously and with varying update rates.  Examples – real-time surveillance system, communication network, internet traffic, on-line transactions in financial markets or retail industry, electric power grids, industry production process and other dynamic environments.  It is impossible to store an entire data stream. Moreover, it tends to be of rather low level of abstraction. 26
  • 27.  Mining time-series data  A time-series database consist of sequence of values spread over repeated measurements of time.  Time-series database is popular in stock-market analysis, economic and sales forecasting, budgetary analysis, utility studies, yield studies, work-load projections, observation of natural phenomenon  Mining sequence patterns  A sequence database consist of sequence of ordered elements or events, recorded with or without a concrete notion of time. Sequential pattern mining is the discovery of frequently occurring ordered events or sequence of patterns.  Applications include customer shopping sequence, web clickstream, biological sequences, sequences of events in science and engineering. 27

Editor's Notes

  • #4: Steps 1-4 are different forms of data preprocessing
  • #6: From a data warehouse perspective, data mining can be viewed as an advanced stage of on-line analytical processing (OLAP).