Introduction to data mining

 Extraction or ‘mining’ of large amount of data
 Also known as knowledge mining from data / knowledge extraction / data or
pattern analysis / data archaeology / data dredging
 Most popular – Knowledge Discovery from Data (KDD)
 Data available in huge amount -> Imminent need for turning into useful info
 Application – market analysis, fraud detection, customer retention, production
control, science exploration
2

 Data cleaning (remove noise and inconsistent data)
 Data integration (combine multiple data sources)
 Data selection (relevant data is retrieved from database)
 Data transformation (data is transformed or consolidated by mining/aggregation)
 Data mining (extraction of data patterns)
 Pattern evaluation (identifying interesting patterns representing knowledge using
interestingness measures)
 Knowledge presentation (visualization and presentation of mined knowledge)
3

 Database, Data Warehouses, WWW, Information Repositories – It may be a set of
databases/warehouses or any other information repositories. Data cleaning and
data integration is performed.
 Database / Data Warehouse servers – responsible for fetching relevant data based
on user’s request
 Knowledge base – it’s the domain knowledge that guides the search. Includes
concept hierarchies used to organize attributes, user believes
 Data mining engine – consist of functional modules for task such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis.
 Pattern evaluation module – employs interestingness measures and interactive
with data mining modules to focus the search towards interesting patterns
 User interface – user specifies a data mining query or task, providing information
to help focus search and perform exploratory data mining based on intermediate
data mining results.
5

 Relational Databases
 Data Warehouses
 Transactional Databases
 Advanced Data and Information Systems and Advanced Application
 Object-Relational Database
 Temporal Database/Sequence Database and Time-Series Database
 Spatial Databases and Spatiotemporal Databases
 Text Databases and Multimedia Databases
 Heterogeneous Databases and Legacy Databases
 Data Streams
 World Wide Web
6

 No coupling
 DM system does not utilize any function of DB/DW.
 Fetches data from source and stores result in different file
 Drawbacks
 Without a DB system, a DM system spends time in searching, collecting, transforming data.
 DM systems doesn’t have any tested, scalable algorithm or data structure implemented
 DM systems needs another tool to extract data
 Loose coupling
 DM system will use some feature of DB system like fetching data, performing data
mining and storing the results in a file/place in database
 Advantage
 Fetch data from database using query processing, indexing
 Has advantages of flexibility, efficiency by the system.
 Disadvantage – mining does not explore data structure/query optimization methods 7

 Semi-tight coupling
 Linking of DM system to DB system and efficient implementation of a few essential data
mining primitives is provided by DB
 Includes sorting, indexing, aggregation, histogram analysis, pre-computation of
statistical measures like sum, count, min-max, standard deviation
 Enhances performance of DM system since some frequently used results is pre-computed
 Tight coupling
 DM system is smoothly integrated into DB system.
 data mining queries and functionalities are optimized based on mining query analysis,
data structure, indexing schemes and query processing methods.
8

 Why preprocess the data?
 Incomplete (lacking attribute values)
 Noisy (containing errors or outliers)
 Inconsistent (containing discrepancies in department codes used to categorize them)
 Redundancy (repetition of the same data)
 Descriptive Data Summarization helps in the study of general characteristics of
the data and identifies the presence of noise or outliers which is useful for
successful for cleaning and data integration.
 Measures of central tendency – mean, median, weighted arithmetic mean, mode
 Measure of data dispersion – quartiles, interquartile range, variance
9

 A distributive measure is a measure that can be computed for a given data set by
partitioning the data into smaller subsets, computing the measure for each subset
and then merging the result in order to arrive at the measure’s value for the
original dataset.
 An algebraic measure is a measure that can be computed by applying an algebraic
function to one or more distributive measures.
 A holistic measure is a measure that must be computed on the entire data set as a
whole. It cannot be computed by partitioning the given data into subsets and
merging the values obtained for the measure in each subset.
10

 The degree to which the numerical data tend to spread is called dispersion or
variance of the data.
 Most common measure of dispersion are range, five-number summary, inter
quartile range, standard deviation.
 For displaying the data summary and dispersion popular graphs include –
histograms, quantile plots, q-q plots, scatter plots, loess curves.
11

 Data cleaning tends to fill missing values, smooth out noise, identify outliers,
correct inconsistencies
 Missing values
 Ignore the tuple
 Fill the missing value manually
 Use a global constant to fill the missing value
 Use the architecture mean to fill the missing value
 Use the attribute mean for samples belonging to the same class as the given tuple
 Use the most probable value to fill the missing value
 Use regression, decision-tree induction, Bayesian formation
13

 Noisy data
 Binning
 Consults the neighboring value
 Performs local smoothing
 Smoothing by bin means – each value of bin is replaced by mean value of the bin
 Smoothing by bin median – each value of the bin is replaced by bin median
 smoothing by bin boundaries – max and min value of bin is bin boundary and each value of
bin is replaced by the closest bin boundary
 Regression
 Filters the data into functions
 Linear regression finds the best line to fit two attributes
 Multiple regression involves more than two variables
 Clustering
 Outliers is detected through clustering where similar values are organized into clusters
 Values falling off the set is outlier
14

 Data integration
 Entity identification problem is matching of equivalent real-world entries from multiple
data sources
 Correlation analysis measures how strong one attribute implies the other
 Data transformation
 Smoothing – binning, regression, clustering
 Aggregation
 Generalization – low level data is replaced by higher level concept through the use of
concept hierarchy
 Normalization – data is scaled to fall within a small specified range
 Min-Max method
 Z-score normalization
 Decimal scaling
 Attribute construction
15

 Applied to obtain a reduced representation of data set
 Data cube aggregation
 Attribute subset selection reduces the data size by removing irrelevant or redundant
attribute.
 Dimensionality reduction involves data encoding or transformation to obtain
compressed data. Lossy dimensionality reduction – wavelet transform, principal
component analysis
 Numerosity reduction
 Parametric methods use a model to estimate data ex. Log-Linear model
 Nonparametric method include histogram, clustering and sample for storing reduced
representation
 Discretization and concept hierarchy reduces the number of values for a given attribute
by dividing the range of the attribute into intervals.
16

DM task is divided into two categories: descriptive and predictive
Descriptive mining task characterizes general properties of the data
Predictive mining task performs inference on current data in order to make predictions
17

 Data characterization is summarization of the general characterization or
features of the target class of data.
 Data corresponding to user specific class are typically collected by database query
 Example: to study the characteristics of software products whose sales increased by 10%,
data related to the product is collected
 Data cube OLAP roll-up operation is used for data summarization
 Output is presented in the form of pie charts, histogram
 Data discrimination is comparison of the general features of target class data
objects with general features of the object from one or a set of contrasting class.
 Example: comparison of a product whose sales increased by 10% with that of a product
whose sales decreased by 30%
18

 Classification is the process of finding a model that describes or distinguishes data
classes or concepts for the purpose of being able to use the model to predict the
class of object whose class label is unknown.
 Classifying loan as ‘safe’ or ‘risky’
 Given a customer profile, guess whether he will buy a new computer
 Decision tree induction
 Bayesian classification
 Rule-based classification
 Classification by backpropogation
 Support vector machines
 Classification by association rule analysis
19

 Prediction models continuous valued functions. Numeric prediction is the task of
predicting continues values for the given input.
 Regression analysis is a statistical methodology that is often used for numerical
prediction
 Linear/straight-line regression involves a response variable, y and a single
predictor variable, x. It models y as a function of x. [y=b+wx]
 Multiple linear regression extends straight-line regression to models more than
one predictor variable
 Nonlinear regression models polynomial terms
20

 The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
 A cluster is a collection of data objects that are similar to one another within the
same cluster and are dissimilar to the objects in other cluster.
 Class labels are not present in training data because they are not known to begin
with. Clustering is used to generate such labels
 Applications: taxonomy (organization of observations into hierarchy of classes that
group similar events together)
21

 Partitioning method
 Partitioning method creates k partitions of the database of n objects of data tuples
 Requirements
 Each group must contain at least one object
 Each object must belong to exactly one group
 Objects in the same cluster are close or related to each other whereas objects of different
cluster are fat apart or very different
 k-means algorithm where each cluster is represented by the mean value of the objects
 k-medoids algorithm where each cluster is represented by one of the objects located near
the center of the cluster.
 works well for small to medium databases
22

 Hierarchical method
 Created hierarchical decomposition of the given set of data objects.
 Classification based on how hierarchical decomposition is formed
 Agglomerative/Bottom-up approach merges objects or groups that are close to one another, until
all the groups are merged into one
 Divisive/Top-down approach starts with all of the objects in the same cluster. It breaks down into
smaller cluster until eventually each object is in one cluster
 Density-based method
 Can easily determine clusters of arbitrary shape
 Used to filter out noise
 Grid based method
 Quantize the object space into a finite number of cells that form a grid structure.
 Faster processing
 Model based clustering
 Hypothesizes a model for each cluster and finds the best fit of the data to the given
model
 Locates cluster by constructing a density function that reflects spatial distribution of
data
 Automatically determines the number of clusters based on standard statistics
 Example: self organizing maps
23

 Clustering high dimensional data
 examines objects having a number of features
 Subspace clustering method searches for clusters in subspace
 Frequent pattern based clustering extracts distinct frequent patterns among subset of
dimensions that occur frequently
 Constrain based clustering
 Performs clustering by incorporating user-specific constrains
 A constrain expresses a user’s expectations or desired results
 Example: spatial clustering with the existence of obstacles and clustering under user
specific constrains
24

 Outliers are data that do not comply with the general behavior or model of data
 Its discarded by most data mining applications. However, in applications like
fraud detection, it worth noting. Example: fraudulent usage of credit cards by
detecting purchases extremely of extremely large amount on a given day
 Outliers may be detected by using a statistical test for probability model or using
distance measure where objects that are a substantial distance from any other
cluster is considered outlier.
 Evolution analysis describes and models regularities or trends for objects whose
behavior changes over time.
25

 Massive data, temporally ordered, fast changing and potentially infinite is stream
data.
 Stream data flow in and out of a computer system continuously and with varying
update rates.
 Examples – real-time surveillance system, communication network, internet
traffic, on-line transactions in financial markets or retail industry, electric power
grids, industry production process and other dynamic environments.
 It is impossible to store an entire data stream. Moreover, it tends to be of rather
low level of abstraction.
26

 Mining time-series data
 A time-series database consist of sequence of values spread over repeated measurements
of time.
 Time-series database is popular in stock-market analysis, economic and sales
forecasting, budgetary analysis, utility studies, yield studies, work-load projections,
observation of natural phenomenon
 Mining sequence patterns
 A sequence database consist of sequence of ordered elements or events, recorded with or
without a concrete notion of time. Sequential pattern mining is the discovery of
frequently occurring ordered events or sequence of patterns.
 Applications include customer shopping sequence, web clickstream, biological sequences,
sequences of events in science and engineering.
27

Introduction to data mining

More Related Content

What's hot (20)

Similar to Introduction to data mining (20)

More from Ujjawal (10)

Recently uploaded (20)

Introduction to data mining

Editor's Notes