Data Mining
Techniques
UNIT-I
The Introduction to Data mining
Systems
• What is Data?
• What is Database?
• What is Database Management System?
The Introduction to Data mining
Systems
• Why Data Mining?
• Data Collection and Data Availability
• Major sources of abundant data
Data Mining
• What is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
Examples
• Examples of Data Mining
1. Marketing
2. Banking
3. Government
4. Health Care
5. Education
6. Retail Industry
7. Logistics and supply chain
Steps involved in Data Mining Process
Large-scale Data is Everywhere!
 There has been enormous data growth in both commercial and scientific
databases due to advances in data generation and collection technologies
Cyber Security E-Commerce
Traffic
Patterns
Social Networking: Twitter
Sensor Networks
Computational
Simulations
Why Data Mining? Commercial
Viewpoint
• Lots of data is being collected
and warehoused
– Web data
• Yahoo has Peta Bytes of web data
• Facebook has billions of active users
– purchases at department/
grocery stores, e-commerce
• Amazon handles millions of visits/day
– Bank/Credit Card transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Great Opportunities to Solve Society’s Major Problems
Improving health care and reducing costs
Finding alternative/ green energy sources
Predicting the impact of climate change
Reducing hunger and poverty by
increasing agriculture production
Evolution of Data Mining
Evolution of Database Technology-
Summary
• 1960s-Data Collection, DB creation, network
DBMS
• 1970-Relational Data Model, relational DBMS
• 1980-RDBMS,Advanced data models,
Application oriented DBMS
• 2000s-stream data management and mining,
DM & its applications.
Basic Concepts
• Classification
• Clustering
• Supervised Learning
• Unsupervised Learning
Knowledge Discovery Process
Steps in the process of Knowledge
Discovery(KDD Process)
• Data Cleaning
• Data Integration
• Data Selection
• Data Transformation
• Data Mining
• Pattern Evaluation
• Knowledge Presentation
Kinds of Data
• What kinds of Data can be mined?
• Database Data
• Data Warehouses
• Transactional Data
• Other Kinds of Data
Database Data
• Database Management System(DBMS)
• Relational Data base: tables, attributes, tuples
• Entity-Relationship(ER Model)
• Database queries
• Mining relational database
Data Warehouses
• What is data warehouse?
• What is data cube?
• OLAP (Online Analytical Processing)
operations: drill down, roll up
Typical framework of a data
warehouse for All Electronics
A Multidimensional data cube ,commonly used for data warehousing.(a)
showing summarized data for All Electronics and b)showing summarized
data resulting from drill-down and roll-up operations on the cube .
Transactional Data
Other Kinds of Data
• Time related or sequence data
• Data streams
• Spatial data
• Engineering design data
• Hypertext and Multimedia data
• Graph and Networked data
Kinds of Patterns(Data Mining
Functionalities)
• Data Mining Tasks: Descriptive and Predictive
• DM functionalities includes:
• Characterization and Discrimination
• Mining frequent patterns, Associations and
Correlations
• Classification and Regression
• Clustering Analysis
• Outlier Analysis
• Are all patterns are interesting
Class/Concept Description: Characterization
and Discrimination
• Eg., In all electronics store, class of items for sale include computers and
printers and concepts of customers include big Spenders and budget
Spenders
• Data Characterization
• Methods for data summarization and characterization:simple data
summaries based on statistics measures and plots,data cube based OLAP
operations,attribute oriented induction techniques.
• Output of Data Characterization and Example for Data Characterization
• Data Discrimination
• Output of Data Discrimination and Example for Data Discrimination
Mining Frequent patterns,
association and correlations
• Frequent patterns:Frequent itemset,frequent subsequences,frequent
substructure
• Association Analysis:
• Eg:association rule- buys(x,”computer”) => buys(x,”software”)
predicate
[support=1%,confidence=50%]
confidence(certainity),support(under analysis)
• Single dimensional association rule
• Multidimensional association rule
Age(x,”20..29”)^ income(x,”40..49K”) =>buys(x,”laptops”)
[support=2%,confidence=60%]
• Association should satisfy both minimum threshold and minimum
confidence
Classification and regression for
predictive analysis
• What is classification and its example?
• Training data and test data
• Derived models presented by
1. Classification rules(If-then-rules)
2. Decision tree
3. Mathematical formulae
4. Neural networks
• Regression analysis
Cluster analysis and outlier analysis
Are all pattern interesting?
• Support(x=>y) =p(x U y)
• Confidence(x =>y) =p(y/x)
• Accuracy
• Coverage
• Unexpected Vs expected
Data mining Technologies
Statistics
• It is a collection, analysis, interpretation or
explanation and presentation of data.
• Statistical model
• Statistical description
• Inferential statistics or predictive statistics
• Statistical hypothesis test
Machine Learning
• What is machine learning?
• Classic problems in machine learning are:
• Supervised learning
• Unsupervised learning
• Semi-supervised learning
• Active learning
Database System, Data warehouses
& Information retrieval
• Database systems research
• Data warehouse
• Information retrieval
• Language model
• Topic model
Data Mining Applications
• Business Intelligence
• Web Search Engines
Issues in Data Mining
• Mining Methodology
• User Interaction
• Efficiency and Scalability
• Diversity of database types
• Data Mining and society
Mining Methodology
• Mining various and new kinds of knowledge
• Mining knowledge in multidimensional space
• Data Mining-an interdisciplinary effort
• Boosting the power of discovery in a networked
environment
• Handling uncertainty, noise or incompleteness of data
• Pattern evaluation and pattern-or constraint-guided
mining
User Interaction
• Interactive mining
• Incorporation of background knowledge
• Ad hoc data mining and data mining query
languages
• Presentation and visualization of data mining
results
Efficiency and Scalability
• Efficiency and scalability of data mining
algorithms
• Parallel, distributed and incremental mining
algorithms
• Cloud computing and cluster computing
Diversity of database types
• Handling complex types of data
• Mining dynamic, networked and global data
repositories
Data Mining and Society
• Social impacts of data mining
• Privacy-preserving data mining
• Invisible data mining
Summary
• Data mining: Discovering interesting patterns and knowledge from
massive amount of data
• A natural evolution of database technology, in great demand, with
wide applications
• A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
• Data mining technologies and applications
• Major issues in data mining
DATAWAREHOUSE:BASIC
CONCEPTS
• What is data warehouse?
• Subject-oriented, integrated, time- variant,
nonvolatile
• How are organizations using the information from
data warehouses?
- Knowledge workers
• Query driven approach(Traditional Database
approach)
• Update driven approach(Data warehousing approach)
Difference between operational database
systems and data warehouse
• What is OLTP and OLAP?
- Online transaction processing(OLTP)
- Online analytical processing (OLAP)
• Major features /differences between OLTP & OLAP
systems
-User and system orientation
-Data Contents
-Database design
-View
-Access patterns
Why have a separate Data
Warehouse?
• DBMS
• Data Warehouse
• Different functions and different data
-Missing data
-Data consolidation
-Data Quality
Data warehousing: A multiered
architecture
• Bottom tier: Data Warehouse Server
-Data Sources
-Gateways
• Middle tier: OLAP server
-ROLAP(Relational OLAP)server
-MOLAP(Multidimensional OLAP)
• Top tier: Front-end tools
A three-tier data warehousing
architecture
Data Warehouse Models
• Enterprise warehouse
• Data Mart
• Virtual warehouse
• Types of Data Mart
-Independent Data Mart
-Dependent Data Mart
• Data warehouse development
-Top-down approach &Bottom-up approach to
DataWarehouse development
A recommended approach for
data warehouse development
Data Warehouse Models
• High-level corporate data model is defined
within short period
• Enterprise and Department Data Marts
• Distributed Data Marts
• Multitier Data Warehouse
Extraction, Transformation and
Loading
• Data Extraction
• Data Cleaning
• Data Transformation
• Load
• Refresh
Metadata Repository
• Description of the data warehouse structure
• Operational metadata
-Data lineage
-Currency of data
-Monitoring Information
• Algorithms used for summarization
• Mapping from the operational environment to data
warehouse
• Data related to system performance
• Business metadata
Data warehouse modeling: Data
Cube and OLAP
• What is data cube?
• Facts
• Fact table
• Lattice of cuboids
• Base cuboid
• Apex cuboid
2D,3D,4D-Data Cube
3D Data Cube
4DData Cube
Schemas for multidimensional
data models
• Star schema
• Snow flake schema
• Fact constellation schema
Star Schema
Snow flake Schema
Fact Constellation
Schema Hierarchy Vs Set-Grouping
Hierarchy
• Data warehouse Vs Data Mart
• Dimensions: The role of Concept Hierarchies
-set of low level concepts to higher level,
more general concepts
Set grouping Hierarchy
• Discretizing or grouping values for a given
dimension or attributes
Measures: Their Categorization and
Computation
• Distributive
• Algebraic
• Holistic
Typical OLAP operations
• Roll-up(drill-up)
• Drill –down(reverse of roll-up)
• Slice & Dice
• Pivot
• Other operations: drill-across, drill-through
• OLAP systems Vs Statistical Databases
-Starnet query model for querying
multidimensional database:radial lines,foot
print
Slice,dice,pivot
starnet query model
END OF THE UNIT-1

Data mining techniques unit 1

  • 1.
  • 2.
    The Introduction toData mining Systems • What is Data? • What is Database? • What is Database Management System?
  • 3.
    The Introduction toData mining Systems • Why Data Mining? • Data Collection and Data Availability • Major sources of abundant data
  • 4.
    Data Mining • Whatis Data Mining? • Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Alternative names – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • Is everything “data mining”? – Simple search and query processing – (Deductive) expert systems
  • 5.
    Examples • Examples ofData Mining 1. Marketing 2. Banking 3. Government 4. Health Care 5. Education 6. Retail Industry 7. Logistics and supply chain
  • 6.
    Steps involved inData Mining Process
  • 7.
    Large-scale Data isEverywhere!  There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies Cyber Security E-Commerce Traffic Patterns Social Networking: Twitter Sensor Networks Computational Simulations
  • 8.
    Why Data Mining?Commercial Viewpoint • Lots of data is being collected and warehoused – Web data • Yahoo has Peta Bytes of web data • Facebook has billions of active users – purchases at department/ grocery stores, e-commerce • Amazon handles millions of visits/day – Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management)
  • 9.
    Great Opportunities toSolve Society’s Major Problems Improving health care and reducing costs Finding alternative/ green energy sources Predicting the impact of climate change Reducing hunger and poverty by increasing agriculture production
  • 10.
  • 11.
    Evolution of DatabaseTechnology- Summary • 1960s-Data Collection, DB creation, network DBMS • 1970-Relational Data Model, relational DBMS • 1980-RDBMS,Advanced data models, Application oriented DBMS • 2000s-stream data management and mining, DM & its applications.
  • 12.
    Basic Concepts • Classification •Clustering • Supervised Learning • Unsupervised Learning
  • 13.
  • 14.
    Steps in theprocess of Knowledge Discovery(KDD Process) • Data Cleaning • Data Integration • Data Selection • Data Transformation • Data Mining • Pattern Evaluation • Knowledge Presentation
  • 15.
    Kinds of Data •What kinds of Data can be mined? • Database Data • Data Warehouses • Transactional Data • Other Kinds of Data
  • 16.
    Database Data • DatabaseManagement System(DBMS) • Relational Data base: tables, attributes, tuples • Entity-Relationship(ER Model) • Database queries • Mining relational database
  • 18.
    Data Warehouses • Whatis data warehouse? • What is data cube? • OLAP (Online Analytical Processing) operations: drill down, roll up
  • 19.
    Typical framework ofa data warehouse for All Electronics
  • 20.
    A Multidimensional datacube ,commonly used for data warehousing.(a) showing summarized data for All Electronics and b)showing summarized data resulting from drill-down and roll-up operations on the cube .
  • 21.
  • 22.
    Other Kinds ofData • Time related or sequence data • Data streams • Spatial data • Engineering design data • Hypertext and Multimedia data • Graph and Networked data
  • 23.
    Kinds of Patterns(DataMining Functionalities) • Data Mining Tasks: Descriptive and Predictive • DM functionalities includes: • Characterization and Discrimination • Mining frequent patterns, Associations and Correlations • Classification and Regression • Clustering Analysis • Outlier Analysis • Are all patterns are interesting
  • 24.
    Class/Concept Description: Characterization andDiscrimination • Eg., In all electronics store, class of items for sale include computers and printers and concepts of customers include big Spenders and budget Spenders • Data Characterization • Methods for data summarization and characterization:simple data summaries based on statistics measures and plots,data cube based OLAP operations,attribute oriented induction techniques. • Output of Data Characterization and Example for Data Characterization • Data Discrimination • Output of Data Discrimination and Example for Data Discrimination
  • 25.
    Mining Frequent patterns, associationand correlations • Frequent patterns:Frequent itemset,frequent subsequences,frequent substructure • Association Analysis: • Eg:association rule- buys(x,”computer”) => buys(x,”software”) predicate [support=1%,confidence=50%] confidence(certainity),support(under analysis) • Single dimensional association rule • Multidimensional association rule Age(x,”20..29”)^ income(x,”40..49K”) =>buys(x,”laptops”) [support=2%,confidence=60%] • Association should satisfy both minimum threshold and minimum confidence
  • 26.
    Classification and regressionfor predictive analysis • What is classification and its example? • Training data and test data • Derived models presented by 1. Classification rules(If-then-rules) 2. Decision tree 3. Mathematical formulae 4. Neural networks • Regression analysis
  • 28.
    Cluster analysis andoutlier analysis
  • 29.
    Are all patterninteresting? • Support(x=>y) =p(x U y) • Confidence(x =>y) =p(y/x) • Accuracy • Coverage • Unexpected Vs expected
  • 30.
  • 31.
    Statistics • It isa collection, analysis, interpretation or explanation and presentation of data. • Statistical model • Statistical description • Inferential statistics or predictive statistics • Statistical hypothesis test
  • 32.
    Machine Learning • Whatis machine learning? • Classic problems in machine learning are: • Supervised learning • Unsupervised learning • Semi-supervised learning • Active learning
  • 35.
    Database System, Datawarehouses & Information retrieval • Database systems research • Data warehouse • Information retrieval • Language model • Topic model
  • 36.
    Data Mining Applications •Business Intelligence • Web Search Engines
  • 37.
    Issues in DataMining • Mining Methodology • User Interaction • Efficiency and Scalability • Diversity of database types • Data Mining and society
  • 38.
    Mining Methodology • Miningvarious and new kinds of knowledge • Mining knowledge in multidimensional space • Data Mining-an interdisciplinary effort • Boosting the power of discovery in a networked environment • Handling uncertainty, noise or incompleteness of data • Pattern evaluation and pattern-or constraint-guided mining
  • 39.
    User Interaction • Interactivemining • Incorporation of background knowledge • Ad hoc data mining and data mining query languages • Presentation and visualization of data mining results
  • 40.
    Efficiency and Scalability •Efficiency and scalability of data mining algorithms • Parallel, distributed and incremental mining algorithms • Cloud computing and cluster computing
  • 41.
    Diversity of databasetypes • Handling complex types of data • Mining dynamic, networked and global data repositories
  • 42.
    Data Mining andSociety • Social impacts of data mining • Privacy-preserving data mining • Invisible data mining
  • 43.
    Summary • Data mining:Discovering interesting patterns and knowledge from massive amount of data • A natural evolution of database technology, in great demand, with wide applications • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation • Mining can be performed in a variety of data • Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. • Data mining technologies and applications • Major issues in data mining
  • 44.
    DATAWAREHOUSE:BASIC CONCEPTS • What isdata warehouse? • Subject-oriented, integrated, time- variant, nonvolatile • How are organizations using the information from data warehouses? - Knowledge workers • Query driven approach(Traditional Database approach) • Update driven approach(Data warehousing approach)
  • 45.
    Difference between operationaldatabase systems and data warehouse • What is OLTP and OLAP? - Online transaction processing(OLTP) - Online analytical processing (OLAP) • Major features /differences between OLTP & OLAP systems -User and system orientation -Data Contents -Database design -View -Access patterns
  • 46.
    Why have aseparate Data Warehouse? • DBMS • Data Warehouse • Different functions and different data -Missing data -Data consolidation -Data Quality
  • 47.
    Data warehousing: Amultiered architecture • Bottom tier: Data Warehouse Server -Data Sources -Gateways • Middle tier: OLAP server -ROLAP(Relational OLAP)server -MOLAP(Multidimensional OLAP) • Top tier: Front-end tools
  • 48.
    A three-tier datawarehousing architecture
  • 49.
    Data Warehouse Models •Enterprise warehouse • Data Mart • Virtual warehouse • Types of Data Mart -Independent Data Mart -Dependent Data Mart • Data warehouse development -Top-down approach &Bottom-up approach to DataWarehouse development
  • 50.
    A recommended approachfor data warehouse development
  • 51.
    Data Warehouse Models •High-level corporate data model is defined within short period • Enterprise and Department Data Marts • Distributed Data Marts • Multitier Data Warehouse
  • 52.
    Extraction, Transformation and Loading •Data Extraction • Data Cleaning • Data Transformation • Load • Refresh
  • 53.
    Metadata Repository • Descriptionof the data warehouse structure • Operational metadata -Data lineage -Currency of data -Monitoring Information • Algorithms used for summarization • Mapping from the operational environment to data warehouse • Data related to system performance • Business metadata
  • 54.
    Data warehouse modeling:Data Cube and OLAP • What is data cube? • Facts • Fact table • Lattice of cuboids • Base cuboid • Apex cuboid
  • 55.
  • 56.
  • 57.
  • 59.
    Schemas for multidimensional datamodels • Star schema • Snow flake schema • Fact constellation schema
  • 60.
  • 61.
  • 62.
  • 63.
    Schema Hierarchy VsSet-Grouping Hierarchy • Data warehouse Vs Data Mart • Dimensions: The role of Concept Hierarchies -set of low level concepts to higher level, more general concepts
  • 64.
    Set grouping Hierarchy •Discretizing or grouping values for a given dimension or attributes
  • 65.
    Measures: Their Categorizationand Computation • Distributive • Algebraic • Holistic
  • 66.
    Typical OLAP operations •Roll-up(drill-up) • Drill –down(reverse of roll-up) • Slice & Dice • Pivot • Other operations: drill-across, drill-through • OLAP systems Vs Statistical Databases -Starnet query model for querying multidimensional database:radial lines,foot print
  • 67.
  • 68.
  • 69.
    END OF THEUNIT-1