This document provides an overview of data mining techniques and concepts. It defines data mining as the process of discovering interesting patterns and knowledge from large amounts of data. The key steps involved are data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Common data mining techniques include classification, clustering, association rule mining, and anomaly detection. The document also discusses data sources, major applications of data mining, and challenges.
The Introduction toData mining
Systems
• What is Data?
• What is Database?
• What is Database Management System?
3.
The Introduction toData mining
Systems
• Why Data Mining?
• Data Collection and Data Availability
• Major sources of abundant data
4.
Data Mining
• Whatis Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
5.
Examples
• Examples ofData Mining
1. Marketing
2. Banking
3. Government
4. Health Care
5. Education
6. Retail Industry
7. Logistics and supply chain
Large-scale Data isEverywhere!
There has been enormous data growth in both commercial and scientific
databases due to advances in data generation and collection technologies
Cyber Security E-Commerce
Traffic
Patterns
Social Networking: Twitter
Sensor Networks
Computational
Simulations
8.
Why Data Mining?Commercial
Viewpoint
• Lots of data is being collected
and warehoused
– Web data
• Yahoo has Peta Bytes of web data
• Facebook has billions of active users
– purchases at department/
grocery stores, e-commerce
• Amazon handles millions of visits/day
– Bank/Credit Card transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
9.
Great Opportunities toSolve Society’s Major Problems
Improving health care and reducing costs
Finding alternative/ green energy sources
Predicting the impact of climate change
Reducing hunger and poverty by
increasing agriculture production
Evolution of DatabaseTechnology-
Summary
• 1960s-Data Collection, DB creation, network
DBMS
• 1970-Relational Data Model, relational DBMS
• 1980-RDBMS,Advanced data models,
Application oriented DBMS
• 2000s-stream data management and mining,
DM & its applications.
Steps in theprocess of Knowledge
Discovery(KDD Process)
• Data Cleaning
• Data Integration
• Data Selection
• Data Transformation
• Data Mining
• Pattern Evaluation
• Knowledge Presentation
15.
Kinds of Data
•What kinds of Data can be mined?
• Database Data
• Data Warehouses
• Transactional Data
• Other Kinds of Data
A Multidimensional datacube ,commonly used for data warehousing.(a)
showing summarized data for All Electronics and b)showing summarized
data resulting from drill-down and roll-up operations on the cube .
Other Kinds ofData
• Time related or sequence data
• Data streams
• Spatial data
• Engineering design data
• Hypertext and Multimedia data
• Graph and Networked data
23.
Kinds of Patterns(DataMining
Functionalities)
• Data Mining Tasks: Descriptive and Predictive
• DM functionalities includes:
• Characterization and Discrimination
• Mining frequent patterns, Associations and
Correlations
• Classification and Regression
• Clustering Analysis
• Outlier Analysis
• Are all patterns are interesting
24.
Class/Concept Description: Characterization
andDiscrimination
• Eg., In all electronics store, class of items for sale include computers and
printers and concepts of customers include big Spenders and budget
Spenders
• Data Characterization
• Methods for data summarization and characterization:simple data
summaries based on statistics measures and plots,data cube based OLAP
operations,attribute oriented induction techniques.
• Output of Data Characterization and Example for Data Characterization
• Data Discrimination
• Output of Data Discrimination and Example for Data Discrimination
25.
Mining Frequent patterns,
associationand correlations
• Frequent patterns:Frequent itemset,frequent subsequences,frequent
substructure
• Association Analysis:
• Eg:association rule- buys(x,”computer”) => buys(x,”software”)
predicate
[support=1%,confidence=50%]
confidence(certainity),support(under analysis)
• Single dimensional association rule
• Multidimensional association rule
Age(x,”20..29”)^ income(x,”40..49K”) =>buys(x,”laptops”)
[support=2%,confidence=60%]
• Association should satisfy both minimum threshold and minimum
confidence
26.
Classification and regressionfor
predictive analysis
• What is classification and its example?
• Training data and test data
• Derived models presented by
1. Classification rules(If-then-rules)
2. Decision tree
3. Mathematical formulae
4. Neural networks
• Regression analysis
Statistics
• It isa collection, analysis, interpretation or
explanation and presentation of data.
• Statistical model
• Statistical description
• Inferential statistics or predictive statistics
• Statistical hypothesis test
Database System, Datawarehouses
& Information retrieval
• Database systems research
• Data warehouse
• Information retrieval
• Language model
• Topic model
Issues in DataMining
• Mining Methodology
• User Interaction
• Efficiency and Scalability
• Diversity of database types
• Data Mining and society
38.
Mining Methodology
• Miningvarious and new kinds of knowledge
• Mining knowledge in multidimensional space
• Data Mining-an interdisciplinary effort
• Boosting the power of discovery in a networked
environment
• Handling uncertainty, noise or incompleteness of data
• Pattern evaluation and pattern-or constraint-guided
mining
39.
User Interaction
• Interactivemining
• Incorporation of background knowledge
• Ad hoc data mining and data mining query
languages
• Presentation and visualization of data mining
results
40.
Efficiency and Scalability
•Efficiency and scalability of data mining
algorithms
• Parallel, distributed and incremental mining
algorithms
• Cloud computing and cluster computing
41.
Diversity of databasetypes
• Handling complex types of data
• Mining dynamic, networked and global data
repositories
42.
Data Mining andSociety
• Social impacts of data mining
• Privacy-preserving data mining
• Invisible data mining
43.
Summary
• Data mining:Discovering interesting patterns and knowledge from
massive amount of data
• A natural evolution of database technology, in great demand, with
wide applications
• A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
• Data mining technologies and applications
• Major issues in data mining
44.
DATAWAREHOUSE:BASIC
CONCEPTS
• What isdata warehouse?
• Subject-oriented, integrated, time- variant,
nonvolatile
• How are organizations using the information from
data warehouses?
- Knowledge workers
• Query driven approach(Traditional Database
approach)
• Update driven approach(Data warehousing approach)
45.
Difference between operationaldatabase
systems and data warehouse
• What is OLTP and OLAP?
- Online transaction processing(OLTP)
- Online analytical processing (OLAP)
• Major features /differences between OLTP & OLAP
systems
-User and system orientation
-Data Contents
-Database design
-View
-Access patterns
46.
Why have aseparate Data
Warehouse?
• DBMS
• Data Warehouse
• Different functions and different data
-Missing data
-Data consolidation
-Data Quality
47.
Data warehousing: Amultiered
architecture
• Bottom tier: Data Warehouse Server
-Data Sources
-Gateways
• Middle tier: OLAP server
-ROLAP(Relational OLAP)server
-MOLAP(Multidimensional OLAP)
• Top tier: Front-end tools
Data Warehouse Models
•Enterprise warehouse
• Data Mart
• Virtual warehouse
• Types of Data Mart
-Independent Data Mart
-Dependent Data Mart
• Data warehouse development
-Top-down approach &Bottom-up approach to
DataWarehouse development
Data Warehouse Models
•High-level corporate data model is defined
within short period
• Enterprise and Department Data Marts
• Distributed Data Marts
• Multitier Data Warehouse
Metadata Repository
• Descriptionof the data warehouse structure
• Operational metadata
-Data lineage
-Currency of data
-Monitoring Information
• Algorithms used for summarization
• Mapping from the operational environment to data
warehouse
• Data related to system performance
• Business metadata
54.
Data warehouse modeling:Data
Cube and OLAP
• What is data cube?
• Facts
• Fact table
• Lattice of cuboids
• Base cuboid
• Apex cuboid
Schema Hierarchy VsSet-Grouping
Hierarchy
• Data warehouse Vs Data Mart
• Dimensions: The role of Concept Hierarchies
-set of low level concepts to higher level,
more general concepts