UNIT-1 NOTES Data warehousing and data mining.pptx

Data Ware Housing
and Data Mining
UNIT 1
BY T.MUKTHAR AHAMED
& CHAITANYA(4TH
CSE)

Data Warehouse: Basic Concepts
 A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data used
to support management's decision-making process.
 Subject-Oriented: Data warehouses focus on major subjects of the enterprise, such as customers,
products, and sales, rather than on the day-to-day operations. This is achieved by excluding data not
relevant to decision support and organizing data around subjects to facilitate analysis.
 Integrated: A data warehouse integrates data from multiple, heterogeneous sources. This involves
ensuring consistency in naming conventions, encoding structures, and attribute measures. For
example, different units of measure for the same data (e.g., currency) must be reconciled.
 Time-Variant: Data in a data warehouse provides information from a historical perspective. It is
typically stored to provide insights over a long period (e.g., 5-10 years or more). Every data structure
in the warehouse implicitly or explicitly contains a time element.
 Nonvolatile: The data in a data warehouse is stable. Once data is loaded, it generally remains constant
and is not updated or deleted. This helps ensure that the historical data for analysis is consistent and
reliable.

 Differences between Operational Database Systems and Data Warehouses
Feature Operational Database Systems Data Warehouses
Primary Purpose Run day-to-day business operations Support decision making and analysis
Data Content Current data Historical, summarized, and consolidated data
Data Organization Application-oriented Subject-oriented
Data Type Detailed, frequently updated Summarized, stable, read-only
Processing Online Transaction Processing (OLTP) Online Analytical Processing (OLAP)
Focus Transaction throughput, concurrency control Query performance, complex analytical queries

Why Have a Separate Data Warehouse?
A separate data warehouse is crucial for several reasons:
 Separation of Concerns: Separating analytical processing from operational databases prevents
performance degradation on transactional systems due to complex analytical queries.
 Data Consistency: It allows for the integration of data from various sources into a single,
consistent format, resolving inconsistencies and redundancies present in source systems.
 Historical Data: Operational databases usually store only current data, whereas data warehouses
maintain historical data necessary for trend analysis, forecasting, and long-term decision
making.
 Query Complexity: Data warehouses are designed to handle complex, ad-hoc analytical queries
efficiently, which would be inefficient or impractical on operational systems.

Data Warehousing: A Multitiered Architecture
A typical data warehousing architecture consists of three tiers:
 Bottom Tier (Data Warehouse Server): This is usually a relational database system that stores
the data warehouse. It includes back-end tools for data extraction, cleaning, transformation,
loading, and refreshing.
 Middle Tier (OLAP Server): This tier acts as a bridge between the user and the bottom-tier
database. It can be implemented using:
o Relational OLAP (ROLAP): An extended relational DBMS that maps OLAP operations to
standard relational operations.
o Multidimensional OLAP (MOLAP): A specialized multidimensional database (MDDB) that
directly implements multidimensional data and operations.
 Top Tier (Client Layer): This tier includes front-end tools for querying, reporting, analysis, and
data mining. These tools allow users to interact with the data warehouse and perform various
analytical tasks.

Data Warehouse Models
 Enterprise Warehouse: A comprehensive corporate-wide data warehouse that collects
information about all subjects spanning the entire organization.
 Data Mart: A subset of the enterprise data warehouse that focuses on a specific
department, subject area, or business function (e.g., sales, marketing, finance). Data
marts can be dependent (sourced from an enterprise warehouse) or independent (sourced
directly from operational data or external data).
 Virtual Warehouse: A set of operational views over operational databases. It is easy to
construct but may not be as efficient for complex queries as a physical data warehouse,
nor does it typically store historical data.

Extraction, Transformation, and Loading (ETL)
ETL are crucial processes for populating and refreshing the data warehouse:
 Extraction: Gathering data from multiple, heterogeneous, and external sources.
 Transformation: Cleaning and transforming the extracted data into a consistent format
suitable for the data warehouse. This includes data cleaning (detecting and rectifying
errors), data integration (combining data from different sources), and data reduction
(reducing the volume of data).
 Loading: Loading the transformed data into the data warehouse.

Metadata Repository
A metadata repository stores metadata, which is "data about data." It is essential for the
effective use and management of a data warehouse. It includes:
 Operational Metadata: Data lineage (source, transformation, destination), currency
of data, and monitoring information.
 Business Metadata: Data definitions, ownership, and business rules.
 Technical Metadata: Schema, mapping, and ETL process details

September 3, 2025
9
Cube: A Lattice of Cuboids
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid

Data Warehouse Modeling: Data Cube and OLAP
Data warehouse modeling primarily employs a multidimensional view of data, often
represented as a data cube. This model is critical for Online Analytical Processing (OLAP),
which allows users to analyze data from different perspectives.
1. Data Cube: A Multidimensional Data Model
The data cube is a core concept in multidimensional data modeling. It allows data to be
viewed from multiple dimensions, such as time, item, location, and supplier, and measures
(e.g., sales, revenue, average rating).
 Dimensions: These are the perspectives or attributes along which an organization wants
to keep records. For example, for a sales data warehouse, dimensions might include
time, item, branch, and location.
 Measures: These are the numerical values that are the subject of analysis, such as
sales_amount, quantity_sold, or profit. Measures are aggregated (e.g., sum, average,
count) across dimensions.

 Fact Table: In a star schema (discussed below), the fact table contains the measures
and foreign keys to the dimension tables. It stores the facts or measurements of interest.
Concept Hierarchy: Each dimension can have a hierarchy, which allows for data analysis
at different levels of abstraction. For example, the location dimension might have a
hierarchy: street < city < province_or_state < country. This enables drilling down
(viewing more detailed data) and rolling up (viewing more aggregated data).

September 3, 2025
A Sample Data Cube
Total annual sales
of TV in U.S.A.
Date
P
r
o
d
u
c
t
Country
sum
sum
TV
VCR
PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum

September 3, 2025
13
A Concept Hierarchy: Dimension
(location)
all
Europe North_America
Mexico
Canada
Spain
Germany
Vancouver
M. Wind
L. Chan
...
...
...
... ...
...
all
region
office
country
Toronto
Frankfurt
city

2. Star, Snowflake, and Fact Constellation Schemas
These are common schema designs for data warehouses, optimizing for querying and
analytical performance:
 Star Schema:
o The most common and simplest model.
o Consists of a large fact table in the center, connected to a set of smaller dimension
tables.
o Each dimension is represented by a single dimension table.
o Simple and easy to understand, making it good for query performance.
o Example: A sales fact table connected to time, item, branch, and location dimension
tables.

September 3, 2025
15
Example of Star Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch

 Snowflake Schema:
o A variation of the star schema where dimension tables are normalized.
o Dimension tables are further broken down into multiple related tables, forming a
snowflake-like structure.
o Example: The location dimension table might be normalized into city, province, and
country tables.
o Reduces data redundancy but increases the number of joins required for queries,
potentially impacting performance.

September 3, 2025
17
Example of Snowflake Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
province_or_street
country
city

Fact Constellation Schema (Galaxy Schema):
o Consists of multiple fact tables sharing some common dimension tables.
o More complex than star or snowflake schemas.
o Suitable for highly intricate data warehouse designs with multiple business
processes that share common dimensions.
o Example: A sales fact table and a shipping fact table sharing the time and item
dimensions, but having their own specific dimensions (e.g., customer for sales,
shipper for shipping).

September 3, 2025
19
Example of Fact Constellation
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_street
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper

3. OLAP Operations
Online Analytical Processing (OLAP) provides the analytical capabilities for exploring data
in a multidimensional way. Key OLAP operations include:
 Roll-up (Drill-up): Aggregates data by climbing up a concept hierarchy or by
dimension reduction.
o Example: Moving from city to country in a location hierarchy, or summarizing sales
from individual products to product categories.
 Drill-down: The reverse of roll-up, providing more detailed data by stepping down a
concept hierarchy or introducing a new dimension.
o Example: Moving from country to city in a location hierarchy, or viewing sales by
individual products after previously viewing by product categories.

 Slice: Selects a single dimension from the cube, resulting in a 2D view (a sub cube) for a specific value of one
or more dimensions.
o Example: Fixing the time dimension to "Q1 2024" and viewing sales data across item and location.
 Dice: Defines a sub cube by performing a selection on two or more dimensions.
o Example: Selecting sales for "Q1 2024" (time dimension) and "New York" (location dimension) and
viewing results for item and customer.
 Pivot (Rotate): Rotates the axes of the data cube to provide a different multidimensional perspective of the
data. This allows users to view the data from different orientations.
o Example: Swapping the item and location dimensions on a 2D slice to view location by item instead of
item by location.
 Other OLAP Operations:
o Drill-across: Involves querying across multiple fact tables.
o Drill-through: Allows users to go from the aggregated data in the data warehouse to the detailed
operational data in the source systems.

4. OLAP vs. OLTP
 OLAP (Online Analytical Processing):
o Focuses on analysis, decision support, and complex queries involving aggregations
over large data volumes.
o Characterized by fewer transactions, but these transactions are complex and involve
many records.
o Primarily read-only operations.
o Uses historical data.
o Optimized for query retrieval speed.

ROLAP SERVER
September 3, 2025
27

MOLAP SERVER
September 3, 2025
28

HOLAP SERVER
September 3, 2025
29

September 3, 2025
30
An OLAM Architecture
Data
Warehouse
Meta
Data
MDDB
OLAM
Engine
OLAP
Engine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data
Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result

 OLTP (Online Transaction Processing):
o Focuses on day-to-day operations, rapid processing of transactions, and data
integrity.
o Characterized by many short, atomic transactions.
o Primarily update, insert, and delete operations.
o Uses current data.
o Optimized for transaction throughput.

Data Warehouse Design and Usage
Designing and using a data warehouse involves a systematic approach, from gathering requirements
to implementation and deployment.
1. Data Warehouse Design Process
The design of a data warehouse is a complex process that involves careful planning and execution.
Key steps include:
 Top-Down vs. Bottom-Up Approaches:
o Top-Down Approach (Inmon's approach): Starts with creating a corporate-wide, normalized
enterprise data model and then builds data marts from it. This approach emphasizes data
consistency and integration across the entire organization.
o Bottom-Up Approach (Kimball's approach): Focuses on building individual data marts first,
based on specific business processes or departmental needs, and then integrating them to form
a larger enterprise data warehouse. This approach is often quicker to implement and delivers
early value.

 Combined Approach (Hybrid Approach): Many organizations adopt a hybrid
approach, combining the benefits of both top-down and bottom-up strategies. This might
involve an enterprise-wide data warehouse providing integrated data, with departmental
data marts built on top for specific analytical needs.
 Business Requirements Gathering: This is a crucial initial step where designers work
closely with business users to understand their analytical needs, reporting requirements,
and decision-making processes. This helps define the scope, dimensions, and measures
for the data warehouse.

 Data Modeling: This involves designing the logical and physical structures of the data
warehouse, typically using star, snowflake, or fact constellation schemas. This step
defines dimension tables, fact tables, their attributes, and relationships.
 Physical Design: This phase focuses on physical storage considerations, indexing
strategies, partitioning, and aggregation techniques to optimize query performance and
data loading.

2. Data Warehouse Usage for Information Processing
Data warehouses support various types of information processing, moving beyond simple
querying to sophisticated analytical capabilities:
 Information Processing Stages:
o Querying: Basic data retrieval using standard SQL queries, allowing users to select,
project, and join data.
o Basic Online Analytical Processing (OLAP): Involves operations like drill-down,
roll-up, slice, dice, and pivot for interactive multidimensional analysis.
o Advanced Analytical Processing (Data Mining): Goes beyond traditional OLAP to
discover hidden patterns, correlations, clusters, and predictions from the data. This
includes tasks like association rule mining, classification, clustering, and prediction.

 OLAP Applications: Data warehouses are extensively used for various OLAP
applications, including:
o Trend Analysis: Identifying patterns and trends over time.
o Exception Reporting: Spotting unusual deviations or outliers in business
performance.
o Ad-hoc Querying: Allowing users to pose flexible, unanticipated questions to explore
data.
o Reporting and Dashboards: Generating static or interactive reports and dashboards
for monitoring key performance indicators (KPIs) and business metrics.

3. Data Warehouse Implementation
Implementation involves the technical execution of the design:
 ETL (Extraction, Transformation, and Loading): This is a critical component of data
warehouse implementation. It involves extracting data from source systems, transforming
it into a consistent format suitable for the data warehouse, cleaning it, and loading it into
the target database. ETL processes also handle data refreshing and incremental updates.
 Tool Selection: Choosing appropriate ETL tools, database management systems (DBMS),
OLAP servers, and front-end analytical tools.
 Performance Optimization: Strategies such as indexing, materialized views (pre-
computed aggregates), partitioning, and efficient storage mechanisms are employed to
ensure fast query response times.
 Security and Access Control: Implementing robust security measures to protect sensitive
data and control user access based on roles and permissions.

4. Data Warehouse Metadata
Metadata (data about data) is crucial for the effective design, implementation, and usage of
a data warehouse. It includes:
 Business Metadata: Defines data elements in business terms, providing context and
meaning to users. Examples include data definitions, ownership, and business rules.
 Operational Metadata: Describes the data's origin, transformation history, loading
frequency, and data quality information. This helps in managing and monitoring the
ETL processes.
 Technical Metadata: Details the data warehouse schema, data mapping from sources to
the warehouse, and ETL process logic. This information is primarily for developers and
administrators.
 Importance of Metadata: It helps users understand the data, facilitates data
governance, supports data quality initiatives, and aids in system maintenance and
evolution.

5. Data Warehouse Development Methodologies
Various methodologies can guide the development of a data warehouse:
 Waterfall Model: A traditional, sequential approach where each phase (requirements,
design, implementation, testing, deployment) is completed before moving to the next.
While structured, it can be less flexible for evolving business needs.
 Iterative and Incremental Development: This approach involves developing the data
warehouse in small, manageable iterations or phases. Each iteration delivers a functional
subset of the data warehouse, allowing for early user feedback and adaptation to changing
requirements. This is often preferred for its flexibility.

UNIT-1 NOTES Data warehousing and data mining.pptx

September 3, 2025
41
Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
other
source
s
Data Storage
OLAP Server

Data Warehouse Implementation
Implementing a data warehouse involves a series of critical steps and considerations to ensure its
effectiveness and efficiency. This chapter focuses on the techniques and challenges associated with
materializing and managing the data warehouse.
1. Efficient Data Cube Computation
Building data cubes, especially for large datasets, can be computationally intensive. Efficient methods
are crucial:
 Materialization of Views/Cubes: Pre-computing and storing certain aggregated views or subcubes
can significantly speed up query processing. However, fully materializing all possible views can be
impractical due to storage costs and maintenance overhead.
o Partial Materialization: A common strategy is to materialize only a select set of views that are
frequently queried or are computationally expensive to derive on the fly.
o Selection of Views: The challenge is to choose which views to materialize. This involves trade-
offs between query response time, space requirements, and the time needed to update the
materialized views.

 Techniques for Cube Computation:
o MultiWay Array Aggregation: An efficient method for computing data cubes from a
base cuboid (the smallest cuboid containing all dimensions). It involves sorting and
aggregating data in multiple passes.
o BUC (Bottom-Up Cube): A scalable method that computes cuboids in a bottom-up
fashion, starting from the most aggregated cuboid and moving towards more detailed
ones. It prunes redundant computations and effectively handles sparse data.
o Star-Cubing: Integrates the advantages of both MultiWay and BUC, offering improved
performance by using a "star-tree" data structure for simultaneous aggregation.

o Shell-Fragments: Computes only a small portion (shell) of the cube and derives
other parts as needed, reducing computation and storage.
o Computing Cubes with Constraints: Techniques for computing cubes where
certain conditions or constraints must be satisfied by the aggregated data.
o Computing Iceberg Cubes: Focuses on computing only those cuboids or cells
that satisfy a minimum support (or measure) threshold, effectively pruning cells
with low aggregate values. This is particularly useful for very large or sparse
datasets.
o High-Dimensional OLAP: Addresses the challenges of data cubes with many
dimensions, where traditional methods might struggle due to the "curse of
dimensionality."

2. Indexing OLAP Data
Indexing is fundamental for accelerating query performance in data warehouses. Traditional B+-tree
indexes are often used, but specialized indexing techniques for multidimensional data are more
effective:
 Bitmap Indexing:
o Creates a bitmap for each distinct value of a dimension.
o Highly effective for low-cardinality attributes (attributes with a small number of distinct
values), as bitwise operations are very fast.
o Can also be used for higher-cardinality attributes with binning.
o Example: For a gender dimension (Male, Female), there would be two bitmaps.
 Join Indexing:
o A pre-computed join between a dimension table and a fact table.
o Speeds up join operations by storing the row IDs of related entries, avoiding runtime joins.
o Particularly useful in star schemas.

3. OLAP Query Processing
Optimizing the processing of complex OLAP queries is crucial:
 Query Rewrite: Transforming user queries into equivalent but more efficient forms that
can leverage materialized views or indexes.
 Query Optimization: Selecting the most efficient execution plan for a query, considering
factors like available indexes, materialized views, and data distribution.
 Caching: Storing frequently accessed query results or data blocks in memory to reduce
disk I/O.

4. Data Warehouse Metadata Management
Metadata is vital for understanding, managing, and using the data warehouse effectively:
 Metadata Repository: A centralized repository that stores all types of metadata
(business, operational, technical).
 Importance of Metadata:
o Data Understanding: Helps users and analysts understand the meaning, source, and
lineage of data.
o Data Governance: Supports data quality, compliance, and security efforts.
o ETL Management: Tracks the status and history of ETL processes.
o System Maintenance: Aids in troubleshooting, system evolution, and impact
analysis.

5. Data Warehouse Back-End Tools and Utilities
A suite of tools is required to manage the data warehouse environment:
 Data Extraction: Tools to pull data from various source systems (relational databases, flat files,
ERP systems, etc.).
 Data Cleaning and Transformation: Tools to identify and correct data errors, resolve
inconsistencies, aggregate, and summarize data according to the data warehouse schema. This is
a crucial and often time-consuming part of the ETL process.
 Data Loading: Tools to load the transformed data into the data warehouse, including initial full
loads and incremental updates.
 Data Refreshing: Processes for periodically updating the data warehouse to reflect changes in
source systems, including handling new data and changes to existing data.
 Warehouse Manager: A central component responsible for managing and coordinating the
various back-end tasks, including scheduling ETL jobs, monitoring performance, and
administering the warehouse.

6. Data Warehouse Administration and Management
Ongoing administration is essential for the smooth operation of a data warehouse:
 Security Management: Defining and enforcing access controls, authentication, and authorization
policies to protect sensitive data.
 Performance Monitoring and Tuning: Continuously monitoring system performance, identifying
bottlenecks, and applying optimizations to maintain fast query response times.
 Backup and Recovery: Implementing robust backup and disaster recovery plans to protect against
data loss.
 Archiving: Managing the archival of old data to optimize storage and performance while retaining
historical information.
 User Training and Support: Providing training and support to end-users on how to effectively use
the data warehouse and its analytical tools.

Data Mining and Pattern Mining: Technologies
Data mining is a blend of multiple disciplines, each contributing essential techniques and concepts.
1. Interdisciplinary Nature of Data Mining
Data mining draws upon a wide range of fields to achieve its objectives of discovering knowledge
from data. These include:
 Database Systems and Data Warehouse Technology: These provide the infrastructure for
storing, managing, and retrieving large datasets. Data warehouses, in particular, are designed for
analytical processing, making them ideal sources for data mining.
 Statistics: Statistical methods are fundamental to data mining for tasks such as hypothesis testing,
regression analysis, correlation analysis, and data summarization. They provide the mathematical
rigor for understanding data distributions and relationships.

 Machine Learning: This field provides algorithms that enable systems to learn from data
without being explicitly programmed. Key machine learning techniques used in data
mining include classification, clustering, regression, and anomaly detection.
 Information Retrieval: Techniques from information retrieval are used for handling
unstructured or semi-structured data, particularly text, and for efficient data access and
relevance ranking.
 Pattern Recognition: This discipline focuses on the automatic discovery of patterns and
regularities in data. It encompasses methods for classification, clustering, and feature
extraction.

 Artificial Intelligence (AI): AI contributes search algorithms, knowledge representation
techniques, and reasoning mechanisms that are applied in various data mining tasks.
 Visualization: Tools and techniques for visually representing data and mining results help
users understand complex patterns and insights.
 High-Performance Computing: Given the massive scale of data, high-performance
computing, including parallel and distributed computing, is essential for efficient data
mining algorithm execution.

2. Relationship with Other Fields
Data mining is not an isolated discipline but rather interacts closely with and leverages advances
from other scientific and commercial fields:
 Database Management Systems (DBMS) and Data Warehousing: Data mining often operates
on data stored in traditional relational databases or, more commonly, in data warehouses. Data
warehousing provides the integrated, historical, and subject-oriented data necessary for effective
analysis.
 Statistics: While both fields deal with data analysis, data mining often focuses on finding
patterns in very large datasets with less emphasis on statistical significance testing for specific
hypotheses, instead favoring algorithmic approaches for pattern discovery.
 Machine Learning: Data mining adopts many algorithms from machine learning, such as
decision trees, neural networks, and support vector machines, for tasks like prediction and
classification.
 Knowledge-Based Systems: Data mining can integrate with knowledge-based systems by
extracting new knowledge that can then be incorporated into expert systems or intelligent agents.

3. Technologies Contributing to Data Mining
Several specific technological advancements have significantly contributed to the emergence
and growth of data mining:
 Massive Data Collection: The ability to collect and store vast amounts of data
electronically (e.g., from transactions, web logs, sensor networks) has created the "data
rich, information poor" problem, which data mining aims to solve.
 Powerful Multiprocessor Computers: The availability of powerful and affordable
parallel and distributed computing systems allows for the processing of immense datasets
in reasonable timeframes.
 Advanced Data Storage Mechanisms: Technologies like sophisticated database
management systems, data warehouses, and cloud storage provide efficient and scalable
ways to manage diverse data types.

 Improved Data Access and Connectivity: The internet and advanced networking
technologies facilitate the aggregation of data from disparate sources.
 Development of Sophisticated Algorithms: Continuous research and development in
statistics, machine learning, and AI have led to increasingly sophisticated algorithms
capable of uncovering complex patterns.
 User-Friendly Data Mining Software: The development of user-friendly tools and
software packages has made data mining accessible to a wider audience, including
business analysts.
In essence, data mining is a multidisciplinary field that combines robust computational
power with advanced analytical techniques to extract valuable, hidden knowledge from large
datasets, enabling informed decision-making across various domains.

Applications of Data Mining
Data mining has a vast array of applications across numerous industries and domains, driven by the
increasing availability of data and the need to extract actionable insights.
Applications and Trends
 Business Intelligence:
o Market Analysis and Management: Understanding customer buying behavior, identifying
profitable customers, cross-marketing, market segmentation, and target marketing.
o Risk Management: Fraud detection (e.g., credit card fraud, insurance fraud), anti-money
laundering.
o Customer Relationship Management (CRM): Predicting customer churn, improving
customer retention, personalized marketing, and customer service optimization.
o Financial Data Analysis: Loan prediction, financial forecasting, stock market analysis, and
risk assessment.

 Scientific and Engineering Applications:
o Bioinformatics and Medical Data Analysis: Gene pattern analysis, protein function
prediction, disease diagnosis, drug discovery, and personalized medicine.
o Astronomy: Discovering celestial objects, analyzing astronomical images and sensor data.
o Geospatial Data Mining: Analyzing spatial data for urban planning, environmental
monitoring, and resource management.
o Scientific Data Analysis: Identifying patterns and anomalies in experimental data from
various scientific disciplines.

 Other Application Areas:
o Telecommunications: Network fault diagnosis, call detail record analysis, customer churn
prediction.
o Retail: Sales forecasting, inventory management, customer shopping pattern analysis, product
placement optimization.
o Education: Predicting student performance, identifying at-risk students, optimizing learning
paths.
o Web Mining: Analyzing web usage patterns, personalized recommendations, search engine
optimization, and community discovery.
o Text Mining: Analyzing large collections of text documents for information extraction, topic
modeling, and sentiment analysis.
o Image and Video Data Mining: Object recognition, surveillance, content-based image/video
retrieval.
o Social Network Analysis: Understanding relationships, influence, and community structures in
social media.

Data Mining Applications
 Financial Data Analysis:
o Loan Payment Prediction and Customer Credit Policy Analysis: Building models to
predict loan default rates and optimize credit policies.
o Classification and Clustering for Loan Application: Categorizing loan applicants
into risk groups.
o Fraud Detection and Management: Detecting anomalous transactions (e.g., credit
card fraud, insurance fraud, stock manipulation).
o Financial Planning and Asset Management: Identifying investment opportunities,
optimizing portfolios, and predicting market trends.

 Retail Industry:
o Customer Retention and Churn Analysis: Predicting which customers are likely to
leave and devising strategies to retain them.
o Market Basket Analysis: Discovering associations between products frequently
bought together to inform product placement and promotional strategies.
o Sales Forecasting: Predicting future sales to optimize inventory and staffing.
o Target Marketing: Identifying specific customer segments for personalized
promotions.

 Telecommunications Industry:
o Network Fault Management: Identifying network anomalies and predicting
potential failures.
o Fraud Detection: Detecting fraudulent call patterns.
o Customer Churn Prediction: Identifying customers likely to switch service
providers.
o Call Detail Record Analysis: Analyzing call patterns to understand customer
behavior and optimize service plans.

 Biological Data Analysis (Bioinformatics):
o Gene Expression Analysis: Identifying genes that are co-expressed or involved in
specific biological processes.
o Protein Function Prediction: Inferring the function of unknown proteins based on
sequence or structural similarities.
o Drug Discovery: Identifying potential drug candidates and understanding their
interactions with biological targets.
o Disease Diagnosis and Prognosis: Developing models to diagnose diseases and
predict disease progression.

 Other Scientific Applications:
o Astronomy and Astrophysics: Categorizing celestial objects, detecting unusual
phenomena in telescope data.
o Geospatial Data Mining: Discovering patterns in geographical data, such as land-
use changes, environmental impact assessment, and resource management.
o Materials Science: Discovering new materials properties from experimental data.

 Health Care:
o Patient Diagnosis and Prognosis: Assisting doctors in diagnosing diseases and
predicting patient outcomes.
o Medical Image Analysis: Identifying abnormalities in medical scans.
o Personalized Medicine: Tailoring treatments based on individual patient
characteristics.
 Intrusion Detection:
o Detecting anomalies and malicious activities in computer networks.
o Classifying network traffic as normal or intrusive.

 Recommender Systems:
o Providing personalized recommendations for products, movies, music, or news based on user
preferences and past behavior.
 Social Media Analysis:
o Community Detection: Identifying groups of users with similar interests or connections.
o Influence Analysis: Identifying influential users in a network.
o Sentiment Analysis: Determining the emotional tone or opinion expressed in text data.
These applications demonstrate that data mining is not just a theoretical field but a practical
discipline with the power to transform industries and advance scientific discovery by extracting
valuable, actionable knowledge from vast amounts of data.

Major Issues in Data Mining
The key challenges and research frontiers in data mining, categorizing them into mining methodology issues,
user interaction issues, performance issues, and diverse data type issues.
1. Mining Methodology and User Interaction Issues
These issues relate to the effectiveness and usability of data mining techniques:
 Mining Various Kinds of Knowledge in Databases:
o Data mining systems should be able to discover different types of knowledge, including
characterization, discrimination, association, classification, prediction, clustering, and outlier analysis.
o The ability to mine multiple kinds of patterns requires diverse methodologies and algorithms.
 Interactive Mining of Knowledge at Multiple Levels of Abstraction:
o Users often need to explore data at different levels of granularity (e.g., from overall sales trends to
specific product sales).
o Data mining systems should allow for interactive drilling down, rolling up, and pivoting through data,
similar to OLAP capabilities. This helps users focus on relevant patterns.

 Incorporation of Background Knowledge:
o Domain-specific knowledge (background knowledge) can guide the mining process, improve
the quality of discovered patterns, and help in pattern evaluation.
o Integrating such knowledge, often in the form of concept hierarchies or expert rules, is crucial
for meaningful results.
 Handling Noisy or Incomplete Data:
o Real-world data is often imperfect, containing noise, errors, or missing values.
o Robust data mining methods are needed to handle these imperfections without significantly
compromising the accuracy or reliability of the discovered patterns. This includes data
cleaning and imputation techniques.

 Pattern Evaluation—The Problem of Interestingness Measures:
o Not all discovered patterns are equally interesting or useful to a user.
o Developing effective "interestingness measures" (e.g., support, confidence, lift for
association rules; accuracy for classifiers) to identify truly novel, actionable, and
significant patterns is a major challenge.
o Subjective interestingness (relevance to the user's goals) is also important.
 Mining Information from Heterogeneous Databases and Global Information
Systems:
o Data often resides in diverse sources with different formats, schemas, and semantic
meanings.
o Integrating and mining data from heterogeneous sources (e.g., relational databases,
data warehouses, Web, text, spatial, multimedia) is a complex task.

2. Performance Issues
These issues address the scalability and efficiency of data mining algorithms:
 Efficiency and Scalability of Data Mining Algorithms:
o Data mining algorithms must be efficient enough to handle massive datasets.
o Scalability refers to the ability of an algorithm to maintain good performance as the data
volume increases linearly or super linearly.
o This requires optimized algorithms, parallel processing, and incremental mining techniques.
 Parallel, Distributed, and Incremental Mining Methods:
o To cope with huge datasets, parallel and distributed data mining algorithms that run on
multiple processors or machines are essential.
o Incremental mining methods allow updating existing patterns or discovering new ones
without re-mining the entire dataset from scratch when new data arrives.

3. Issues Relating to the Diversity of Data Types
Data mining is increasingly applied to complex and varied data types beyond traditional relational tables:
 Mining Complex Types of Data:
o Traditional data mining often focuses on relational and transactional data.
o New challenges arise from mining complex data types such as:
 Text and Web Data: Unstructured and semi-structured data requiring natural language processing
and graph mining techniques.
 Multimedia Data: Images, audio, and video, where content-based retrieval and feature extraction
are critical.
 Spatial and Spatio-temporal Data: Geographical information with location and time components.
 Time-Series Data: Sequential data like stock prices or sensor readings.
 Stream Data: Data that arrives continuously and rapidly, requiring real-time processing.
 Graph and Network Data: Social networks, biological networks, and the Web graph.

4. Social Impacts of Data Mining
Beyond technical challenges, data mining raises important ethical and societal concerns:
 Privacy and Security Issues:
o The collection and analysis of personal data raise concerns about individual privacy.
o Ensuring data anonymity, developing privacy-preserving data mining (PPDM) techniques, and
securing sensitive information are crucial.
 Societal Impacts:
o Potential for misuse of mined knowledge (e.g., discrimination, surveillance).
o Impact on employment and decision-making processes.
o Issues of fairness, accountability, and transparency in algorithmic decision-making.
Addressing these major issues is vital for the continued advancement and responsible application of
data mining technology.

Data Objects and Attribute Types
This introduces fundamental concepts related to the types of data that data mining techniques
operate on, focusing on data objects (records) and their attributes (features).
1. Data Objects
 Definition: Data objects (also called samples, examples, instances, data points, or objects)
are the individual entities or items about which data is collected.
 Representation: In a database, data objects correspond to rows or records.
 Example: In a sales database, data objects could be customers, store items, or sales
transactions. If analyzing customer data, each customer would be a data object.

2. Attributes
 Definition: An attribute (also known as a dimension, feature, or variable) is a data field representing a
characteristic or feature of a data object.
 Representation: In a database, attributes correspond to columns.
 Types of Attributes: Attributes can be classified into different types based on the nature of the values they
can take. Understanding attribute types is crucial because different types of attributes require different data
mining techniques.
3. Attribute Types
Attributes are broadly classified into Nominal, Binary, Ordinal, and Numeric types.
A. Nominal Attributes
 Definition: Values are categories, symbols, or names of things. They are qualitative and do not have a
meaningful order or quantitative meaning.
 Operations: Equality or inequality can be determined (e.g., color = red).
 Examples: hair_color (black, brown, blonde, red), marital_status (single, married, divorced), occupation, ID
numbers (although numeric, they are nominal as they just identify).

B. Binary Attributes
 Definition: A nominal attribute with only two categories or states: 0 or 1.
 Types:
o Symmetric Binary: Both states are equally important and carry no preference (e.g.,
gender (male, female)).
o Asymmetric Binary: The two states are not equally important. For example, in a
medical test, positive (presence of a condition) is typically more important than
negative (absence). In many data mining applications, only the presence (1) of a
characteristic is recorded.
 Example: smoker (yes/no), true/false, customer_qualified (0/1).

C. Ordinal Attributes
 Definition: Values have a meaningful order or ranking among them, but the magnitude
of the difference between successive values is not known or meaningful.
 Operations: Order relationships can be determined (e.g., > or <).
 Examples: size (small, medium, large), education_level (high school, bachelor's,
master's, PhD), ratings (good, better, best).

D. Numeric Attributes
 Definition: Quantitative attributes that are measurable, integer-valued or real-valued. They provide quantitative
measurements.
 Types:
o Interval-Scaled Attributes:
 Measured on a scale of equal-sized units.
 Order matters, and the difference between values is meaningful.
 No true zero point (a value of zero does not mean the absence of the quantity).
 Ratios are not meaningful.
 Examples: temperature in Celsius or Fahrenheit (0°C doesn't mean no temperature), calendar_dates.
o Ratio-Scaled Attributes:
 Have all the properties of interval-scaled attributes.
 Have a true zero point, indicating the complete absence of the quantity.
 Ratios are meaningful (e.g., 20 is twice as much as 10).
 Examples: height, weight, age, monetary _ amount, temperature in Kelvin (0 Kelvin means absolute zero, no
thermal energy).

4. Discrete vs. Continuous Attributes
Attributes can also be broadly categorized based on the number of values they can take:
 Discrete Attributes:
o Has a finite or countably infinite set of values.
o Can be represented as integers.
o Examples: zip_codes, number_of_cars (integer), number_of_children (integer),
attributes of nominal, binary, or ordinal type.
 Continuous Attributes:
o Has real values, typically represented as floating-point numbers.
o The number of possible values is infinite.
o Examples: height, weight, temperature, income.
Understanding these distinctions is crucial for selecting appropriate data preprocessing techniques, similarity
measures, and data mining algorithms, as different attribute types behave differently and require specific
handling.

Basic Statistical Descriptions of Data
These descriptions are crucial for understanding data quality, detecting outliers, and guiding further
data analysis.
1. Measuring the Central Tendency
Measures of central tendency indicate the "middle" or "center" of a data set.
 Mean (Arithmetic Mean):
o The most common measure, calculated as the sum of all values divided by the number of
values.
o Formula: For a set of n observations {x1
,x2
,...,xn
}, the mean is xˉ=n1
∑i=1n
xi
.
o Weighted Arithmetic Mean: Used when values have different importance, given by $ bar{x}
= frac{sum_{i=1}^{n} w_i x_i}{sum_{i=1}^{n} w_i} $, where wiis the weight of xi
.
o Robustness: Sensitive to outliers. A single extreme value can significantly shift the mean.

 Median:
o The middle value in a dataset that has been ordered (sorted).
o If n is odd, the median is the middle value.
o If n is even, the median is the average of the two middle values.
o Robustness: Less sensitive to outliers than the mean. It's a better measure of central tendency for skewed distributions.
 Mode:
o The value that occurs most frequently in a dataset.
o A dataset can have:
 No mode: If all values occur only once. Data: 1, 2, 3, 4, 5
Mode: None (all values occur once)
 Unimodal:Onemode.Data:2,4,4,6,7
Mode: 4 (because it appears twice)
 Bimodal: Two modes. Data: 3, 3, 6, 6, 9
Modes: 3 and 6 (both appear twice)
 Multimodal: More than two modes.
o Applicability: Can be used for numerical and categorical data.

 Midrange:
o The average of the largest and smallest values in a dataset.
o Formula: Midrange = (Max+Min)/2
.
o Robustness: Highly sensitive to outliers.

2. Measuring the Dispersion of Data
Measures of dispersion indicate how spread out the data values are.
 Range:
o The difference between the maximum and minimum values in a dataset.
o Formula: Range = Max−Min.
o Robustness: Highly sensitive to outliers.
 Quartiles and Interquartile Range (IQR):
o Quartiles: Divide an ordered dataset into four equal parts.
 Q1(first quartile): At the 25th percentile.
 Q2(second quartile): The median (50th percentile).
 Q3(third quartile): At the 75th percentile.
o Interquartile Range (IQR): The difference between the third and first quartiles.
o Formula: IQR = Q3
−Q1
.
o Robustness: Less sensitive to outliers than the range. It represents the spread of the middle 50% of the
data.

How to find quartiles:
Order the data: Arrange all the data points in ascending order (from smallest to
largest).
Find the Median (Q2):
If the number of data points (n) is odd, the median is the middle value.
If n is even, the median is the average of the two middle values.
Find Q1: This is the median of the lower half of the data (all values before Q2).
Find Q3: This is the median of the upper half of the data (all values after Q2).

Example: Consider the dataset: 2, 5, 7, 8, 10, 12, 15, 18, 20, 22
Ordered data: 2, 5, 7, 8, 10, 12, 15, 18, 20, 22
Q2 (Median): Since there are 10 data points (even), the median is the average of the 5th
and 6th values. (10+12)/2=11.
So, Q2 = 11.
Lower half: 2, 5, 7, 8, 10
Q1: The median of the lower half is 7.
So, Q1 = 7.
Upper half: 12, 15, 18, 20, 22
Q3: The median of the upper half is 18.
So, Q3 = 18.

Interquartile Range (IQR)
The Interquartile Range (IQR) is a measure of statistical dispersion, or spread, and is
equal to the difference between the upper and lower quartiles. It represents the middle
50% of the data.
Formula for IQR:
IQR = Q3 - Q1
Example (using the previous dataset):
Q3 = 18
Q1 = 7
IQR = 18 - 7 = 11

 Five-Number Summary:
The five-number summary is a set of five descriptive statistics that provide a concise
summary of the distribution of a dataset. It is particularly useful for understanding the shape,
spread, and central tendency of data, and for identifying potential outliers.
The five numbers are:
Minimum Value: The smallest observation in the dataset.
First Quartile (Q1): The value below which 25% of the data falls.
Median (Q2): The middle value of the dataset, with 50% of the data falling below it.
Third Quartile (Q3): The value below which 75% of the data falls.
Maximum Value: The largest observation in the dataset.

These five values are often represented visually in a box plot (also known as a box-and-
whisker plot), which provides a clear graphical representation of the data's distribution.
Example:
Using the dataset from our previous discussion: 2, 5, 7, 8, 10, 12, 15, 18, 20, 22
Minimum Value: 2
Q1: 7
Median (Q2): 11
Q3: 18
Maximum Value: 22
So, the five-number summary for this dataset is (2, 7, 11, 18, 22).

 Boxplot (Box-and-Whisker Plot):
o A graphical representation of the five-number summary.
o The box spans from Q1to Q3
, with a line at the median.
o "Whiskers" extend from the box to the minimum and maximum values (or to 1.5×IQR from Q1and Q3to
identify potential outliers).
o Useful for comparing distributions across different groups or identifying outliers.
 Variance (σ2):
o Measures the average of the squared differences from the mean. It quantifies how much individual data points
vary from the mean.
o Formula (Population): σ2=N1
∑i=1N
(xi
−μ)2.
o Formula (Sample): s2=n−11
∑i=1n
(xi
−xˉ)2.
o Robustness: Sensitive to outliers, as it uses the mean in its calculation.
 Standard Deviation (σ):
o The square root of the variance. It is expressed in the same units as the data, making it more interpretable than
variance.
o Formula: σ=σ2
.

3. Graphic Displays of Basic Statistical Descriptions
Visualizing data is crucial for understanding its characteristics and distributions.
 Histogram:
o Displays the distribution of a numerical attribute by dividing the data into bins and showing the frequency (or count) of values
falling into each bin.
o Reveals the shape of the distribution (e.g., symmetric, skewed, bimodal).
 Quantile Plot:
o Plots each data value xiagainst its corresponding quantile (fi
), where fi
=ni−0.5
.
o Shows the overall distribution of the data and helps identify outliers or clusters.
 Quantile-Quantile Plot (Q-Q Plot):
o Plots the quantiles of one univariate distribution against the corresponding quantiles of another univariate distribution (often a
theoretical distribution like the normal distribution).
o Used to check if a dataset follows a particular distribution or to compare two empirical distributions. If points lie on a 45-degree
line, the distributions are similar.
 Scatter Plot:
o Displays the relationship between two numerical attributes. Each data object is represented as a point in a 2D plane.
o Helps identify trends, correlations (positive, negative, no correlation), clusters, and outliers.
These basic statistical descriptions and graphical displays provide foundational insights into the data, which are indispensable steps in any
data mining process.

Data Visualization
Visualization is a powerful tool for initial data exploration and presenting mining results.
1. Why Data Visualization?
 Human Perception: Humans are highly skilled at recognizing patterns, trends, and anomalies in
visual representations. Visualization leverages this strength.
 Exploratory Data Analysis: It helps in understanding data distribution, relationships between
variables, identifying outliers, and discovering structures that might be hard to find through
purely statistical methods.
 Communication: Effectively communicates insights and findings from data analysis to a diverse
audience.

2. General Categories of Data Visualization Methods
Data visualization techniques can be broadly categorized based on the type of data and the purpose of visualization.
A. Visualizing Positional and Nominal Data
 Pie Chart:
o Used to display the proportion of categories in a whole. Each slice represents a category, and its size is
proportional to the percentage it represents.
o Best for a small number of categories.
 Bar Chart:
o Compares the values of different categories. Each bar represents a category, and its length corresponds to the
value.
o Can be used for nominal, ordinal, or discrete numeric data.
 Stacked Bar Chart:
o Shows the composition of categories within a larger group.
 Table (Listing):
While not strictly a "chart," tabular representation is a fundamental way to present data, allowing for precise value
lookup.

B. Visualizing Time-Series Data
 Line Graphs:
o Connect data points over time to show trends, patterns, and fluctuations.
o The x-axis typically represents time, and the y-axis represents the measured value.
 Area Graphs:
o Similar to line graphs but the area between the line and the x-axis is filled, emphasizing
magnitude.
 Stacked Area Graphs:
o Show how the composition of a total changes over time.

C. Visualizing Data Distributions (for single attributes)
 Histograms:
o Displays the distribution of a numerical attribute by dividing the data into bins and showing the
frequency or count of values in each bin.
o Reveals the shape of the distribution (e.g., normal, skewed, bimodal).
 Boxplots (Box-and-Whisker Plots):
o Summarizes the five-number summary (minimum, Q1, median, Q3, maximum) and identifies potential
outliers.
o Useful for comparing distributions across multiple groups.
 Quantile Plots:
o Plots data values against their corresponding quantiles, providing insight into the distribution's shape and
density.
 Q-Q Plots (Quantile-Quantile Plots):
o Compares the quantiles of two distributions (e.g., an empirical distribution against a theoretical one) to
assess if they follow a similar pattern.

D. Visualizing Relationships between Two Variables
 Scatter Plots:
o Displays the relationship between two numerical attributes. Each point represents a data object
with coordinates corresponding to its values for the two attributes.
o Helps identify correlations (positive, negative, none), clusters, and outliers.
 Bubble Charts:
o An extension of scatter plots where a third numerical variable is represented by the size of the
points (bubbles).
 Heatmaps:
o Represent values in a matrix using a color gradient. Often used for correlation matrices or
visualizing gene expression levels.

E. Visualizing Multi-Dimensional Data
Visualizing data with more than two or three dimensions is challenging, and specialized techniques are used:
 Parallel Coordinates:
o Each dimension is represented by a vertical axis. A data object is represented as a polyline that intersects each axis at the value of
the corresponding dimension.
o Reveals clusters, correlations, and relationships between multiple dimensions.
 RadViz / Star Plots / Spider Charts:
o Each axis radiates from a central point. A data object is represented by a polygon connecting its values on each axis.
o Useful for comparing multiple variables for a small number of data objects.
 Chernoff Faces:
o Maps data attribute values to features of a human face (e.g., eye size, mouth curve, nose length). Different facial expressions
represent different data objects.
o Can be effective for recognizing patterns and similarities but can be subjective.
 Pixel-Oriented Visualization:
o Maps attribute values to colored pixels and arranges them in a specific order (e.g., space-filling curves).
o Can display very large datasets.
 Hierarchical Visualization:
o Uses a hierarchical subdivision of a display area to represent data relationships (e.g., treemaps, sunburst diagrams).

3. Interactive Data Visualization
Modern visualization tools offer interactivity, allowing users to:
 Zoom and Pan: Focus on specific areas of interest.
 Filtering and Brushing: Select subsets of data and highlight them across multiple views.
 Drill-Down and Roll-Up: Change the level of detail or aggregation.
 Linked Views: Changes in one visualization are reflected in others.
Data visualization is an indispensable part of the data mining process, aiding in exploratory
analysis, pattern discovery, and effective communication of insights.

Measuring Data Similarity and Dissimilarity
This focuses on the crucial concepts of data similarity and dissimilarity (distance measures), which are
fundamental for many data mining tasks, especially clustering, classification, and outlier analysis.
1. Introduction to Similarity and Dissimilarity
 Similarity: A numerical measure of how alike two data objects are. Higher similarity values
indicate a stronger resemblance.
 Dissimilarity (or Distance): A numerical measure of how different two data objects are. Lower
dissimilarity values (closer to zero) indicate a stronger resemblance, while higher values indicate
greater difference.
 Relationship: Similarity and dissimilarity are inversely related. Often, one can be converted to the
other (e.g., similarity=1−dissimilarity).

2. Proximity Measures for Nominal Attributes
For nominal attributes (categories without order), similarity is often measured by the simple
matching approach.
 Simple Matching Coefficient:
o Counts the number of attributes for which two objects have the same value.
o Formula: sim(x,y)=total number of attributes/number of matches
.
o Dissimilarity (Simple Matching Distance): d(x,y)=number of mismatches
/total
number of attributes

3. Proximity Measures for Binary Attributes
Binary attributes have only two states (e.g., 0 or 1). Measures distinguish between symmetric and asymmetric cases.
 Let x and y be two binary objects.
 q: number of attributes where x=1 and y=1 (co-occurrence of 1s).
 r: number of attributes where x=1 and y=0.
 s: number of attributes where x=0 and y=1.
 t: number of attributes where x=0 and y=0 (co-occurrence of 0s).
 Total attributes: p=q+r+s+t.
 Symmetric Binary Attributes: Both states are equally important.
o Simple Matching Coefficient: sim(x,y)=q+r+s+tq+t
.
o Dissimilarity (Simple Matching Distance): d(x,y)=q+r+s+tr+s
.
 Asymmetric Binary Attributes: One state (typically '1' for presence) is considered more important than the other
('0' for absence).
o Jaccard Coefficient (for similarity): Focuses only on the presence of attributes (1s).
 Formula: simJaccard
(x,y)=q+r+sq
. (Ignores t, the number of 0-0 matches).
o Asymmetric Binary Distance: d(x,y)=q+r+sr+s
. (Ignores t).

Asymmetric Binary Distance is a special type of dissimilarity measure used for binary attributes
where only one of the binary outcomes (usually '1') is important.
When to Use Asymmetric Binary Distance?
Use it when:
•Presence (1) is more meaningful than absence (0)
Example: medical symptoms, market basket data, keyword presence
Y = 1 Y = 0
X = 1 a b
X = 0 c d
Binary Attribute Representation
Given two binary vectors X and Y, the values are compared in terms of:
•1 = presence
•0 = absence

4. Proximity Measures for Numeric Attributes (Dissimilarity/Distance Measures)
These measures are often called "distances" in metric spaces.
 Minkowski Distance: A generalized distance metric that includes Euclidean and Manhattan distances as special cases.
o Formula: d(x,y)=(∑k=1d xk
−yk p)1/p.
∣ ∣
o d: number of dimensions (attributes).
o xk
,yk
: values of the k-th attribute for objects x and y.
o Manhattan Distance (L1 norm, p=1):
 Formula: d(x,y)=∑k=1d xk
−yk .
∣ ∣
 Represents the sum of the absolute differences of their coordinates. Often called city-block distance.
o Euclidean Distance (L2 norm, p=2):
 The most common distance measure.
 Formula: d(x,y)=∑k=1d
(xk
−yk
)2
.
 Represents the straight-line distance between two points in Euclidean space.
o Supremum Distance (L-infinity norm, p→∞):
 Formula: d(x,y)=maxk xk
−yk .
∣ ∣
 Represents the maximum difference between any attribute of the two objects.

 Standardization: When attributes have different scales or units, it's crucial to standardize them (e.g., using
Z-score normalization) before computing distances to prevent attributes with larger ranges from dominating
the distance calculation.
5. Cosine Similarity
 Definition: Measures the cosine of the angle between two vectors. It is often used for high-dimensional data,
particularly in text mining and information retrieval, where data objects are represented as term-frequency
vectors.
 Formula: simcosine
(x,y)= x y x y
=∑k=1d
xk2

∑k=1d
yk2

∑k=1d
xk
yk

.
∣∣ ∣∣⋅∣∣ ∣∣ ⋅
 Range: Values range from -1 (opposite) to 1 (identical). 0 indicates orthogonality (no linear relationship).
 Note: Cosine similarity is a measure of orientation, not magnitude. Two vectors can be very far apart in
Euclidean distance but have high cosine similarity if they point in the same direction.

6. Proximity Measures for Ordinal Attributes
 Ordinal attributes have a meaningful order but unknown interval magnitudes.
 Steps:
1. Map the ordinal values to ranks (e.g., small=0, medium=1, large=2).
2. Normalize the ranks to a range (e.g., [0, 1]) if different attributes have different numbers of
states.
3. Treat the normalized ranks as numeric values and use standard distance measures (e.g.,
Euclidean distance).

7. Proximity Measures for Mixed-Type Attributes
When data objects have a mixture of attribute types (nominal, binary, numeric, ordinal), a weighted
approach is typically used.
 Combined Dissimilarity Measure:
o Formula: d(x,y)=∑k=1d
δxy(k)
∑k=1d
δxy(k)
dxy(k)

.
o dxy(k)
: The dissimilarity calculated for the k-th attribute (e.g., 0 for matches, 1 for mismatches
for nominal; Euclidean distance for numeric).
o δxy(k)
: A weight or indicator that is 1 if the measurement for attribute k is not missing for
both objects x and y, and 0 otherwise (or 0 if the attribute is asymmetric binary and both
values are 0).
Understanding and correctly applying these similarity and dissimilarity measures are foundational
for many data mining algorithms that rely on computing relationships between data objects.

What is a Cloud Data Warehouse?
•A Cloud Data Warehouse is a data warehouse built on cloud infrastructure,
offering:
Scalability (compute and storage)
High availability
Disaster recovery
Managed services
•Data is accessed via web interfaces, APIs, or SQL clients.

Feature Description
Elasticity Automatically scale up/down resources
Separation of Compute & Storage Allows independent scaling and pricing
Serverless Options No need to manage servers (e.g., Google BigQuery)
Pay-as-you-go Pricing Pay only for resources used
Multi-tenancy Supports multiple users securely
Fault Tolerance Built-in backups and redundancy
Global Access Accessible from anywhere via the internet
Fully Managed Vendor handles maintenance, updates, patching
Characteristics of Cloud Data Warehouses

Architecture of Cloud Data Warehouse
Traditional vs. Cloud DW
•Traditional DW:
On-premise servers
Manual maintenance
•Cloud DW:
Hosted on cloud
Auto-scalable and distributed
Managed services
No Fixed capacity

General Components
•Data Sources – CRM, ERP, IoT, social media, etc.
•ETL/ELT Tools – Extract, transform, load or load then
transform
(e.g., Talend, Apache Nifi, AWS Glue)
•Query Interface – SQL, APIs, BI tools (e.g., Tableau, Power BI)
•Storage Layer – Object storage (e.g., S3, Blob, GCS)
•Compute Layer – For query processing and transformation
•Cloud DW Engine – Redshift, BigQuery, Snowflake

Platform Provider Key Features
Amazon Redshift AWS
Columnar storage, MPP engine,
integration with AWS
ecosystem
Google BigQuery Google Cloud
Serverless, SQL-based, supports
ML
Snowflake
Independent (AWS, Azure,
GCP)
Multi-cloud, separates
compute/storage, strong security
Azure Synapse Microsoft Azure
Unified analytics with Spark
and SQL engines
Oracle Autonomous DW Oracle Cloud
Self-managing, self-securing,
self-repairing DB
.
Popular Cloud Data Warehouse Platforms

Benefit Description
Cost Efficiency No hardware costs, pay for usage
High Performance Distributed query engines, parallel processing
Scalability Instantly scale storage or compute
Speed of Deployment No provisioning delays
Data Integration Easy integration with cloud apps, data lakes
Disaster Recovery Built-in backup and restore
Global Access Anytime, anywhere access
Advantages of Cloud Data Warehousing

Aspect Cloud Data Warehouse Data Lake
Data Type Structured
All types (structured, semi,
unstructured)
Schema Schema-on-write Schema-on-read
Performance High (for queries) Slower for queries
Storage Costlier Cheaper
Examples Redshift, BigQuery
Amazon S3, Azure Data
Lake, Hadoop
Cloud Data Warehouse vs. Data Lake

A Data Lake is a centralized repository that allows you to store all your structured,
semi-structured, and unstructured data at any scale — raw and in its native format —
until it is needed for analysis.

UNIT-1 NOTES Data warehousing and data mining.pptx

More Related Content

Similar to UNIT-1 NOTES Data warehousing and data mining.pptx (20)

Recently uploaded (20)

UNIT-1 NOTES Data warehousing and data mining.pptx