SlideShare a Scribd company logo
A. D. Patel Institute Of Technology
Data Mining And Business Intelligence (2170715): A. Y. 2019-20
Data Compression – Numerosity Reduction
Prepared By :
Dhruv V. Shah (160010116053)
B.E. (IT) Sem - VII
Guided By :
Prof. Ravi D. Patel
(Dept Of IT , ADIT)
Department Of Information Technology
A.D. Patel Institute Of Technology (ADIT)
New Vallabh Vidyanagar , Anand , Gujarat
1
Outline
 Introduction
 Data Reduction Strategies
 Numerosity Reduction
 Numerosity Reduction Methods
1) Parametric Methods
1.1) Regression
1.2) Log-Linear Model
2) Non-Parametric Methods
2.1) Histograms
2.2) Clustering
2.3) Sampling
2.4) Data Cube Aggregation.
 References
2
 Why Need Data Reduction?
 A database/data warehouse may store terabytes of data.
 Complex data analysis/mining may take a very long time to run on the complete data set.
3
 Data Reduction:
Introduction
 Data Reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data.
 That, is Mining on the reduced data set should be more efficient yet produce the same
analytical results.
Data Reduction Strategies
4
 Data cube aggregation
 Attribute Subset Selection
 Numerosity reduction — e.g., fit data into models
 Dimensionality reduction - Data Compression
 Discretization and concept hierarchy generation
Numerosity Reduction
5
 What is Numerosity Reduction?
 These techniques replace the original data volume by alternative, smaller forms of data
representation.
 There are two techniques for numerosity reduction methods.
1) Parametric
2) Non-Parametric
Numerosity Reduction Methods
1) Parametric Methods :
 A model is used to estimate the data, so that only the data parameters need to be restored and
not the actual data.
 It assumes that the data fits some model estimates model parameters.
 The Regression and Log-Linear methods are used for creating such models.
 Regression :
 Regression can be a simple linear regression or multiple linear regression.
 When there is only single independent attribute, such regression model is called simple linear
regression and if there are multiple independent attributes, then such regression models are
called multiple linear regression.
 In linear regression, the data are modeled to a fit straight line.
6
Cont.…
7
 For example,
a random variable y can be modeled as a linear function of another random variable x with the
equation y = ax+b ,where a and b (regression coefficients) specifies the slope and y-intercept of the
line, respectively.
In multiple linear regression, y will be modeled as a linear function of two or more
predictor(independent) variables.
 Log-Linear Model :
 Log-linear model can be used to estimate the probability of each data point in a
multidimensional space for a set of discretized attributes, based on a smaller subset of
dimensional combinations.
 This allows a higher-dimensional data space to be constructed from lower-dimensional
attributes.
 Regression and log-linear model can both be used on sparse data, although their application
may be limited.
2) Non-Parametric Methods :
 Do not assume the data.
 These methods are used for storing reduced representations of the data include histograms,
clustering, sampling and data cube aggregation.
8
Cont.…
1) Histograms :
 Divide data into buckets and store average (sum) for each bucket.
 Partitioning rules:
1) Equal-width:
Equal bucket range
2) Equal-frequency (or equal-depth) :
It uses binning to approximate data distribution
 Binning Method :
 Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Smoothing by bin means:
Bin 1: 9, 9, 9, 9 (3 + 8 + 9 + 15) /4
Bin 2: 23, 23, 23, 23 (21+ 21+ 24 + 25)/4
Bin 3: 29, 29, 29, 29 (26 + 28 + 29 + 34)/4
9
Cont.…
3) V-optimal:
with the least histogram variance (weighted sum of the original values that each
bucket represents)
4) MaxDiff:
Consider difference between pair of adjacent values. Set bucket boundary between
each pair for pairs having the β (No. of buckets)–1 largest differences
Cont….
10
 Multi-dimensional histogram
Fig. Histogram with Singleton buckets
Cont.…
11
Fig. Equal-width Histogram
 List of prices:
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18,
20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
2) Clustering :
 Clustering divides the data into groups/clusters.
 This technique partitions the whole data into different clusters.
 In data reduction, the cluster representation of the data are used to replace the actual data.
 It also helps to detect outliers in data.
12
13
C1 C2
C3
Fig. Clustering
14
3) Sampling :
 Sampling obtaining a small sample s to represent the whole data set N
 Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the
data
 Choose a representative subset of the data
 Simple random sampling may have very poor performance in the presence of skew
 Develop adaptive sampling methods
 Stratified sampling
 Approximate the percentage of each class (or subpopulation of interest) in the
overall database.
 Used in conjunction with skewed data.
 Sampling may not reduce database I/Os (page at a time).
15
Sampling Techniques :
 Simple Random Sample Without Replacement (SRSWOR)
 Simple Random Sample With Replacement (SRSWR)
 Cluster Sample
 Stratified Sample
Sampling Random Sample with or without Replacement
Fig. SRSWOR & SRSWR
16
Raw Data
Cluster Sample
17
 Tuples are grouped into M mutually disjoint clusters
 SRS of m clusters is taken where m < M
 Tuples in a database retrieved in pages
 Page - Cluster
 SRSWOR to pages
Stratified Sample
18
 Data is divided into mutually disjoint parts called strata
 SRS at each stratum
 Representative samples ensured even in the presence of skewed data
Cluster and Stratified Sampling
19
Fig. Cluster & Stratified Sampling
Features of Sampling :
 Cost depends on size of sample.
 Sub-linear on size of data.
 Linear with respect to dimensions.
 Estimates answer to an aggregate query.
20
21
3) Data Cube Aggregation: :
 A data cube is generally used to easily interpret data. It is especially useful when representing
data together with dimensions as certain measures of business requirements.
 A cube's every dimension represents certain characteristic of the database.
 Data Cubes store multidimensional aggregated information.
 Data cubes provide fast access to precomputed, summarized data, thereby benefiting online
analytical processing (OLAP) as well as data mining.
22
Categories of Data Cube :
 Dimensions:
 Represents categories of data such as time or location.
 Each dimension includes different levels of categories.
 Example :
23
Categories of Data Cube :
 Measures:
 These are the actual data values that occupy the cells as defined by the dimensions selected.
 Measures include facts or variables typically stored as numerical fields.
 Example :
24
References
 https://en.wikipedia.org/wiki/Data_cube
 https://www.geeksforgeeks.org/numerosity-reduction-in-data-mining/
 http://www.lastnightstudy.com/Show?id=44/Data-Reduction-In-Data-Mining
25

More Related Content

What's hot (20)

PPT
3.3 hierarchical methods
Krish_ver2
 
PPTX
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
PPT
Cluster analysis
Kamalakshi Deshmukh-Samag
 
PPT
Data Mining
Jay Nagar
 
PPTX
Types of clustering and different types of clustering algorithms
Prashanth Guntal
 
PPTX
Clustering in Data Mining
Archana Swaminathan
 
PPT
DATA MINING:Clustering Types
Ashwin Shenoy M
 
PPTX
Cluster Analysis
guest0edcaf
 
PPT
Chapter8
akhila chilukuri
 
PPT
Chap8 basic cluster_analysis
guru_prasadg
 
PPTX
Clustering in data Mining (Data Mining)
Mustafa Sherazi
 
PPT
Cluster analysis
Shubham Goyal
 
PDF
Dimensionality reduction
Shatakirti Er
 
PPT
Data preprocessing
Harry Potter
 
PPTX
Cluster analysis
Pushkar Mishra
 
PPTX
Introduction to Clustering algorithm
hadifar
 
PPTX
Discretization and concept hierarchy(os)
snegacmr
 
PDF
Unsupervised learning clustering
Dr Nisha Arora
 
PPTX
Cluster analysis
Jewel Refran
 
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
3.3 hierarchical methods
Krish_ver2
 
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
Cluster analysis
Kamalakshi Deshmukh-Samag
 
Data Mining
Jay Nagar
 
Types of clustering and different types of clustering algorithms
Prashanth Guntal
 
Clustering in Data Mining
Archana Swaminathan
 
DATA MINING:Clustering Types
Ashwin Shenoy M
 
Cluster Analysis
guest0edcaf
 
Chap8 basic cluster_analysis
guru_prasadg
 
Clustering in data Mining (Data Mining)
Mustafa Sherazi
 
Cluster analysis
Shubham Goyal
 
Dimensionality reduction
Shatakirti Er
 
Data preprocessing
Harry Potter
 
Cluster analysis
Pushkar Mishra
 
Introduction to Clustering algorithm
hadifar
 
Discretization and concept hierarchy(os)
snegacmr
 
Unsupervised learning clustering
Dr Nisha Arora
 
Cluster analysis
Jewel Refran
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 

Similar to Data Compression in Data mining and Business Intelligencs (20)

PDF
Student_Garden_geostatistics_course
Pedro Correia
 
PDF
Student_Garden_geostatistics_course
Pedro Correia
 
PDF
Comparison between cube techniques
ijsrd.com
 
PPT
Data preparation
Tony Nguyen
 
PPT
Data preparation
Harry Potter
 
PPT
Data preperation
Fraboni Ec
 
PPT
Data preperation
Luis Goldster
 
PPT
Data preparation
Young Alista
 
PPT
Data preparation
James Wong
 
PDF
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
PPT
Data preperation
Hoang Nguyen
 
PDF
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
Zurich_R_User_Group
 
PPTX
AlgorithmsModelsNov13.pptx
PerumalPitchandi
 
PDF
Ijariie1117 volume 1-issue 1-page-25-27
IJARIIE JOURNAL
 
PPTX
ML-Lec-18-NEW Dimensionality Reduction-PCA (1).pptx
shafanahmad06
 
PDF
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
The Statistical and Applied Mathematical Sciences Institute
 
PPT
Datapreprocessingppt
Shree Hari
 
PDF
Bank loan purchase modeling
Saleesh Satheeshchandran
 
PPT
Data preprocessing in Data Mining
DHIVYADEVAKI
 
PDF
Four data models in GIS
Prof. A.Balasubramanian
 
Student_Garden_geostatistics_course
Pedro Correia
 
Student_Garden_geostatistics_course
Pedro Correia
 
Comparison between cube techniques
ijsrd.com
 
Data preparation
Tony Nguyen
 
Data preparation
Harry Potter
 
Data preperation
Fraboni Ec
 
Data preperation
Luis Goldster
 
Data preparation
Young Alista
 
Data preparation
James Wong
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
Data preperation
Hoang Nguyen
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
Zurich_R_User_Group
 
AlgorithmsModelsNov13.pptx
PerumalPitchandi
 
Ijariie1117 volume 1-issue 1-page-25-27
IJARIIE JOURNAL
 
ML-Lec-18-NEW Dimensionality Reduction-PCA (1).pptx
shafanahmad06
 
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
The Statistical and Applied Mathematical Sciences Institute
 
Datapreprocessingppt
Shree Hari
 
Bank loan purchase modeling
Saleesh Satheeshchandran
 
Data preprocessing in Data Mining
DHIVYADEVAKI
 
Four data models in GIS
Prof. A.Balasubramanian
 
Ad

More from ShahDhruv21 (12)

PPTX
Semantic net in AI
ShahDhruv21
 
PPTX
Error Detection & Error Correction Codes
ShahDhruv21
 
PPTX
Secure Hash Algorithm (SHA)
ShahDhruv21
 
PPTX
Data Mining in Health Care
ShahDhruv21
 
PPTX
MongoDB installation,CRUD operation & JavaScript shell
ShahDhruv21
 
PPTX
2D Transformation
ShahDhruv21
 
PPTX
Interpreter
ShahDhruv21
 
PPTX
Topological Sorting
ShahDhruv21
 
PPTX
Pyramid Vector Quantization
ShahDhruv21
 
PPTX
Event In JavaScript
ShahDhruv21
 
PPTX
JSP Directives
ShahDhruv21
 
PPTX
WaterFall Model & Spiral Mode
ShahDhruv21
 
Semantic net in AI
ShahDhruv21
 
Error Detection & Error Correction Codes
ShahDhruv21
 
Secure Hash Algorithm (SHA)
ShahDhruv21
 
Data Mining in Health Care
ShahDhruv21
 
MongoDB installation,CRUD operation & JavaScript shell
ShahDhruv21
 
2D Transformation
ShahDhruv21
 
Interpreter
ShahDhruv21
 
Topological Sorting
ShahDhruv21
 
Pyramid Vector Quantization
ShahDhruv21
 
Event In JavaScript
ShahDhruv21
 
JSP Directives
ShahDhruv21
 
WaterFall Model & Spiral Mode
ShahDhruv21
 
Ad

Recently uploaded (20)

DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPT
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PPTX
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
PDF
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Day2 B2 Best.pptx
helenjenefa1
 
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 

Data Compression in Data mining and Business Intelligencs

  • 1. A. D. Patel Institute Of Technology Data Mining And Business Intelligence (2170715): A. Y. 2019-20 Data Compression – Numerosity Reduction Prepared By : Dhruv V. Shah (160010116053) B.E. (IT) Sem - VII Guided By : Prof. Ravi D. Patel (Dept Of IT , ADIT) Department Of Information Technology A.D. Patel Institute Of Technology (ADIT) New Vallabh Vidyanagar , Anand , Gujarat 1
  • 2. Outline  Introduction  Data Reduction Strategies  Numerosity Reduction  Numerosity Reduction Methods 1) Parametric Methods 1.1) Regression 1.2) Log-Linear Model 2) Non-Parametric Methods 2.1) Histograms 2.2) Clustering 2.3) Sampling 2.4) Data Cube Aggregation.  References 2
  • 3.  Why Need Data Reduction?  A database/data warehouse may store terabytes of data.  Complex data analysis/mining may take a very long time to run on the complete data set. 3  Data Reduction: Introduction  Data Reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.  That, is Mining on the reduced data set should be more efficient yet produce the same analytical results.
  • 4. Data Reduction Strategies 4  Data cube aggregation  Attribute Subset Selection  Numerosity reduction — e.g., fit data into models  Dimensionality reduction - Data Compression  Discretization and concept hierarchy generation
  • 5. Numerosity Reduction 5  What is Numerosity Reduction?  These techniques replace the original data volume by alternative, smaller forms of data representation.  There are two techniques for numerosity reduction methods. 1) Parametric 2) Non-Parametric
  • 6. Numerosity Reduction Methods 1) Parametric Methods :  A model is used to estimate the data, so that only the data parameters need to be restored and not the actual data.  It assumes that the data fits some model estimates model parameters.  The Regression and Log-Linear methods are used for creating such models.  Regression :  Regression can be a simple linear regression or multiple linear regression.  When there is only single independent attribute, such regression model is called simple linear regression and if there are multiple independent attributes, then such regression models are called multiple linear regression.  In linear regression, the data are modeled to a fit straight line. 6
  • 7. Cont.… 7  For example, a random variable y can be modeled as a linear function of another random variable x with the equation y = ax+b ,where a and b (regression coefficients) specifies the slope and y-intercept of the line, respectively. In multiple linear regression, y will be modeled as a linear function of two or more predictor(independent) variables.  Log-Linear Model :  Log-linear model can be used to estimate the probability of each data point in a multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations.  This allows a higher-dimensional data space to be constructed from lower-dimensional attributes.  Regression and log-linear model can both be used on sparse data, although their application may be limited.
  • 8. 2) Non-Parametric Methods :  Do not assume the data.  These methods are used for storing reduced representations of the data include histograms, clustering, sampling and data cube aggregation. 8 Cont.… 1) Histograms :  Divide data into buckets and store average (sum) for each bucket.  Partitioning rules: 1) Equal-width: Equal bucket range 2) Equal-frequency (or equal-depth) : It uses binning to approximate data distribution
  • 9.  Binning Method :  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34  Smoothing by bin means: Bin 1: 9, 9, 9, 9 (3 + 8 + 9 + 15) /4 Bin 2: 23, 23, 23, 23 (21+ 21+ 24 + 25)/4 Bin 3: 29, 29, 29, 29 (26 + 28 + 29 + 34)/4 9 Cont.… 3) V-optimal: with the least histogram variance (weighted sum of the original values that each bucket represents) 4) MaxDiff: Consider difference between pair of adjacent values. Set bucket boundary between each pair for pairs having the β (No. of buckets)–1 largest differences
  • 10. Cont…. 10  Multi-dimensional histogram Fig. Histogram with Singleton buckets
  • 11. Cont.… 11 Fig. Equal-width Histogram  List of prices: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
  • 12. 2) Clustering :  Clustering divides the data into groups/clusters.  This technique partitions the whole data into different clusters.  In data reduction, the cluster representation of the data are used to replace the actual data.  It also helps to detect outliers in data. 12
  • 14. 14 3) Sampling :  Sampling obtaining a small sample s to represent the whole data set N  Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data  Choose a representative subset of the data  Simple random sampling may have very poor performance in the presence of skew  Develop adaptive sampling methods  Stratified sampling  Approximate the percentage of each class (or subpopulation of interest) in the overall database.  Used in conjunction with skewed data.  Sampling may not reduce database I/Os (page at a time).
  • 15. 15 Sampling Techniques :  Simple Random Sample Without Replacement (SRSWOR)  Simple Random Sample With Replacement (SRSWR)  Cluster Sample  Stratified Sample
  • 16. Sampling Random Sample with or without Replacement Fig. SRSWOR & SRSWR 16 Raw Data
  • 17. Cluster Sample 17  Tuples are grouped into M mutually disjoint clusters  SRS of m clusters is taken where m < M  Tuples in a database retrieved in pages  Page - Cluster  SRSWOR to pages
  • 18. Stratified Sample 18  Data is divided into mutually disjoint parts called strata  SRS at each stratum  Representative samples ensured even in the presence of skewed data
  • 19. Cluster and Stratified Sampling 19 Fig. Cluster & Stratified Sampling
  • 20. Features of Sampling :  Cost depends on size of sample.  Sub-linear on size of data.  Linear with respect to dimensions.  Estimates answer to an aggregate query. 20
  • 21. 21 3) Data Cube Aggregation: :  A data cube is generally used to easily interpret data. It is especially useful when representing data together with dimensions as certain measures of business requirements.  A cube's every dimension represents certain characteristic of the database.  Data Cubes store multidimensional aggregated information.  Data cubes provide fast access to precomputed, summarized data, thereby benefiting online analytical processing (OLAP) as well as data mining.
  • 22. 22 Categories of Data Cube :  Dimensions:  Represents categories of data such as time or location.  Each dimension includes different levels of categories.  Example :
  • 23. 23 Categories of Data Cube :  Measures:  These are the actual data values that occupy the cells as defined by the dimensions selected.  Measures include facts or variables typically stored as numerical fields.  Example :
  • 25. 25