SlideShare a Scribd company logo
4
Most read
5
Most read
6
Most read
MEMBERS:
Dheeraj Pachauri(1809113042)
Himanshu Bharti(1809113052)
Shahnawaz Khan(1900910139007)
Abhay Kumar Mishra(1900910139001)
 Clustering
 Data Stream
 Stream Clustering
 Requirements for clustering algorithms
 Stream clustering steps & algorithms
 Prototype array
 Window models
 Outliers & its detection
 Applications of clustering
 Method of identifying similar groups of data in a data
set.
 Entities in each group are comparatively more similar
to entities of that group than those of other group.
 Some methods include K-means, K-mediods, DB-
SCAN etc.
 STREAM: Data that arrives continuously such as Google
queries, telephone records, multimedia data, financial
transactions etc.
 Not feasible to store in a database & data can be lost if not
processed immediately
 DATA STREAM: Continuous, massive, unbounded
sequences of data objects that are continuously generated at
a rapid rate.
 The problem of data stream clustering is defined as:
Input: a sequence of n points in metric space & an
integer k.
Output: k centers in the set of the n points so as to
minimize the sum of distances from data points to their
closest cluster centers.
 ONLINE PHASE
 Summarize the data into memory-efficient data
structures
 OFFLINE PHASE
 Use a clustering algorithm to find the data
partition
 Provide timely results by performing fast &
incremental processing of data objects
 Rapidly adapt to changing dynamics of the data,
which means algorithm should detect when new
clusters may appear, or others disappear
 Scale to the number of objects that are
continuously arriving
 Provide a compact model representation
 Rapidly detect the presence of outliers & act
accordingly
 High dimensionality, interpretability & usability
 Deals with different data types. Ex- XML trees,
DNA sequences, GPS information etc.
 ALGORITHM STEPS:
 Data Abstraction: Summarize the data into
memory-efficient data structures
 Clustering phase: Use a clustering algorithm to
find the data partition
There are five main classes:
 HIERARCHICAL BASED ALGORITHMS: It
uses the dendrogram data structure which is
binary tree based. Useful to summarize &
visualize the data.
 Examples are BIRCH, CHAMELEON, ODAC,
E-Stream & HUE-Stream.
 It splits the data instances into a predefined
number of clusters based on similarity to the
cluster centroids.
 Examples are Clustream, HPStream,
SWClustering, StreamKM++ & CLARA.
 It uses multi-resolution grid data structure.
 The workspace is divided into a number of
cells, in a grid structure, and each instance is
assigned to a cell
 Grid cells are then clustered.
 Examples include GCHDS, GSCDS, DGClust,
CLIQUE, WaveCluster & STING.
 It keeps summary of input data in large
number of micro clusters.
 Micro cluster is a set of data instances that are
very close to each other.
 Synopsis is kept with a feature vector. Then,
these micro clusters are merged & formed final
clusters.
 Examples are DBSCAN, LDBSCAN, DSCLU,
SOStream & MR-Stream
 It finds the data distribution model that fit best
to the input data.
 Attempt to optimize the fit between the data &
some mathematical model.
 Adopts statistical & AI approach
 Examples are COBWEB, CluDistream & SWEM
 Some data stream clustering algorithms usea
simplified summarization structure called
prototype array.
 Array of protoypes that summarizes the data
partition.
 It’s used to summarize the stream to divide the
data stream into chunks of size m.
 In most data stream scenarios, more recent
information from the stream can reflect the
emerging of new trends or changes on the data
distribution.
 This information can be used to explain the
evolution of the process under observation.
 Moving window techniques have been
proposed to partially address this problem.
 Only the most recent information from the data stream are stored
in a data structure whose size can be variable or fixed.
 This is usually a first in, first out(FIFO) structure which considers
the objects from the current period of time upto a certain period in
the past.
 The organization & manipulation of objects are based on the
principle of queue processing.
 Considers the most recent information by associating
weights to objects from the data stream.
 More recent objects receive higher weight than older
objects & the weights of the objects decrease with time.
 The weight of the objects exponentially decays from
black to white.
 Adopted in density based clustering algorithms.
 Last in the row
 It considers the data in the data stream from
the beginning until now.
 The coreset tree structure is responsible for
reducing 2m objects to m objects. The
construction of this structure is defined as
follows:
 First, the tree has only the root node v, which
contains all the 2m objects in Ev. The prototype
of the root node Xpv is chosen randomly from
Ev & Nv=|Ev|=2m. Afterwards, two child
nodes for v are created as v1 & v2.
 To create these nodes, the object that is farthest
away from the prototype object is selected.
 OUTLIERS: The set of objects are considerably dissimilar from
the remainder of the data.
 PROBLEM: Find top n outlier points
 APPLICATIONS:
 Credit card fraud detection
 Telecom fraud detection
 Customer segmentation
 Medical analysis
 Besides the requirements of being incremental
& fast, data stream clustering algorithms
should also be able to properly handle outliers
through the stream.
 These are objects that deviate from the general
behaviour of a data model & occur due to
different causes, such as problems in data
collection, storage & transmission errors,
fraudulent activities or changes in the
behaviour of the system.
 Pattern recognition
 Spatial data analysis
 Image processing
 Economic Science(especially market research)
 WWW
 Internet
 Data Mining & Analysis by MJ Zaki
 Websites(dimacs.rutgers.edu &
dsc.soic.indiana.edu)
 Class notes
Clustering for Stream and Parallelism (DATA ANALYTICS)

More Related Content

What's hot (20)

PPTX
Lecture optimal binary search tree
Divya Ks
 
PPT
Np cooks theorem
Narayana Galla
 
PPTX
Backtracking
subhradeep mitra
 
PPT
BackTracking Algorithm: Technique and Examples
Fahim Ferdous
 
PPTX
Knowledge representation In Artificial Intelligence
Ramla Sheikh
 
PPT
DESIGN AND ANALYSIS OF ALGORITHMS
Gayathri Gaayu
 
PPT
2.3 bayesian classification
Krish_ver2
 
PDF
Production System in AI
Bharat Bhushan
 
PPT
Unit 4
Ravi Kumar
 
PPTX
Introduction TO Finite Automata
Ratnakar Mikkili
 
PPTX
INTER PROCESS COMMUNICATION (IPC).pptx
LECO9
 
PPTX
Demand paging
Trinity Dwarka
 
PPTX
8 queens problem using back tracking
Tech_MX
 
PPTX
The n Queen Problem
Sukrit Gupta
 
PPTX
Interface specification
maliksiddique1
 
PPTX
Deadlock in database
Tayyab Hussain
 
PPTX
Lexical analysis - Compiler Design
Muhammed Afsal Villan
 
PPTX
Mathematical Analysis of Recursive Algorithm.
mohanrathod18
 
PPTX
State space search
chauhankapil
 
PPTX
Single pass assembler
Bansari Shah
 
Lecture optimal binary search tree
Divya Ks
 
Np cooks theorem
Narayana Galla
 
Backtracking
subhradeep mitra
 
BackTracking Algorithm: Technique and Examples
Fahim Ferdous
 
Knowledge representation In Artificial Intelligence
Ramla Sheikh
 
DESIGN AND ANALYSIS OF ALGORITHMS
Gayathri Gaayu
 
2.3 bayesian classification
Krish_ver2
 
Production System in AI
Bharat Bhushan
 
Unit 4
Ravi Kumar
 
Introduction TO Finite Automata
Ratnakar Mikkili
 
INTER PROCESS COMMUNICATION (IPC).pptx
LECO9
 
Demand paging
Trinity Dwarka
 
8 queens problem using back tracking
Tech_MX
 
The n Queen Problem
Sukrit Gupta
 
Interface specification
maliksiddique1
 
Deadlock in database
Tayyab Hussain
 
Lexical analysis - Compiler Design
Muhammed Afsal Villan
 
Mathematical Analysis of Recursive Algorithm.
mohanrathod18
 
State space search
chauhankapil
 
Single pass assembler
Bansari Shah
 

Similar to Clustering for Stream and Parallelism (DATA ANALYTICS) (20)

PPT
5.1 mining data streams
Krish_ver2
 
PDF
Study of Density Based Clustering Techniques on Data Streams
IJERA Editor
 
PDF
E502024047
IJERA Editor
 
PDF
E502024047
IJERA Editor
 
PDF
IJCSIT
Poonam Debnath
 
PDF
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
PPT
081.ppt
amil baba
 
PDF
1105.1950
Nhat Tam
 
PPT
data streammining and its applications.ppt
ajajkhan16
 
PPT
Chapter 08 Data Mining Techniques
Houw Liong The
 
PDF
Data stream mining
George Tzinos
 
PDF
Aa31163168
IJERA Editor
 
PDF
Mining Stream Data using k-Means clustering Algorithm
Manishankar Medi
 
PDF
Paper id 26201478
IJRAT
 
PDF
Data stream mining techniques: a review
TELKOMNIKA JOURNAL
 
PDF
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Albert Bifet
 
PDF
Application of Dynamic Clustering Alogirthm in Medical Surveillance
IJCSEA Journal
 
PDF
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
PDF
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
PDF
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
5.1 mining data streams
Krish_ver2
 
Study of Density Based Clustering Techniques on Data Streams
IJERA Editor
 
E502024047
IJERA Editor
 
E502024047
IJERA Editor
 
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
081.ppt
amil baba
 
1105.1950
Nhat Tam
 
data streammining and its applications.ppt
ajajkhan16
 
Chapter 08 Data Mining Techniques
Houw Liong The
 
Data stream mining
George Tzinos
 
Aa31163168
IJERA Editor
 
Mining Stream Data using k-Means clustering Algorithm
Manishankar Medi
 
Paper id 26201478
IJRAT
 
Data stream mining techniques: a review
TELKOMNIKA JOURNAL
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Albert Bifet
 
Application of Dynamic Clustering Alogirthm in Medical Surveillance
IJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
Ad

Recently uploaded (20)

PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Ad

Clustering for Stream and Parallelism (DATA ANALYTICS)

  • 1. MEMBERS: Dheeraj Pachauri(1809113042) Himanshu Bharti(1809113052) Shahnawaz Khan(1900910139007) Abhay Kumar Mishra(1900910139001)
  • 2.  Clustering  Data Stream  Stream Clustering  Requirements for clustering algorithms  Stream clustering steps & algorithms  Prototype array  Window models  Outliers & its detection  Applications of clustering
  • 3.  Method of identifying similar groups of data in a data set.  Entities in each group are comparatively more similar to entities of that group than those of other group.  Some methods include K-means, K-mediods, DB- SCAN etc.
  • 4.  STREAM: Data that arrives continuously such as Google queries, telephone records, multimedia data, financial transactions etc.  Not feasible to store in a database & data can be lost if not processed immediately  DATA STREAM: Continuous, massive, unbounded sequences of data objects that are continuously generated at a rapid rate.  The problem of data stream clustering is defined as: Input: a sequence of n points in metric space & an integer k. Output: k centers in the set of the n points so as to minimize the sum of distances from data points to their closest cluster centers.
  • 5.  ONLINE PHASE  Summarize the data into memory-efficient data structures  OFFLINE PHASE  Use a clustering algorithm to find the data partition
  • 6.  Provide timely results by performing fast & incremental processing of data objects  Rapidly adapt to changing dynamics of the data, which means algorithm should detect when new clusters may appear, or others disappear  Scale to the number of objects that are continuously arriving  Provide a compact model representation  Rapidly detect the presence of outliers & act accordingly  High dimensionality, interpretability & usability  Deals with different data types. Ex- XML trees, DNA sequences, GPS information etc.
  • 7.  ALGORITHM STEPS:  Data Abstraction: Summarize the data into memory-efficient data structures  Clustering phase: Use a clustering algorithm to find the data partition
  • 8. There are five main classes:  HIERARCHICAL BASED ALGORITHMS: It uses the dendrogram data structure which is binary tree based. Useful to summarize & visualize the data.  Examples are BIRCH, CHAMELEON, ODAC, E-Stream & HUE-Stream.
  • 9.  It splits the data instances into a predefined number of clusters based on similarity to the cluster centroids.  Examples are Clustream, HPStream, SWClustering, StreamKM++ & CLARA.
  • 10.  It uses multi-resolution grid data structure.  The workspace is divided into a number of cells, in a grid structure, and each instance is assigned to a cell  Grid cells are then clustered.  Examples include GCHDS, GSCDS, DGClust, CLIQUE, WaveCluster & STING.
  • 11.  It keeps summary of input data in large number of micro clusters.  Micro cluster is a set of data instances that are very close to each other.  Synopsis is kept with a feature vector. Then, these micro clusters are merged & formed final clusters.  Examples are DBSCAN, LDBSCAN, DSCLU, SOStream & MR-Stream
  • 12.  It finds the data distribution model that fit best to the input data.  Attempt to optimize the fit between the data & some mathematical model.  Adopts statistical & AI approach  Examples are COBWEB, CluDistream & SWEM
  • 13.  Some data stream clustering algorithms usea simplified summarization structure called prototype array.  Array of protoypes that summarizes the data partition.  It’s used to summarize the stream to divide the data stream into chunks of size m.
  • 14.  In most data stream scenarios, more recent information from the stream can reflect the emerging of new trends or changes on the data distribution.  This information can be used to explain the evolution of the process under observation.  Moving window techniques have been proposed to partially address this problem.
  • 15.  Only the most recent information from the data stream are stored in a data structure whose size can be variable or fixed.  This is usually a first in, first out(FIFO) structure which considers the objects from the current period of time upto a certain period in the past.  The organization & manipulation of objects are based on the principle of queue processing.
  • 16.  Considers the most recent information by associating weights to objects from the data stream.  More recent objects receive higher weight than older objects & the weights of the objects decrease with time.  The weight of the objects exponentially decays from black to white.  Adopted in density based clustering algorithms.
  • 17.  Last in the row  It considers the data in the data stream from the beginning until now.
  • 18.  The coreset tree structure is responsible for reducing 2m objects to m objects. The construction of this structure is defined as follows:  First, the tree has only the root node v, which contains all the 2m objects in Ev. The prototype of the root node Xpv is chosen randomly from Ev & Nv=|Ev|=2m. Afterwards, two child nodes for v are created as v1 & v2.  To create these nodes, the object that is farthest away from the prototype object is selected.
  • 19.  OUTLIERS: The set of objects are considerably dissimilar from the remainder of the data.  PROBLEM: Find top n outlier points  APPLICATIONS:  Credit card fraud detection  Telecom fraud detection  Customer segmentation  Medical analysis
  • 20.  Besides the requirements of being incremental & fast, data stream clustering algorithms should also be able to properly handle outliers through the stream.  These are objects that deviate from the general behaviour of a data model & occur due to different causes, such as problems in data collection, storage & transmission errors, fraudulent activities or changes in the behaviour of the system.
  • 21.  Pattern recognition  Spatial data analysis  Image processing  Economic Science(especially market research)  WWW
  • 22.  Internet  Data Mining & Analysis by MJ Zaki  Websites(dimacs.rutgers.edu & dsc.soic.indiana.edu)  Class notes