SlideShare a Scribd company logo
A Comprehensive Study of Clustering
Algorithms for Big Data Mining with
MapReduce Capability
Kamlesh Kumar Pandey
Research Scholar
Dept. of Computer Science & Applications
Dr. HariSingh Gour Vishwavidyalaya (A Central University), Sagar, M.P.
E-mail: kamleshamk@gmail.com
International Conference on Social Networking and Computational Intelligence
(Paper ID : 173)
Paper Presentation
on
Content
• Objectives
• Big Data
• Big Data Mining
• Clustering taxonomy
• Analysis of Clustering Algorithm for Big Data Mining
• Summarization of Clustering Algorithm based on Three-Dimensional of Big Data
• Proposed MapReduce Framework for the Clustering Algorithm
• Experimental
Objectives
• The objective of this study is identifying a traditional clustering algorithms
for big data respect to volume, variety, and velocity and built the common
executable framework for clustering algorithm with the MapReduce
approach under big data mining.
Big Data
• Present time technology is growing very fast. Every originations, industries or person
moving towards Internet of things, cloud computing, warless sensor networks, social
media, internet. These sources generated a data growing fast in per second, minutes or per
hour in size of Terabytes or Petabytes .
• Diebold et Al. (2000) is a first writer who discussed the word Big Data in his research
paper. All of these authors define Big Data there means if the data set is large then
gigabyte then these type of data set is known as Big Data.
• Doug Laney et al (2001) was the first person who gave a proper definition for Big Data.
He gave three characteristics Volume, Variety, and Velocity of Big Data and these
characteristics known as 3 V’s of Big Data Management. If traditional data have met two
basic characteristic at a time these data are come to under Big data.
• Gartner (2012), “Big data is high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making”
Big Data V’s
• In present time seven V’s used for Big Data where the first three V’s Volume,
Variety, and Velocity are the main characteristics of big data. In addition to
Veracity, Variability, Value, and Visualization are depending on the organization.
Big Data Mining
• Big Data Mining fetching on the requested information, uncovering
hidden relationship or patterns or extracting for the needed information or
knowledge from a dataset these datasets have to meet three V’s of Big
Data with higher complexity.
Clustering
• Clustering is the one of the approaches for analysis and discovering the
complex relation, pattern, and data in the form of underlying groups for the
unlabeled object and Big Data perspective, the clustering algorithm must be
deal high volume, high variety and high velocity with scalability.
Clustering Taxonomy
• Partitioning based Clustering: These clustering methods divided the dataset into
K partition based on the distance function.
• Hierarchical based Clustering: In this approach, large data are organized in a
hierarchical manner based on the medium of proximity and its detect on easily
relationship between data points.
• Density Based Clustering: These clustering methods divided the dataset into
based on the higher density of the data space.
• Grid-Based Clustering: The core idea of grid clustering algorithms is that original
data space is converted into a grid format which defines the size for clustering.
• Model-Based Clustering: These clustering methods divided the data set into based
on models such as mathematics, and statistical distribution.
Analysis of Clustering Algorithm for Big Data Mining
• Design of clustering algorithms needs some criteria for big data mining,
which is defining to Volume, Velocity, and Variety and increases the
efficiency of the clustering.
• Volume related criteria such as cluster is must be dealt huge size, high
dimensional and noisy of the dataset.
• Variety related criteria such as cluster is must be recognized as dataset
categorization and clusters Shape.
• Velocity related criteria define the complexity, scalability, and performance of
the clustering algorithm during the execution of real dataset.
Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset
type
Cluster
shape
Scalability Time
complexity
Partition
based
Clustering
K-Means Large No High Numerical Convex Medium 0 (knt)
K-Medoies Small No Low Categorical Convex Low 0(k(n-k)2)
PAM Small No Low Numerical Convex Low 0 (k3 * n2)
CLARA Large No Low Numerical Convex High 0(ks2+k(n-k))
CLARANS Large No Low Numerical Convex Medium 0(n2)
Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(2)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Hierarchic
al
based
Clustering
BIRCH Large No Low Numerical Convex High 0(n)
CURE Large Yes High Numerical Arbitrary High 0(n2logn)
ROKE Small Yes Low Numerical/Ca
tegorical
Arbitrary Medium 0(n2logn)
Chameleon Small No Low All type Data Arbitrary High 0(n2)
ECHIDNA Large No Low Multivariate Convex High 0(nb(1+logbm)
Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(3)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Density
based
Clustering
DBSCAN Large No Low Numerical Arbitrary Medium 0(nlogn)
OPTICS Large No Low Numerical Arbitrary Medium 0(nlogn)
Mean-shift Small No Low Numerical Arbitrary Low 0 (kernel)
DENCLUE Large Yes High Numerical Arbitrary Medium 0(log |d|)
GDBSCAN Large No Low Numerical Arbitrary Medium ----------------
Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(4)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Grid
based
Clustering
STING Large Yes Small Spatial Arbitrary High 0(n)
CLIQUE Small Yes Medium Numerical Convex High 0(n+k2)
Wave
Cluster
Large No High Spatial Arbitrary Medium 0(n)
OptiGrid Large Yes High Spatial Arbitrary Medium 0(nd) to 0(nd-log n)
MAFIA Large No High Numerical Arbitrary High 0(cp + pn)
Summarization of Clustering Algorithm based on
Three-Dimensional of Big Data(5)
Clustering
Categories
Clustering
Algorithm
Volume Variety Velocity
Dataset
size
High
dimensional
data
Handling
Noisy data
Dataset type Cluster
shape
Scalability Time complexity
Model
based
Clustering
COBWEB Large No Medium Numerical Arbitrary Medium 0(n2)
SLINK Large No Medium Numerical Arbitrary Medium 0(n2)
SOM Small Yes Low Multivariate Arbitrary Low 0(n2m)
ART Large No High Multivariate Arbitrary High (type+layer)
EM Large Yes Low Spatial Convex 0(knp)
Proposed MapReduce Framework for the Clustering
Algorithm
• If any clustering algorithm works under huge dataset or high dimensional with scalability and
heterogeneous data in the form of arbitrary shape so they suitable for big data mining.
• Designing of a clustering algorithm for big data mining has a capability for parallel and distributed
computing. MapReduce is one of the programming model for implementation of big data mining.
• MapReduce techniques are inspired by the Map and Reduce function.
• The idea of Map function is breakdown to a task into possible phases and executes these phases in
parallel order without disturbing any phases. Map function also assigns appropriate key/value pairs
in every data.
• Reduce function collects all map results and combining all values based on the same key and given
a final result of the MapReduce computational task. This concept reduces the computational time
for big data mining
Proposed MapReduce Framework for the Clustering
Algorithm(2)
Step 1: Big data set is transformed into <key, value> pairs because MapReduce used to
HDFS with parallel and distributed computing.
Step 2: Mapper function takes <key, value> pairs as input and executes on parallel order
according to the existing clustering algorithm.
Step 3: Combiner function combine all Map results and sort every <value> according to
<key> and given to output as <key, list (value)> format.
Step 4: Reduce function takes the output from Combiner function and maps to one <key, list
(value)> to another <key, list (value)> according to existing clustering algorithm and
calculate the final cluster result.
Step 5: Reduce function given the accurate and unique number of cluster.
Proposed MapReduce Framework for the Clustering
Algorithm(3)
Experimental
• K-Means, BIRCH, CLARA, CURE, DBSCAN, DENCLUE, Wavecluster are
some good clustering algorithm for big data mining because it fulfills the
criteria of big data clustering.
• Dataset: - Power ( 512,320 real data points with 7 dimensions)
• System:- Intel I3 processor, 4 GB RAM, 320 GB hard disk, windows 7.
we show execution time of existing K-Mean and MapReduce base K-Mean
clustering algorithm.
Algorithm Execution time in second
K-mean (existing) 60
K-mean (Proposed MapReduce Based) 20
References
[1]. Sivarajah U. and Kamal M.M.: Critical analysis of Big Data challenges and analytical methods, Journal of Business Research (Elsevier),
Vol 70, pp 263-286, DOI: 10.1016/j.jbusres.2016.08.001, (2017).
[2]. Wasastjerna M.C.: The role of big data and digital privacy in merger review. European Competition Journal, vol. 14, no. 2-3, pp. 417-
444, DOI: 10.1080/17441056.2018.1533364, (2018).
[3]. Gandomi A., and H. M.: Beyond the hype Big data concepts methods and analytics. I.J. of Info. Man., vol. 35, no. 2, pp. 137 -144, DOI:
10.1016/j.ijinfomgt.2014.10.007, (2015).
[4]. Pandey K.K.: Mining on Relationship in Big Data era Using Apriori Algorithm, Proc. Of NCDAMLS, pp. 55-60, ISBN: 978-93-5291-
457-9, (2018).
[5]. Che D., P. Z., and S.M., and From Big Data to Big Data Mining Challenges Issues and Opportunities. LNCS, vol. 7827, pp. 1-12 , doi
10.1007/978-3-642-40270-8_1, (2013).
[6]. Li N., Zeng L., Qing H., and Zhongzhi S.: Parallel Implementation of Apriori Algorithm Based on MapReduce. Proc of 13th IEEE ACIS
International Conference on SEAIPDC, DOI: 10.1109/SNPD.2012.31, (2017).
[7]. Oussous A., Benjelloun F.Z., Lahcen A.A., and Belfkih S.: Big Data technologies: A survey, Journal of King Saud University – Computer
and Information Sciences, Vol-30, pp 431–448, DOI: 10.1016/j.jksuci.2017.06.001, (2018).
[8]. Chen M., M.S., and L.Y.: Big Data A Survey. Mob. Netw. Appl., vol. 19, no. 2, pp. 171–209, doi 10.1007/s11036-013-0489-0, (2014).
[9]. Gole S., and Tidke B.: A survey of Big Data in social media using data mining techniques. Proc. of IEEE ICACCS, doi
10.1109/ICACCS.2015.7324059, (2015).
[10]. Elgendy N., and E. A.: Big Data Analytics A Literature Review Paper. LNAI, vol. 8557, pp. 214–227, doi 10.1007/978-3-319-08976-
8_16, (2014).
[11]. Ozkose H., Ari E.S., and Gencer C.: Yesterday, Today and Tomorrow of Big Data, Procedia - Social and Behavioral Sciences, vol. 195,
pp. 1042-1050, doi 10.1016/j.sbspro.2015.06.147, (2015).
References
[12]. Kaur P. and Kaur K., :Comparative Study of Techniquesand Issues in Data Clustering, Lecture Notes in Networks and Systems, Vol-8,
pp 1-8, DOI 10.1007/978-981-10-3818-1_1,(2017).
[13]. Nagpal A., Jatain A. and Gaur D.:Review based on Data Clustering Algorithms, Proc. of IEEE Conference on ICT, published by IEEE
Xplore,pp 298-303, DOI: 10.1109/CICT.2013.6558109, (2013).
[14]. Berkhin P.,:Survey of Clustering Data Mining Techniques, M. (eds) Grouping Multidimensional Data, pp. 25-71, doi 10.1007/3-540-
28349-8_2, (2006).
[15]. Chen W.,OliverioJ.,Kim H.O, and Shen J., The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data,
Journal of Industrial Integration and Management: Innovation and Entrepreneurship, DOI:10.1142/S2424862218500173,(2018).
[16]. Xu R.,and Wunsch D. : Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS, Vol. 16, Issue 3, pp
645-678, (2005).
[17]. Xu D., and Tian Y.: A Comprehensive Survey of Clustering Algorithms, Annals of Data Science, Vol 2, Issue 2, pp 165–193,DOI:
10.1007/s40745-015-0040-1,(2015).
[18]. Pandove D.and Goel S.: A Comprehensive Study on ClusteringApproaches for Big Data Mining, Proc. Of IEEE ICECS, pp 1333-
1338,(2015).
[19]. Fahad A; Alshatri N, Tari Z, Alamri A, Khalil I, AND ZomayaA.Y.,:A Survey of Clustering Algorithms for BigData: Taxonomy and
Empirical Analysis, IEEE Transactions on Emerging Topics in Computing, Vol 2, Issue 3,pp 267 - 279, DOI: 10.1109/TETC.2014.2330519 ,
(2014).
[20]. Jain A. K., Murty M. N. and Flynn P. J., Data clustering: a review, ACM Computing Surveys, Vol 31,Issue 3, pp 264-323, DOI:
10.1145/331499.331504,(1999).
[21]. Shirkhorshidi A.S., Aghabozorgi S, Wah T.Y. and HerawanT.:Big Data Clustering: A Review, published by Lecture Notes in Computer
Science(Springer), Vol 8583, DOI: 10.1007/978-3-319-09156-3_49,(2014).
References
[22]. Berkhin P., A Survey of Clustering Data Mining Techniques, Grouping Multidimensional Data (Springer), DOI: 10.1007/3-540-28349-
8_2 (2006).
[23]. Pujari A.K, Rajesh K. & Reddy D.S.: Clustering Techniques in Data Mining—A Survey, IETE Journal of Research, vol 47, Issue 1-2, pp
19-28, DOI: 10.1080/03772063.2001.11416199,(2001).
[24]. Dave M., and Gianey R. : Different Clustering Algorithms for Big Data Analytics: A Review, Proc of IEEE SMART, pp 328-333,(2016).
[25]. Macqueen J.: Some methods for classification and analysis of multivariate observations. Proceedings 5th Berkeley Symposium on
Mathematical Statistics Probability, Vol 1,pp 281–297,(1967).
[26]. Emani C.K., Cullot N. and Nicolle C: Understandable Big Data: A survey, Computer Science Review, Vol-17, pp 70-81, DOI:
dx.doi.org/10.1016/j.cosrev.2015.05.002, (2015).
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability

More Related Content

What's hot (20)

PDF
A survey on Efficient Enhanced K-Means Clustering Algorithm
ijsrd.com
 
PDF
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Waqas Tariq
 
PPTX
Qiu bosc2010
BOSC 2010
 
PDF
Skytree big data london meetup - may 2013
bigdatalondon
 
PDF
K-means Clustering Method for the Analysis of Log Data
idescitation
 
PDF
IEEE Datamining 2016 Title and Abstract
tsysglobalsolutions
 
PDF
An Empirical Evaluation of RDF Graph Partitioning Techniques
Adnan Akhter
 
PDF
Query evaluation over network of data aggregators
IAEME Publication
 
PDF
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEM
Nexgen Technology
 
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
Clustering Algorithm with a Novel Similarity Measure
IOSR Journals
 
PPTX
Clustering
Ganesh Satpute
 
PDF
Interactive Latency in Big Data Visualization
bigdataviz_bay
 
PDF
Vol 16 No 2 - July-December 2016
ijcsbi
 
PPTX
On the Support of a Similarity-Enabled Relational Database Management System ...
Universidade de São Paulo
 
PPTX
Graphlab Ted Dunning Clustering
MapR Technologies
 
PDF
Data stream mining techniques: a review
TELKOMNIKA JOURNAL
 
PDF
Clustering
Kiran Bhowmick
 
PPTX
Distributed approximate spectral clustering for large scale datasets
Bita Kazemi
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
ijsrd.com
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Waqas Tariq
 
Qiu bosc2010
BOSC 2010
 
Skytree big data london meetup - may 2013
bigdatalondon
 
K-means Clustering Method for the Analysis of Log Data
idescitation
 
IEEE Datamining 2016 Title and Abstract
tsysglobalsolutions
 
An Empirical Evaluation of RDF Graph Partitioning Techniques
Adnan Akhter
 
Query evaluation over network of data aggregators
IAEME Publication
 
EFFICIENT R-TREE BASED INDEXING SCHEME FOR SERVER-CENTRIC CLOUD STORAGE SYSTEM
Nexgen Technology
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Clustering Algorithm with a Novel Similarity Measure
IOSR Journals
 
Clustering
Ganesh Satpute
 
Interactive Latency in Big Data Visualization
bigdataviz_bay
 
Vol 16 No 2 - July-December 2016
ijcsbi
 
On the Support of a Similarity-Enabled Relational Database Management System ...
Universidade de São Paulo
 
Graphlab Ted Dunning Clustering
MapR Technologies
 
Data stream mining techniques: a review
TELKOMNIKA JOURNAL
 
Clustering
Kiran Bhowmick
 
Distributed approximate spectral clustering for large scale datasets
Bita Kazemi
 

Similar to A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability (20)

PDF
Chapter 5.pdf
DrGnaneswariG
 
PPTX
K- means clustering method based Data Mining of Network Shared Resources .pptx
SaiPragnaKancheti
 
PPTX
K- means clustering method based Data Mining of Network Shared Resources .pptx
SaiPragnaKancheti
 
PDF
A fuzzy clustering algorithm for high dimensional streaming data
Alexander Decker
 
PPTX
Azure Databricks for Data Scientists
Richard Garris
 
PDF
Current clustering techniques
Poonam Kshirsagar
 
PDF
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
PDF
Volume 2-issue-6-2143-2147
Editor IJARCET
 
PDF
Volume 2-issue-6-2143-2147
Editor IJARCET
 
PPTX
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
PPTX
Machine Learning : Clustering - Cluster analysis.pptx
tecaviw979
 
PDF
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 
PDF
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
PDF
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
csandit
 
DOCX
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
Nexgen Technology
 
PDF
Review of Existing Methods in K-means Clustering Algorithm
IRJET Journal
 
PPTX
Optimal Chain Matrix Multiplication Big Data Perspective
পল্লব রায়
 
PDF
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
PDF
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
NAVER Engineering
 
PPTX
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
Sravani477269
 
Chapter 5.pdf
DrGnaneswariG
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
SaiPragnaKancheti
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
SaiPragnaKancheti
 
A fuzzy clustering algorithm for high dimensional streaming data
Alexander Decker
 
Azure Databricks for Data Scientists
Richard Garris
 
Current clustering techniques
Poonam Kshirsagar
 
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
Volume 2-issue-6-2143-2147
Editor IJARCET
 
Volume 2-issue-6-2143-2147
Editor IJARCET
 
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
Machine Learning : Clustering - Cluster analysis.pptx
tecaviw979
 
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
csandit
 
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
Nexgen Technology
 
Review of Existing Methods in K-means Clustering Algorithm
IRJET Journal
 
Optimal Chain Matrix Multiplication Big Data Perspective
পল্লব রায়
 
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
NAVER Engineering
 
PCA-LDA-Lobo.pptxttvertyuytreiopkjhgftfv
Sravani477269
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Ad

A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability

  • 1. A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability Kamlesh Kumar Pandey Research Scholar Dept. of Computer Science & Applications Dr. HariSingh Gour Vishwavidyalaya (A Central University), Sagar, M.P. E-mail: [email protected] International Conference on Social Networking and Computational Intelligence (Paper ID : 173) Paper Presentation on
  • 2. Content • Objectives • Big Data • Big Data Mining • Clustering taxonomy • Analysis of Clustering Algorithm for Big Data Mining • Summarization of Clustering Algorithm based on Three-Dimensional of Big Data • Proposed MapReduce Framework for the Clustering Algorithm • Experimental
  • 3. Objectives • The objective of this study is identifying a traditional clustering algorithms for big data respect to volume, variety, and velocity and built the common executable framework for clustering algorithm with the MapReduce approach under big data mining.
  • 4. Big Data • Present time technology is growing very fast. Every originations, industries or person moving towards Internet of things, cloud computing, warless sensor networks, social media, internet. These sources generated a data growing fast in per second, minutes or per hour in size of Terabytes or Petabytes . • Diebold et Al. (2000) is a first writer who discussed the word Big Data in his research paper. All of these authors define Big Data there means if the data set is large then gigabyte then these type of data set is known as Big Data. • Doug Laney et al (2001) was the first person who gave a proper definition for Big Data. He gave three characteristics Volume, Variety, and Velocity of Big Data and these characteristics known as 3 V’s of Big Data Management. If traditional data have met two basic characteristic at a time these data are come to under Big data. • Gartner (2012), “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”
  • 5. Big Data V’s • In present time seven V’s used for Big Data where the first three V’s Volume, Variety, and Velocity are the main characteristics of big data. In addition to Veracity, Variability, Value, and Visualization are depending on the organization.
  • 6. Big Data Mining • Big Data Mining fetching on the requested information, uncovering hidden relationship or patterns or extracting for the needed information or knowledge from a dataset these datasets have to meet three V’s of Big Data with higher complexity.
  • 7. Clustering • Clustering is the one of the approaches for analysis and discovering the complex relation, pattern, and data in the form of underlying groups for the unlabeled object and Big Data perspective, the clustering algorithm must be deal high volume, high variety and high velocity with scalability.
  • 8. Clustering Taxonomy • Partitioning based Clustering: These clustering methods divided the dataset into K partition based on the distance function. • Hierarchical based Clustering: In this approach, large data are organized in a hierarchical manner based on the medium of proximity and its detect on easily relationship between data points. • Density Based Clustering: These clustering methods divided the dataset into based on the higher density of the data space. • Grid-Based Clustering: The core idea of grid clustering algorithms is that original data space is converted into a grid format which defines the size for clustering. • Model-Based Clustering: These clustering methods divided the data set into based on models such as mathematics, and statistical distribution.
  • 9. Analysis of Clustering Algorithm for Big Data Mining • Design of clustering algorithms needs some criteria for big data mining, which is defining to Volume, Velocity, and Variety and increases the efficiency of the clustering. • Volume related criteria such as cluster is must be dealt huge size, high dimensional and noisy of the dataset. • Variety related criteria such as cluster is must be recognized as dataset categorization and clusters Shape. • Velocity related criteria define the complexity, scalability, and performance of the clustering algorithm during the execution of real dataset.
  • 10. Summarization of Clustering Algorithm based on Three-Dimensional of Big Data Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Partition based Clustering K-Means Large No High Numerical Convex Medium 0 (knt) K-Medoies Small No Low Categorical Convex Low 0(k(n-k)2) PAM Small No Low Numerical Convex Low 0 (k3 * n2) CLARA Large No Low Numerical Convex High 0(ks2+k(n-k)) CLARANS Large No Low Numerical Convex Medium 0(n2)
  • 11. Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(2) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Hierarchic al based Clustering BIRCH Large No Low Numerical Convex High 0(n) CURE Large Yes High Numerical Arbitrary High 0(n2logn) ROKE Small Yes Low Numerical/Ca tegorical Arbitrary Medium 0(n2logn) Chameleon Small No Low All type Data Arbitrary High 0(n2) ECHIDNA Large No Low Multivariate Convex High 0(nb(1+logbm)
  • 12. Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(3) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Density based Clustering DBSCAN Large No Low Numerical Arbitrary Medium 0(nlogn) OPTICS Large No Low Numerical Arbitrary Medium 0(nlogn) Mean-shift Small No Low Numerical Arbitrary Low 0 (kernel) DENCLUE Large Yes High Numerical Arbitrary Medium 0(log |d|) GDBSCAN Large No Low Numerical Arbitrary Medium ----------------
  • 13. Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(4) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Grid based Clustering STING Large Yes Small Spatial Arbitrary High 0(n) CLIQUE Small Yes Medium Numerical Convex High 0(n+k2) Wave Cluster Large No High Spatial Arbitrary Medium 0(n) OptiGrid Large Yes High Spatial Arbitrary Medium 0(nd) to 0(nd-log n) MAFIA Large No High Numerical Arbitrary High 0(cp + pn)
  • 14. Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(5) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Model based Clustering COBWEB Large No Medium Numerical Arbitrary Medium 0(n2) SLINK Large No Medium Numerical Arbitrary Medium 0(n2) SOM Small Yes Low Multivariate Arbitrary Low 0(n2m) ART Large No High Multivariate Arbitrary High (type+layer) EM Large Yes Low Spatial Convex 0(knp)
  • 15. Proposed MapReduce Framework for the Clustering Algorithm • If any clustering algorithm works under huge dataset or high dimensional with scalability and heterogeneous data in the form of arbitrary shape so they suitable for big data mining. • Designing of a clustering algorithm for big data mining has a capability for parallel and distributed computing. MapReduce is one of the programming model for implementation of big data mining. • MapReduce techniques are inspired by the Map and Reduce function. • The idea of Map function is breakdown to a task into possible phases and executes these phases in parallel order without disturbing any phases. Map function also assigns appropriate key/value pairs in every data. • Reduce function collects all map results and combining all values based on the same key and given a final result of the MapReduce computational task. This concept reduces the computational time for big data mining
  • 16. Proposed MapReduce Framework for the Clustering Algorithm(2) Step 1: Big data set is transformed into <key, value> pairs because MapReduce used to HDFS with parallel and distributed computing. Step 2: Mapper function takes <key, value> pairs as input and executes on parallel order according to the existing clustering algorithm. Step 3: Combiner function combine all Map results and sort every <value> according to <key> and given to output as <key, list (value)> format. Step 4: Reduce function takes the output from Combiner function and maps to one <key, list (value)> to another <key, list (value)> according to existing clustering algorithm and calculate the final cluster result. Step 5: Reduce function given the accurate and unique number of cluster.
  • 17. Proposed MapReduce Framework for the Clustering Algorithm(3)
  • 18. Experimental • K-Means, BIRCH, CLARA, CURE, DBSCAN, DENCLUE, Wavecluster are some good clustering algorithm for big data mining because it fulfills the criteria of big data clustering. • Dataset: - Power ( 512,320 real data points with 7 dimensions) • System:- Intel I3 processor, 4 GB RAM, 320 GB hard disk, windows 7. we show execution time of existing K-Mean and MapReduce base K-Mean clustering algorithm. Algorithm Execution time in second K-mean (existing) 60 K-mean (Proposed MapReduce Based) 20
  • 19. References [1]. Sivarajah U. and Kamal M.M.: Critical analysis of Big Data challenges and analytical methods, Journal of Business Research (Elsevier), Vol 70, pp 263-286, DOI: 10.1016/j.jbusres.2016.08.001, (2017). [2]. Wasastjerna M.C.: The role of big data and digital privacy in merger review. European Competition Journal, vol. 14, no. 2-3, pp. 417- 444, DOI: 10.1080/17441056.2018.1533364, (2018). [3]. Gandomi A., and H. M.: Beyond the hype Big data concepts methods and analytics. I.J. of Info. Man., vol. 35, no. 2, pp. 137 -144, DOI: 10.1016/j.ijinfomgt.2014.10.007, (2015). [4]. Pandey K.K.: Mining on Relationship in Big Data era Using Apriori Algorithm, Proc. Of NCDAMLS, pp. 55-60, ISBN: 978-93-5291- 457-9, (2018). [5]. Che D., P. Z., and S.M., and From Big Data to Big Data Mining Challenges Issues and Opportunities. LNCS, vol. 7827, pp. 1-12 , doi 10.1007/978-3-642-40270-8_1, (2013). [6]. Li N., Zeng L., Qing H., and Zhongzhi S.: Parallel Implementation of Apriori Algorithm Based on MapReduce. Proc of 13th IEEE ACIS International Conference on SEAIPDC, DOI: 10.1109/SNPD.2012.31, (2017). [7]. Oussous A., Benjelloun F.Z., Lahcen A.A., and Belfkih S.: Big Data technologies: A survey, Journal of King Saud University – Computer and Information Sciences, Vol-30, pp 431–448, DOI: 10.1016/j.jksuci.2017.06.001, (2018). [8]. Chen M., M.S., and L.Y.: Big Data A Survey. Mob. Netw. Appl., vol. 19, no. 2, pp. 171–209, doi 10.1007/s11036-013-0489-0, (2014). [9]. Gole S., and Tidke B.: A survey of Big Data in social media using data mining techniques. Proc. of IEEE ICACCS, doi 10.1109/ICACCS.2015.7324059, (2015). [10]. Elgendy N., and E. A.: Big Data Analytics A Literature Review Paper. LNAI, vol. 8557, pp. 214–227, doi 10.1007/978-3-319-08976- 8_16, (2014). [11]. Ozkose H., Ari E.S., and Gencer C.: Yesterday, Today and Tomorrow of Big Data, Procedia - Social and Behavioral Sciences, vol. 195, pp. 1042-1050, doi 10.1016/j.sbspro.2015.06.147, (2015).
  • 20. References [12]. Kaur P. and Kaur K., :Comparative Study of Techniquesand Issues in Data Clustering, Lecture Notes in Networks and Systems, Vol-8, pp 1-8, DOI 10.1007/978-981-10-3818-1_1,(2017). [13]. Nagpal A., Jatain A. and Gaur D.:Review based on Data Clustering Algorithms, Proc. of IEEE Conference on ICT, published by IEEE Xplore,pp 298-303, DOI: 10.1109/CICT.2013.6558109, (2013). [14]. Berkhin P.,:Survey of Clustering Data Mining Techniques, M. (eds) Grouping Multidimensional Data, pp. 25-71, doi 10.1007/3-540- 28349-8_2, (2006). [15]. Chen W.,OliverioJ.,Kim H.O, and Shen J., The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data, Journal of Industrial Integration and Management: Innovation and Entrepreneurship, DOI:10.1142/S2424862218500173,(2018). [16]. Xu R.,and Wunsch D. : Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS, Vol. 16, Issue 3, pp 645-678, (2005). [17]. Xu D., and Tian Y.: A Comprehensive Survey of Clustering Algorithms, Annals of Data Science, Vol 2, Issue 2, pp 165–193,DOI: 10.1007/s40745-015-0040-1,(2015). [18]. Pandove D.and Goel S.: A Comprehensive Study on ClusteringApproaches for Big Data Mining, Proc. Of IEEE ICECS, pp 1333- 1338,(2015). [19]. Fahad A; Alshatri N, Tari Z, Alamri A, Khalil I, AND ZomayaA.Y.,:A Survey of Clustering Algorithms for BigData: Taxonomy and Empirical Analysis, IEEE Transactions on Emerging Topics in Computing, Vol 2, Issue 3,pp 267 - 279, DOI: 10.1109/TETC.2014.2330519 , (2014). [20]. Jain A. K., Murty M. N. and Flynn P. J., Data clustering: a review, ACM Computing Surveys, Vol 31,Issue 3, pp 264-323, DOI: 10.1145/331499.331504,(1999). [21]. Shirkhorshidi A.S., Aghabozorgi S, Wah T.Y. and HerawanT.:Big Data Clustering: A Review, published by Lecture Notes in Computer Science(Springer), Vol 8583, DOI: 10.1007/978-3-319-09156-3_49,(2014).
  • 21. References [22]. Berkhin P., A Survey of Clustering Data Mining Techniques, Grouping Multidimensional Data (Springer), DOI: 10.1007/3-540-28349- 8_2 (2006). [23]. Pujari A.K, Rajesh K. & Reddy D.S.: Clustering Techniques in Data Mining—A Survey, IETE Journal of Research, vol 47, Issue 1-2, pp 19-28, DOI: 10.1080/03772063.2001.11416199,(2001). [24]. Dave M., and Gianey R. : Different Clustering Algorithms for Big Data Analytics: A Review, Proc of IEEE SMART, pp 328-333,(2016). [25]. Macqueen J.: Some methods for classification and analysis of multivariate observations. Proceedings 5th Berkeley Symposium on Mathematical Statistics Probability, Vol 1,pp 281–297,(1967). [26]. Emani C.K., Cullot N. and Nicolle C: Understandable Big Data: A survey, Computer Science Review, Vol-17, pp 70-81, DOI: dx.doi.org/10.1016/j.cosrev.2015.05.002, (2015).