SlideShare a Scribd company logo
Dr.M.Pyingkodi
Dept of MCA
Kongu Engineering College
Erode, Tamilnadu, India
Unsupervised learning
• Unsupervised learning is a machine learning concept where the
unlabelled and unclassified information is analysed to discover
hidden knowledge.
• The algorithms work on the data without any prior training.
Example:
• movie promotions to the correct group of people.
• Earlier times: same set of movie to all the visitors of the page.
• Now: based on their interest, understand what type of movie is liked
by what segment of the people.
Clustering and Association Analysis
• Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and
absence of those commonalities. Clustering which helps in
segmentation of the set of objects into groups of similar
objects.
• Association: An association rule is an unsupervised learning
method which is used for finding the relationships between
variables/objects in the large database (dataset).
Clustering
• clustering is defined as an unsupervised machine learning task that
automatically divides the data into clusters or groups of similar items.
Different types of clustering techniques
The major clustering techniques are
• Partitioning methods,
• Hierarchical methods, and
• Density-based methods.
Unsupervised Learning in Machine Learning
Partitioning methods
• Partitional clustering divides data objects into nonoverlapping groups. In other words, no object can
be a member of more than one cluster, and every cluster must have at least one object.
Two of the most important algorithms for partitioning-based clustering are k-means and k-medoid
• In the k-means algorithm, the centroid of the prototype is identified for clustering, which is normally the
mean of a group of points.
• Similarly, the k-medoid algorithm identifies the medoid which is the most representative point for a
group of points.
• These algorithms are both nondeterministic, meaning they could produce different results from two
separate runs even if the runs were based on the same input.
Hierarchical Clustering
Hierarchical clustering determines cluster assignments by building a hierarchy. This is implemented by either a
bottom-up or a top-down approach:
• Agglomerative clustering is the bottom-up approach. It merges the two points that are the most similar
until all points have been merged into a single cluster.
• Divisive clustering is the top-down approach. It starts with all points as one cluster and splits the least
similar clusters at each step until only single data points remain.
• These methods produce a tree-based hierarchy of points called a dendrogram.
• hierarchical clustering is a deterministic process, meaning cluster assignments won’t change when you run
an algorithm twice on the same input data.
• we have seen in the K-means clustering that there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create the clusters of the same size. To solve
these two challenges, we can opt for the hierarchical clustering algorithm because, in this algorithm, we don't
need to have knowledge about the predefined number of clusters.
Density-Based Clustering
• Density-based clustering determines cluster assignments
based on the density of data points in a region.
• Clusters are assigned where there are high densities of data
points separated by low-density regions.
Partitioning (K-means - A centroid-
based technique)
• K Means segregates the unlabeled data into various groups, called
clusters, based on having similar features, common patterns.
• The principle of the k-means algorithm is to assign each of the ‘n’
data points to one of the K clusters where ‘K’ is a userdefined
parameter as the number of clusters desired.
• The objective is to maximize the homogeneity within the clusters
and also to maximize the differences between the clusters.
• The homogeneity and differences are measured in terms of the
distance between the objects or points in the data set.
• Kmeans Algorithm is an Iterative algorithm that
divides a group of n datasets into k subgroups
/clusters based on the similarity and their mean
distance from the centroid of that particular
subgroup/ formed.
• K, here is the pre-defined number of clusters to
be formed by the Algorithm.
• If K=3, It means the number of clusters to be
formed from the dataset is 3
Step-1: Select the value of K, to decide the number of clusters to be
formed.
Step-2: Select random K points which will act as centroids.
Step-3: Assign each data point, based on their distance from the
randomly selected points (Centroid), to the nearest/closest centroid
which will form the predefined clusters.
Step-4: place a new centroid of each cluster.
Step-5: Repeat step no.3, which reassign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.
Step-7: FINISH
Note: From the reference of text BOOK
• Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
• We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here
we are selecting the below two points as k points, which are not the part of our
dataset.
• Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied
to calculate the distance between two points. So, we will draw a median
between both the centroids.
• From the image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid.
Let's color them as blue and yellow for clear visualization.
• As we need to find the closest cluster, so we will repeat
the process by choosing a new centroid. To choose the
new centroids.
• Consider the below data set which has the values of the data points
• We can randomly choose two initial points as the centroids and from there
we can start calculating distance of each point.
• For now we will consider that D2 and D4 are the centroids.
• To start with we should calculate the distance with the help of Euclidean
Distance which is √((x1-y1)² + (x2-y2)²
Iteration 1:
• Step 1: We need to calculate the distance between the initial centroid points
with other data points. Below have shown the calculation of distance from
initial centroids D2 and D4 from data point D1.
• After calculating the distance of all data points, we get the values as below.
Step 2: Next, we need to group the data points which are closer to centriods. Observe the
above table, we can notice that D1 is closer to D4 as the distance is less. Hence we can
say that D1 belongs to D4 Similarly, D3 and D5 belongs to D2. After grouping, we need
to calculate the mean of grouped values from Table 1.
Cluster 1: (D1, D4) Cluster 2: (D2, D3, D5)
Step 3: Now, we calculate the mean values of the clusters created and the new centriod
values will these mean values and centroid is moved along the graph.
From the above table, we can say the new centroid for cluster 1 is (2.0, 1.0) and for
cluster 2 is (2.67, 4.67)
Iteration 2:
Step 4: Again the values of euclidean distance is calculated from the new centriods. Below is
the table of distance between data points and new centroids.
• We can notice now that clusters have changed the data points. Now the cluster 1 has D1,
D2 and D4 data objects. Similarly, cluster 2 has D3 and D5
Step 5: Calculate the mean values of new clustered groups from Table 1 which we followed in
step 3. The below table will show the mean values
Now we have the new centroid value as following:
cluster 1 ( D1, D2, D4) – (1.67, 1.67) and cluster 2 (D3, D5) – (3.5, 5.5)
This process has to be repeated until we find a constant value for centroids and the latest
cluster will be considered as the final cluster solution.
Choosing the value of K :
• For a small data set, sometimes a rule of thumb that is followed is
• But unfortunately, this thumb rule does not work well for large data sets.
There are several statistical methods to arrive at the suitable number of clusters.
• To find the number of clusters in the data, we need to run the K-Means
clustering algorithm for different values of K and compare the results.
• We should choose the optimal value of K that gives us best performance.
There are different techniques available to find the optimal value of K.
• The most common technique is the elbow method which is described below.
• one effective approach is to employ the hierarchical clustering technique on
sample points from the data set and then arrive at sample K clusters.
How to choose the value of "K number of clusters" in K-
means Clustering?
• The performance of the K-means clustering algorithm depends upon highly
efficient clusters that it forms.
• To choose the optimal number of clusters is a big task.
Elbow Method:
• The Elbow method is one of the most popular ways to find the optimal
number of clusters.
• This method uses the concept of WCSS value.
• WCSS stands for Within Cluster Sum of Squares, which defines the total
variations within a cluster.
The formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in
CLuster3 distance(Pi C3)2
In the above formula of WCSS,
• ∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances
between each data point and its centroid within a cluster1 and the same for
the other two terms.
• To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values
(ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point
is considered as the best value of K.
• Between Clusters Sum of Squares (BCSS), which measures the squared average
distance between all centroids. To calculate BCSS, you find the Euclidean distance
from a given cluster centroid to all other cluster centroids. You then iterate this
process for all of the clusters, and sum all of the values together. This value is the
BCSS. You can divide by the number of clusters to calculate the average BCSS.
• Essentially, BCSS measures the variation between all clusters. A large value can
indicate clusters that are spread out, while a small value can indicate clusters that
are close to each other.
• We iterate the values of k from 1 to 9 and calculate the values of distortions for
each value of k and calculate the distortion and inertia for each value of k in the
given range.
• Distortion is the average of the euclidean squared distance
from the centroid of the respective clusters. Typically, the
Euclidean distance metric is used.
• Inertia is the sum of squared distances of samples to their
closest cluster centre.
• The measure of quality of clustering uses the SSE technique
Calculating SSE
K-Medoids: a representative object-based
technique
• k-means algorithm is sensitive to outliers in the data
set.
• Consider the values 1, 2, 3, 5, 9, 10, 11, and 25.
• Point 25 is the outlier, and it affects the cluster
formation negatively when the mean of the points is
considered as centroids.
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
• Because the SSE of the second clustering is lower, k-means tend to
put point 9 in the same cluster with 1, 2, 3, and 6 though the point is
logically nearer to points 10 and 11.
• This skewedness is introduced due to the outlier point 25, which
shifts the mean away from the centre of the cluster.
k-medoids provides a solution to this problem.
• Instead of considering the mean of the data points in the cluster,
kmedoids considers k representative data points from the existing
points in the data set as the centre of the clusters.
• Note that the medoids in this case are actual data points or objects
from the data set and not an imaginary point as in the case when the
mean of the data sets within cluster is used as the centroid in the k-
means technique. The SSE is calculated as
where o is the representative point or object of cluster C .
• One of the practical implementation of the k-medoids principle is the
Partitioning around Medoids (PAM) algorithm.

More Related Content

What's hot (20)

ODP
Machine Learning with Decision trees
Knoldus Inc.
 
PPTX
Decision Trees
Student
 
PPT
Decision tree
Ami_Surati
 
PPTX
Supervised Machine Learning
Livares Technologies Pvt Ltd
 
PPT
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
PPTX
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
PDF
Naive Bayes
CloudxLab
 
PPTX
Ensemble learning
Haris Jamil
 
PDF
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
PPTX
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
PPTX
Clustering in Data Mining
Archana Swaminathan
 
PPT
K means Clustering Algorithm
Kasun Ranga Wijeweera
 
PPTX
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
PPTX
Unsupervised learning (clustering)
Pravinkumar Landge
 
PPT
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
PDF
Feature selection
Dong Guo
 
PDF
Machine Learning Clustering
Rupak Roy
 
PPTX
Mining single dimensional boolean association rules from transactional
ramya marichamy
 
PDF
Data mining and data warehouse lab manual updated
Yugal Kumar
 
Machine Learning with Decision trees
Knoldus Inc.
 
Decision Trees
Student
 
Decision tree
Ami_Surati
 
Supervised Machine Learning
Livares Technologies Pvt Ltd
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
Naive Bayes
CloudxLab
 
Ensemble learning
Haris Jamil
 
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
Clustering in Data Mining
Archana Swaminathan
 
K means Clustering Algorithm
Kasun Ranga Wijeweera
 
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
Unsupervised learning (clustering)
Pravinkumar Landge
 
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Feature selection
Dong Guo
 
Machine Learning Clustering
Rupak Roy
 
Mining single dimensional boolean association rules from transactional
ramya marichamy
 
Data mining and data warehouse lab manual updated
Yugal Kumar
 

Similar to Unsupervised Learning in Machine Learning (20)

PPTX
K-Means Clustering Algorithm.pptx
JebaRaj26
 
PPTX
Unsupervised learning Algorithms and Assumptions
refedey275
 
PPTX
K means ALGORITHM IN MACHINE LEARNING.pptx
angelinjeba6
 
PPT
26-Clustering MTech-2017.ppt
vikassingh569137
 
PDF
Optimising Data Using K-Means Clustering Algorithm
IJERA Editor
 
DOCX
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
 
PDF
An improvement in k mean clustering algorithm using better time and accuracy
ijpla
 
PPT
Clustering in Machine Learning: A Brief Overview.ppt
shilpamathur13
 
PDF
K means clustering
Kuppusamy P
 
PDF
Machine Learning - Clustering
Darío Garigliotti
 
PPTX
partitioning methods in data mining .pptx
BodhanLaxman1
 
PDF
The International Journal of Engineering and Science (The IJES)
theijes
 
PDF
ch_5_dm clustering in data mining.......
PriyankaPatil919748
 
PPT
Enhance The K Means Algorithm On Spatial Dataset
AlaaZ
 
PPTX
Lec13 Clustering.pptx
Khalid Rabayah
 
PDF
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
Edureka!
 
PDF
ClusteringClusteringClusteringClustering.pdf
SsdSsd5
 
PPT
Lecture_3_k-mean-clustering.ppt
SyedNahin1
 
K-Means Clustering Algorithm.pptx
JebaRaj26
 
Unsupervised learning Algorithms and Assumptions
refedey275
 
K means ALGORITHM IN MACHINE LEARNING.pptx
angelinjeba6
 
26-Clustering MTech-2017.ppt
vikassingh569137
 
Optimising Data Using K-Means Clustering Algorithm
IJERA Editor
 
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
 
An improvement in k mean clustering algorithm using better time and accuracy
ijpla
 
Clustering in Machine Learning: A Brief Overview.ppt
shilpamathur13
 
K means clustering
Kuppusamy P
 
Machine Learning - Clustering
Darío Garigliotti
 
partitioning methods in data mining .pptx
BodhanLaxman1
 
The International Journal of Engineering and Science (The IJES)
theijes
 
ch_5_dm clustering in data mining.......
PriyankaPatil919748
 
Enhance The K Means Algorithm On Spatial Dataset
AlaaZ
 
Lec13 Clustering.pptx
Khalid Rabayah
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
Edureka!
 
ClusteringClusteringClusteringClustering.pdf
SsdSsd5
 
Lecture_3_k-mean-clustering.ppt
SyedNahin1
 
Ad

More from Pyingkodi Maran (20)

PDF
Defining Identity as a Service (IDaaS) in Cloud Computing
Pyingkodi Maran
 
PDF
Data Science Normal Distribution Z-Score
Pyingkodi Maran
 
PDF
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
PDF
Database Manipulation with MYSQL Commands
Pyingkodi Maran
 
PDF
Jquery Tutorials for designing Dynamic Web Site
Pyingkodi Maran
 
PDF
Working with AWS Relational Database Instances
Pyingkodi Maran
 
DOC
Health Monitoring System using IoT.doc
Pyingkodi Maran
 
PPT
IoT Industry Adaptation of AI.ppt
Pyingkodi Maran
 
PPT
IoT_Testing.ppt
Pyingkodi Maran
 
PDF
Azure Devops
Pyingkodi Maran
 
PDF
Creation of Web Portal using DURPAL
Pyingkodi Maran
 
PDF
AWS Relational Database Instance
Pyingkodi Maran
 
PDF
AWS S3 Buckets
Pyingkodi Maran
 
PDF
Creation of AWS Instance in Cloud Platform
Pyingkodi Maran
 
PDF
Amazon Web Service.pdf
Pyingkodi Maran
 
PDF
Cloud Security
Pyingkodi Maran
 
PDF
Cloud Computing Introduction
Pyingkodi Maran
 
PDF
Supervised Machine Learning Algorithm
Pyingkodi Maran
 
PDF
Feature Engineering in Machine Learning
Pyingkodi Maran
 
PDF
Normalization in DBMS
Pyingkodi Maran
 
Defining Identity as a Service (IDaaS) in Cloud Computing
Pyingkodi Maran
 
Data Science Normal Distribution Z-Score
Pyingkodi Maran
 
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
Database Manipulation with MYSQL Commands
Pyingkodi Maran
 
Jquery Tutorials for designing Dynamic Web Site
Pyingkodi Maran
 
Working with AWS Relational Database Instances
Pyingkodi Maran
 
Health Monitoring System using IoT.doc
Pyingkodi Maran
 
IoT Industry Adaptation of AI.ppt
Pyingkodi Maran
 
IoT_Testing.ppt
Pyingkodi Maran
 
Azure Devops
Pyingkodi Maran
 
Creation of Web Portal using DURPAL
Pyingkodi Maran
 
AWS Relational Database Instance
Pyingkodi Maran
 
AWS S3 Buckets
Pyingkodi Maran
 
Creation of AWS Instance in Cloud Platform
Pyingkodi Maran
 
Amazon Web Service.pdf
Pyingkodi Maran
 
Cloud Security
Pyingkodi Maran
 
Cloud Computing Introduction
Pyingkodi Maran
 
Supervised Machine Learning Algorithm
Pyingkodi Maran
 
Feature Engineering in Machine Learning
Pyingkodi Maran
 
Normalization in DBMS
Pyingkodi Maran
 
Ad

Recently uploaded (20)

PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PPTX
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 

Unsupervised Learning in Machine Learning

  • 1. Dr.M.Pyingkodi Dept of MCA Kongu Engineering College Erode, Tamilnadu, India
  • 2. Unsupervised learning • Unsupervised learning is a machine learning concept where the unlabelled and unclassified information is analysed to discover hidden knowledge. • The algorithms work on the data without any prior training. Example: • movie promotions to the correct group of people. • Earlier times: same set of movie to all the visitors of the page. • Now: based on their interest, understand what type of movie is liked by what segment of the people.
  • 3. Clustering and Association Analysis • Cluster analysis finds the commonalities between the data objects and categorizes them as per the presence and absence of those commonalities. Clustering which helps in segmentation of the set of objects into groups of similar objects. • Association: An association rule is an unsupervised learning method which is used for finding the relationships between variables/objects in the large database (dataset).
  • 4. Clustering • clustering is defined as an unsupervised machine learning task that automatically divides the data into clusters or groups of similar items.
  • 5. Different types of clustering techniques The major clustering techniques are • Partitioning methods, • Hierarchical methods, and • Density-based methods.
  • 7. Partitioning methods • Partitional clustering divides data objects into nonoverlapping groups. In other words, no object can be a member of more than one cluster, and every cluster must have at least one object. Two of the most important algorithms for partitioning-based clustering are k-means and k-medoid • In the k-means algorithm, the centroid of the prototype is identified for clustering, which is normally the mean of a group of points. • Similarly, the k-medoid algorithm identifies the medoid which is the most representative point for a group of points. • These algorithms are both nondeterministic, meaning they could produce different results from two separate runs even if the runs were based on the same input.
  • 8. Hierarchical Clustering Hierarchical clustering determines cluster assignments by building a hierarchy. This is implemented by either a bottom-up or a top-down approach: • Agglomerative clustering is the bottom-up approach. It merges the two points that are the most similar until all points have been merged into a single cluster. • Divisive clustering is the top-down approach. It starts with all points as one cluster and splits the least similar clusters at each step until only single data points remain. • These methods produce a tree-based hierarchy of points called a dendrogram. • hierarchical clustering is a deterministic process, meaning cluster assignments won’t change when you run an algorithm twice on the same input data. • we have seen in the K-means clustering that there are some challenges with this algorithm, which are a predetermined number of clusters, and it always tries to create the clusters of the same size. To solve these two challenges, we can opt for the hierarchical clustering algorithm because, in this algorithm, we don't need to have knowledge about the predefined number of clusters.
  • 9. Density-Based Clustering • Density-based clustering determines cluster assignments based on the density of data points in a region. • Clusters are assigned where there are high densities of data points separated by low-density regions.
  • 10. Partitioning (K-means - A centroid- based technique) • K Means segregates the unlabeled data into various groups, called clusters, based on having similar features, common patterns. • The principle of the k-means algorithm is to assign each of the ‘n’ data points to one of the K clusters where ‘K’ is a userdefined parameter as the number of clusters desired. • The objective is to maximize the homogeneity within the clusters and also to maximize the differences between the clusters. • The homogeneity and differences are measured in terms of the distance between the objects or points in the data set.
  • 11. • Kmeans Algorithm is an Iterative algorithm that divides a group of n datasets into k subgroups /clusters based on the similarity and their mean distance from the centroid of that particular subgroup/ formed. • K, here is the pre-defined number of clusters to be formed by the Algorithm. • If K=3, It means the number of clusters to be formed from the dataset is 3
  • 12. Step-1: Select the value of K, to decide the number of clusters to be formed. Step-2: Select random K points which will act as centroids. Step-3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters. Step-4: place a new centroid of each cluster. Step-5: Repeat step no.3, which reassign each datapoint to the new closest centroid of each cluster. Step-6: If any reassignment occurs, then go to step-4 else go to Step 7. Step-7: FINISH
  • 13. Note: From the reference of text BOOK
  • 14. • Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It means here we will try to group these datasets into two different clusters. • We need to choose some random k points or centroid to form the cluster. These points can be either the points from the dataset or any other point. So, here we are selecting the below two points as k points, which are not the part of our dataset.
  • 15. • Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by applying some mathematics that we have studied to calculate the distance between two points. So, we will draw a median between both the centroids. • From the image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
  • 16. • As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new centroids.
  • 17. • Consider the below data set which has the values of the data points • We can randomly choose two initial points as the centroids and from there we can start calculating distance of each point. • For now we will consider that D2 and D4 are the centroids. • To start with we should calculate the distance with the help of Euclidean Distance which is √((x1-y1)² + (x2-y2)²
  • 18. Iteration 1: • Step 1: We need to calculate the distance between the initial centroid points with other data points. Below have shown the calculation of distance from initial centroids D2 and D4 from data point D1. • After calculating the distance of all data points, we get the values as below.
  • 19. Step 2: Next, we need to group the data points which are closer to centriods. Observe the above table, we can notice that D1 is closer to D4 as the distance is less. Hence we can say that D1 belongs to D4 Similarly, D3 and D5 belongs to D2. After grouping, we need to calculate the mean of grouped values from Table 1. Cluster 1: (D1, D4) Cluster 2: (D2, D3, D5) Step 3: Now, we calculate the mean values of the clusters created and the new centriod values will these mean values and centroid is moved along the graph. From the above table, we can say the new centroid for cluster 1 is (2.0, 1.0) and for cluster 2 is (2.67, 4.67)
  • 20. Iteration 2: Step 4: Again the values of euclidean distance is calculated from the new centriods. Below is the table of distance between data points and new centroids. • We can notice now that clusters have changed the data points. Now the cluster 1 has D1, D2 and D4 data objects. Similarly, cluster 2 has D3 and D5 Step 5: Calculate the mean values of new clustered groups from Table 1 which we followed in step 3. The below table will show the mean values Now we have the new centroid value as following: cluster 1 ( D1, D2, D4) – (1.67, 1.67) and cluster 2 (D3, D5) – (3.5, 5.5) This process has to be repeated until we find a constant value for centroids and the latest cluster will be considered as the final cluster solution.
  • 21. Choosing the value of K : • For a small data set, sometimes a rule of thumb that is followed is • But unfortunately, this thumb rule does not work well for large data sets. There are several statistical methods to arrive at the suitable number of clusters. • To find the number of clusters in the data, we need to run the K-Means clustering algorithm for different values of K and compare the results. • We should choose the optimal value of K that gives us best performance. There are different techniques available to find the optimal value of K. • The most common technique is the elbow method which is described below. • one effective approach is to employ the hierarchical clustering technique on sample points from the data set and then arrive at sample K clusters.
  • 22. How to choose the value of "K number of clusters" in K- means Clustering? • The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. • To choose the optimal number of clusters is a big task. Elbow Method: • The Elbow method is one of the most popular ways to find the optimal number of clusters. • This method uses the concept of WCSS value. • WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below: WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
  • 23. In the above formula of WCSS, • ∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid within a cluster1 and the same for the other two terms. • To measure the distance between data points and centroid, we can use any method such as Euclidean distance or Manhattan distance. To find the optimal value of clusters, the elbow method follows the below steps: • It executes the K-means clustering on a given dataset for different K values (ranges from 1-10). • For each value of K, calculates the WCSS value. • Plots a curve between calculated WCSS values and the number of clusters K. • The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of K.
  • 24. • Between Clusters Sum of Squares (BCSS), which measures the squared average distance between all centroids. To calculate BCSS, you find the Euclidean distance from a given cluster centroid to all other cluster centroids. You then iterate this process for all of the clusters, and sum all of the values together. This value is the BCSS. You can divide by the number of clusters to calculate the average BCSS. • Essentially, BCSS measures the variation between all clusters. A large value can indicate clusters that are spread out, while a small value can indicate clusters that are close to each other. • We iterate the values of k from 1 to 9 and calculate the values of distortions for each value of k and calculate the distortion and inertia for each value of k in the given range.
  • 25. • Distortion is the average of the euclidean squared distance from the centroid of the respective clusters. Typically, the Euclidean distance metric is used. • Inertia is the sum of squared distances of samples to their closest cluster centre. • The measure of quality of clustering uses the SSE technique
  • 27. K-Medoids: a representative object-based technique • k-means algorithm is sensitive to outliers in the data set. • Consider the values 1, 2, 3, 5, 9, 10, 11, and 25. • Point 25 is the outlier, and it affects the cluster formation negatively when the mean of the points is considered as centroids.
  • 30. • Because the SSE of the second clustering is lower, k-means tend to put point 9 in the same cluster with 1, 2, 3, and 6 though the point is logically nearer to points 10 and 11. • This skewedness is introduced due to the outlier point 25, which shifts the mean away from the centre of the cluster. k-medoids provides a solution to this problem. • Instead of considering the mean of the data points in the cluster, kmedoids considers k representative data points from the existing points in the data set as the centre of the clusters. • Note that the medoids in this case are actual data points or objects from the data set and not an imaginary point as in the case when the mean of the data sets within cluster is used as the centroid in the k- means technique. The SSE is calculated as where o is the representative point or object of cluster C .
  • 31. • One of the practical implementation of the k-medoids principle is the Partitioning around Medoids (PAM) algorithm.