What is Data Science
• Also Known as data driven Science
• It is an interdisciplinary field about
scientific methods, processes and
system to extract knowledge or insights
from data in various form either
structured or unstructured
What is Machine Learning
 Machine Learning is a concept which allows the machine to
learn from examples and experience, and that too without
being explicitly programmed. So instead of you writing the
code, what you do is you feed data to the generic algorithm,
and the algorithm/ machine builds the logic based on the
given data
Features of Machine Learning
 It uses the data to detect patterns in a dataset and adjust
program actions accordingly
 It focuses the development of computer programs that can
teach themselves to grow and change when exposed to new
data
 It enables the computer ti find hidden insights using
interactive algorithms without being explicitly programmed
 It is a method of data analysis that automate analytical model
building
How ML Model Works
Application
Stages of Machine Learning
Machine Learning
Algorithms
Supervised Learning Unsupervised Learning
ClusteringDecission Tree
Random
Forest
KNN
SVM Naïve BayesLogistic
Regression
Hierchical K-Means DB-Scan
Linear
Regression
PCA
Clustering
 It refers to the grouping of records, observations into classes
of similar objects.
 It is a collection of records that are similar to one another and
dissimilar to records in other clusters.
 No target variable for clustering
 It segments the entire dataset into homogeneous subgroups
where similarity within cluster is maximized and outside
cluster is minimized
Hierchical Clustering
Divisive MethodAgglomerative
Linkage Function
There are several linkage functions available for hierarchical clustering
We will focus on these three commonly used methods
 Single linkage
 Nearest Neighbour approach
 Based on minimum distance between any records in two clusters
 Complete linkage
 Farthest Neighbour approach
 Based on maximum distance between any records in two clusters
 Average linkage
Average distance of all the records in cluster.
Example of Hierchical Clustering
Consider A, B, C, D, E as cases with
the following similarities:
A B C D E
A - 2 7 9 4
B 2 - 9 11 18
C 7 9 - 4 8
D 9 6 4 - 2
E 4 18 8 2 -
Example Contd..
So lets cluster E and B. We now have the structure:
Example Contd..
 Now we update the case-to-case matrix
A BE C D
A - 4 7 9
BE 4 - 9 11
C 7 9 - 4
D 9 6 4 -
Note: To compute BE -- SC (A, B) = 2
SC (A, E) = 4
SC(A,BE) = 4 if we are using single linkage
SC(A,BE) = 2 if we are using complete linkage
SC(A,BE) = 3 if we are using group average
So lets cluster BE and C. We now have the structure:
Now we update the case-to-case matrix.
A BCE D
A - 7 9
BCE 7 - 2
D 9 6 -
To compute SC(A, BCE):
SC (A, BE) = 2
SC (A, C) = 7 so SC(A,BCE) = 2
To Compute: SC(D,BCE)
SC(D, BE) = 2
SC(D, C) = 4 so SC(D, BCE) = 2
SC(D,A) = 9 which is greater than SC(A,BCE) or SC(D,BCE)
so we now cluster A and D.
 So lets cluster A and D. At this point, there are only two nodes that have not been
clustered, AD and BCE. We now cluster them.
Now we have clustered everything
Advantage & Disadvantage
Advantages of hierarchical clustering are
Easy to understand
Often efficient in clustering
Disadvantages of hierarchical clustering are
 Not very much scalable
 Choice of distance measure is far from trivial job
 Not applicable for dataset with missing values
 For huge dataset it wont work
 Due to heuristic nature, greedy search may result in
unclear cluster hierarchy
K-Means Clustering
 It is an algorithm to group the objects based on attributes/features into K number of
group
 Steps for K-Means
Step 1: Begin with a decision on the value of k = number of clusters
Step 2: Put any initial partition that classifies the data into k clusters. You may assign
the training samples randomly,or systematically as the following
• Take the first k training sample as single- element clusters
• Assign each of the remaining (N-k) training sample to the cluster with the nearest
centroid. After each assignment, recompute the centroid of the gaining cluster
Step 3: Take each sample in sequence and compute its distance from the centroid of
each of the clusters. If a sample is not currently in the cluster with the closest centroid,
switch this sample to that cluster and update the centroid of the cluster gaining the
new sample and the cluster losing the sample
Step 4 . Repeat step 3 until convergence is achieved, that is until a pass through the
training sample causes no new assignments
Example of K-Means Clustering
 Let us consider a simple dataset
DataPoints Axis Points
a (1,3)
b (3,3)
c (4,3)
d (5,3)
e (1,2)
f (4,2)
g (1,1)
h (2,1)
 Step 1: Randomly assign any two individuals as two centroids
DataPoints Axis Points
a (1,3)
b (3,3)
c (4,3)
d (5,3)
e (1,2)
f (4,2)
g (1,1)
h (2,1)
 Finding Nearest Cluster centre for each record
Point Distance from
m1
Distance from
m2
Cluster
Membership
a 2 2.24 c1
b 2.83 2.24 c2
c 3.61 2.83 c2
d 4.47 3.61 c2
e 1.00 1.41 c1
f 3.16 2.24 c2
g 0 1.00 c1
h 1.00 0 c2
 Finding Nearest Cluster centre for each record(second pass)
Point Distance from
m1
Distance from
m2
Cluster
Membership
a 1.00 2.67 c1
b 2.24 0.85 c2
c 3.16 0.72 c2
d 4.12 1.52 c2
e 0.00 2.63 c1
f 3.00 0.57 c2
g 1.00 2.95 c1
h 1.41 2.13 c1
 Finding Nearest Cluster centre for each record(third pass)
As no records have shifted cluster membership, cluster centroids also remain
unchanged & Algorithm terminates
Point Distance from
m1
Distance from
m2
Cluster
Membership
a 1.27 3.01 c1
b 2.15 1.03 c2
c 3.02 0.25 c2
d 3.95 1.03 c2
e 0.35 3.09 c1
f 2.76 0.75 c2
g 0.79 3.47 c1
h 1.06 2.66 c1
Advantage & Disadvantage
Advantages
• Very simple algorithm
• Always converges (even though locally)
• Quite fast and interpretation of clusters is quite easy
Issues
• Greatly affected by extreme values
• Performs poorly for irregular shaped data points (longitude and latitude)
• Each time it is run, different results may come out!
• Cannot handle categorical data
How to find Cluster Goodness
 Silhouette Score
 For each of data value i,
ai=distance between data value and its cluster centre
bi=distance between data value and next closest cluster centre
Cluster Characterstics:
If Si>0.5, Good evidence of reality of clusters
If Si<0.25 Bad evidence of reality of clusters
If 0.25<Si<0.5, Some evidence of reality of the clusters

ML basic &amp; clustering

  • 1.
    What is DataScience • Also Known as data driven Science • It is an interdisciplinary field about scientific methods, processes and system to extract knowledge or insights from data in various form either structured or unstructured
  • 2.
    What is MachineLearning  Machine Learning is a concept which allows the machine to learn from examples and experience, and that too without being explicitly programmed. So instead of you writing the code, what you do is you feed data to the generic algorithm, and the algorithm/ machine builds the logic based on the given data
  • 3.
    Features of MachineLearning  It uses the data to detect patterns in a dataset and adjust program actions accordingly  It focuses the development of computer programs that can teach themselves to grow and change when exposed to new data  It enables the computer ti find hidden insights using interactive algorithms without being explicitly programmed  It is a method of data analysis that automate analytical model building
  • 4.
  • 5.
  • 6.
  • 7.
    Machine Learning Algorithms Supervised LearningUnsupervised Learning ClusteringDecission Tree Random Forest KNN SVM Naïve BayesLogistic Regression Hierchical K-Means DB-Scan Linear Regression PCA
  • 8.
    Clustering  It refersto the grouping of records, observations into classes of similar objects.  It is a collection of records that are similar to one another and dissimilar to records in other clusters.  No target variable for clustering  It segments the entire dataset into homogeneous subgroups where similarity within cluster is maximized and outside cluster is minimized
  • 9.
  • 10.
    Linkage Function There areseveral linkage functions available for hierarchical clustering We will focus on these three commonly used methods  Single linkage  Nearest Neighbour approach  Based on minimum distance between any records in two clusters  Complete linkage  Farthest Neighbour approach  Based on maximum distance between any records in two clusters  Average linkage Average distance of all the records in cluster.
  • 11.
    Example of HierchicalClustering Consider A, B, C, D, E as cases with the following similarities: A B C D E A - 2 7 9 4 B 2 - 9 11 18 C 7 9 - 4 8 D 9 6 4 - 2 E 4 18 8 2 -
  • 12.
    Example Contd.. So letscluster E and B. We now have the structure:
  • 13.
    Example Contd..  Nowwe update the case-to-case matrix A BE C D A - 4 7 9 BE 4 - 9 11 C 7 9 - 4 D 9 6 4 - Note: To compute BE -- SC (A, B) = 2 SC (A, E) = 4 SC(A,BE) = 4 if we are using single linkage SC(A,BE) = 2 if we are using complete linkage SC(A,BE) = 3 if we are using group average
  • 14.
    So lets clusterBE and C. We now have the structure:
  • 15.
    Now we updatethe case-to-case matrix. A BCE D A - 7 9 BCE 7 - 2 D 9 6 - To compute SC(A, BCE): SC (A, BE) = 2 SC (A, C) = 7 so SC(A,BCE) = 2 To Compute: SC(D,BCE) SC(D, BE) = 2 SC(D, C) = 4 so SC(D, BCE) = 2 SC(D,A) = 9 which is greater than SC(A,BCE) or SC(D,BCE) so we now cluster A and D.
  • 16.
     So letscluster A and D. At this point, there are only two nodes that have not been clustered, AD and BCE. We now cluster them.
  • 17.
    Now we haveclustered everything
  • 18.
    Advantage & Disadvantage Advantagesof hierarchical clustering are Easy to understand Often efficient in clustering Disadvantages of hierarchical clustering are  Not very much scalable  Choice of distance measure is far from trivial job  Not applicable for dataset with missing values  For huge dataset it wont work  Due to heuristic nature, greedy search may result in unclear cluster hierarchy
  • 19.
    K-Means Clustering  Itis an algorithm to group the objects based on attributes/features into K number of group  Steps for K-Means Step 1: Begin with a decision on the value of k = number of clusters Step 2: Put any initial partition that classifies the data into k clusters. You may assign the training samples randomly,or systematically as the following • Take the first k training sample as single- element clusters • Assign each of the remaining (N-k) training sample to the cluster with the nearest centroid. After each assignment, recompute the centroid of the gaining cluster
  • 20.
    Step 3: Takeeach sample in sequence and compute its distance from the centroid of each of the clusters. If a sample is not currently in the cluster with the closest centroid, switch this sample to that cluster and update the centroid of the cluster gaining the new sample and the cluster losing the sample Step 4 . Repeat step 3 until convergence is achieved, that is until a pass through the training sample causes no new assignments
  • 21.
    Example of K-MeansClustering  Let us consider a simple dataset DataPoints Axis Points a (1,3) b (3,3) c (4,3) d (5,3) e (1,2) f (4,2) g (1,1) h (2,1)
  • 22.
     Step 1:Randomly assign any two individuals as two centroids DataPoints Axis Points a (1,3) b (3,3) c (4,3) d (5,3) e (1,2) f (4,2) g (1,1) h (2,1)
  • 23.
     Finding NearestCluster centre for each record Point Distance from m1 Distance from m2 Cluster Membership a 2 2.24 c1 b 2.83 2.24 c2 c 3.61 2.83 c2 d 4.47 3.61 c2 e 1.00 1.41 c1 f 3.16 2.24 c2 g 0 1.00 c1 h 1.00 0 c2
  • 24.
     Finding NearestCluster centre for each record(second pass) Point Distance from m1 Distance from m2 Cluster Membership a 1.00 2.67 c1 b 2.24 0.85 c2 c 3.16 0.72 c2 d 4.12 1.52 c2 e 0.00 2.63 c1 f 3.00 0.57 c2 g 1.00 2.95 c1 h 1.41 2.13 c1
  • 25.
     Finding NearestCluster centre for each record(third pass) As no records have shifted cluster membership, cluster centroids also remain unchanged & Algorithm terminates Point Distance from m1 Distance from m2 Cluster Membership a 1.27 3.01 c1 b 2.15 1.03 c2 c 3.02 0.25 c2 d 3.95 1.03 c2 e 0.35 3.09 c1 f 2.76 0.75 c2 g 0.79 3.47 c1 h 1.06 2.66 c1
  • 26.
    Advantage & Disadvantage Advantages •Very simple algorithm • Always converges (even though locally) • Quite fast and interpretation of clusters is quite easy Issues • Greatly affected by extreme values • Performs poorly for irregular shaped data points (longitude and latitude) • Each time it is run, different results may come out! • Cannot handle categorical data
  • 27.
    How to findCluster Goodness  Silhouette Score  For each of data value i, ai=distance between data value and its cluster centre bi=distance between data value and next closest cluster centre Cluster Characterstics: If Si>0.5, Good evidence of reality of clusters If Si<0.25 Bad evidence of reality of clusters If 0.25<Si<0.5, Some evidence of reality of the clusters