SlideShare a Scribd company logo
MODULE 4
SHIWANI GUPTA
SUPERVISED LEARNING - CLASSIFICATION
Evaluation Metric
Logistic Regression
k Nearest Neighbor
Linear SVM
Kernel
DT
Issue in DT learning
Ensemble- Bagging
RF
Ensemble – Boosting
Adaboost
Use case
2
Performance
◦ Null Hypothesis: commonly accepted fact that you wish to test eg. data scientist salary on an av. is 113,000 dollars.
◦ Alternative Hypothesis: everything else eg. mean data scientist salary is not 113,000 dollars.
◦ Type I error (FP): Rejecting a true null hypothesis
◦ Type II error (FN): Accepting a false null hypothesis
◦ Confusion Matrix
◦ Accuracy = (TP+TN)/(TP+FN+FP+TN)
◦ Precision = TP/(TP+FP) eg. No. of patients diagnosed as having cancer actually had
◦ Recall/Sensitivity = TP/(TP+FN) eg. What portion of patients that actually had cancer were diagnosed by model as
having
◦ Specificity = TN/(TN+FP) eg. Benign patients predicted benign
◦ F-score = (2*P*R)/(P+R)
PredictedActual Positive Negative
Positive TP FP
Negative FN TN
https://www.khanacademy.org/math/ap-statistics/tests-significance-ap/error-probabilities-power/v/introduction-to-type-i-
and-type-ii-errors 3
Logistic Regression
Specialized case of Generalized Linear Model
◦ Just like LR, LoR can work with both continuous data eg. weight and discrete data eg. gender.
◦ A statistical model predicting the likelihood / probability.
◦ Uses logistic / sigmoid function to model binary/dichotomous/categorical dependent variable.
• It is a mathematical function used to map the predicted values to probabilities. It forms a "S" curve.
• In logistic regression, we use the concept of the threshold value, such that values above the threshold tends to 1, and a
value below the threshold tends to 0. Thus any real value is mapped into another value within a range of 0 and 1.
◦ Assumes no / very little multicollinearity between predictor / independent variables.
https://www.youtube.com/watch?v=yIYKR4sgzI8&list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe 4
Mathematics
◦ Null Hypothesis H0: A relationship exists between predictor and response variable
◦ prob of success p = 0.8, prob of failure q = 1-p = 0.2 range [0,1]
◦ Odds(odds ratio) = success/failure = p/(1-p)
◦ Odds of success=p/q=4 range = [0,∞]
◦ log(odds) OR logit(p) = log(p/(1-p)) = z range=[-∞, ∞] as in Linear Regression
◦ p = elog(odds) / (1+elog(odds))
https://www.youtube.com/watch?v=vN5cNN2-HWE&list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe&index=25
Mathematics
Linear Regression
6
Loan Defaulter
Sav
ing
s(L
akh
s)
0.5
0
0.7
5
1.0
0
1.2
5
1.5
0
1.7
5
1.7
5
2.0
0
2.2
5
2.5
0
2.7
5
3.0
0
3.2
5
3.5
0
4.0
0
4.2
5
4.5
0
4.7
5
5.0
0
5.5
0
Loa
n
Def
ault
er/
Not
0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
Fitt
ed
Val
ue
0.0
347
0.0
497
0.0
708
0.1
000
0.1
393
0.1
908
0/1
908
0.2
556
0.3
335
0.4
216
0.5
149
0.6
073
0.6
925
0.7
664
0.8
744
0.9
102
0.9
366
0.9
556
0.9
690
0.9
851
Pre
dict
ion
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
Coefficients b0 = -4.0778 b1 = 1.5046
prob = 1/(1+e-(-4.0778+1.5046*saving))
7
savings 0.5 0.75 1 1.25 1.5 1.75 1.75 2 2.25 2.5 2.75 3 3.25 3.5 4 4.25 4.5 4.75 5 5.5
y 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
prob = fitted
value 0.0347070.0497670.0708830.100020.1393260.1908110.190810.2556690.3334880.4215780.5149580.6073050.6925670.7664370.8744180.9102550.9366060.9555980.96909 0.98519
prediction 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
odds 0.0359550.0523740.0762910.11113 0.16188 0.2358050.235810.3434890.5003490.7288411.0616771.5465092.2527463.2814986.96292710.1426614.7744621.5214531.349666.51982
logit -3.3255 -2.94935 -2.5732 -2.1971 -1.8209 -1.44475 -1.4448 -1.0686 -0.69245 -0.3163 0.05985 0.436 0.81215 1.1883 1.9406 2.31675 2.6929 3.06905 3.4452 4.1975
8
Maximum Likelihood Estimation
• Probabilistic framework for estimating parameters of
model follows Bernoulli distribution.
• Log likelihood
• This negative function is because when we train, we
need to maximize the probability by minimizing loss
function.
• Decreasing the cost will increase the maximum
likelihood assuming that samples are drawn from an
identically independent distribution.
• When the model is a poor fit, log likelihood is
relatively large negative value and when model is a
good fit, log likelihood is close to zero.
9
Cost Function
10
Gradient Descent
‘a’ represents hypothesis
11
Types
◦ Binary Eg. 0/1, pass/fail, spam/not spam
◦ Multinomial: cat/dog/sheep, Veg/NonVeg/Vegan
◦ Ordinal: low/medium/high, movie rating 1-5
12
Use Cases
◦ Email spam
◦ Credit card fraud
◦ Cancer benign/ malignant
◦ Predict if a user will invest in term deposit
◦ Loan defaulter
13
ADVANTAGES
• It is simple to implement
• Works well for linearly separable data
• Gives a measure of how relevant an
independent variable is through coefficient
• Tells us about the direction of the relationship
(positive or negative)
DISADVANTAGES
• Fails to predict continuous outcome
• Linearity assumption
• Not accurate for small sample size
14
PRACTICE QUESTIONS
◦ A team scored 285 runs in a cricket match. Assuming regression coefficients to be 0.3548 and 0.00089 respectively, calculate
its probability of winning the match.
◦ You are applying for a home loan and your credit score is 720. Assuming logistic regression coefficient to be 9.346 and 0.0146
respectively, calculate probability of home loan application getting approved.
15
K Nearest Neighbor
◦ non-parametric: it does not make any underlying assumptions
about the distribution of data
◦ Intuition: given an unclassified point, we can assign it to a group
by observing what group it’s nearest neighbors belong to
• K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification
problems
• It is also called a lazy learner algorithm because it does not
learn from the training set instead it stores the dataset during
training phase and at the time of classification, it performs an
action on the dataset.
• Also, the accuracy of the above classifier increases as we increase
the number of data points in the training set.
16
Algorithm
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
Step-6: Our model is ready.
K can be kept as an odd number so that we can calculate a clear majority in the case where only two groups are
possible (e.g. Red/Blue). Most preferred value is 5. A very low value, can be noisy and lead to effects of outliers in
model. With increasing K, we get smoother, more defined boundaries across different classifications.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a
cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN
model will find the similar features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
17
Distance metric
◦ Minkowski Distance
◦ Euclidean Distance if input variables similar in type eg. width, height
◦ Manhattan Distance / City block distance if grid like path
◦ Hamming Distance between binary vectors
◦ Others: Jaccard, Mahalanobis, cosine similarity, Tanimoto, etc.
18
Numerical Example
x1=acid durability (sec) x2=strength (kg/m2) y=class Euclidean Distance
7 7 Bad 16
7 4 Bad 25
3 4 Good 9
1 4 Good 13
Factory produces a new paper tissue that passes lab test with x1=3, x2=7. Classify this tissue.
1. k? k=3
2. Compute distance
3. Sort dist. and determine nearest neighbor based on kth min. dist.
4. Gather category y of nearest neighbors
5. Use simple majority as prediction of query instance
19
Use Case
◦ Application
◦ pattern recognition
◦ data mining
◦ intrusion detection
◦ recommender
◦ products on Amazon
◦ articles on Medium
◦ movies on Netflix
◦ videos on YouTube
20
ADVANTAGES
• It is simple to implement.
• No hyperparameter tuning required.
• Makes no assumptions about data.
• Quite useful as in real world most data doesn’t
obey typical theoretical assumptions.
• No explicit training phase hence fast.
DISADVANTAGES
• The computation cost is high because of calculating the
distance between data points for all the training samples.
• Since all training data required for computation of
distance, algo requires large amount of memory.
• Prediction stage is slow.
• Sensitive to irrelevant features.
• Sensitive to scale of data.
21
SVM
◦ Discriminative classifier
◦ Extreme data points – support vectors (only support vectors are important whereas other training example are ignorable)
◦ Hyperplane – best separates two classes
◦ If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane
becomes a two-dimensional plane.
◦ Unoptimized decision boundary could result in more miss classifications
◦ Maximum Margin classifier
◦ Margin = double the distance (perpendicular) between hyperplane and support vector (closest data point)
◦ Super sensitive to outliers in training data if they are considered as support vectors.
◦ In SVM, if the output of linear function is greater than 1, we identify it with one class and if the output is -1, we identify it with
another class. The threshold values are changed to 1 and -1 in SVM, which acts as margin.
22
Implementation: https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html 23
Assumptions and Types
• Numerical Inputs: SVM assumes that your inputs are numeric. If you have categorical inputs you
may need to covert them to binary dummy variables (one variable for each category).
• Binary Classification: Basic SVM is intended for binary (two-class) classification problems.
Although, extensions have been developed for regression and multi-class classification.
• Soft margin: allows some samples to be placed on wrong side of margin.
• Hard margin
24
Understanding Mathematics
Mathematical Eqn and Primal Dual:
https://www.youtube.com/watch?v=ptwn9wg_s48
TASK
Refer pg 13 pdf for solved numerical 10.1
25
From slide 10
C = 1/λ
C controls cost of misclassification of training data
Non Linear SVM
z=x^2+y^2
Transformation through nonlinear mapping function into linearly separable data
Kernel Types:
Linear
Polynomial
RBF/Gaussian (weighted NN) squared Euclidean distance, γ = 1/(2σ2)
Exponential
https://www.youtube.com/watch?v=efR1C6CvhmE
Refer pg 18 pdf for solved numerical 10.2
26
SVM poses a quadratic optimization problem that looks for maximizing the margin between both classes and
minimizing the amount of miss-classifications. For non-separable problems, in order to find a solution, the
miss-classification constraint must be relaxed, and this is done by "regularization“.
Regularization
C is the penalty parameter, which
represents misclassification or error term
i.e. how much error is bearable.
This is how you can control the trade-off
between decision boundary and
misclassification term.
A smaller value of C creates a large-
margin hyperplane that is tolerant of miss
classifications.
Large value of C creates a small-margin
hyperplane and thus overfits and heavily
penalizes for misclassified points.
γ represents the spread of Kernel i.e. decision region
A lower value of Gamma will loosely fit the training dataset since
it considers only nearby points in calculating the separation line.
Higher value of gamma will exactly fit the training dataset
creating islands, which causes over-fitting since it considers all
the data points in the calculation of the separation line.
27
https://chrisalbon.com/machine_learning/support
_vector_machines/svc_parameters_using_rbf_ke
rnel/
Use Case and Variants
◦ Face Recognition
◦ Intrusion detection
◦ Classification of emails, news articles and web pages
◦ Classification of genes
◦ Handwriting recognition.
◦ You can use a numerical optimization procedure as stochastic gradient descent to search for the coefficients of the hyperplane.
◦ The most popular method for fitting SVM is the Sequential Minimal Optimization (SMO) method that is very efficient. It breaks
the Quadratic Programming problem down into sub-problems that can be solved analytically (by calculating) rather than
numerically (by searching or optimizing) through Lagrangian Multiplier by satisfying Karush Kahun Tucker (KKT) conditions.
28
ADVANTAGES
• Effective in high dimensional space
• Applicable for both classification and regression
• Their dependence on relatively few support vectors
means that they are very compact models, and take up
very little memory.
• Once the model is trained, the prediction phase is very
fast
• Effective when no. of features > no. of samples
• Support overlapping classes
DISADVANTAGES
• Don’t provide probability estimates, these are
calculated using an expensive five-fold cross-
validation
• Requires scaling of features
• Sensitive to outliers
• Sensitive to the type of kernel used
29
PRACTICE QUESTIONS
◦ Given the following data, calculate hyperplane. Also classify (0.6,0.9) based on calculated hyperplane.
30
A1 A2 y
0.38 0.47 +
0.49 0.61 -
0.92 0.41 -
0.74 0.89 -
0.18 0.58 +
0.41 0.35 +
0.93 0.81 -
0.21 0.1 +
Multiclass / Multinomial Classification
◦ One vs One (OvO)
Eg. red, blue, green, yellow class
red vs blue, red vs green, red vs yellow, blue vs green, blue vs
yellow, green vs yellow
6 datasets i.e. c*(c-1)/2 models for c classes
Most votes for classification. argmax of sum of scores for
numerical class membership as probability
High computational complexity
31
◦ One vs Rest (OvR) One vs All (OvA)
Eg. red vs [blue, green, yellow]
blue vs [red, green, yellow]
green vs [red, blue, yellow]
yellow vs [red, blue, green]
C models for c classes
Decision Tree
◦ DT asks a question and classifies an instance based on an answer
◦ Categorical data, numeric data or ranked data. Outcome category or numeric
◦ Intuitive top down approach, follows If Then rules
◦ Interpretable and graphically representable
◦ Instances or tuples represented as attribute value pairs
◦ Performs Recursive Partitioning (greedy)
◦ Root (entire population/sample), internal node, leaf node
◦ Impure node
https://www.kdnuggets.com/2019/08/understanding-decision-trees-classification-python.html
2
Splitting
Criteria
Attribute
Value
Missing
Value
Outlier Pruning
Strategy
ID3 Information
Gain
Handles only
categorical data
Doesn’t
handle
Susceptible None
C4.5 Gain Ratio Handles both
categorical and
numeric
Handles Error Based
CART Gini Index Can handle Cost
Complexity
Types and Comparison
Attribute selection measures (heuristic)
◦ Entropy defines randomness/variance in data = -plog2p - qlog2q i.e. how unpredictable it is
◦ If p=q, entropy=1; p=1/0, entropy=0
◦ Information Gain is decrease in entropy post split. Chose attribute with highest information gain
◦ IG=Entropy(S)-[weighted av.*entropy of each feature]
◦ Gain Ratio = Gain/Split Info, where split info provides normalisation
◦ Gini Index/Impurity = 1-p2-q2
◦ Compute for each feature, chose lowest impurity feature for root
◦ Perfect split: gini impurity=0, higher the gini gain, better the split
◦ Use entropy for exponential data distribution
https://www.youtube.com/watch?v=7VeUPuFGJHk&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=34
https://victorzhou.com/blog/information-gain/ https://victorzhou.com/blog/gini-impurity/
Determine the attribute that best classifies the training data
Example
Information Gain: https://www.youtube.com/watch?v=JsbaJp6VaaU
Solution
Rainy
Solved numerical with practical implementation
https://www.xoriant.com/blog/product-engineering/decision-
trees-machine-learning-algorithm.html
Solved numerical
https://medium.datadriveni
nvestor.com/decision-tree-
algorithm-with-hands-on-
example-e6c2afb40d38
Gini Index
https://www.youtube.com/watch?v=9K0M2KCyNYo
ID3 algo
1.Create root node for the tree
2.If all examples are positive, return leaf node ‘positive’
3.Else if all examples are negative, return leaf node ‘negative’
4.Calculate the entropy of current state H(S)
5.For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x)
6.Select the attribute which has maximum value of IG(S, x)
7.Remove the attribute that offers highest IG from the set of attributes
8.Repeat until we run out of all attributes, or the decision tree has all leaf nodes.
ADVANTAGES
• Can be used with missing values
• Can handle multidimensional data
• Doesn’t require any domain knowledge
DISADVANTAGES
◦ Suffers from overfitting
◦ Handling continuous attributes
◦ Choosing appropriate attribute selection measure
◦ Handling attributes with differing costs
◦ Improving computational efficiency
SA
◦ X=(age=youth, income=medium,
student=yes, credit_rating=fair)
sr.no. age income student credit buy_computer
1 <30 High No Fair No
2 <30 High No Excellent No
3 31-40 High No Fair Yes
4 >40 Medium No Fair Yes
5 >40 Low Yes Fair Yes
6 >40 Low Yes Excellent No
7 31-40 Low Yes Excellent Yes
8 <30 Medium No Fair No
9 <30 Low Yes Fair Yes
10 >40 Medium Yes Fair Yes
11 <30 Medium Yes Excellent Yes
12 31-40 Medium No Excellent Yes
13 31-40 High Yes Fair Yes
14 >40 Medium No Excellent No
10
Issues in DT learning
◦ Determine how deeply to grow the decision tree
◦ Handling continuous attributes
◦ Choosing an appropriate attribute selection measure
◦ Handling training data with missing attribute values
◦ Handling attributes with differing costs
◦ Cost Sensitive DT
◦ Improving computational efficiency
◦ Overfitting in DT learning
◦ Pre Prune: Stop growing before it reaches a point where it perfectly classifies the data
◦ Post Prune: Grow full tree then prune
11
Ensemble Learning
I want to invest in a company XYZ. I am not sure about its performance though. So, I look for advice on whether the stock price will increase more
than 6% per annum or not? I decide to approach various experts having diverse domain experience:
1. Employee of Company XYZ: This person knows the internal functionality of the company and has the insider information about the functionality of
the firm. But he lacks a broader perspective on how are competitors innovating, how is the technology evolving and what will be the impact of this
evolution on Company XYZ’s product. In the past, he has been right 70% times.
2. Financial Advisor of Company XYZ: This person has a broader perspective on how companies strategy will fair of in this competitive environment.
However, he lacks a view on how the company’s internal policies are fairing off. In the past, he has been right 75% times.
3. Stock Market Trader: This person has observed the company’s stock price over past 3 years. He knows the seasonality trends and how the overall
market is performing. He also has developed a strong intuition on how stocks might vary over time. In the past, he has been right 70% times.
4. Employee of a competitor: This person knows the internal functionality of the competitor firms and is aware of certain changes which are yet to be
brought. He lacks a sight of company in focus and the external factors which can relate the growth of competitor with the company of subject. In the
past, he has been right 60% of times.
5. Market Research team in same segment: This team analyzes the customer preference of company XYZ’s product over others and how is this
changing with time. Because he deals with customer side, he is unaware of the changes company XYZ will bring because of alignment to its own
goals. In the past, they have been right 75% of times.
6. Social Media Expert: This person can help us understand how has company XYZ positioned its products in the market. And how are the sentiment
of customers changing over time towards company. He is unaware of any kind of details beyond digital marketing. In the past, he has been right
65% of times.
Given the broad spectrum of access we have, we can probably combine all the information and make an informed decision.
In a scenario when all the 6 experts/teams verify that it’s a good decision (assuming all the predictions are independent of each other), we will get a
combined accuracy rate of
1 - 30%*25%*30%*40%*25%*35%= 1 - 0.07875 = 99.92125%
Variance vs Bias
◦ Bias error is useful to quantify how much on an average are the predicted
values different from the actual value. A high bias error means we have a
under-performing model which keeps on missing important trends.
◦ Variance on the other side quantifies how are the prediction made on same
observation different from each other. A high variance model will over-fit on
your training population and perform badly on any observation beyond
training.
Ensemble (Unity is Strength)
◦ Hypothesis: when weak models (base learners) are correctly combined we can obtain more accurate and/or robust models.
◦ Bagging: homogenous weak learners learn in parallel then prediction averaged
◦ Focusses to reduce variance
◦ Boosting: homogenous weak learners learn sequentially
◦ Stacking: heterogenous weak learners learn in parallel
◦ Focus to reduce bias
◦ Homogenous learners built using same ML model
◦ Heterogenous learners built using different models
◦ Weak Learner eg. Decision Stump (one level DT)
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-
guide-for-ensemble-models/
Bagging (Bootstrap AGgreGatING)
Random Sampling with replacement for almost independent and almost representative data
(unit selected at random from population is returned and second element selected)
Simple average for Regression, simple majority vote for Classification (hard voting, soft voting)
Out-of-bag sample to evaluate Bagging Classifier
UseCase
◦ Ozone Data
Random Forest
◦ Trees are very popular base models for ensemble methods.
◦ Strong learners composed of multiple trees can be called “forests”.
◦ Multiple trees allow for probabilistic classification and they are built independent of each other.
◦ Trees that compose a forest can be chosen to be either shallow or deep.
◦ Shallow trees have less variance but higher bias and they will be better choice for sequential models i.e. boosting.
◦ Deep trees, have low bias but high variance and are relevant choices for bagging method that is mainly focused at
reducing variance.
◦ RF use a trick to make multiple fitted trees a bit less correlated with each other. When growing, each tree instead of
only sampling over the observations in the dataset to generate a bootstrap sample, we also sample over features and
keep only a random subset of them to build the tree. It makes the decision making process more robust to missing
data.
◦ Thus RF combines the concepts of bagging and random feature subspace selection to create more robust models.
SA4 https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=2s
https://victorzhou.com/blog/intro-to-random-forests/
Boosting
◦ In sequential methods the idea is to fit models iteratively such that the training of model at a given step
depends on the models fitted at the previous steps.
◦ It produces an ensemble model that is in general less biased than the weak learners that compose it.
◦ Each model in the sequence is fitted giving more importance to observations in the dataset that were badly
handled by the previous models in the sequence.
◦ Intuitively, each new model focusses its efforts on the most difficult observations to fit up to now, so that we
obtain, at the end of the process, a strong learner with lower bias (notice that boosting can also have the effect
of reducing variance).
◦ Boosting, like bagging, can be used for regression as well as for classification problems.
◦ If we want to use trees as our base models, we will choose most of the time shallow decision trees with only a
few depths. Tree with one node is termed as a Stump.
◦ Types: Adaboost (SAMME), GradientBoost, XGBoost, GBM, LGBM, CatBoost, etc.
ADAptive BOOSTing
◦ Adaptive boosting updates the weights attached to each of the training dataset observations
◦ It trains and deploys trees in series
◦ Sensitive to noisy data and outliers
◦ Iterative optimization process
◦ Variants LogitBoost, L2Boost
◦ Usecase: face detection
https://www.youtube.com/watch?v=LsK-xG1cLYA
ML MODULE 4.pdf
Stacking
◦ considers heterogeneous weak learners (different learning algorithms are combined)
◦ learns to combine the base models using a meta-model
◦ For example, for a classification problem, we can choose as weak learners a kNN classifier, a logistic
regressor and a SVM, and decide to learn a Neural Network as meta-model. Then, the neural network will
take as inputs the outputs of our three weak learners and will learn to return final predictions based on it.
◦ Variants include Multi level stacking
◦ Usecase: Classification of Cancer Microarrays
https://youtu.be/DCrcoh7cMHU
SA4
23
1 Explain various basic evaluation measures of supervised learning Algorithms for Classification.
2 Explain odds ratio and logit transformation.
3 Why is the Maximum Likelihood Estimation method used?
4 Justify the need of regularization in Logistic Regression
5 Differentiate Linear and Logistic regression.
6 Explain how Radial Basis function Network a nonlinearly separable problem to a linearly separable problem.
7 Explain key terminologies of SVM: hyperplane, separating hyperplane, hard margin, soft margin, support vectors.
8 Examine why SVM is more accurate than Logistic Regression.
9 Create optimal hyperplane for following points: {(1,1), (2,1), (1,-1), (2,-1), (4,0), (5,1), (6,0)}
10 For the given data, determine the entropy after classification using each attribute for classification separately and find which attribute is set as decision attribute for root by finding
information gain w.r.t. entropy of Temperature as reference attribute.
11 Create DT for attribute class using respective values:
12 What is a decision tree? How will you choose the best attribute for decision tree classifier? Give suitable examples.
13 Explain procedure to construct decision trees.
14 Discuss ensembles with the objective of resolving issues in DT learning.
15 What is the significance of the Gini Index as splitting criteria?
16 Differentiate ID3, CART and C4.5.
17 Suppose we apply DT learning to a training set. What if the training set size goes to infinity, will the learning algorithm return the correct tree. Why or why not?
18 Explain the working of the Bagging or Boosting ensemble.
19 Compare types of Boosting algorithms.
S. No. 10 Temperature Wind Humidity
1 Hot Weak High
2 Hot Strong High
3 Mild Weak Normal
4 Cool Strong High
5 Cool Weak Normal
6 Mild Strong Normal
7 Mild Weak High
8 Hot Strong High
9 Mild Weak Normal
Eyecolor 11 Married Sex Hairlength class
Brown Y M Long Football
Blue Y M Short Football
Brown Y M Long Football
Brown N F Long Netball
Brown N F Long Netball
Blue N Fm Long Football
Brown N F Long Netball
Brown N M Short Football
Brown Y F Short Netball
Brown N F Long Netball

More Related Content

What's hot (13)

PDF
K-means and GMM
Sanghyuk Chun
 
PPTX
2. Linear regression with one variable.pptx
Emad Nabil
 
PPTX
Curse of dimensionality
Nikhil Sharma
 
PPTX
Non Deterministic and Deterministic Problems
Scandala Tamang
 
PPTX
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
PDF
Taiwanese Credit Card Client Fraud detection
Ravi Gupta
 
PDF
Parametric versus semi nonparametric parametric regression models
Nuriye Sancar
 
PDF
Customer churn prediction in banking
BU - PG Master Computing Conference
 
PDF
Seasonal ARIMA
Joud Khattab
 
PPTX
Analytical Hierarchy Process (AHP)
Rajiv Kumar
 
PDF
R Programming: Mathematical Functions In R
Rsquared Academy
 
PPTX
Daa unit 4
Abhimanyu Mishra
 
K-means and GMM
Sanghyuk Chun
 
2. Linear regression with one variable.pptx
Emad Nabil
 
Curse of dimensionality
Nikhil Sharma
 
Non Deterministic and Deterministic Problems
Scandala Tamang
 
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
Taiwanese Credit Card Client Fraud detection
Ravi Gupta
 
Parametric versus semi nonparametric parametric regression models
Nuriye Sancar
 
Customer churn prediction in banking
BU - PG Master Computing Conference
 
Seasonal ARIMA
Joud Khattab
 
Analytical Hierarchy Process (AHP)
Rajiv Kumar
 
R Programming: Mathematical Functions In R
Rsquared Academy
 
Daa unit 4
Abhimanyu Mishra
 

Similar to ML MODULE 4.pdf (20)

PPTX
Lecture 11.pptxVYFYFYF UYF6 F7T7T7ITY8Y8YUO
AjayKumar773878
 
PPTX
Lecture 8 about data mining and how to use it.pptx
HedraAtif
 
PPTX
The world of loss function
홍배 김
 
PDF
Machine Learning Algorithms Introduction.pdf
Vinodh58
 
PDF
Cheatsheet supervised-learning
Steve Nouri
 
PPTX
Data Mining Lecture_10(b).pptx
Subrata Kumer Paul
 
PDF
Classification Techniques for Machine Learning
rahuljain582793
 
PDF
15-Data Analytics in IoT - Supervised Learning-04-09-2024.pdf
DharanshNeema
 
PPTX
Reuqired ppt for machine learning algirthms and part
SiddheshMhatre27
 
PPTX
Deep learning from mashine learning AI..
premkumarlive
 
PDF
Machine learning cheat sheet
Hany Sewilam Abdel Hamid
 
PDF
Data Science Cheatsheet.pdf
qawali1
 
PDF
Introduction to machine learning
Sanghamitra Deb
 
PPTX
KNN CLASSIFIER, INTRODUCTION TO K-NEAREST NEIGHBOR ALGORITHM.pptx
Nishant83346
 
PDF
working with python
bhavesh lande
 
PPTX
Introduction to Classification . pptx
Harsha Patil
 
PPTX
Coursera 1week
csl9496
 
PPTX
Machine learning
Sukhwinder Singh
 
PPTX
PREDICT 422 - Module 1.pptx
VikramKumar790542
 
PDF
Explore ml day 2
preetikumara
 
Lecture 11.pptxVYFYFYF UYF6 F7T7T7ITY8Y8YUO
AjayKumar773878
 
Lecture 8 about data mining and how to use it.pptx
HedraAtif
 
The world of loss function
홍배 김
 
Machine Learning Algorithms Introduction.pdf
Vinodh58
 
Cheatsheet supervised-learning
Steve Nouri
 
Data Mining Lecture_10(b).pptx
Subrata Kumer Paul
 
Classification Techniques for Machine Learning
rahuljain582793
 
15-Data Analytics in IoT - Supervised Learning-04-09-2024.pdf
DharanshNeema
 
Reuqired ppt for machine learning algirthms and part
SiddheshMhatre27
 
Deep learning from mashine learning AI..
premkumarlive
 
Machine learning cheat sheet
Hany Sewilam Abdel Hamid
 
Data Science Cheatsheet.pdf
qawali1
 
Introduction to machine learning
Sanghamitra Deb
 
KNN CLASSIFIER, INTRODUCTION TO K-NEAREST NEIGHBOR ALGORITHM.pptx
Nishant83346
 
working with python
bhavesh lande
 
Introduction to Classification . pptx
Harsha Patil
 
Coursera 1week
csl9496
 
Machine learning
Sukhwinder Singh
 
PREDICT 422 - Module 1.pptx
VikramKumar790542
 
Explore ml day 2
preetikumara
 
Ad

More from Shiwani Gupta (20)

PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Generative Artificial Intelligence and Large Language Model
Shiwani Gupta
 
PDF
ML MODULE 6.pdf
Shiwani Gupta
 
PDF
ML MODULE 5.pdf
Shiwani Gupta
 
PDF
module6_stringmatchingalgorithm_2022.pdf
Shiwani Gupta
 
PDF
module5_backtrackingnbranchnbound_2022.pdf
Shiwani Gupta
 
PDF
module4_dynamic programming_2022.pdf
Shiwani Gupta
 
PDF
module3_Greedymethod_2022.pdf
Shiwani Gupta
 
PDF
module2_dIVIDEncONQUER_2022.pdf
Shiwani Gupta
 
PDF
module1_Introductiontoalgorithms_2022.pdf
Shiwani Gupta
 
PDF
ML MODULE 1_slideshare.pdf
Shiwani Gupta
 
PDF
ML MODULE 2.pdf
Shiwani Gupta
 
PDF
ML Module 3.pdf
Shiwani Gupta
 
PDF
Problem formulation
Shiwani Gupta
 
PDF
Simplex method
Shiwani Gupta
 
PDF
Functionsandpigeonholeprinciple
Shiwani Gupta
 
PDF
Relations
Shiwani Gupta
 
PDF
Logic
Shiwani Gupta
 
PDF
Set theory
Shiwani Gupta
 
PDF
Uncertain knowledge and reasoning
Shiwani Gupta
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Generative Artificial Intelligence and Large Language Model
Shiwani Gupta
 
ML MODULE 6.pdf
Shiwani Gupta
 
ML MODULE 5.pdf
Shiwani Gupta
 
module6_stringmatchingalgorithm_2022.pdf
Shiwani Gupta
 
module5_backtrackingnbranchnbound_2022.pdf
Shiwani Gupta
 
module4_dynamic programming_2022.pdf
Shiwani Gupta
 
module3_Greedymethod_2022.pdf
Shiwani Gupta
 
module2_dIVIDEncONQUER_2022.pdf
Shiwani Gupta
 
module1_Introductiontoalgorithms_2022.pdf
Shiwani Gupta
 
ML MODULE 1_slideshare.pdf
Shiwani Gupta
 
ML MODULE 2.pdf
Shiwani Gupta
 
ML Module 3.pdf
Shiwani Gupta
 
Problem formulation
Shiwani Gupta
 
Simplex method
Shiwani Gupta
 
Functionsandpigeonholeprinciple
Shiwani Gupta
 
Relations
Shiwani Gupta
 
Set theory
Shiwani Gupta
 
Uncertain knowledge and reasoning
Shiwani Gupta
 
Ad

Recently uploaded (20)

PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 

ML MODULE 4.pdf

  • 2. SUPERVISED LEARNING - CLASSIFICATION Evaluation Metric Logistic Regression k Nearest Neighbor Linear SVM Kernel DT Issue in DT learning Ensemble- Bagging RF Ensemble – Boosting Adaboost Use case 2
  • 3. Performance ◦ Null Hypothesis: commonly accepted fact that you wish to test eg. data scientist salary on an av. is 113,000 dollars. ◦ Alternative Hypothesis: everything else eg. mean data scientist salary is not 113,000 dollars. ◦ Type I error (FP): Rejecting a true null hypothesis ◦ Type II error (FN): Accepting a false null hypothesis ◦ Confusion Matrix ◦ Accuracy = (TP+TN)/(TP+FN+FP+TN) ◦ Precision = TP/(TP+FP) eg. No. of patients diagnosed as having cancer actually had ◦ Recall/Sensitivity = TP/(TP+FN) eg. What portion of patients that actually had cancer were diagnosed by model as having ◦ Specificity = TN/(TN+FP) eg. Benign patients predicted benign ◦ F-score = (2*P*R)/(P+R) PredictedActual Positive Negative Positive TP FP Negative FN TN https://www.khanacademy.org/math/ap-statistics/tests-significance-ap/error-probabilities-power/v/introduction-to-type-i- and-type-ii-errors 3
  • 4. Logistic Regression Specialized case of Generalized Linear Model ◦ Just like LR, LoR can work with both continuous data eg. weight and discrete data eg. gender. ◦ A statistical model predicting the likelihood / probability. ◦ Uses logistic / sigmoid function to model binary/dichotomous/categorical dependent variable. • It is a mathematical function used to map the predicted values to probabilities. It forms a "S" curve. • In logistic regression, we use the concept of the threshold value, such that values above the threshold tends to 1, and a value below the threshold tends to 0. Thus any real value is mapped into another value within a range of 0 and 1. ◦ Assumes no / very little multicollinearity between predictor / independent variables. https://www.youtube.com/watch?v=yIYKR4sgzI8&list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe 4
  • 5. Mathematics ◦ Null Hypothesis H0: A relationship exists between predictor and response variable ◦ prob of success p = 0.8, prob of failure q = 1-p = 0.2 range [0,1] ◦ Odds(odds ratio) = success/failure = p/(1-p) ◦ Odds of success=p/q=4 range = [0,∞] ◦ log(odds) OR logit(p) = log(p/(1-p)) = z range=[-∞, ∞] as in Linear Regression ◦ p = elog(odds) / (1+elog(odds)) https://www.youtube.com/watch?v=vN5cNN2-HWE&list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe&index=25
  • 7. Loan Defaulter Sav ing s(L akh s) 0.5 0 0.7 5 1.0 0 1.2 5 1.5 0 1.7 5 1.7 5 2.0 0 2.2 5 2.5 0 2.7 5 3.0 0 3.2 5 3.5 0 4.0 0 4.2 5 4.5 0 4.7 5 5.0 0 5.5 0 Loa n Def ault er/ Not 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 Fitt ed Val ue 0.0 347 0.0 497 0.0 708 0.1 000 0.1 393 0.1 908 0/1 908 0.2 556 0.3 335 0.4 216 0.5 149 0.6 073 0.6 925 0.7 664 0.8 744 0.9 102 0.9 366 0.9 556 0.9 690 0.9 851 Pre dict ion 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 Coefficients b0 = -4.0778 b1 = 1.5046 prob = 1/(1+e-(-4.0778+1.5046*saving)) 7
  • 8. savings 0.5 0.75 1 1.25 1.5 1.75 1.75 2 2.25 2.5 2.75 3 3.25 3.5 4 4.25 4.5 4.75 5 5.5 y 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 prob = fitted value 0.0347070.0497670.0708830.100020.1393260.1908110.190810.2556690.3334880.4215780.5149580.6073050.6925670.7664370.8744180.9102550.9366060.9555980.96909 0.98519 prediction 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 odds 0.0359550.0523740.0762910.11113 0.16188 0.2358050.235810.3434890.5003490.7288411.0616771.5465092.2527463.2814986.96292710.1426614.7744621.5214531.349666.51982 logit -3.3255 -2.94935 -2.5732 -2.1971 -1.8209 -1.44475 -1.4448 -1.0686 -0.69245 -0.3163 0.05985 0.436 0.81215 1.1883 1.9406 2.31675 2.6929 3.06905 3.4452 4.1975 8
  • 9. Maximum Likelihood Estimation • Probabilistic framework for estimating parameters of model follows Bernoulli distribution. • Log likelihood • This negative function is because when we train, we need to maximize the probability by minimizing loss function. • Decreasing the cost will increase the maximum likelihood assuming that samples are drawn from an identically independent distribution. • When the model is a poor fit, log likelihood is relatively large negative value and when model is a good fit, log likelihood is close to zero. 9
  • 12. Types ◦ Binary Eg. 0/1, pass/fail, spam/not spam ◦ Multinomial: cat/dog/sheep, Veg/NonVeg/Vegan ◦ Ordinal: low/medium/high, movie rating 1-5 12
  • 13. Use Cases ◦ Email spam ◦ Credit card fraud ◦ Cancer benign/ malignant ◦ Predict if a user will invest in term deposit ◦ Loan defaulter 13
  • 14. ADVANTAGES • It is simple to implement • Works well for linearly separable data • Gives a measure of how relevant an independent variable is through coefficient • Tells us about the direction of the relationship (positive or negative) DISADVANTAGES • Fails to predict continuous outcome • Linearity assumption • Not accurate for small sample size 14
  • 15. PRACTICE QUESTIONS ◦ A team scored 285 runs in a cricket match. Assuming regression coefficients to be 0.3548 and 0.00089 respectively, calculate its probability of winning the match. ◦ You are applying for a home loan and your credit score is 720. Assuming logistic regression coefficient to be 9.346 and 0.0146 respectively, calculate probability of home loan application getting approved. 15
  • 16. K Nearest Neighbor ◦ non-parametric: it does not make any underlying assumptions about the distribution of data ◦ Intuition: given an unclassified point, we can assign it to a group by observing what group it’s nearest neighbors belong to • K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems • It is also called a lazy learner algorithm because it does not learn from the training set instead it stores the dataset during training phase and at the time of classification, it performs an action on the dataset. • Also, the accuracy of the above classifier increases as we increase the number of data points in the training set. 16
  • 17. Algorithm Step-1: Select the number K of the neighbors Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these k neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. Step-6: Our model is ready. K can be kept as an odd number so that we can calculate a clear majority in the case where only two groups are possible (e.g. Red/Blue). Most preferred value is 5. A very low value, can be noisy and lead to effects of outliers in model. With increasing K, we get smoother, more defined boundaries across different classifications. Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category. 17
  • 18. Distance metric ◦ Minkowski Distance ◦ Euclidean Distance if input variables similar in type eg. width, height ◦ Manhattan Distance / City block distance if grid like path ◦ Hamming Distance between binary vectors ◦ Others: Jaccard, Mahalanobis, cosine similarity, Tanimoto, etc. 18
  • 19. Numerical Example x1=acid durability (sec) x2=strength (kg/m2) y=class Euclidean Distance 7 7 Bad 16 7 4 Bad 25 3 4 Good 9 1 4 Good 13 Factory produces a new paper tissue that passes lab test with x1=3, x2=7. Classify this tissue. 1. k? k=3 2. Compute distance 3. Sort dist. and determine nearest neighbor based on kth min. dist. 4. Gather category y of nearest neighbors 5. Use simple majority as prediction of query instance 19
  • 20. Use Case ◦ Application ◦ pattern recognition ◦ data mining ◦ intrusion detection ◦ recommender ◦ products on Amazon ◦ articles on Medium ◦ movies on Netflix ◦ videos on YouTube 20
  • 21. ADVANTAGES • It is simple to implement. • No hyperparameter tuning required. • Makes no assumptions about data. • Quite useful as in real world most data doesn’t obey typical theoretical assumptions. • No explicit training phase hence fast. DISADVANTAGES • The computation cost is high because of calculating the distance between data points for all the training samples. • Since all training data required for computation of distance, algo requires large amount of memory. • Prediction stage is slow. • Sensitive to irrelevant features. • Sensitive to scale of data. 21
  • 22. SVM ◦ Discriminative classifier ◦ Extreme data points – support vectors (only support vectors are important whereas other training example are ignorable) ◦ Hyperplane – best separates two classes ◦ If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. ◦ Unoptimized decision boundary could result in more miss classifications ◦ Maximum Margin classifier ◦ Margin = double the distance (perpendicular) between hyperplane and support vector (closest data point) ◦ Super sensitive to outliers in training data if they are considered as support vectors. ◦ In SVM, if the output of linear function is greater than 1, we identify it with one class and if the output is -1, we identify it with another class. The threshold values are changed to 1 and -1 in SVM, which acts as margin. 22
  • 24. Assumptions and Types • Numerical Inputs: SVM assumes that your inputs are numeric. If you have categorical inputs you may need to covert them to binary dummy variables (one variable for each category). • Binary Classification: Basic SVM is intended for binary (two-class) classification problems. Although, extensions have been developed for regression and multi-class classification. • Soft margin: allows some samples to be placed on wrong side of margin. • Hard margin 24
  • 25. Understanding Mathematics Mathematical Eqn and Primal Dual: https://www.youtube.com/watch?v=ptwn9wg_s48 TASK Refer pg 13 pdf for solved numerical 10.1 25 From slide 10 C = 1/λ C controls cost of misclassification of training data
  • 26. Non Linear SVM z=x^2+y^2 Transformation through nonlinear mapping function into linearly separable data Kernel Types: Linear Polynomial RBF/Gaussian (weighted NN) squared Euclidean distance, γ = 1/(2σ2) Exponential https://www.youtube.com/watch?v=efR1C6CvhmE Refer pg 18 pdf for solved numerical 10.2 26 SVM poses a quadratic optimization problem that looks for maximizing the margin between both classes and minimizing the amount of miss-classifications. For non-separable problems, in order to find a solution, the miss-classification constraint must be relaxed, and this is done by "regularization“.
  • 27. Regularization C is the penalty parameter, which represents misclassification or error term i.e. how much error is bearable. This is how you can control the trade-off between decision boundary and misclassification term. A smaller value of C creates a large- margin hyperplane that is tolerant of miss classifications. Large value of C creates a small-margin hyperplane and thus overfits and heavily penalizes for misclassified points. γ represents the spread of Kernel i.e. decision region A lower value of Gamma will loosely fit the training dataset since it considers only nearby points in calculating the separation line. Higher value of gamma will exactly fit the training dataset creating islands, which causes over-fitting since it considers all the data points in the calculation of the separation line. 27 https://chrisalbon.com/machine_learning/support _vector_machines/svc_parameters_using_rbf_ke rnel/
  • 28. Use Case and Variants ◦ Face Recognition ◦ Intrusion detection ◦ Classification of emails, news articles and web pages ◦ Classification of genes ◦ Handwriting recognition. ◦ You can use a numerical optimization procedure as stochastic gradient descent to search for the coefficients of the hyperplane. ◦ The most popular method for fitting SVM is the Sequential Minimal Optimization (SMO) method that is very efficient. It breaks the Quadratic Programming problem down into sub-problems that can be solved analytically (by calculating) rather than numerically (by searching or optimizing) through Lagrangian Multiplier by satisfying Karush Kahun Tucker (KKT) conditions. 28
  • 29. ADVANTAGES • Effective in high dimensional space • Applicable for both classification and regression • Their dependence on relatively few support vectors means that they are very compact models, and take up very little memory. • Once the model is trained, the prediction phase is very fast • Effective when no. of features > no. of samples • Support overlapping classes DISADVANTAGES • Don’t provide probability estimates, these are calculated using an expensive five-fold cross- validation • Requires scaling of features • Sensitive to outliers • Sensitive to the type of kernel used 29
  • 30. PRACTICE QUESTIONS ◦ Given the following data, calculate hyperplane. Also classify (0.6,0.9) based on calculated hyperplane. 30 A1 A2 y 0.38 0.47 + 0.49 0.61 - 0.92 0.41 - 0.74 0.89 - 0.18 0.58 + 0.41 0.35 + 0.93 0.81 - 0.21 0.1 +
  • 31. Multiclass / Multinomial Classification ◦ One vs One (OvO) Eg. red, blue, green, yellow class red vs blue, red vs green, red vs yellow, blue vs green, blue vs yellow, green vs yellow 6 datasets i.e. c*(c-1)/2 models for c classes Most votes for classification. argmax of sum of scores for numerical class membership as probability High computational complexity 31 ◦ One vs Rest (OvR) One vs All (OvA) Eg. red vs [blue, green, yellow] blue vs [red, green, yellow] green vs [red, blue, yellow] yellow vs [red, blue, green] C models for c classes
  • 32. Decision Tree ◦ DT asks a question and classifies an instance based on an answer ◦ Categorical data, numeric data or ranked data. Outcome category or numeric ◦ Intuitive top down approach, follows If Then rules ◦ Interpretable and graphically representable ◦ Instances or tuples represented as attribute value pairs ◦ Performs Recursive Partitioning (greedy) ◦ Root (entire population/sample), internal node, leaf node ◦ Impure node https://www.kdnuggets.com/2019/08/understanding-decision-trees-classification-python.html
  • 33. 2 Splitting Criteria Attribute Value Missing Value Outlier Pruning Strategy ID3 Information Gain Handles only categorical data Doesn’t handle Susceptible None C4.5 Gain Ratio Handles both categorical and numeric Handles Error Based CART Gini Index Can handle Cost Complexity Types and Comparison
  • 34. Attribute selection measures (heuristic) ◦ Entropy defines randomness/variance in data = -plog2p - qlog2q i.e. how unpredictable it is ◦ If p=q, entropy=1; p=1/0, entropy=0 ◦ Information Gain is decrease in entropy post split. Chose attribute with highest information gain ◦ IG=Entropy(S)-[weighted av.*entropy of each feature] ◦ Gain Ratio = Gain/Split Info, where split info provides normalisation ◦ Gini Index/Impurity = 1-p2-q2 ◦ Compute for each feature, chose lowest impurity feature for root ◦ Perfect split: gini impurity=0, higher the gini gain, better the split ◦ Use entropy for exponential data distribution https://www.youtube.com/watch?v=7VeUPuFGJHk&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=34 https://victorzhou.com/blog/information-gain/ https://victorzhou.com/blog/gini-impurity/
  • 35. Determine the attribute that best classifies the training data Example Information Gain: https://www.youtube.com/watch?v=JsbaJp6VaaU
  • 37. Solved numerical with practical implementation https://www.xoriant.com/blog/product-engineering/decision- trees-machine-learning-algorithm.html Solved numerical https://medium.datadriveni nvestor.com/decision-tree- algorithm-with-hands-on- example-e6c2afb40d38
  • 39. ID3 algo 1.Create root node for the tree 2.If all examples are positive, return leaf node ‘positive’ 3.Else if all examples are negative, return leaf node ‘negative’ 4.Calculate the entropy of current state H(S) 5.For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x) 6.Select the attribute which has maximum value of IG(S, x) 7.Remove the attribute that offers highest IG from the set of attributes 8.Repeat until we run out of all attributes, or the decision tree has all leaf nodes.
  • 40. ADVANTAGES • Can be used with missing values • Can handle multidimensional data • Doesn’t require any domain knowledge DISADVANTAGES ◦ Suffers from overfitting ◦ Handling continuous attributes ◦ Choosing appropriate attribute selection measure ◦ Handling attributes with differing costs ◦ Improving computational efficiency
  • 41. SA ◦ X=(age=youth, income=medium, student=yes, credit_rating=fair) sr.no. age income student credit buy_computer 1 <30 High No Fair No 2 <30 High No Excellent No 3 31-40 High No Fair Yes 4 >40 Medium No Fair Yes 5 >40 Low Yes Fair Yes 6 >40 Low Yes Excellent No 7 31-40 Low Yes Excellent Yes 8 <30 Medium No Fair No 9 <30 Low Yes Fair Yes 10 >40 Medium Yes Fair Yes 11 <30 Medium Yes Excellent Yes 12 31-40 Medium No Excellent Yes 13 31-40 High Yes Fair Yes 14 >40 Medium No Excellent No 10
  • 42. Issues in DT learning ◦ Determine how deeply to grow the decision tree ◦ Handling continuous attributes ◦ Choosing an appropriate attribute selection measure ◦ Handling training data with missing attribute values ◦ Handling attributes with differing costs ◦ Cost Sensitive DT ◦ Improving computational efficiency ◦ Overfitting in DT learning ◦ Pre Prune: Stop growing before it reaches a point where it perfectly classifies the data ◦ Post Prune: Grow full tree then prune 11
  • 43. Ensemble Learning I want to invest in a company XYZ. I am not sure about its performance though. So, I look for advice on whether the stock price will increase more than 6% per annum or not? I decide to approach various experts having diverse domain experience: 1. Employee of Company XYZ: This person knows the internal functionality of the company and has the insider information about the functionality of the firm. But he lacks a broader perspective on how are competitors innovating, how is the technology evolving and what will be the impact of this evolution on Company XYZ’s product. In the past, he has been right 70% times. 2. Financial Advisor of Company XYZ: This person has a broader perspective on how companies strategy will fair of in this competitive environment. However, he lacks a view on how the company’s internal policies are fairing off. In the past, he has been right 75% times. 3. Stock Market Trader: This person has observed the company’s stock price over past 3 years. He knows the seasonality trends and how the overall market is performing. He also has developed a strong intuition on how stocks might vary over time. In the past, he has been right 70% times. 4. Employee of a competitor: This person knows the internal functionality of the competitor firms and is aware of certain changes which are yet to be brought. He lacks a sight of company in focus and the external factors which can relate the growth of competitor with the company of subject. In the past, he has been right 60% of times. 5. Market Research team in same segment: This team analyzes the customer preference of company XYZ’s product over others and how is this changing with time. Because he deals with customer side, he is unaware of the changes company XYZ will bring because of alignment to its own goals. In the past, they have been right 75% of times. 6. Social Media Expert: This person can help us understand how has company XYZ positioned its products in the market. And how are the sentiment of customers changing over time towards company. He is unaware of any kind of details beyond digital marketing. In the past, he has been right 65% of times. Given the broad spectrum of access we have, we can probably combine all the information and make an informed decision. In a scenario when all the 6 experts/teams verify that it’s a good decision (assuming all the predictions are independent of each other), we will get a combined accuracy rate of 1 - 30%*25%*30%*40%*25%*35%= 1 - 0.07875 = 99.92125%
  • 44. Variance vs Bias ◦ Bias error is useful to quantify how much on an average are the predicted values different from the actual value. A high bias error means we have a under-performing model which keeps on missing important trends. ◦ Variance on the other side quantifies how are the prediction made on same observation different from each other. A high variance model will over-fit on your training population and perform badly on any observation beyond training.
  • 45. Ensemble (Unity is Strength) ◦ Hypothesis: when weak models (base learners) are correctly combined we can obtain more accurate and/or robust models. ◦ Bagging: homogenous weak learners learn in parallel then prediction averaged ◦ Focusses to reduce variance ◦ Boosting: homogenous weak learners learn sequentially ◦ Stacking: heterogenous weak learners learn in parallel ◦ Focus to reduce bias ◦ Homogenous learners built using same ML model ◦ Heterogenous learners built using different models ◦ Weak Learner eg. Decision Stump (one level DT) https://www.analyticsvidhya.com/blog/2018/06/comprehensive- guide-for-ensemble-models/
  • 46. Bagging (Bootstrap AGgreGatING) Random Sampling with replacement for almost independent and almost representative data (unit selected at random from population is returned and second element selected) Simple average for Regression, simple majority vote for Classification (hard voting, soft voting) Out-of-bag sample to evaluate Bagging Classifier
  • 48. Random Forest ◦ Trees are very popular base models for ensemble methods. ◦ Strong learners composed of multiple trees can be called “forests”. ◦ Multiple trees allow for probabilistic classification and they are built independent of each other. ◦ Trees that compose a forest can be chosen to be either shallow or deep. ◦ Shallow trees have less variance but higher bias and they will be better choice for sequential models i.e. boosting. ◦ Deep trees, have low bias but high variance and are relevant choices for bagging method that is mainly focused at reducing variance. ◦ RF use a trick to make multiple fitted trees a bit less correlated with each other. When growing, each tree instead of only sampling over the observations in the dataset to generate a bootstrap sample, we also sample over features and keep only a random subset of them to build the tree. It makes the decision making process more robust to missing data. ◦ Thus RF combines the concepts of bagging and random feature subspace selection to create more robust models. SA4 https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=2s
  • 50. Boosting ◦ In sequential methods the idea is to fit models iteratively such that the training of model at a given step depends on the models fitted at the previous steps. ◦ It produces an ensemble model that is in general less biased than the weak learners that compose it. ◦ Each model in the sequence is fitted giving more importance to observations in the dataset that were badly handled by the previous models in the sequence. ◦ Intuitively, each new model focusses its efforts on the most difficult observations to fit up to now, so that we obtain, at the end of the process, a strong learner with lower bias (notice that boosting can also have the effect of reducing variance). ◦ Boosting, like bagging, can be used for regression as well as for classification problems. ◦ If we want to use trees as our base models, we will choose most of the time shallow decision trees with only a few depths. Tree with one node is termed as a Stump. ◦ Types: Adaboost (SAMME), GradientBoost, XGBoost, GBM, LGBM, CatBoost, etc.
  • 51. ADAptive BOOSTing ◦ Adaptive boosting updates the weights attached to each of the training dataset observations ◦ It trains and deploys trees in series ◦ Sensitive to noisy data and outliers ◦ Iterative optimization process ◦ Variants LogitBoost, L2Boost ◦ Usecase: face detection https://www.youtube.com/watch?v=LsK-xG1cLYA
  • 53. Stacking ◦ considers heterogeneous weak learners (different learning algorithms are combined) ◦ learns to combine the base models using a meta-model ◦ For example, for a classification problem, we can choose as weak learners a kNN classifier, a logistic regressor and a SVM, and decide to learn a Neural Network as meta-model. Then, the neural network will take as inputs the outputs of our three weak learners and will learn to return final predictions based on it. ◦ Variants include Multi level stacking ◦ Usecase: Classification of Cancer Microarrays https://youtu.be/DCrcoh7cMHU
  • 54. SA4 23 1 Explain various basic evaluation measures of supervised learning Algorithms for Classification. 2 Explain odds ratio and logit transformation. 3 Why is the Maximum Likelihood Estimation method used? 4 Justify the need of regularization in Logistic Regression 5 Differentiate Linear and Logistic regression. 6 Explain how Radial Basis function Network a nonlinearly separable problem to a linearly separable problem. 7 Explain key terminologies of SVM: hyperplane, separating hyperplane, hard margin, soft margin, support vectors. 8 Examine why SVM is more accurate than Logistic Regression. 9 Create optimal hyperplane for following points: {(1,1), (2,1), (1,-1), (2,-1), (4,0), (5,1), (6,0)} 10 For the given data, determine the entropy after classification using each attribute for classification separately and find which attribute is set as decision attribute for root by finding information gain w.r.t. entropy of Temperature as reference attribute. 11 Create DT for attribute class using respective values: 12 What is a decision tree? How will you choose the best attribute for decision tree classifier? Give suitable examples. 13 Explain procedure to construct decision trees. 14 Discuss ensembles with the objective of resolving issues in DT learning. 15 What is the significance of the Gini Index as splitting criteria? 16 Differentiate ID3, CART and C4.5. 17 Suppose we apply DT learning to a training set. What if the training set size goes to infinity, will the learning algorithm return the correct tree. Why or why not? 18 Explain the working of the Bagging or Boosting ensemble. 19 Compare types of Boosting algorithms. S. No. 10 Temperature Wind Humidity 1 Hot Weak High 2 Hot Strong High 3 Mild Weak Normal 4 Cool Strong High 5 Cool Weak Normal 6 Mild Strong Normal 7 Mild Weak High 8 Hot Strong High 9 Mild Weak Normal Eyecolor 11 Married Sex Hairlength class Brown Y M Long Football Blue Y M Short Football Brown Y M Long Football Brown N F Long Netball Brown N F Long Netball Blue N Fm Long Football Brown N F Long Netball Brown N M Short Football Brown Y F Short Netball Brown N F Long Netball