SlideShare a Scribd company logo
18ECE307J - APPLIED MACHINE
LEARNING
unit -1
Introduction to Machine learning: Types of Machine Learning - Supervised
Learning – Unsupervised, Learning, reinforcement learning , The Curse of
dimensionality, Bias and Variance, Learning Curve, Classification Error and
noise, linear regression, Support Vector Machines, basics of neural network,
perceptrons, LINEAR SEPARABILITY, Perceptrons and introduction to
Multiplayer, Perceptrons
Prepared by Dr.P.Vijayakumar, Associate professor, ECE,SRM IST
Introduction to Machine learning
• Machine Learning is the science (and art) of programming computers so
they can learn from data
• Machine Learning is the field of study that gives computers the ability to
learn without being explicitly programmed.—Arthur Samuel, 1959
• A computer program is said to learn from experience E with respect to
some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.-—Tom Mitchell, 1997
• Example : spam filter is a Machine Learning program that can learn to flag
spam given examples of spam emails (e.g., flagged by users) and examples
of regular (nonspam, also called “ham”) emails.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
More on spam filter ..
• The examples that the spam filter system uses to learn are called the
training set. Each training example is called a training instance (or
sample).
• In this case, the task T is to flag spam for new emails, the experience E
is the training data, and the performance measure P needs to be
defined; for example, you can use the ratio of correctly classified
emails.
• This particular performance measure is called accuracy and it is often
used in classification tasks.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Traditional programming and machine learning
long list of complex
rules—pretty hard to maintain.
program is much shorter, easier to
maintain, and most likely more accurate.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
More on ML
Automatically adapting to change
Machine Learning can help humans learn EX: list of words and
combinations of words that it believes are the best predictors of spam.
Applying ML techniques to dig into large amounts of data can help
discover patterns that were not immediately apparent. This is called
data mining.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
When to use Machine Learning
• Problems for which existing solutions require a lot of hand-tuning or
long lists of rules: one Machine Learning algorithm can often simplify
code and perform better.
• Complex problems for which there is no good solution at all using a
traditional approach: the best Machine Learning techniques can find a
solution.
• Fluctuating environments: a Machine Learning system can adapt to
new data.
• Getting insights about complex problems and large amounts of data.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Types of Machine Learning
• Supervised learning: A training set of examples with the correct responses (targets) is provided and, based on
this training set, the algorithm generalises to respond correctly to all possible inputs. This is also called
learning from exemplars.
• Unsupervised learning :Correct responses are not provided, but instead the algorithm tries to identify
similarities between the inputs so that inputs that have something in common are categorised together. The
statistical approach to unsupervised learning is known as density estimation
• Reinforcement learning: This is somewhere between supervised and unsupervised learning. The algorithm
gets told when the answer is wrong, but does not get told how to correct it. It has to explore and try out
different possibilities until it works out how to get the answer right. Reinforcement learning is sometime
called learning with a critic because of this monitor that scores the answer, but does not suggest
improvements
• Evolutionary learning: Biological evolution can be seen as a learning process: biological organisms adapt to
improve their survival rates and chance of having offspring in their environment. We’ll look at how we can
model this in a computer, using an idea of fitness, which corresponds to a score for how good the current
solution is.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Supervised Learning
• In supervised learning, the training data you
feed to the algorithm includes the desired
solutions, called labels
• A typical supervised learning task is
classification. The spam filter is a good
example of this: it is trained with many
example emails along with their class (spam
or ham), and it must learn how to classify new
emails.
• Another typical task is to predict a target
numeric value, such as the price of a car,
given a set of features (mileage, age, brand,
etc.) called predictors. This sort of task is
called regression
• To train the system, you need to give it many
examples of cars, including both their
predictors and their labels
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
supervised learning algorithms
• k-Nearest Neighbors
• Linear Regression
• Logistic Regression
• Support Vector Machines (SVMs)
• Decision Trees and Random Forests
• Neural networks
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Unsupervised Learning
• In unsupervised learning, as you might guess, the training
data is unlabeled
• unsupervised learning algorithms
• Clustering
• —k-Means
• —Hierarchical Cluster Analysis (HCA)
• —Expectation Maximization
• Visualization and dimensionality reduction
• —Principal Component Analysis (PCA)
• —Kernel PCA
• —Locally-Linear Embedding (LLE)
• —t-distributed Stochastic Neighbor
Embedding (t-SNE)
• Association rule learning
• —Apriori
• —Eclat
• Clustering: algorithm to try to detect groups of similarity
• Visualization :input:a lot of complex and unlabeled data.
Output: a 2D or 3D representation of your data that can easily
be plotted .
• These algorithms try to preserve as much structure as they can
(e.g., trying to keep separate clusters in the input space from
overlapping in the visualization), so you can understand how
the data is organized and perhaps identify unsuspected
patterns.
• dimensionality reduction: goal is to simplify the data without
losing too much information. One way to do this is to merge
several correlated features into one.(feature extraction)
• association rule learning: goal is to dig into large amounts of
data and discover interesting relations between attributes.
• For example, suppose you own a supermarket. Running an
association rule on your sales logs may reveal that people who
purchase barbecue sauce and potato chips also tend to buy
steak. Thus, you may want to place these items close to each
other.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Reinforcement learning
• The learning system, called an
agent in this context, can observe
the environment, select and
perform actions, and get rewards
in return or penalties in the form of
negative rewards.
• It must then learn by itself what is
the best strategy, called a policy, to
get the most reward over time.
• A policy defines what action the
agent should choose when it is in a
given situation.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
THE MACHINE LEARNING PROCESS
• Data Collection and Preparation:Machine learning algorithms need significant amounts of data, preferably
without too much noise, but with increased dataset size comes increased computational costs, and the sweet
spot at which there is enough data without excessive computational overhead is generally impossible to
predict.
• Feature Selection:It consists of identifying the features that are most useful for the problem under
examination. This invariably requires prior knowledge of the problem and the data; our common sense was
used in the coins example above to identify some potentially useful features and to exclude others.
• Algorithm Choice:Given the dataset, the choice of an appropriate algorithm
• Parameter and Model Selection:For many of the algorithms there are parameters that have to be set manually,
or that require experimentation to identify appropriate values
• Training:training should be simply the use of computational resources in order to build a model of the data
• Evaluation:Before a system can be deployed it needs to be tested and evaluated for accuracy on data that it
was not trained on. This can often include a comparison with human experts in the field, and the selection of
appropriate metrics for this comparison.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
The Curse of dimensionality
• The essence of the curse is the realisation that as the number of
dimensions increases, the volume of the unit hypersphere does not
increase with it.
• The curse of dimensionality will apply to our machine learning algorithms
because as the number of input dimensions gets larger, we will need more
data to enable the algorithm to generalize sufficiently well.
• ML algorithms try to separate data into classes based on the features;
therefore as the number of features increases, more number of
datapoints we need.
• For this reason, we will often have to be careful about what information we
give to the algorithm, meaning that we need to understand something
about the data in advance.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Bias and Variance-bulls-eye diagram
• Bias is the difference between the
average prediction of our model and the
correct value which we are trying to
predict. Model with high bias pays very
little attention to the training data and
oversimplifies the model. It always leads
to high error on training and test data.
• Variance is the variability of model
prediction for a given data point or a
value which tells us spread of our data.
Model with high variance pays a lot of
attention to training data and does not
generalize on the data which it hasn’t
seen before. As a result, such models
perform very well on training data but has
high error rates on test data.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
underfitting and overfitting
In supervised learning, underfitting :
• happens when a model unable to capture the underlying
pattern of the data.
• These models usually have high bias and low variance.
• It happens when we have very less amount of data to
build an accurate model or when we try to build a linear
model with a nonlinear data.
• Also, these kind of models are very simple to capture the
complex patterns in data like Linear and logistic regression.
In supervised learning, overfitting:
• happens when our model captures the noise along with
the underlying pattern in data.
• It happens when we train our model a lot over noisy
dataset.
• These models have low bias and high variance.
• These models are very complex like Decision trees which
are prone to overfitting.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Learning Curve
• Graph that compares the performance of a model on training
and testing data over a varying number of training instances
• We should generally see performance improve as the
number of training points increases
• When we separate training and testing sets and graph them
individually, We can get an idea of how well the model can
generalize to new data
• a learning curve (or training curve) plots the optimal value of
a model's loss function for a training set against this loss
function evaluated on a validation data set with same
parameters as produced the optimal function.
• It is a tool to find out how much a machine model benefits
from adding more training data and whether the estimator
suffers more from a variance error or a bias error.
• If both the validation score and the training score converge
to a value that is too low with increasing size of the training
set, it will not benefit much from more training data
• curve is useful for many purposes including comparing
different algorithms, choosing model parameters during
design, adjusting optimization to improve convergence, and
determining the amount of data used for training.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Underfit Learning Curves
model does not have a suitable capacity for
the complexity of the dataset
he model is capable of further learning and
possible further improvements and that the
training process was halted prematurely
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Overfit Learning Curves
• The plot of training loss continues
to decrease with experience.
• The plot of validation loss
decreases to a point and begins
increasing again.
• The inflection point in validation
loss may be the point at which
training could be halted as
experience after that point shows
the dynamics of overfitting.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Good Fit Learning Curves
• The plot of training loss
decreases to a point of stability.
• The plot of validation loss
decreases to a point of stability
and has a small gap with the
training loss.
• Continued training of a good fit
will likely lead to an overfit.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Classification Error and noise
• For binary classification problems,
there are two primary types of
errors.
• Type 1 errors (false positives) -
rejection of a true null hypothesis
• Type 2 errors (false negatives)- the
non-rejection of a false null
hypothesis
• a true positive is an observation
correctly put into class 1, while a
false positive is an observation
incorrectly put into class 1,
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
The Confusion Matrix
• The confusion matrix is a nice
simple idea: make a square
matrix that contains all the
possible classes in both the
horizontal and vertical directions
and list the classes along the top
of a table as the predicted
outputs, and then down the left-
hand side as the targets
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Measures
• False Positive Rate: When it's actually no,
how often does it predict yes?
• FP/actual no = 10/60 = 0.17
• True Negative Rate: When it's actually no,
how often does it predict no?
• TN/actual no = 50/60 = 0.83
• equivalent to 1 minus False Positive Rate
• also known as "Specificity"
• Precision: When it predicts yes, how often is
it correct?
• TP/predicted yes = 100/110 = 0.91
• Prevalence: How often does the yes condition
actually occur in our sample?
• actual yes/total = 105/165 = 0.64
Accuracy: Overall, how often is the classifier correct?
(TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate: Overall, how often is it
wrong?
(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as "Error Rate"
True Positive Rate: When it's actually yes, how often
does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
linear regression
• fit a line to data, from classification
problems, where we find a line that
separates out the classes, so that they can
be distinguished.
• turn classification problems into regression
problems. This can be done in two ways
• first by introducing an indicator variable, which
simply says which class each datapoint belongs
to. The problem is now to use the data to
predict the indicator variable, which is a
regression problem.
• The second approach is to do repeated
regression, once for each class, with the
indicator value being 1 for examples in the class
and 0 for all of the others.
making a prediction about an unknown value y
(such as the indicator variable for classes or a
future value of some data) by computing some
function of known values xi. With straight lines
model , output y is going to be a sum of the xi
values, each multiplied by a constant parameter:
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Defining the line
• try to minimize the distance between each datapoint and the line that we fit.
• We can measure the distance between a point and a line by defining another line that
goes through the point and hits the line.
• Now, we can try to minimize an error function that measures the sum of all these
distances. Minimize the sum-of-squares of the errors-least-squares optimization.
• choosing the parameters β in order to minimize the squared difference between the
prediction and the actual data value, summed over all of the datapoints. given input
vector Z, the prediction is Z β
1
2
Differentiation 2 and equating to 0
3
4
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Support Vector Machines
• A Support Vector Machine (SVM) is a very powerful and
versatile Machine Learning model, capable of performing
linear or nonlinear classification, regression, and even outlier
detection.
• SVMs are particularly well suited for classification of complex
but small- or medium-sized datasets.
• The two classes can clearly be separated easily with a straight
line (they are linearly separable).
• The model whose decision boundary is represented by the
dashed line is so bad that it does not even separate the
classes properly. The other two models work perfectly on this
training set, but their decision boundaries come so close to
the instances that these models will probably not perform as
well on new instances.
• the decision boundary of an SVM classifier; this line not only
separates the two classes but also stays as far away from the
closest training instances as possible.
• You can think of an SVM classifier as fitting the widest possible
street (represented by the parallel dashed lines) between the
classes.
• This is called large margin classification.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
SENSITIVE TO THE FEATURE SCALES AND SOFT MARGIN
CLASSIFICATION
• on the left plot, the vertical scale is much larger than the
horizontal scale, so the widest possible street is close to
horizontal.
• After feature scaling (e.g., using Scikit-Learn’s StandardScaler),
the decision boundary looks much better (on the right plot).
• If we strictly impose that all instances be off the street and on
the right side, this is called hard margin classification.
• There are two main issues with hard margin classification.
• First, it only works if the data is linearly separable, and second
it is quite sensitive to outliers.
• To avoid these issues it is preferable to use a more flexible
model. The objective is to find a good balance between
keeping the street as large as possible and limiting the margin
violations (i.e., instances that end up in the middle of the
street or even on the wrong side).
• This is called soft margin classification
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Decision Function and Predictions
• The linear SVM classifier model predicts the class of
a new instance x by simply computing the decision
function wT ・ x + b = w1 x1 + ⋯ + wn xn + b: if the
result is positive, the predicted class ŷ is the
positive class (1), or else it is the negative class (0);
Here ,Decision function is a two-dimensional plane
since this dataset has two features (petal width and
petal length).
The decision boundary is the set of points where the
decision function is equal to 0: it is the intersection of
two planes, which is a straight line (represented
by the thick solid line)
The dashed lines represent the points where the decision function
is equal to 1 or –1:
they are parallel and at equal distance to the decision boundary,
forming a margin around it.
Training a linear SVM classifier means finding the value of w and b
that make this margin as wide as possible while avoiding margin
violations (hard margin) or limiting them (soft margin).
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Training Objective
• Consider the slope of the decision function: it is
equal to the norm of the weight vector, ∥ w ∥.
• If we divide this slope by 2, the points where the
decision function is equal to ―1 are going to be
twice as far away from the decision boundary. In
other words, dividing the slope by 2 will multiply
the margin by 2.
• The smaller the weight vector w, the larger the
margin
• So we want to minimize ∥ w ∥ to get a large margin.
• However, if we also want to avoid any margin
violation (hard margin), then we need the decision
function to be greater than 1 for all positive training
instances, and lower than –1 for negative training
instances.
• If we define t(i) = –1 for negative instances (if y(i) =
0) and t(i) = 1 for positive instances (if y(i) = 1), then
we can express this constraint as t(i)(wT ・ x(i) + b)
≥ 1 for all instances.
• We can therefore express the hard margin linear
SVM classifier objective as the constrained
optimization problem as
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
soft margin objective and optimization problem
• To get the soft margin objective, we need to introduce a slack variable ζ(i) ≥ 0 for each instance: ζ(i) measures
how much the ith instance is allowed to violate the margin.
• We now have two conflicting objectives: making the slack variables as small as possible to reduce the margin
violations, and making ½ .wT ・ w as small as possible to increase the margin.
• This is where the C hyperparameter comes in: it allows us to define the trade‐ off between these two
objectives. This gives us the constrained optimization problem
svm_clf = Pipeline((
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge")),
))
svm_clf.fit(X_scaled, y)
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Nonlinear SVM Classification
• Although linear SVM classifiers are efficient and work
surprisingly well in many cases, many datasets are not
even close to being linearly separable.
• One approach to handling nonlinear datasets is to add
more features, such as polynomial features in some
cases this can result in a linearly separable dataset
• Consider the left plot in Figure it represents a simple
dataset with just one feature x1.
• This dataset is not linearly separable, as you can see.
But if you add a second feature x2 = (x1)2, the resulting
2D dataset is perfectly linearly separable
• Adding polynomial features is simple to implement
and can work great with all sorts of Machine Learning
algorithms (not just SVMs),
• but at a low polynomial degree it cannot deal with
very complex datasets, and with a high polynomial
degree it creates a huge number of features, making
the model too slow.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
kernel trick
• kernel trick makes it possible to get the same result as if you
added many polynomial features, even with very highdegree
polynomials, without actually having to add them.
• So there is no combinatorial explosion of the number of features
since you don’t actually add any features
• On the right is another SVM classifier using a 10th degree
polynomial kernel.
• if your model is overfitting, you need to reduce the polynomial
degree. Conversely, if it is underfitting, you can try increasing it.
• The hyperparameter coef0 controls how much the model is
influenced by highdegree polynomials versus low-degree
polynomials.
• A common approach to find the right hyperparameter values is to
use grid search.
• It is often faster to first do a very coarse grid search, then a finer
grid search around the best values found.
• Having a good sense of what each hyperparameter actually does
can also help you search in the right part of the hyperparameter
space
polynomial_svm_clf = Pipeline((
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
))
polynomial_svm_clf.fit(X, y)
Nonlinear
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Adding Similarity Features
• Another technique to tackle nonlinear problems is to add features computed using a
similarity function that measures how much each instance resembles a particular
landmark.
• For example, let’s take the one-dimensional dataset discussed earlier and add two
landmarks to it at x1 = –2 and x1 = 1 (see the left plot in Figure ).
• let’s define the similarity function to be the Gaussian Radial Basis Function (RBF)
• with γ = 0.3
• It is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at
the landmark). Now we are ready to compute the new features.
• For example, let’s look at the instance x1 = –1: it is located at a distance of 1 from the
first landmark, and 2 from the second landmark.
• Therefore its new features are x2 = exp (–0.3 × 12) ≈ 0.74 and x3 = exp (–0.3 × 22) ≈
0.30. The plot on the right of Figure shows the transformed dataset (dropping the
original features).
• As you can see, it is now linearly separable.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Gaussian RBF Kernel
• Just like the polynomial features
method, the similarity features
method can be useful with any
Machine Learning algorithm, but it
may be computationally expensive
to compute all the additional
features,
• especially on large training sets.
However, once again the kernel
trick does its SVM magic:
• it makes it possible to obtain a
similar result as if you had added
many similarity features, without
actually having to add them.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
SVM Regression
• As we mentioned earlier, the SVM algorithm is
quite versatile: not only does it support linear and
nonlinear classification, but it also supports linear
and nonlinear regression.
• The trick is to reverse the objective: instead of
trying to fit the largest possible street between
two classes while limiting margin violations, SVM
Regression tries to fit as many instances as
possible on the street while limiting margin
violations (i.e., instances off the street).
• The width of the street is controlled by a
hyperparameter ϵ. Figure shows two linear SVM
Regression models trained on some random
linear data, one with a large margin (ϵ = 1.5) and
the other with a small margin (ϵ = 0.5).
• You can use Scikit-Learn’s LinearSVR class to
perform linear SVM Regression
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
basics of neural network
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
perceptrons,
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
LINEAR SEPARABILITY
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
introduction to Multiplayer Perceptron's
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST

More Related Content

Similar to Machine learning Basics Introduction ppt (20)

PPTX
Machine learning ppt.
ASHOK KUMAR
 
PDF
MACHINE LEARNING(R17A0534).pdf
FayyoOlani
 
PPTX
Statistical foundations of ml
Vipul Kalamkar
 
PPTX
Machine learning
Shailja Tripathi
 
PPTX
Acem machine learning
Aastha Kohli
 
PDF
Machine Learning Basics_Dr.Balamurugan.pdf
Dr. Balamurugan M
 
PPTX
BE ML Module 1A_Introduction to Machine Learning.pptx
EktaGangwani1
 
PPT
Machine Learning and its Appplications--
sudarmani rajagopal
 
PDF
MACHINE LEARNING Notes by Dr. K. Adisesha
Prof. Dr. K. Adisesha
 
PDF
machinecanthink-160226155704.pdf
PranavPatil822557
 
PDF
Machine-Learning for Data analytics and detection
adityaksnu
 
PPTX
3171617_introduction_applied machine learning.pptx
jainyshah20
 
PPTX
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
RahulKirtoniya
 
PPTX
chapter Three artificial intelligence 1.pptx
gadisaadamu101
 
PPT
Lecture 1
Aun Akbar
 
PPT
lec1.ppt
SVasuKrishna1
 
PPTX
Machine Can Think
Rahul Jaiman
 
PPTX
MachineLearning_Unit-I.pptxScrum.pptxAgile Model.pptxAgile Model.pptxAgile Mo...
22eg105n11
 
PPT
Chapter01.ppt
butest
 
PPTX
Session 17-18 machine learning very important and good type student favour.pptx
devadattha
 
Machine learning ppt.
ASHOK KUMAR
 
MACHINE LEARNING(R17A0534).pdf
FayyoOlani
 
Statistical foundations of ml
Vipul Kalamkar
 
Machine learning
Shailja Tripathi
 
Acem machine learning
Aastha Kohli
 
Machine Learning Basics_Dr.Balamurugan.pdf
Dr. Balamurugan M
 
BE ML Module 1A_Introduction to Machine Learning.pptx
EktaGangwani1
 
Machine Learning and its Appplications--
sudarmani rajagopal
 
MACHINE LEARNING Notes by Dr. K. Adisesha
Prof. Dr. K. Adisesha
 
machinecanthink-160226155704.pdf
PranavPatil822557
 
Machine-Learning for Data analytics and detection
adityaksnu
 
3171617_introduction_applied machine learning.pptx
jainyshah20
 
Rahul_Kirtoniya_11800121032_CSE_Machine_Learning.pptx
RahulKirtoniya
 
chapter Three artificial intelligence 1.pptx
gadisaadamu101
 
Lecture 1
Aun Akbar
 
lec1.ppt
SVasuKrishna1
 
Machine Can Think
Rahul Jaiman
 
MachineLearning_Unit-I.pptxScrum.pptxAgile Model.pptxAgile Model.pptxAgile Mo...
22eg105n11
 
Chapter01.ppt
butest
 
Session 17-18 machine learning very important and good type student favour.pptx
devadattha
 

More from vizhivasu1 (6)

PPT
Digital AMPS and code division multiple access
vizhivasu1
 
PPT
D-AMPS - Digital Advanced Mobile Phone System
vizhivasu1
 
PPT
D-AMPS - Digital Advanced Mobile Phone System, Code division multiple access
vizhivasu1
 
PPT
UNIT III MATCHED FILTER in communication systems.ppt
vizhivasu1
 
PPTX
Detailed description of working and functionality of MOSFET.pptx
vizhivasu1
 
PPT
Computer organization and architecture arithmetic.ppt
vizhivasu1
 
Digital AMPS and code division multiple access
vizhivasu1
 
D-AMPS - Digital Advanced Mobile Phone System
vizhivasu1
 
D-AMPS - Digital Advanced Mobile Phone System, Code division multiple access
vizhivasu1
 
UNIT III MATCHED FILTER in communication systems.ppt
vizhivasu1
 
Detailed description of working and functionality of MOSFET.pptx
vizhivasu1
 
Computer organization and architecture arithmetic.ppt
vizhivasu1
 
Ad

Recently uploaded (20)

PPTX
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PPTX
Usage of Power BI for Pharmaceutical Data analysis.pptx
Anisha Herala
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
Slide studies GC- CRC - PC - HNC baru.pptx
LLen8
 
Data base management system Transactions.ppt
gandhamcharan2006
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
Usage of Power BI for Pharmaceutical Data analysis.pptx
Anisha Herala
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Climate Action.pptx action plan for climate
justfortalabat
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
Ad

Machine learning Basics Introduction ppt

  • 1. 18ECE307J - APPLIED MACHINE LEARNING unit -1 Introduction to Machine learning: Types of Machine Learning - Supervised Learning – Unsupervised, Learning, reinforcement learning , The Curse of dimensionality, Bias and Variance, Learning Curve, Classification Error and noise, linear regression, Support Vector Machines, basics of neural network, perceptrons, LINEAR SEPARABILITY, Perceptrons and introduction to Multiplayer, Perceptrons Prepared by Dr.P.Vijayakumar, Associate professor, ECE,SRM IST
  • 2. Introduction to Machine learning • Machine Learning is the science (and art) of programming computers so they can learn from data • Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.—Arthur Samuel, 1959 • A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.-—Tom Mitchell, 1997 • Example : spam filter is a Machine Learning program that can learn to flag spam given examples of spam emails (e.g., flagged by users) and examples of regular (nonspam, also called “ham”) emails. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 3. More on spam filter .. • The examples that the spam filter system uses to learn are called the training set. Each training example is called a training instance (or sample). • In this case, the task T is to flag spam for new emails, the experience E is the training data, and the performance measure P needs to be defined; for example, you can use the ratio of correctly classified emails. • This particular performance measure is called accuracy and it is often used in classification tasks. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 4. Traditional programming and machine learning long list of complex rules—pretty hard to maintain. program is much shorter, easier to maintain, and most likely more accurate. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 5. More on ML Automatically adapting to change Machine Learning can help humans learn EX: list of words and combinations of words that it believes are the best predictors of spam. Applying ML techniques to dig into large amounts of data can help discover patterns that were not immediately apparent. This is called data mining. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 6. When to use Machine Learning • Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better. • Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution. • Fluctuating environments: a Machine Learning system can adapt to new data. • Getting insights about complex problems and large amounts of data. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 7. Types of Machine Learning • Supervised learning: A training set of examples with the correct responses (targets) is provided and, based on this training set, the algorithm generalises to respond correctly to all possible inputs. This is also called learning from exemplars. • Unsupervised learning :Correct responses are not provided, but instead the algorithm tries to identify similarities between the inputs so that inputs that have something in common are categorised together. The statistical approach to unsupervised learning is known as density estimation • Reinforcement learning: This is somewhere between supervised and unsupervised learning. The algorithm gets told when the answer is wrong, but does not get told how to correct it. It has to explore and try out different possibilities until it works out how to get the answer right. Reinforcement learning is sometime called learning with a critic because of this monitor that scores the answer, but does not suggest improvements • Evolutionary learning: Biological evolution can be seen as a learning process: biological organisms adapt to improve their survival rates and chance of having offspring in their environment. We’ll look at how we can model this in a computer, using an idea of fitness, which corresponds to a score for how good the current solution is. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 8. Supervised Learning • In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels • A typical supervised learning task is classification. The spam filter is a good example of this: it is trained with many example emails along with their class (spam or ham), and it must learn how to classify new emails. • Another typical task is to predict a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.) called predictors. This sort of task is called regression • To train the system, you need to give it many examples of cars, including both their predictors and their labels Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 9. supervised learning algorithms • k-Nearest Neighbors • Linear Regression • Logistic Regression • Support Vector Machines (SVMs) • Decision Trees and Random Forests • Neural networks Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 10. Unsupervised Learning • In unsupervised learning, as you might guess, the training data is unlabeled • unsupervised learning algorithms • Clustering • —k-Means • —Hierarchical Cluster Analysis (HCA) • —Expectation Maximization • Visualization and dimensionality reduction • —Principal Component Analysis (PCA) • —Kernel PCA • —Locally-Linear Embedding (LLE) • —t-distributed Stochastic Neighbor Embedding (t-SNE) • Association rule learning • —Apriori • —Eclat • Clustering: algorithm to try to detect groups of similarity • Visualization :input:a lot of complex and unlabeled data. Output: a 2D or 3D representation of your data that can easily be plotted . • These algorithms try to preserve as much structure as they can (e.g., trying to keep separate clusters in the input space from overlapping in the visualization), so you can understand how the data is organized and perhaps identify unsuspected patterns. • dimensionality reduction: goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one.(feature extraction) • association rule learning: goal is to dig into large amounts of data and discover interesting relations between attributes. • For example, suppose you own a supermarket. Running an association rule on your sales logs may reveal that people who purchase barbecue sauce and potato chips also tend to buy steak. Thus, you may want to place these items close to each other. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 11. Reinforcement learning • The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return or penalties in the form of negative rewards. • It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. • A policy defines what action the agent should choose when it is in a given situation. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 12. THE MACHINE LEARNING PROCESS • Data Collection and Preparation:Machine learning algorithms need significant amounts of data, preferably without too much noise, but with increased dataset size comes increased computational costs, and the sweet spot at which there is enough data without excessive computational overhead is generally impossible to predict. • Feature Selection:It consists of identifying the features that are most useful for the problem under examination. This invariably requires prior knowledge of the problem and the data; our common sense was used in the coins example above to identify some potentially useful features and to exclude others. • Algorithm Choice:Given the dataset, the choice of an appropriate algorithm • Parameter and Model Selection:For many of the algorithms there are parameters that have to be set manually, or that require experimentation to identify appropriate values • Training:training should be simply the use of computational resources in order to build a model of the data • Evaluation:Before a system can be deployed it needs to be tested and evaluated for accuracy on data that it was not trained on. This can often include a comparison with human experts in the field, and the selection of appropriate metrics for this comparison. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 13. The Curse of dimensionality • The essence of the curse is the realisation that as the number of dimensions increases, the volume of the unit hypersphere does not increase with it. • The curse of dimensionality will apply to our machine learning algorithms because as the number of input dimensions gets larger, we will need more data to enable the algorithm to generalize sufficiently well. • ML algorithms try to separate data into classes based on the features; therefore as the number of features increases, more number of datapoints we need. • For this reason, we will often have to be careful about what information we give to the algorithm, meaning that we need to understand something about the data in advance. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 14. Bias and Variance-bulls-eye diagram • Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. • Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 15. underfitting and overfitting In supervised learning, underfitting : • happens when a model unable to capture the underlying pattern of the data. • These models usually have high bias and low variance. • It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a nonlinear data. • Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression. In supervised learning, overfitting: • happens when our model captures the noise along with the underlying pattern in data. • It happens when we train our model a lot over noisy dataset. • These models have low bias and high variance. • These models are very complex like Decision trees which are prone to overfitting. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 16. Learning Curve • Graph that compares the performance of a model on training and testing data over a varying number of training instances • We should generally see performance improve as the number of training points increases • When we separate training and testing sets and graph them individually, We can get an idea of how well the model can generalize to new data • a learning curve (or training curve) plots the optimal value of a model's loss function for a training set against this loss function evaluated on a validation data set with same parameters as produced the optimal function. • It is a tool to find out how much a machine model benefits from adding more training data and whether the estimator suffers more from a variance error or a bias error. • If both the validation score and the training score converge to a value that is too low with increasing size of the training set, it will not benefit much from more training data • curve is useful for many purposes including comparing different algorithms, choosing model parameters during design, adjusting optimization to improve convergence, and determining the amount of data used for training. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 17. Underfit Learning Curves model does not have a suitable capacity for the complexity of the dataset he model is capable of further learning and possible further improvements and that the training process was halted prematurely Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 18. Overfit Learning Curves • The plot of training loss continues to decrease with experience. • The plot of validation loss decreases to a point and begins increasing again. • The inflection point in validation loss may be the point at which training could be halted as experience after that point shows the dynamics of overfitting. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 19. Good Fit Learning Curves • The plot of training loss decreases to a point of stability. • The plot of validation loss decreases to a point of stability and has a small gap with the training loss. • Continued training of a good fit will likely lead to an overfit. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 20. Classification Error and noise • For binary classification problems, there are two primary types of errors. • Type 1 errors (false positives) - rejection of a true null hypothesis • Type 2 errors (false negatives)- the non-rejection of a false null hypothesis • a true positive is an observation correctly put into class 1, while a false positive is an observation incorrectly put into class 1, Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 21. The Confusion Matrix • The confusion matrix is a nice simple idea: make a square matrix that contains all the possible classes in both the horizontal and vertical directions and list the classes along the top of a table as the predicted outputs, and then down the left- hand side as the targets Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 22. Measures • False Positive Rate: When it's actually no, how often does it predict yes? • FP/actual no = 10/60 = 0.17 • True Negative Rate: When it's actually no, how often does it predict no? • TN/actual no = 50/60 = 0.83 • equivalent to 1 minus False Positive Rate • also known as "Specificity" • Precision: When it predicts yes, how often is it correct? • TP/predicted yes = 100/110 = 0.91 • Prevalence: How often does the yes condition actually occur in our sample? • actual yes/total = 105/165 = 0.64 Accuracy: Overall, how often is the classifier correct? (TP+TN)/total = (100+50)/165 = 0.91 Misclassification Rate: Overall, how often is it wrong? (FP+FN)/total = (10+5)/165 = 0.09 equivalent to 1 minus Accuracy also known as "Error Rate" True Positive Rate: When it's actually yes, how often does it predict yes? TP/actual yes = 100/105 = 0.95 also known as "Sensitivity" or "Recall" Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 23. linear regression • fit a line to data, from classification problems, where we find a line that separates out the classes, so that they can be distinguished. • turn classification problems into regression problems. This can be done in two ways • first by introducing an indicator variable, which simply says which class each datapoint belongs to. The problem is now to use the data to predict the indicator variable, which is a regression problem. • The second approach is to do repeated regression, once for each class, with the indicator value being 1 for examples in the class and 0 for all of the others. making a prediction about an unknown value y (such as the indicator variable for classes or a future value of some data) by computing some function of known values xi. With straight lines model , output y is going to be a sum of the xi values, each multiplied by a constant parameter: Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 24. Defining the line • try to minimize the distance between each datapoint and the line that we fit. • We can measure the distance between a point and a line by defining another line that goes through the point and hits the line. • Now, we can try to minimize an error function that measures the sum of all these distances. Minimize the sum-of-squares of the errors-least-squares optimization. • choosing the parameters β in order to minimize the squared difference between the prediction and the actual data value, summed over all of the datapoints. given input vector Z, the prediction is Z β 1 2 Differentiation 2 and equating to 0 3 4 Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 25. Support Vector Machines • A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. • SVMs are particularly well suited for classification of complex but small- or medium-sized datasets. • The two classes can clearly be separated easily with a straight line (they are linearly separable). • The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly. The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances. • the decision boundary of an SVM classifier; this line not only separates the two classes but also stays as far away from the closest training instances as possible. • You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes. • This is called large margin classification. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 26. SENSITIVE TO THE FEATURE SCALES AND SOFT MARGIN CLASSIFICATION • on the left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal. • After feature scaling (e.g., using Scikit-Learn’s StandardScaler), the decision boundary looks much better (on the right plot). • If we strictly impose that all instances be off the street and on the right side, this is called hard margin classification. • There are two main issues with hard margin classification. • First, it only works if the data is linearly separable, and second it is quite sensitive to outliers. • To avoid these issues it is preferable to use a more flexible model. The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations (i.e., instances that end up in the middle of the street or even on the wrong side). • This is called soft margin classification Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 27. Decision Function and Predictions • The linear SVM classifier model predicts the class of a new instance x by simply computing the decision function wT ・ x + b = w1 x1 + ⋯ + wn xn + b: if the result is positive, the predicted class ŷ is the positive class (1), or else it is the negative class (0); Here ,Decision function is a two-dimensional plane since this dataset has two features (petal width and petal length). The decision boundary is the set of points where the decision function is equal to 0: it is the intersection of two planes, which is a straight line (represented by the thick solid line) The dashed lines represent the points where the decision function is equal to 1 or –1: they are parallel and at equal distance to the decision boundary, forming a margin around it. Training a linear SVM classifier means finding the value of w and b that make this margin as wide as possible while avoiding margin violations (hard margin) or limiting them (soft margin). Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 28. Training Objective • Consider the slope of the decision function: it is equal to the norm of the weight vector, ∥ w ∥. • If we divide this slope by 2, the points where the decision function is equal to ―1 are going to be twice as far away from the decision boundary. In other words, dividing the slope by 2 will multiply the margin by 2. • The smaller the weight vector w, the larger the margin • So we want to minimize ∥ w ∥ to get a large margin. • However, if we also want to avoid any margin violation (hard margin), then we need the decision function to be greater than 1 for all positive training instances, and lower than –1 for negative training instances. • If we define t(i) = –1 for negative instances (if y(i) = 0) and t(i) = 1 for positive instances (if y(i) = 1), then we can express this constraint as t(i)(wT ・ x(i) + b) ≥ 1 for all instances. • We can therefore express the hard margin linear SVM classifier objective as the constrained optimization problem as Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 29. soft margin objective and optimization problem • To get the soft margin objective, we need to introduce a slack variable ζ(i) ≥ 0 for each instance: ζ(i) measures how much the ith instance is allowed to violate the margin. • We now have two conflicting objectives: making the slack variables as small as possible to reduce the margin violations, and making ½ .wT ・ w as small as possible to increase the margin. • This is where the C hyperparameter comes in: it allows us to define the trade‐ off between these two objectives. This gives us the constrained optimization problem svm_clf = Pipeline(( ("scaler", StandardScaler()), ("linear_svc", LinearSVC(C=1, loss="hinge")), )) svm_clf.fit(X_scaled, y) Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 30. Nonlinear SVM Classification • Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets are not even close to being linearly separable. • One approach to handling nonlinear datasets is to add more features, such as polynomial features in some cases this can result in a linearly separable dataset • Consider the left plot in Figure it represents a simple dataset with just one feature x1. • This dataset is not linearly separable, as you can see. But if you add a second feature x2 = (x1)2, the resulting 2D dataset is perfectly linearly separable • Adding polynomial features is simple to implement and can work great with all sorts of Machine Learning algorithms (not just SVMs), • but at a low polynomial degree it cannot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 31. kernel trick • kernel trick makes it possible to get the same result as if you added many polynomial features, even with very highdegree polynomials, without actually having to add them. • So there is no combinatorial explosion of the number of features since you don’t actually add any features • On the right is another SVM classifier using a 10th degree polynomial kernel. • if your model is overfitting, you need to reduce the polynomial degree. Conversely, if it is underfitting, you can try increasing it. • The hyperparameter coef0 controls how much the model is influenced by highdegree polynomials versus low-degree polynomials. • A common approach to find the right hyperparameter values is to use grid search. • It is often faster to first do a very coarse grid search, then a finer grid search around the best values found. • Having a good sense of what each hyperparameter actually does can also help you search in the right part of the hyperparameter space polynomial_svm_clf = Pipeline(( ("poly_features", PolynomialFeatures(degree=3)), ("scaler", StandardScaler()), ("svm_clf", LinearSVC(C=10, loss="hinge")) )) polynomial_svm_clf.fit(X, y) Nonlinear Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 32. Adding Similarity Features • Another technique to tackle nonlinear problems is to add features computed using a similarity function that measures how much each instance resembles a particular landmark. • For example, let’s take the one-dimensional dataset discussed earlier and add two landmarks to it at x1 = –2 and x1 = 1 (see the left plot in Figure ). • let’s define the similarity function to be the Gaussian Radial Basis Function (RBF) • with γ = 0.3 • It is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at the landmark). Now we are ready to compute the new features. • For example, let’s look at the instance x1 = –1: it is located at a distance of 1 from the first landmark, and 2 from the second landmark. • Therefore its new features are x2 = exp (–0.3 × 12) ≈ 0.74 and x3 = exp (–0.3 × 22) ≈ 0.30. The plot on the right of Figure shows the transformed dataset (dropping the original features). • As you can see, it is now linearly separable. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 33. Gaussian RBF Kernel • Just like the polynomial features method, the similarity features method can be useful with any Machine Learning algorithm, but it may be computationally expensive to compute all the additional features, • especially on large training sets. However, once again the kernel trick does its SVM magic: • it makes it possible to obtain a similar result as if you had added many similarity features, without actually having to add them. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 34. SVM Regression • As we mentioned earlier, the SVM algorithm is quite versatile: not only does it support linear and nonlinear classification, but it also supports linear and nonlinear regression. • The trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations (i.e., instances off the street). • The width of the street is controlled by a hyperparameter ϵ. Figure shows two linear SVM Regression models trained on some random linear data, one with a large margin (ϵ = 1.5) and the other with a small margin (ϵ = 0.5). • You can use Scikit-Learn’s LinearSVR class to perform linear SVM Regression Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 35. basics of neural network Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 37. LINEAR SEPARABILITY Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
  • 38. introduction to Multiplayer Perceptron's Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST