Machine learning Basics Introduction ppt

18ECE307J - APPLIED MACHINE
LEARNING
unit -1
Introduction to Machine learning: Types of Machine Learning - Supervised
Learning – Unsupervised, Learning, reinforcement learning , The Curse of
dimensionality, Bias and Variance, Learning Curve, Classification Error and
noise, linear regression, Support Vector Machines, basics of neural network,
perceptrons, LINEAR SEPARABILITY, Perceptrons and introduction to
Multiplayer, Perceptrons
Prepared by Dr.P.Vijayakumar, Associate professor, ECE,SRM IST

Introduction to Machine learning
• Machine Learning is the science (and art) of programming computers so
they can learn from data
• Machine Learning is the field of study that gives computers the ability to
learn without being explicitly programmed.—Arthur Samuel, 1959
• A computer program is said to learn from experience E with respect to
some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.-—Tom Mitchell, 1997
• Example : spam filter is a Machine Learning program that can learn to flag
spam given examples of spam emails (e.g., flagged by users) and examples
of regular (nonspam, also called “ham”) emails.
Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST

More on spam filter ..
• The examples that the spam filter system uses to learn are called the
training set. Each training example is called a training instance (or
sample).
• In this case, the task T is to flag spam for new emails, the experience E
is the training data, and the performance measure P needs to be
defined; for example, you can use the ratio of correctly classified
emails.
• This particular performance measure is called accuracy and it is often
used in classification tasks.

Traditional programming and machine learning
long list of complex
rules—pretty hard to maintain.
program is much shorter, easier to
maintain, and most likely more accurate.

More on ML
Automatically adapting to change
Machine Learning can help humans learn EX: list of words and
combinations of words that it believes are the best predictors of spam.
Applying ML techniques to dig into large amounts of data can help
discover patterns that were not immediately apparent. This is called
data mining.

When to use Machine Learning
• Problems for which existing solutions require a lot of hand-tuning or
long lists of rules: one Machine Learning algorithm can often simplify
code and perform better.
• Complex problems for which there is no good solution at all using a
traditional approach: the best Machine Learning techniques can find a
solution.
• Fluctuating environments: a Machine Learning system can adapt to
new data.
• Getting insights about complex problems and large amounts of data.

Types of Machine Learning
• Supervised learning: A training set of examples with the correct responses (targets) is provided and, based on
this training set, the algorithm generalises to respond correctly to all possible inputs. This is also called
learning from exemplars.
• Unsupervised learning :Correct responses are not provided, but instead the algorithm tries to identify
similarities between the inputs so that inputs that have something in common are categorised together. The
statistical approach to unsupervised learning is known as density estimation
• Reinforcement learning: This is somewhere between supervised and unsupervised learning. The algorithm
gets told when the answer is wrong, but does not get told how to correct it. It has to explore and try out
different possibilities until it works out how to get the answer right. Reinforcement learning is sometime
called learning with a critic because of this monitor that scores the answer, but does not suggest
improvements
• Evolutionary learning: Biological evolution can be seen as a learning process: biological organisms adapt to
improve their survival rates and chance of having offspring in their environment. We’ll look at how we can
model this in a computer, using an idea of fitness, which corresponds to a score for how good the current
solution is.

Supervised Learning
• In supervised learning, the training data you
feed to the algorithm includes the desired
solutions, called labels
• A typical supervised learning task is
classification. The spam filter is a good
example of this: it is trained with many
example emails along with their class (spam
or ham), and it must learn how to classify new
emails.
• Another typical task is to predict a target
numeric value, such as the price of a car,
given a set of features (mileage, age, brand,
etc.) called predictors. This sort of task is
called regression
• To train the system, you need to give it many
examples of cars, including both their
predictors and their labels

supervised learning algorithms
• k-Nearest Neighbors
• Linear Regression
• Logistic Regression
• Support Vector Machines (SVMs)
• Decision Trees and Random Forests
• Neural networks

Unsupervised Learning
• In unsupervised learning, as you might guess, the training
data is unlabeled
• unsupervised learning algorithms
• Clustering
• —k-Means
• —Hierarchical Cluster Analysis (HCA)
• —Expectation Maximization
• Visualization and dimensionality reduction
• —Principal Component Analysis (PCA)
• —Kernel PCA
• —Locally-Linear Embedding (LLE)
• —t-distributed Stochastic Neighbor
Embedding (t-SNE)
• Association rule learning
• —Apriori
• —Eclat
• Clustering: algorithm to try to detect groups of similarity
• Visualization :input:a lot of complex and unlabeled data.
Output: a 2D or 3D representation of your data that can easily
be plotted .
• These algorithms try to preserve as much structure as they can
(e.g., trying to keep separate clusters in the input space from
overlapping in the visualization), so you can understand how
the data is organized and perhaps identify unsuspected
patterns.
• dimensionality reduction: goal is to simplify the data without
losing too much information. One way to do this is to merge
several correlated features into one.(feature extraction)
• association rule learning: goal is to dig into large amounts of
data and discover interesting relations between attributes.
• For example, suppose you own a supermarket. Running an
association rule on your sales logs may reveal that people who
purchase barbecue sauce and potato chips also tend to buy
steak. Thus, you may want to place these items close to each
other.

Reinforcement learning
• The learning system, called an
agent in this context, can observe
the environment, select and
perform actions, and get rewards
in return or penalties in the form of
negative rewards.
• It must then learn by itself what is
the best strategy, called a policy, to
get the most reward over time.
• A policy defines what action the
agent should choose when it is in a
given situation.

THE MACHINE LEARNING PROCESS
• Data Collection and Preparation:Machine learning algorithms need significant amounts of data, preferably
without too much noise, but with increased dataset size comes increased computational costs, and the sweet
spot at which there is enough data without excessive computational overhead is generally impossible to
predict.
• Feature Selection:It consists of identifying the features that are most useful for the problem under
examination. This invariably requires prior knowledge of the problem and the data; our common sense was
used in the coins example above to identify some potentially useful features and to exclude others.
• Algorithm Choice:Given the dataset, the choice of an appropriate algorithm
• Parameter and Model Selection:For many of the algorithms there are parameters that have to be set manually,
or that require experimentation to identify appropriate values
• Training:training should be simply the use of computational resources in order to build a model of the data
• Evaluation:Before a system can be deployed it needs to be tested and evaluated for accuracy on data that it
was not trained on. This can often include a comparison with human experts in the field, and the selection of
appropriate metrics for this comparison.

The Curse of dimensionality
• The essence of the curse is the realisation that as the number of
dimensions increases, the volume of the unit hypersphere does not
increase with it.
• The curse of dimensionality will apply to our machine learning algorithms
because as the number of input dimensions gets larger, we will need more
data to enable the algorithm to generalize sufficiently well.
• ML algorithms try to separate data into classes based on the features;
therefore as the number of features increases, more number of
datapoints we need.
• For this reason, we will often have to be careful about what information we
give to the algorithm, meaning that we need to understand something
about the data in advance.

Bias and Variance-bulls-eye diagram
• Bias is the difference between the
average prediction of our model and the
correct value which we are trying to
predict. Model with high bias pays very
little attention to the training data and
oversimplifies the model. It always leads
to high error on training and test data.
• Variance is the variability of model
prediction for a given data point or a
value which tells us spread of our data.
Model with high variance pays a lot of
attention to training data and does not
generalize on the data which it hasn’t
seen before. As a result, such models
perform very well on training data but has
high error rates on test data.

underfitting and overfitting
In supervised learning, underfitting :
• happens when a model unable to capture the underlying
pattern of the data.
• These models usually have high bias and low variance.
• It happens when we have very less amount of data to
build an accurate model or when we try to build a linear
model with a nonlinear data.
• Also, these kind of models are very simple to capture the
complex patterns in data like Linear and logistic regression.
In supervised learning, overfitting:
• happens when our model captures the noise along with
the underlying pattern in data.
• It happens when we train our model a lot over noisy
dataset.
• These models have low bias and high variance.
• These models are very complex like Decision trees which
are prone to overfitting.

Learning Curve
• Graph that compares the performance of a model on training
and testing data over a varying number of training instances
• We should generally see performance improve as the
number of training points increases
• When we separate training and testing sets and graph them
individually, We can get an idea of how well the model can
generalize to new data
• a learning curve (or training curve) plots the optimal value of
a model's loss function for a training set against this loss
function evaluated on a validation data set with same
parameters as produced the optimal function.
• It is a tool to find out how much a machine model benefits
from adding more training data and whether the estimator
suffers more from a variance error or a bias error.
• If both the validation score and the training score converge
to a value that is too low with increasing size of the training
set, it will not benefit much from more training data
• curve is useful for many purposes including comparing
different algorithms, choosing model parameters during
design, adjusting optimization to improve convergence, and
determining the amount of data used for training.

Underfit Learning Curves
model does not have a suitable capacity for
the complexity of the dataset
he model is capable of further learning and
possible further improvements and that the
training process was halted prematurely

Overfit Learning Curves
• The plot of training loss continues
to decrease with experience.
• The plot of validation loss
decreases to a point and begins
increasing again.
• The inflection point in validation
loss may be the point at which
training could be halted as
experience after that point shows
the dynamics of overfitting.

Good Fit Learning Curves
• The plot of training loss
decreases to a point of stability.
• The plot of validation loss
decreases to a point of stability
and has a small gap with the
training loss.
• Continued training of a good fit
will likely lead to an overfit.

Classification Error and noise
• For binary classification problems,
there are two primary types of
errors.
• Type 1 errors (false positives) -
rejection of a true null hypothesis
• Type 2 errors (false negatives)- the
non-rejection of a false null
hypothesis
• a true positive is an observation
correctly put into class 1, while a
false positive is an observation
incorrectly put into class 1,

The Confusion Matrix
• The confusion matrix is a nice
simple idea: make a square
matrix that contains all the
possible classes in both the
horizontal and vertical directions
and list the classes along the top
of a table as the predicted
outputs, and then down the left-
hand side as the targets

Measures
• False Positive Rate: When it's actually no,
how often does it predict yes?
• FP/actual no = 10/60 = 0.17
• True Negative Rate: When it's actually no,
how often does it predict no?
• TN/actual no = 50/60 = 0.83
• equivalent to 1 minus False Positive Rate
• also known as "Specificity"
• Precision: When it predicts yes, how often is
it correct?
• TP/predicted yes = 100/110 = 0.91
• Prevalence: How often does the yes condition
actually occur in our sample?
• actual yes/total = 105/165 = 0.64
Accuracy: Overall, how often is the classifier correct?
(TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate: Overall, how often is it
wrong?
(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as "Error Rate"
True Positive Rate: When it's actually yes, how often
does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"

linear regression
• fit a line to data, from classification
problems, where we find a line that
separates out the classes, so that they can
be distinguished.
• turn classification problems into regression
problems. This can be done in two ways
• first by introducing an indicator variable, which
simply says which class each datapoint belongs
to. The problem is now to use the data to
predict the indicator variable, which is a
regression problem.
• The second approach is to do repeated
regression, once for each class, with the
indicator value being 1 for examples in the class
and 0 for all of the others.
making a prediction about an unknown value y
(such as the indicator variable for classes or a
future value of some data) by computing some
function of known values xi. With straight lines
model , output y is going to be a sum of the xi
values, each multiplied by a constant parameter:

Defining the line
• try to minimize the distance between each datapoint and the line that we fit.
• We can measure the distance between a point and a line by defining another line that
goes through the point and hits the line.
• Now, we can try to minimize an error function that measures the sum of all these
distances. Minimize the sum-of-squares of the errors-least-squares optimization.
• choosing the parameters β in order to minimize the squared difference between the
prediction and the actual data value, summed over all of the datapoints. given input
vector Z, the prediction is Z β
1
2
Differentiation 2 and equating to 0
3
4

Support Vector Machines
• A Support Vector Machine (SVM) is a very powerful and
versatile Machine Learning model, capable of performing
linear or nonlinear classification, regression, and even outlier
detection.
• SVMs are particularly well suited for classification of complex
but small- or medium-sized datasets.
• The two classes can clearly be separated easily with a straight
line (they are linearly separable).
• The model whose decision boundary is represented by the
dashed line is so bad that it does not even separate the
classes properly. The other two models work perfectly on this
training set, but their decision boundaries come so close to
the instances that these models will probably not perform as
well on new instances.
• the decision boundary of an SVM classifier; this line not only
separates the two classes but also stays as far away from the
closest training instances as possible.
• You can think of an SVM classifier as fitting the widest possible
street (represented by the parallel dashed lines) between the
classes.
• This is called large margin classification.

SENSITIVE TO THE FEATURE SCALES AND SOFT MARGIN
CLASSIFICATION
• on the left plot, the vertical scale is much larger than the
horizontal scale, so the widest possible street is close to
horizontal.
• After feature scaling (e.g., using Scikit-Learn’s StandardScaler),
the decision boundary looks much better (on the right plot).
• If we strictly impose that all instances be off the street and on
the right side, this is called hard margin classification.
• There are two main issues with hard margin classification.
• First, it only works if the data is linearly separable, and second
it is quite sensitive to outliers.
• To avoid these issues it is preferable to use a more flexible
model. The objective is to find a good balance between
keeping the street as large as possible and limiting the margin
violations (i.e., instances that end up in the middle of the
street or even on the wrong side).
• This is called soft margin classification

Decision Function and Predictions
• The linear SVM classifier model predicts the class of
a new instance x by simply computing the decision
function wT ・ x + b = w1 x1 + ⋯ + wn xn + b: if the
result is positive, the predicted class ŷ is the
positive class (1), or else it is the negative class (0);
Here ,Decision function is a two-dimensional plane
since this dataset has two features (petal width and
petal length).
The decision boundary is the set of points where the
decision function is equal to 0: it is the intersection of
two planes, which is a straight line (represented
by the thick solid line)
The dashed lines represent the points where the decision function
is equal to 1 or –1:
they are parallel and at equal distance to the decision boundary,
forming a margin around it.
Training a linear SVM classifier means finding the value of w and b
that make this margin as wide as possible while avoiding margin
violations (hard margin) or limiting them (soft margin).

Training Objective
• Consider the slope of the decision function: it is
equal to the norm of the weight vector, ∥ w ∥.
• If we divide this slope by 2, the points where the
decision function is equal to ―1 are going to be
twice as far away from the decision boundary. In
other words, dividing the slope by 2 will multiply
the margin by 2.
• The smaller the weight vector w, the larger the
margin
• So we want to minimize ∥ w ∥ to get a large margin.
• However, if we also want to avoid any margin
violation (hard margin), then we need the decision
function to be greater than 1 for all positive training
instances, and lower than –1 for negative training
instances.
• If we define t(i) = –1 for negative instances (if y(i) =
0) and t(i) = 1 for positive instances (if y(i) = 1), then
we can express this constraint as t(i)(wT ・ x(i) + b)
≥ 1 for all instances.
• We can therefore express the hard margin linear
SVM classifier objective as the constrained
optimization problem as

soft margin objective and optimization problem
• To get the soft margin objective, we need to introduce a slack variable ζ(i) ≥ 0 for each instance: ζ(i) measures
how much the ith instance is allowed to violate the margin.
• We now have two conflicting objectives: making the slack variables as small as possible to reduce the margin
violations, and making ½ .wT ・ w as small as possible to increase the margin.
• This is where the C hyperparameter comes in: it allows us to define the trade‐ off between these two
objectives. This gives us the constrained optimization problem
svm_clf = Pipeline((
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge")),
))
svm_clf.fit(X_scaled, y)

Nonlinear SVM Classification
• Although linear SVM classifiers are efficient and work
surprisingly well in many cases, many datasets are not
even close to being linearly separable.
• One approach to handling nonlinear datasets is to add
more features, such as polynomial features in some
cases this can result in a linearly separable dataset
• Consider the left plot in Figure it represents a simple
dataset with just one feature x1.
• This dataset is not linearly separable, as you can see.
But if you add a second feature x2 = (x1)2, the resulting
2D dataset is perfectly linearly separable
• Adding polynomial features is simple to implement
and can work great with all sorts of Machine Learning
algorithms (not just SVMs),
• but at a low polynomial degree it cannot deal with
very complex datasets, and with a high polynomial
degree it creates a huge number of features, making
the model too slow.

kernel trick
• kernel trick makes it possible to get the same result as if you
added many polynomial features, even with very highdegree
polynomials, without actually having to add them.
• So there is no combinatorial explosion of the number of features
since you don’t actually add any features
• On the right is another SVM classifier using a 10th degree
polynomial kernel.
• if your model is overfitting, you need to reduce the polynomial
degree. Conversely, if it is underfitting, you can try increasing it.
• The hyperparameter coef0 controls how much the model is
influenced by highdegree polynomials versus low-degree
polynomials.
• A common approach to find the right hyperparameter values is to
use grid search.
• It is often faster to first do a very coarse grid search, then a finer
grid search around the best values found.
• Having a good sense of what each hyperparameter actually does
can also help you search in the right part of the hyperparameter
space
polynomial_svm_clf = Pipeline((
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
))
polynomial_svm_clf.fit(X, y)
Nonlinear

Adding Similarity Features
• Another technique to tackle nonlinear problems is to add features computed using a
similarity function that measures how much each instance resembles a particular
landmark.
• For example, let’s take the one-dimensional dataset discussed earlier and add two
landmarks to it at x1 = –2 and x1 = 1 (see the left plot in Figure ).
• let’s define the similarity function to be the Gaussian Radial Basis Function (RBF)
• with γ = 0.3
• It is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at
the landmark). Now we are ready to compute the new features.
• For example, let’s look at the instance x1 = –1: it is located at a distance of 1 from the
first landmark, and 2 from the second landmark.
• Therefore its new features are x2 = exp (–0.3 × 12) ≈ 0.74 and x3 = exp (–0.3 × 22) ≈
0.30. The plot on the right of Figure shows the transformed dataset (dropping the
original features).
• As you can see, it is now linearly separable.

Gaussian RBF Kernel
• Just like the polynomial features
method, the similarity features
method can be useful with any
Machine Learning algorithm, but it
may be computationally expensive
to compute all the additional
features,
• especially on large training sets.
However, once again the kernel
trick does its SVM magic:
• it makes it possible to obtain a
similar result as if you had added
many similarity features, without
actually having to add them.

SVM Regression
• As we mentioned earlier, the SVM algorithm is
quite versatile: not only does it support linear and
nonlinear classification, but it also supports linear
and nonlinear regression.
• The trick is to reverse the objective: instead of
trying to fit the largest possible street between
two classes while limiting margin violations, SVM
Regression tries to fit as many instances as
possible on the street while limiting margin
violations (i.e., instances off the street).
• The width of the street is controlled by a
hyperparameter ϵ. Figure shows two linear SVM
Regression models trained on some random
linear data, one with a large margin (ϵ = 1.5) and
the other with a small margin (ϵ = 0.5).
• You can use Scikit-Learn’s LinearSVR class to
perform linear SVM Regression

basics of neural network

perceptrons,

LINEAR SEPARABILITY

introduction to Multiplayer Perceptron's

Machine learning Basics Introduction ppt

More Related Content

Similar to Machine learning Basics Introduction ppt (20)

More from vizhivasu1 (6)

Recently uploaded (20)

Machine learning Basics Introduction ppt