Regression
Machine Learning Perspective
BY:
Suyash pratap Singh and Vineet raj Parashar
JUET,GUNA
Good understanding of how regression works can help:
• Having a the appropriate model
• Right training algorithm to use
• Good set of hyperparameters for your task.
• Debug issues and perform error analysis more efficiently
• Lastly, most of the topics discussed will be essential in understanding,
building, and training neural networks
Agenda
• Linear Regression model
• Closed-form equation
• Normal Equation
• Iterative optimization approach
• Gradient Descent
• Variants
Polynomial Regression
• More complex model that can fit nonlinear datasets.
• It has more parameters than Linear Regression so it is more prone to
overfitting the training data
• How to detect overfitting
• Learning curves
Regularization and Logistic Regression
• Regularization methods
• Techniques that can reduce the risk of overfitting the training set
• Finally, we will look at two more models that are commonly used for
classification tasks:
• Logistic Regression
• Softmax Regression.
Prerequisite: Linear algebra and calculus
• Vectors
• Matrices
• Operations
• Transpose
• Dot product
• Matrix inverse
• Partial derivatives
Linear Regression
• Simple regression model of life satisfaction:
life_satisfaction =
• GDP_per_capita is the input feature
• This model is just a linear function of the input feature
• ϴ0 andϴ1 are the model’s parameters
• More generally, a linear model makes a prediction by simply computing a
weighted sum of the input features, plus a constant called the bias term
Linear Regression model prediction
Linear Regression: Vectorized Form
• This can be written much more concisely using a vectorized form, as
shown below:
Performance Measures
• Above given equation is the linear regression model
• Now question is : how to train this model?
• Training a model is to find values its parameters so that the model best fits
the training set.
• What is meaning of best here?
• For this purpose, we first need a measure of how well (or poorly) the model fits the
training data
• Most common performance measure of a regression model is the Root Mean Square
Error (RMSE)
• Therefore, to train a Linear Regression model, you need to find the value of ϴ that
minimizes the RMSE.
• RMSE is more sensitive to outliers
RMSE
• RMSE requires MSE
• The MSE of a Linear Regression hypothesis hϴ on a training set X is
calculated using
Mean Absolute Error
Notations
Age PClaass Gender Fare
67 1 1 1200
54 2 1 1000
8 3 0 600
18 3 1 500
2 1 0 650
• m is the number of instances in the dataset we are measuring the RMSE on
• x(i) is a vector of all the feature values (excluding the label) of the ith instance in the dataset, and
y(i) is its label (the desired output value for that instance)
• For instance : First passenger in the dataset with name “Alan” his age is 67 years and travelled in
1st class and he/she has to pay 1200 dollars
• x (1)=
67
1
1
• y(1) = 1200
Vectorized Dataset
67 1 1
54 2 1
8 3 0
18 3 1
2 1 0
𝑦 =
1200
1000
600
500
650
Prediction error
• For example, if your system predicts that the fare for the first
passenger is $1110, then ŷ(1) = h(x(1)) = $1110.
• The prediction error for this passenger is:
 ŷ(1) – y(1) = $90
The Normal Equation
• To find the value of θ that minimizes the cost function, there is a
closed-form solution
• in other words, a mathematical equation that gives the result directly.
This is called the Normal Equation
• θ is the value of θ that minimizes the cost function.
• y is the vector of target values containing y(1) to y(m).
Training Complexity
• The Normal Equation computes the inverse of XT X, which is an (n +
1) × (n + 1) matrix (where n is the number of features). The
computational complexity of inverting such a matrix is typically about
O(n2.4) to O(n3) (depending on the implementation).
• In other words, if you double the number of features, you multiply
the computation time by roughly 22.4 = 5.3 to 23 = 8.
• The SVD approach used by Scikit-Learn’s LinearRegression class is
about O(n2).
• If you double the number of features, you multiply the computation
time by roughly 4.
When to use Normal Equation?
• Both the Normal Equation and the SVD approach get very slow when
the number of features grows large (e.g., 100,000).
• On the positive side, both are linear with regards to the number of
instances in the training set (they are O(m)), so they handle large
training sets efficiently, provided they can fit in memory.
Prediction Complexity
• Also, once you have trained your Linear Regression model (using the
Normal Equation or any other algorithm), predictions are very fast
• The computational complexity is linear with regards to both the
number of instances you want to make predictions on and the
number of features.
• In other words, making predictions on twice as many instances (or
twice as many features) will just take roughly twice as much time.
When not to use Normal Equation:
• Cases where there are a large number of features
• Too many training instances to fit in memory
• Now we have to look at very different ways to train a Linear
Regression model
Gradient Descent
• Gradient Descent is a very generic optimization algorithm capable of
finding optimal solutions to a wide range of problems.
• The general idea of Gradient Descent is to tweak parameters
iteratively in order to minimize a cost function.
• How it works?
• It measures the local gradient of the error function with regards to the
parameter vector θ, and it goes in the direction of descending gradient. Once
the gradient is zero, you have reached a minimum!
• Concretely, you start by filling θ with random values (this is called random initialization), and
then you improve it gradually, taking one baby step at a time, each step attempting to
decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum
Cost Function
Learning Rate: Too small
Learning Rate: Too large
GD Pitfalls
• Finally, not all cost functions look like nice regular bowls. There may
be holes, ridges, plateaus, and all sorts of irregular terrains, making
convergence to the minimum very difficult.
Global minimum
• Fortunately, the MSE cost function for a Linear Regression model
happens to be a convex function, which means that if you pick any
two points on the curve, the line segment joining them never crosses
the curve.
• This implies that there are no local minima, just one global minimum.
It is also a continuous function with a slope that never changes
abruptly.
• These two facts have a great consequence: Gradient Descent is
guaranteed to approach arbitrarily close the global minimum (if you
wait long enough and if the learning rate is not too high).
Importance of Feature Scaling
• The cost function has the shape of a bowl, but it can be an elongated bowl if the features have very
different scales.
• Figure shows Gradient Descent on a training set where features 1 and 2 have the same scale (on the left),
and on a training set where feature 1 has such smaller values than feature 2 (on the right).
• When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-
Learn’s StandardScaler class), or else it will take much longer to converge.
Model’s parameter space
• Above diagram also illustrates the fact that training a model means
searching for a combination of model parameters that minimizes a
cost function (over the training set).
• It is a search in the model’s parameter space: the more parameters a
model has, the more dimensions this space has, and the harder the
search is: searching for a needle in a 300-dimensional haystack is
much trickier than in three dimensions.
• Fortunately, since the cost function is convex in the case of Linear
Regression, the needle is simply at the bottom of the bowl.
Batch Gradient Descent
• To implement Gradient Descent, you need to compute the gradient of
the cost function with regards to each model parameter ϴj.
• In other words, you need to calculate how much the cost function will
change if you change ϴj just a little bit.
• This is called a partial derivative.
• Following equation computes the partial derivative of the cost
function with regards to parameter ϴj
Gradient Vector of a Cost Function
Complexity of Gradient Descent
• Notice that this formula involves calculations over the full training set
X, at each Gradient Descent step!
• This is why the algorithm is called Batch Gradient Descent: it uses the
whole batch of training data at every step
• As a result it is terribly slow on very large training sets
• However, Gradient Descent scales well with the number of features;
training a Linear Regression model when there are hundreds of
thousands of features is much faster using Gradient Descent than
using the Normal Equation or SVD decomposition
Learning Rate
• Once you have the gradient vector, which points uphill, just go in the
opposite direction to go downhill.
• This means subtracting ∇θMSE(θ) from θ
• This is where the learning rate η comes into play: multiply the
gradient vector by η to determine the size of the downhill step as
shown in below Equation:
Batch Gradient Implementation:
import numpy as np
eta=.1
np.random.seed(0)
n_iter=1000
m=100
x=2*np.random.rand(100,1)
y=4+3*x+np.random.randn(100,1)
x_b=np.c_[np.ones((100,1)),x]
theta=np.random.rand(2,1)
print(theta)
for i in range(n_iter):
gradient=2/m*x_b.T.dot(x_b.dot(theta)-y)
#print(i,"g",gradient)
theta=theta-eta*gradient
#print("iteration no:",i,"theta:",theta)
print(theta)
Using different learning rate eta
Learning rate selection and stopping criteria
• To find a good learning rate, you can use grid search
• However, you may want to limit the number of iterations so that grid
search can eliminate models that take too long to converge
• How many iterations
• Too large
• Too small
• Gradient Vector value <=Tolerance
Convergence Rate
• When the cost function is convex and its slope does not change
abruptly (as is the case for the MSE cost function), Batch Gradient
Descent with a fixed learning rate will eventually converge to the
optimal solution, but you may have to wait a while:
• It can take O(1/ϵ) iterations to reach the optimum within a range of ϵ
depending on the shape of the cost function.
• If you divide the tolerance by 10 to have a more precise solution, then
the algorithm may have to run about 10 times longer.
Feature Scaling
• When using Gradient Descent, you should ensure that all features
have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or
else it will take much longer to converge.
Stochatic GD
• The main problem with Batch Gradient Descent is the fact that it uses the whole training set to
compute the gradients at every step, which makes it very slow when the training set is large.
• At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training
set at every step and computes the gradients based only on that single instance.
• Obviously this makes the algorithm much faster since it has very little data to manipulate at every
iteration.
• It also makes it possible to train on huge training sets, since only one instance needs to be in
memory at each iteration (SGD can be implemented as an out-of-core algorithm)
• On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular
than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost
function will bounce up and down, decreasing only on average.
• Over time it will end up very close to the minimum, but once it gets there it will continue to
bounce around, never settling down
• So once the algorithm stops, the final parameter values are good, but not optimal.
SGD Implementationimport numpy as np
np.random.seed(0)
n_epochs=50
m=100
x=2*np.random.rand(100,1)
y=4+3*x+np.random.randn(100,1)
x_b=np.c_[np.ones((100,1)),x]
t,t1=5,50
def learning_schedule(t):
return t/(t+t1)
theta=np.random.randn(2,1)
for epoch in range(n_epochs):
for i in range(m):
random_index=np.random.randint(m)
xi=x_b[random_index:random_index+1]
yi=y[random_index:random_index+1]
gradients=2*xi.T.dot(xi.dot(theta)-yi)
eta=learning_schedule(epoch*m+i)
theta=theta-eta*gradients
print(theta)
Linear Regression Algorithms: Comparision
Graph
Problem of Overfitting
&
Learning Curves
Regression ppt
Regularization
• As we saw in Chapters 1 and 2, a good way to reduce overfitting is to
regularize the model (i.e., to constrain it): the fewer degrees of
freedom it has, the harder it will be for it to overfit the data
• For example, a simple way to regularize a polynomial model is to the
reduce number of polynomial degrees.
• Three different ways to constrain the weights (i.e. the number of
polynomial degrees):
• Ridge Regression, Lasso Regression, and Elastic Net
Ridge cost function Ridge closed form equation
Lasso cost function Lasso gradient calculation
Elastic-Net cost function
Elastic-Net gradient calculation:
𝑀𝑆𝐸 𝜃 + r ∝ sign 𝜃 + (1 − r) ∝ 𝜃
Ridge Gradient calculation
𝑀𝑆𝐸 𝜃 +∝ 𝜃
Sklearn (ridge, lasso, elasticnet)
Sklearn – Linear, GD and Polynomial regression
Logistic Regression
• Logistic Regression (also called Logit Regression) is commonly used to
estimate the probability that an instance belongs to a particular class
• If the estimated probability is greater than 50%, then the model
predicts that the instance belongs to the positive class else negative
class
• This makes it a binary classifier
• Instead of outputting the result directly like the Linear Regression
model does, it outputs the logistic of this result
Logistic Regression
• Logistic Regression model estimated probability (vectorized form):
• The logistic—noted σ(・)—is a sigmoid function (i.e., S-shaped) that
outputs a number between 0 and 1.
• It is defined as shown in Equation and as shown in figure (matplotlib)
Sigmoid Function
• Notice that σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a
Logistic Regression model predicts 1 if xT θ is positive, and 0 if it is
negative.
Prediction
• Once the Logistic Regression model has estimated the probability p =
hθ(x) that an instance x belongs to the positive class, it can make its
prediction ŷ easily
Cost function
• Good, now you know how a Logistic Regression model estimates
probabilities and makes predictions.
• But how is it trained?
• The objective of training is to set the parameter vector θ so that the
model estimates high probabilities for positive instances (y =1) and
low probabilities for negative instances (y = 0).
• This idea is captured by the cost function shown in Equation
Continue:
• This cost function makes sense because – log(t) grows very large
when t approaches 0, so the cost will be large if the model estimates
a probability close to 0 for a positive instance, and it will also be very
large if the model estimates a probability close to 1 for a negative
instance.
• On the other hand, – log(t) is close to 0 when t is close to 1, so the
cost will be close to 0 if the estimated probability is close to 0 for a
negative instance or close to 1 for a positive instance, which is
precisely what we want.
Logistic cost function partial derivatives
Comment:
• It is often the case that a learning algorithm will try to optimize a
different function than the performance
• measure used to evaluate the final model. This is generally because
that function is easier to compute, because
• it has useful differentiation properties that the performance measure
lacks, or because we want to constrain
• the model during training, as we will see when we discuss
regularization.
Knowledge:
• Gradient Descent (GD), that gradually tweaks the model parameters
to minimize the cost function over the training set, eventually
converging to the same set of parameters as the first method.
• Variants of Gradient Descent that will be used in neural networks:
Batch GD, Mini-batch GD, and Stochastic GD.
Thank You!!!
"Success is not an accident, it is hard work,
perseverance, learning, sacrifice and most of all
determination towards what you are doing."

More Related Content

PPTX
Techniques in Deep Learning
PDF
Introduction to Reinforcement Learning
PDF
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
PPT
Reinforcement learning 7313
PDF
Reinforcement learning
PDF
Karmarkar's Algorithm For Linear Programming Problem
PPTX
An introduction to reinforcement learning (rl)
PDF
Distributed Deep Q-Learning
Techniques in Deep Learning
Introduction to Reinforcement Learning
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
Reinforcement learning 7313
Reinforcement learning
Karmarkar's Algorithm For Linear Programming Problem
An introduction to reinforcement learning (rl)
Distributed Deep Q-Learning

What's hot (20)

PDF
Generalized Reinforcement Learning
PDF
GBM theory code and parameters
PPTX
Greedy method
PPTX
Reinforcement Learning
PDF
Linear programming class 12 investigatory project
PPTX
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
PDF
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
PPTX
Reinforcement Learning : A Beginners Tutorial
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
DOCX
Limitations of linear programming
PPTX
Introduction to Dynamic Programming, Principle of Optimality
PDF
Reinforcement Learning
PDF
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
PPTX
Time Series - Auto Regressive Models
PDF
Linear programing
PPTX
Linear programming manzoor nabi
PPTX
Linear Programming
PPTX
Unit 2 lpp tp
PPT
Unit ii-1-lp
DOCX
Application of linear programming technique for staff training of register se...
Generalized Reinforcement Learning
GBM theory code and parameters
Greedy method
Reinforcement Learning
Linear programming class 12 investigatory project
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Reinforcement Learning : A Beginners Tutorial
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Limitations of linear programming
Introduction to Dynamic Programming, Principle of Optimality
Reinforcement Learning
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Time Series - Auto Regressive Models
Linear programing
Linear programming manzoor nabi
Linear Programming
Unit 2 lpp tp
Unit ii-1-lp
Application of linear programming technique for staff training of register se...
Ad

Similar to Regression ppt (20)

PDF
Machine learning
PDF
Lecture 5 - Linear Regression Linear Regression
PDF
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
PPTX
Linear Regression.pptx
PDF
Machine learning using matlab.pdf
PDF
Linear logisticregression
PDF
L1 intro2 supervised_learning
PDF
CS229 Machine Learning Lecture Notes
PDF
A Brief Introduction to Linear Regression
PDF
X01 Supervised learning problem linear regression one feature theorie
PPTX
Linear regression, costs & gradient descent
PDF
Regression_1.pdf
PPTX
2. Linear regression with one variable.pptx
PPTX
Machine learning introduction lecture notes
PDF
Artificial Intelligence Course: Linear models
PPTX
Advance Machine Learning presentation.pptx
PDF
ML_Lec4 introduction to linear regression.pdf
PDF
Regression
PPTX
Bootcamp of new world to taken seriously
PDF
Introduction to machine learning
Machine learning
Lecture 5 - Linear Regression Linear Regression
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Linear Regression.pptx
Machine learning using matlab.pdf
Linear logisticregression
L1 intro2 supervised_learning
CS229 Machine Learning Lecture Notes
A Brief Introduction to Linear Regression
X01 Supervised learning problem linear regression one feature theorie
Linear regression, costs & gradient descent
Regression_1.pdf
2. Linear regression with one variable.pptx
Machine learning introduction lecture notes
Artificial Intelligence Course: Linear models
Advance Machine Learning presentation.pptx
ML_Lec4 introduction to linear regression.pdf
Regression
Bootcamp of new world to taken seriously
Introduction to machine learning
Ad

More from SuyashSingh70 (6)

PPTX
Introduction to ml and dl
PPTX
Introduction to ml
PPTX
How to clear a job interview
PPTX
Personality development
PPT
Sucess in interview
PPTX
Classificationand different algorithm
Introduction to ml and dl
Introduction to ml
How to clear a job interview
Personality development
Sucess in interview
Classificationand different algorithm

Recently uploaded (20)

PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPT
Predictive modeling basics in data cleaning process
PDF
Microsoft 365 products and services descrption
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Steganography Project Steganography Project .pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Leprosy and NLEP programme community medicine
PPT
Image processing and pattern recognition 2.ppt
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Introduction to Data Science and Data Analysis
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
A Complete Guide to Streamlining Business Processes
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
retention in jsjsksksksnbsndjddjdnFPD.pptx
Predictive modeling basics in data cleaning process
Microsoft 365 products and services descrption
Pilar Kemerdekaan dan Identi Bangsa.pptx
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
SAP 2 completion done . PRESENTATION.pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
DU, AIS, Big Data and Data Analytics.ppt
CYBER SECURITY the Next Warefare Tactics
Steganography Project Steganography Project .pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Leprosy and NLEP programme community medicine
Image processing and pattern recognition 2.ppt
[EN] Industrial Machine Downtime Prediction
Introduction to Data Science and Data Analysis

Regression ppt

  • 1. Regression Machine Learning Perspective BY: Suyash pratap Singh and Vineet raj Parashar JUET,GUNA
  • 2. Good understanding of how regression works can help: • Having a the appropriate model • Right training algorithm to use • Good set of hyperparameters for your task. • Debug issues and perform error analysis more efficiently • Lastly, most of the topics discussed will be essential in understanding, building, and training neural networks
  • 3. Agenda • Linear Regression model • Closed-form equation • Normal Equation • Iterative optimization approach • Gradient Descent • Variants
  • 4. Polynomial Regression • More complex model that can fit nonlinear datasets. • It has more parameters than Linear Regression so it is more prone to overfitting the training data • How to detect overfitting • Learning curves
  • 5. Regularization and Logistic Regression • Regularization methods • Techniques that can reduce the risk of overfitting the training set • Finally, we will look at two more models that are commonly used for classification tasks: • Logistic Regression • Softmax Regression.
  • 6. Prerequisite: Linear algebra and calculus • Vectors • Matrices • Operations • Transpose • Dot product • Matrix inverse • Partial derivatives
  • 7. Linear Regression • Simple regression model of life satisfaction: life_satisfaction = • GDP_per_capita is the input feature • This model is just a linear function of the input feature • ϴ0 andϴ1 are the model’s parameters • More generally, a linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term
  • 9. Linear Regression: Vectorized Form • This can be written much more concisely using a vectorized form, as shown below:
  • 10. Performance Measures • Above given equation is the linear regression model • Now question is : how to train this model? • Training a model is to find values its parameters so that the model best fits the training set. • What is meaning of best here? • For this purpose, we first need a measure of how well (or poorly) the model fits the training data • Most common performance measure of a regression model is the Root Mean Square Error (RMSE) • Therefore, to train a Linear Regression model, you need to find the value of ϴ that minimizes the RMSE. • RMSE is more sensitive to outliers
  • 11. RMSE • RMSE requires MSE • The MSE of a Linear Regression hypothesis hϴ on a training set X is calculated using
  • 13. Notations Age PClaass Gender Fare 67 1 1 1200 54 2 1 1000 8 3 0 600 18 3 1 500 2 1 0 650 • m is the number of instances in the dataset we are measuring the RMSE on • x(i) is a vector of all the feature values (excluding the label) of the ith instance in the dataset, and y(i) is its label (the desired output value for that instance) • For instance : First passenger in the dataset with name “Alan” his age is 67 years and travelled in 1st class and he/she has to pay 1200 dollars • x (1)= 67 1 1 • y(1) = 1200
  • 14. Vectorized Dataset 67 1 1 54 2 1 8 3 0 18 3 1 2 1 0 𝑦 = 1200 1000 600 500 650
  • 15. Prediction error • For example, if your system predicts that the fare for the first passenger is $1110, then ŷ(1) = h(x(1)) = $1110. • The prediction error for this passenger is:  ŷ(1) – y(1) = $90
  • 16. The Normal Equation • To find the value of θ that minimizes the cost function, there is a closed-form solution • in other words, a mathematical equation that gives the result directly. This is called the Normal Equation • θ is the value of θ that minimizes the cost function. • y is the vector of target values containing y(1) to y(m).
  • 17. Training Complexity • The Normal Equation computes the inverse of XT X, which is an (n + 1) × (n + 1) matrix (where n is the number of features). The computational complexity of inverting such a matrix is typically about O(n2.4) to O(n3) (depending on the implementation). • In other words, if you double the number of features, you multiply the computation time by roughly 22.4 = 5.3 to 23 = 8. • The SVD approach used by Scikit-Learn’s LinearRegression class is about O(n2). • If you double the number of features, you multiply the computation time by roughly 4.
  • 18. When to use Normal Equation? • Both the Normal Equation and the SVD approach get very slow when the number of features grows large (e.g., 100,000). • On the positive side, both are linear with regards to the number of instances in the training set (they are O(m)), so they handle large training sets efficiently, provided they can fit in memory.
  • 19. Prediction Complexity • Also, once you have trained your Linear Regression model (using the Normal Equation or any other algorithm), predictions are very fast • The computational complexity is linear with regards to both the number of instances you want to make predictions on and the number of features. • In other words, making predictions on twice as many instances (or twice as many features) will just take roughly twice as much time.
  • 20. When not to use Normal Equation: • Cases where there are a large number of features • Too many training instances to fit in memory • Now we have to look at very different ways to train a Linear Regression model
  • 21. Gradient Descent • Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. • The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function. • How it works? • It measures the local gradient of the error function with regards to the parameter vector θ, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum! • Concretely, you start by filling θ with random values (this is called random initialization), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum
  • 25. GD Pitfalls • Finally, not all cost functions look like nice regular bowls. There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult.
  • 26. Global minimum • Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. • This implies that there are no local minima, just one global minimum. It is also a continuous function with a slope that never changes abruptly. • These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough and if the learning rate is not too high).
  • 27. Importance of Feature Scaling • The cost function has the shape of a bowl, but it can be an elongated bowl if the features have very different scales. • Figure shows Gradient Descent on a training set where features 1 and 2 have the same scale (on the left), and on a training set where feature 1 has such smaller values than feature 2 (on the right). • When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit- Learn’s StandardScaler class), or else it will take much longer to converge.
  • 28. Model’s parameter space • Above diagram also illustrates the fact that training a model means searching for a combination of model parameters that minimizes a cost function (over the training set). • It is a search in the model’s parameter space: the more parameters a model has, the more dimensions this space has, and the harder the search is: searching for a needle in a 300-dimensional haystack is much trickier than in three dimensions. • Fortunately, since the cost function is convex in the case of Linear Regression, the needle is simply at the bottom of the bowl.
  • 29. Batch Gradient Descent • To implement Gradient Descent, you need to compute the gradient of the cost function with regards to each model parameter ϴj. • In other words, you need to calculate how much the cost function will change if you change ϴj just a little bit. • This is called a partial derivative. • Following equation computes the partial derivative of the cost function with regards to parameter ϴj
  • 30. Gradient Vector of a Cost Function
  • 31. Complexity of Gradient Descent • Notice that this formula involves calculations over the full training set X, at each Gradient Descent step! • This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step • As a result it is terribly slow on very large training sets • However, Gradient Descent scales well with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster using Gradient Descent than using the Normal Equation or SVD decomposition
  • 32. Learning Rate • Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. • This means subtracting ∇θMSE(θ) from θ • This is where the learning rate η comes into play: multiply the gradient vector by η to determine the size of the downhill step as shown in below Equation:
  • 33. Batch Gradient Implementation: import numpy as np eta=.1 np.random.seed(0) n_iter=1000 m=100 x=2*np.random.rand(100,1) y=4+3*x+np.random.randn(100,1) x_b=np.c_[np.ones((100,1)),x] theta=np.random.rand(2,1) print(theta) for i in range(n_iter): gradient=2/m*x_b.T.dot(x_b.dot(theta)-y) #print(i,"g",gradient) theta=theta-eta*gradient #print("iteration no:",i,"theta:",theta) print(theta)
  • 35. Learning rate selection and stopping criteria • To find a good learning rate, you can use grid search • However, you may want to limit the number of iterations so that grid search can eliminate models that take too long to converge • How many iterations • Too large • Too small • Gradient Vector value <=Tolerance
  • 36. Convergence Rate • When the cost function is convex and its slope does not change abruptly (as is the case for the MSE cost function), Batch Gradient Descent with a fixed learning rate will eventually converge to the optimal solution, but you may have to wait a while: • It can take O(1/ϵ) iterations to reach the optimum within a range of ϵ depending on the shape of the cost function. • If you divide the tolerance by 10 to have a more precise solution, then the algorithm may have to run about 10 times longer.
  • 37. Feature Scaling • When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.
  • 38. Stochatic GD • The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. • At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. • Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration. • It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (SGD can be implemented as an out-of-core algorithm) • On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. • Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down • So once the algorithm stops, the final parameter values are good, but not optimal.
  • 39. SGD Implementationimport numpy as np np.random.seed(0) n_epochs=50 m=100 x=2*np.random.rand(100,1) y=4+3*x+np.random.randn(100,1) x_b=np.c_[np.ones((100,1)),x] t,t1=5,50 def learning_schedule(t): return t/(t+t1) theta=np.random.randn(2,1) for epoch in range(n_epochs): for i in range(m): random_index=np.random.randint(m) xi=x_b[random_index:random_index+1] yi=y[random_index:random_index+1] gradients=2*xi.T.dot(xi.dot(theta)-yi) eta=learning_schedule(epoch*m+i) theta=theta-eta*gradients print(theta)
  • 43. Regularization • As we saw in Chapters 1 and 2, a good way to reduce overfitting is to regularize the model (i.e., to constrain it): the fewer degrees of freedom it has, the harder it will be for it to overfit the data • For example, a simple way to regularize a polynomial model is to the reduce number of polynomial degrees. • Three different ways to constrain the weights (i.e. the number of polynomial degrees): • Ridge Regression, Lasso Regression, and Elastic Net
  • 44. Ridge cost function Ridge closed form equation Lasso cost function Lasso gradient calculation Elastic-Net cost function Elastic-Net gradient calculation: 𝑀𝑆𝐸 𝜃 + r ∝ sign 𝜃 + (1 − r) ∝ 𝜃 Ridge Gradient calculation 𝑀𝑆𝐸 𝜃 +∝ 𝜃
  • 45. Sklearn (ridge, lasso, elasticnet)
  • 46. Sklearn – Linear, GD and Polynomial regression
  • 47. Logistic Regression • Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class • If the estimated probability is greater than 50%, then the model predicts that the instance belongs to the positive class else negative class • This makes it a binary classifier • Instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result
  • 48. Logistic Regression • Logistic Regression model estimated probability (vectorized form): • The logistic—noted σ(・)—is a sigmoid function (i.e., S-shaped) that outputs a number between 0 and 1. • It is defined as shown in Equation and as shown in figure (matplotlib)
  • 49. Sigmoid Function • Notice that σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a Logistic Regression model predicts 1 if xT θ is positive, and 0 if it is negative.
  • 50. Prediction • Once the Logistic Regression model has estimated the probability p = hθ(x) that an instance x belongs to the positive class, it can make its prediction ŷ easily
  • 51. Cost function • Good, now you know how a Logistic Regression model estimates probabilities and makes predictions. • But how is it trained? • The objective of training is to set the parameter vector θ so that the model estimates high probabilities for positive instances (y =1) and low probabilities for negative instances (y = 0). • This idea is captured by the cost function shown in Equation
  • 52. Continue: • This cost function makes sense because – log(t) grows very large when t approaches 0, so the cost will be large if the model estimates a probability close to 0 for a positive instance, and it will also be very large if the model estimates a probability close to 1 for a negative instance. • On the other hand, – log(t) is close to 0 when t is close to 1, so the cost will be close to 0 if the estimated probability is close to 0 for a negative instance or close to 1 for a positive instance, which is precisely what we want.
  • 53. Logistic cost function partial derivatives
  • 54. Comment: • It is often the case that a learning algorithm will try to optimize a different function than the performance • measure used to evaluate the final model. This is generally because that function is easier to compute, because • it has useful differentiation properties that the performance measure lacks, or because we want to constrain • the model during training, as we will see when we discuss regularization.
  • 55. Knowledge: • Gradient Descent (GD), that gradually tweaks the model parameters to minimize the cost function over the training set, eventually converging to the same set of parameters as the first method. • Variants of Gradient Descent that will be used in neural networks: Batch GD, Mini-batch GD, and Stochastic GD.
  • 56. Thank You!!! "Success is not an accident, it is hard work, perseverance, learning, sacrifice and most of all determination towards what you are doing."