SlideShare a Scribd company logo
CS 179: LECTURE 13
INTRO TO MACHINE LEARNING
GOALS OF WEEKS 5-6
 What is machine learning (ML) and when is it useful?
 Intro to major techniques and applications
 Give examples
 How can CUDA help?
 Departure from usual pattern: we will give the
application first, and the CUDA later
HOW TO FOLLOW THIS LECTURE
 This lecture and the next one will have a lot of math!
 Don’t worry about keeping up with the derivations 100%
 Important equations will be boxed
 Key terms to understand: loss/objective function, linear
regression, gradient descent, linear classifier
 The theory lectures will probably be boring for those of
you who have done some machine learning (CS156/155)
already
WHAT IS ML GOOD FOR?
 Handwriting recognition
 Spam detection
WHAT IS ML GOOD FOR?
 Teaching a robot how to do a backflip
 https://youtu.be/fRj34o4hN4I
 Predicting the performance of a stock portfolio
 The list goes on!
WHAT IS ML?
 What do these problems have in common?
 Some pattern we want to learn
 No good closed-form model for it
 LOTS of data
 What can we do?
 Use data to learn a statistical model for
the pattern we are interested in
DATA REPRESENTATION
 One data point is a vector 𝑥 in ℝ𝑑
 A 30 × 30 pixel image is a 900-dimensional
vector (one component per pixel intensity)
 If we are classifying an email as spam or not
spam, set 𝑑 = number of words in dictionary
 Count the number of times 𝑛𝑖 that a word 𝑖
appears in an email and set 𝑥𝑖 = 𝑛𝑖
 The possibilities are endless 
WHAT ARE WE TRYING TO DO?
 Given an input 𝑥 ∈ ℝ𝑑
, produce an output 𝑦
 What is 𝑦?
 Could be a real number, e.g. predicted return
of a given stock portfolio
 Could be 0 or 1, e.g. spam or not spam
 Could be a vector in ℝ𝑚
, e.g. telling a robot
how to move each of its 𝑚 joints
 Just like 𝑥, 𝑦 can be almost anything 
EXAMPLE OF (𝑥, 𝑦) PAIRS
 ,
0
0
0
0
0
1
0
0
0
0
, ,
1
0
0
0
0
0
0
0
0
0
, ,
0
1
0
0
0
0
0
0
0
0
, ,
0
0
0
1
0
0
0
0
0
0
, etc.
NOTATION
𝑥′
=
1
𝑥
∈ ℝ𝑑+1
𝐗 = 𝑥 1
, … , 𝑥 𝑁
∈ ℝ𝑑×𝑁
𝐗′
= 𝑥 1 ′
, … , 𝑥 𝑁 ′
∈ ℝ 𝑑+1 ×𝑁
𝐘 = 𝑦 1
, … , 𝑦 𝑁 𝑇
∈ ℝ𝑁×𝑚
𝕀 𝑝 =
1
0
𝑝 is true
otherwise
STATISTICAL MODELS
 Given (𝐗, 𝐘) (𝑁 pairs of 𝑥 𝑖
, 𝑦 𝑖
data), how
do we accurately predict an output 𝑦 given
an input 𝑥?
 One solution: a model 𝑓(𝑥) parametrized by
a vector (or matrix) 𝑤, denoted as 𝑓 𝑥; 𝑤
 The task is finding a set of optimal
parameters 𝑤
FITTING A MODEL
 So what does optimal mean?
 Under some measure of closeness, we want
𝑓(𝑥; 𝑤) to be as close as possible to the true
solution 𝑦 for any input 𝑥
 This measure of closeness is called a loss
function or objective function and is
denoted 𝐽 𝑤; 𝐗, 𝐘 -- it depends on our data
set (𝐗, 𝐘)!
 To fit a model, we try to find parameters 𝑤∗
that minimize 𝐽(𝑤; 𝐗, 𝐘), i.e. an optimal 𝑤
FITTING A MODEL
 What characterizes a good loss function?
 Represents the magnitude of our model’s
error on the data we are given
 Penalizes large errors more than small ones
 Continuous and differentiable in 𝑤
 Bonus points if it is also convex in 𝑤
 Continuity, differentiability, and convexity are
to make minimization easier
LINEAR REGRESSION
 𝑓 𝑥; 𝑤 = 𝑤0 + 𝑖=1
𝑑
𝑤𝑖𝑥𝑖 = 𝑤𝑇
𝑥′
 Below: 𝑑 = 1. 𝑤𝑇
𝑥′ is graphed.
LINEAR REGRESSION
 What should we use as a loss function?
 Each 𝑦 𝑖
is a real number
 Mean-squared error is a good choice 
 𝐽 𝑤; 𝐗, 𝐘 =
1
𝑁 𝑖=1
𝑁
𝑓 𝑥 𝑖
; 𝑤 − 𝑦 𝑖 2
=
1
𝑁 𝑖=1
𝑁
𝑤𝑇
𝑥 𝑖 ′
− 𝑦 𝑖
2
=
1
𝑁
𝑤𝑇
𝐗′ − 𝐘 𝑇
𝑤𝑇
𝐗′
− 𝐘
GRADIENT DESCENT
 How do we find 𝑤∗
= argmin
𝑤∈ℝ𝑑+1
𝐽(𝑤; 𝐗, 𝐘)?
 A function’s gradient points in the direction
of steepest ascent, and its negative in the
direction of steepest descent
 Following the gradient downhill will cause us
to converge to a local minimum!
GRADIENT DESCENT
GRADIENT DESCENT
GRADIENT DESCENT
 Fix some constant learning rate 𝜂 (0.03 is usually a
good place to start)
 Initialize 𝑤 randomly
 Typically select each component of 𝑤 independently
from some standard distribution (uniform, normal, etc.)
 While 𝑤 is still changing (hasn’t converged)
 Update 𝑤 ← 𝑤 − 𝜂∇𝐽 𝑤; 𝐗, 𝐘
GRADIENT DESCENT
 For mean squared error loss in linear regression,
∇𝐽 𝑤; 𝐗, 𝐘 =
2
𝑁
𝑤𝑇
𝐗′
𝐗′𝑇
− 𝐗′
𝐘
 This is just linear algebra! GPUs are good at this kind of
thing 
 Why do we care?
 𝑓 𝑥; 𝑤∗ = 𝑤∗𝑇
𝑥′ is the model with the lowest possible
mean-squared error on our training dataset (𝐗, 𝐘)!
STOCHASTIC GRADIENT DESCENT
 The previous algorithm computes the gradient over the
entire data set before stepping.
 Called batch gradient descent
 What if we just picked a single data point 𝑥 𝑖
, 𝑦 𝑖
at
random, computed the gradient for that point, and
updated the parameters?
 Called stochastic gradient descent
STOCHASTIC GRADIENT DESCENT
 Advantages of SGD
 Easier to implement for large datasets
 Works better for non-convex loss functions
 Sometimes faster
 Often use SGD on a “mini-batch” of 𝑘 examples rather
than just one at a time
 Allows higher throughput and more parallelization
BINARY LINEAR CLASSIFICATION
 𝑓 𝑥; 𝑤 = 𝕀 𝑤𝑇
𝑥′
> 0
 Divides ℝ𝑑
into two half-spaces
 𝑤𝑇
𝑥′
= 0 is a hyperplane
 A line in 2D, a plane in 3D, and so on
 Known as the decision boundary
 Everything on one side of the hyperplane is
class 0 and everything on the other side is
class 1
BINARY LINEAR CLASSIFICATION
 Below: 𝑑 = 2. Black line is the decision boundary
𝑤𝑇
𝑥′ = 0
MULTI-CLASS GENERALIZATION
 We want to classify 𝑥 into one of 𝑚 classes
 For each input 𝑥, 𝑦 is a vector in ℝ𝑚
with 𝑦𝑘 = 1 if class 𝑥 =
𝑘 and 𝑦𝑗 = 0 otherwise (i.e. 𝑦𝑘 = 𝕀 class 𝑥 = 𝑘 )
 Known as a one-hot vector
 Our model 𝑓(𝑥; 𝐖) is parametrized by a 𝑚 × (𝑑 + 1) matrix
𝐖 = 𝑤 1
, … , 𝑤 𝑚
 The model returns an 𝑚-dimensional vector (like 𝑦) with
𝑓𝑘 𝑥; 𝐖 = 𝕀 arg max
𝑖
𝑤 𝑖 𝑇
𝑥′
= 𝑘
MULTI-CLASS GENERALIZATION
 𝑤 𝑗 𝑇
𝑥′
= 𝑤 𝑘 𝑇
𝑥′
describes the intersection of 2
hyperplanes in ℝ𝑑+1
(where 𝑥 ∈ ℝ𝑑
)
 Divides ℝ𝑑
into half-spaces; 𝑤 𝑗 𝑇
𝑥′ > 𝑤 𝑘 𝑇
𝑥′ on one side,
vice versa on the other side.
 If 𝑤 𝑗 𝑇
𝑥′
= 𝑤 𝑘 𝑇
𝑥′
= max
𝑖
𝑤 𝑖 𝑇
𝑥′
, this is a decision
boundary!
 Illustrative figures follow
MULTI-CLASS GENERALIZATION
 Below: 𝑑 = 1, 𝑚 = 4. max
𝑖
𝑤 𝑖 𝑇
𝑥′
is graphed.
MULTI-CLASS GENERALIZATION
 Below: 𝑑 = 2, 𝑚 = 3. Lines are decision
boundaries 𝑤 𝑗 𝑇
𝑥 = 𝑤 𝑘 𝑇
𝑥 = max
𝑖
𝑤 𝑖 𝑇
𝑥
MULTI-CLASS GENERALIZATION
 For 𝑚 = 2 (binary classification), we get the
scalar version by setting 𝑤 = 𝑤 1
− 𝑤 0
 𝑓1 𝑥; 𝐖 = 𝕀 arg max
𝑖
𝑤 𝑖 𝑇
𝑥′
= 1
= 𝕀 𝑤 1 𝑇
𝑥′
> 𝑤 0 𝑇
𝑥′
= 𝕀 𝑤 1
− 𝑤 0 𝑇
𝑥′
> 0
FITTING A LINEAR CLASSIFIER
 𝑓 𝑥; 𝑤 = 𝕀 𝑤𝑇
𝑥′
> 0
 How do we turn this into something continuous and
differentiable?
 We really want to replace the indicator function 𝕀 with a
smooth function indicating the probability of whether
𝑦 is 0 or 1, based on the value of 𝑤𝑇
𝑥′
PROBABILISTIC INTERPRETATION
 Interpreting 𝑤𝑇
𝑥′
 𝑤𝑇
𝑥′ large and positive
 ℙ 𝑦 = 0 ≪ ℙ[𝑦 = 1]
 𝑤𝑇
𝑥′ large and negative
 ℙ 𝑦 = 0 ≫ ℙ[𝑦 = 1]
 𝑤𝑇
𝑥′ small
 ℙ 𝑦 = 0 ≈ ℙ[𝑦 = 1]
PROBABILISTIC INTERPRETATION
PROBABILISTIC INTERPRETATION
 We therefore use the probability functions
 𝑝0 𝑥; 𝑤 = ℙ 𝑦 = 0 =
1
1+exp(𝑤𝑇𝑥′)
 𝑝1 𝑥; 𝑤 = ℙ 𝑦 = 1 =
exp(𝑤𝑇𝑥′)
1+exp(𝑤𝑇𝑥′)
 If 𝑤 = 𝑤 1
− 𝑤 0
as before, this is just
𝑝𝑘 𝑥; 𝑤 = ℙ 𝑦 = 𝑘 =
exp 𝑤 𝑘 𝑇
𝑥′
exp 𝑤 0 𝑇
𝑥′ +exp 𝑤 1 𝑇
𝑥′
PROBABILISTIC INTERPRETATION
 In the more general 𝑚-class case, we have
𝑝𝑘 𝑥; 𝐖 = ℙ 𝑦𝑘 = 1 =
exp 𝑤 𝑘 𝑇
𝑥′
𝑖=1
𝑚
exp 𝑤 𝑖 𝑇
𝑥′
 This is called the softmax activation and will be used
to define our loss function
THE CROSS-ENTROPY LOSS
 We want to heavily penalize cases where 𝑦𝑘 = 1 with
𝑝𝑘 𝑥; 𝐖 ≪ 1
 This leads us to define the cross-entropy loss as
follows:
𝐽 𝐖; 𝐗, 𝐘 = −
1
𝑁
𝑖=1
𝑁
𝑘=1
𝑚
𝑦𝑘
𝑖
ln 𝑝𝑘 𝑥 𝑖
; 𝐖
MINIMIZING CROSS-ENTROPY
 As with mean-squared error, the cross-entropy loss is
convex and differentiable 
 That means that we can use gradient descent to
converge to a global minimum!
 This global minimum defines the best possible linear
classifier with respect to the cross-entropy loss and the
data set given
SUMMARY
 Basic process of constructing a machine learning model
 Choose an analytically well-behaved loss function that
represents some notion of error for your task
 Use gradient descent to choose model parameters that
minimize that loss function for your data set
 Examples: linear regression and mean squared error,
linear classification and cross-entropy
NEXT TIME
 Gradient of the cross-entropy loss
 Neural networks
 Backpropagation algorithm for gradient
descent

More Related Content

Similar to Machine learning introduction lecture notes (20)

PPTX
CAL 11 Q3 0101 PF FINAL.pptx212233445666
josephmunez2
 
PDF
Lecture 11 linear regression
Mostafa El-Hosseini
 
PDF
Understanding Blackbox Prediction via Influence Functions
SEMINARGROOT
 
PPTX
linear regression1.pptx machine learning
ritin3098
 
PPTX
Learning group em - 20171025 - copy
Shuai Zhang
 
PPTX
Introduction to PyTorch
Jun Young Park
 
PPTX
PRML Chapter 7
Sunwoo Kim
 
PPTX
Machine learning ppt and presentation code
sharma239172
 
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
PDF
Intro to Quant Trading Strategies (Lecture 10 of 10)
Adrian Aley
 
PPTX
20180831 riemannian representation learning
segwangkim
 
PDF
Session 4 .pdf
ssuser8cda84
 
PDF
Machine Learning 1
cairo university
 
PPTX
Stochastic Optimization
Mohammad Reza Jabbari
 
PDF
Intro to Quant Trading Strategies (Lecture 7 of 10)
Adrian Aley
 
PDF
Week_2_Neural_Networks_Basichhhhhhhs.pdf
Aliker5
 
PDF
Explore ml day 2
preetikumara
 
PDF
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Generalized Laplace - Mellin Integral Transformation
IJERA Editor
 
PDF
Intro to Quant Trading Strategies (Lecture 2 of 10)
Adrian Aley
 
CAL 11 Q3 0101 PF FINAL.pptx212233445666
josephmunez2
 
Lecture 11 linear regression
Mostafa El-Hosseini
 
Understanding Blackbox Prediction via Influence Functions
SEMINARGROOT
 
linear regression1.pptx machine learning
ritin3098
 
Learning group em - 20171025 - copy
Shuai Zhang
 
Introduction to PyTorch
Jun Young Park
 
PRML Chapter 7
Sunwoo Kim
 
Machine learning ppt and presentation code
sharma239172
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Intro to Quant Trading Strategies (Lecture 10 of 10)
Adrian Aley
 
20180831 riemannian representation learning
segwangkim
 
Session 4 .pdf
ssuser8cda84
 
Machine Learning 1
cairo university
 
Stochastic Optimization
Mohammad Reza Jabbari
 
Intro to Quant Trading Strategies (Lecture 7 of 10)
Adrian Aley
 
Week_2_Neural_Networks_Basichhhhhhhs.pdf
Aliker5
 
Explore ml day 2
preetikumara
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Generalized Laplace - Mellin Integral Transformation
IJERA Editor
 
Intro to Quant Trading Strategies (Lecture 2 of 10)
Adrian Aley
 

Recently uploaded (20)

PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Ad

Machine learning introduction lecture notes

  • 1. CS 179: LECTURE 13 INTRO TO MACHINE LEARNING
  • 2. GOALS OF WEEKS 5-6  What is machine learning (ML) and when is it useful?  Intro to major techniques and applications  Give examples  How can CUDA help?  Departure from usual pattern: we will give the application first, and the CUDA later
  • 3. HOW TO FOLLOW THIS LECTURE  This lecture and the next one will have a lot of math!  Don’t worry about keeping up with the derivations 100%  Important equations will be boxed  Key terms to understand: loss/objective function, linear regression, gradient descent, linear classifier  The theory lectures will probably be boring for those of you who have done some machine learning (CS156/155) already
  • 4. WHAT IS ML GOOD FOR?  Handwriting recognition  Spam detection
  • 5. WHAT IS ML GOOD FOR?  Teaching a robot how to do a backflip  https://youtu.be/fRj34o4hN4I  Predicting the performance of a stock portfolio  The list goes on!
  • 6. WHAT IS ML?  What do these problems have in common?  Some pattern we want to learn  No good closed-form model for it  LOTS of data  What can we do?  Use data to learn a statistical model for the pattern we are interested in
  • 7. DATA REPRESENTATION  One data point is a vector 𝑥 in ℝ𝑑  A 30 × 30 pixel image is a 900-dimensional vector (one component per pixel intensity)  If we are classifying an email as spam or not spam, set 𝑑 = number of words in dictionary  Count the number of times 𝑛𝑖 that a word 𝑖 appears in an email and set 𝑥𝑖 = 𝑛𝑖  The possibilities are endless 
  • 8. WHAT ARE WE TRYING TO DO?  Given an input 𝑥 ∈ ℝ𝑑 , produce an output 𝑦  What is 𝑦?  Could be a real number, e.g. predicted return of a given stock portfolio  Could be 0 or 1, e.g. spam or not spam  Could be a vector in ℝ𝑚 , e.g. telling a robot how to move each of its 𝑚 joints  Just like 𝑥, 𝑦 can be almost anything 
  • 9. EXAMPLE OF (𝑥, 𝑦) PAIRS  , 0 0 0 0 0 1 0 0 0 0 , , 1 0 0 0 0 0 0 0 0 0 , , 0 1 0 0 0 0 0 0 0 0 , , 0 0 0 1 0 0 0 0 0 0 , etc.
  • 10. NOTATION 𝑥′ = 1 𝑥 ∈ ℝ𝑑+1 𝐗 = 𝑥 1 , … , 𝑥 𝑁 ∈ ℝ𝑑×𝑁 𝐗′ = 𝑥 1 ′ , … , 𝑥 𝑁 ′ ∈ ℝ 𝑑+1 ×𝑁 𝐘 = 𝑦 1 , … , 𝑦 𝑁 𝑇 ∈ ℝ𝑁×𝑚 𝕀 𝑝 = 1 0 𝑝 is true otherwise
  • 11. STATISTICAL MODELS  Given (𝐗, 𝐘) (𝑁 pairs of 𝑥 𝑖 , 𝑦 𝑖 data), how do we accurately predict an output 𝑦 given an input 𝑥?  One solution: a model 𝑓(𝑥) parametrized by a vector (or matrix) 𝑤, denoted as 𝑓 𝑥; 𝑤  The task is finding a set of optimal parameters 𝑤
  • 12. FITTING A MODEL  So what does optimal mean?  Under some measure of closeness, we want 𝑓(𝑥; 𝑤) to be as close as possible to the true solution 𝑦 for any input 𝑥  This measure of closeness is called a loss function or objective function and is denoted 𝐽 𝑤; 𝐗, 𝐘 -- it depends on our data set (𝐗, 𝐘)!  To fit a model, we try to find parameters 𝑤∗ that minimize 𝐽(𝑤; 𝐗, 𝐘), i.e. an optimal 𝑤
  • 13. FITTING A MODEL  What characterizes a good loss function?  Represents the magnitude of our model’s error on the data we are given  Penalizes large errors more than small ones  Continuous and differentiable in 𝑤  Bonus points if it is also convex in 𝑤  Continuity, differentiability, and convexity are to make minimization easier
  • 14. LINEAR REGRESSION  𝑓 𝑥; 𝑤 = 𝑤0 + 𝑖=1 𝑑 𝑤𝑖𝑥𝑖 = 𝑤𝑇 𝑥′  Below: 𝑑 = 1. 𝑤𝑇 𝑥′ is graphed.
  • 15. LINEAR REGRESSION  What should we use as a loss function?  Each 𝑦 𝑖 is a real number  Mean-squared error is a good choice   𝐽 𝑤; 𝐗, 𝐘 = 1 𝑁 𝑖=1 𝑁 𝑓 𝑥 𝑖 ; 𝑤 − 𝑦 𝑖 2 = 1 𝑁 𝑖=1 𝑁 𝑤𝑇 𝑥 𝑖 ′ − 𝑦 𝑖 2 = 1 𝑁 𝑤𝑇 𝐗′ − 𝐘 𝑇 𝑤𝑇 𝐗′ − 𝐘
  • 16. GRADIENT DESCENT  How do we find 𝑤∗ = argmin 𝑤∈ℝ𝑑+1 𝐽(𝑤; 𝐗, 𝐘)?  A function’s gradient points in the direction of steepest ascent, and its negative in the direction of steepest descent  Following the gradient downhill will cause us to converge to a local minimum!
  • 19. GRADIENT DESCENT  Fix some constant learning rate 𝜂 (0.03 is usually a good place to start)  Initialize 𝑤 randomly  Typically select each component of 𝑤 independently from some standard distribution (uniform, normal, etc.)  While 𝑤 is still changing (hasn’t converged)  Update 𝑤 ← 𝑤 − 𝜂∇𝐽 𝑤; 𝐗, 𝐘
  • 20. GRADIENT DESCENT  For mean squared error loss in linear regression, ∇𝐽 𝑤; 𝐗, 𝐘 = 2 𝑁 𝑤𝑇 𝐗′ 𝐗′𝑇 − 𝐗′ 𝐘  This is just linear algebra! GPUs are good at this kind of thing   Why do we care?  𝑓 𝑥; 𝑤∗ = 𝑤∗𝑇 𝑥′ is the model with the lowest possible mean-squared error on our training dataset (𝐗, 𝐘)!
  • 21. STOCHASTIC GRADIENT DESCENT  The previous algorithm computes the gradient over the entire data set before stepping.  Called batch gradient descent  What if we just picked a single data point 𝑥 𝑖 , 𝑦 𝑖 at random, computed the gradient for that point, and updated the parameters?  Called stochastic gradient descent
  • 22. STOCHASTIC GRADIENT DESCENT  Advantages of SGD  Easier to implement for large datasets  Works better for non-convex loss functions  Sometimes faster  Often use SGD on a “mini-batch” of 𝑘 examples rather than just one at a time  Allows higher throughput and more parallelization
  • 23. BINARY LINEAR CLASSIFICATION  𝑓 𝑥; 𝑤 = 𝕀 𝑤𝑇 𝑥′ > 0  Divides ℝ𝑑 into two half-spaces  𝑤𝑇 𝑥′ = 0 is a hyperplane  A line in 2D, a plane in 3D, and so on  Known as the decision boundary  Everything on one side of the hyperplane is class 0 and everything on the other side is class 1
  • 24. BINARY LINEAR CLASSIFICATION  Below: 𝑑 = 2. Black line is the decision boundary 𝑤𝑇 𝑥′ = 0
  • 25. MULTI-CLASS GENERALIZATION  We want to classify 𝑥 into one of 𝑚 classes  For each input 𝑥, 𝑦 is a vector in ℝ𝑚 with 𝑦𝑘 = 1 if class 𝑥 = 𝑘 and 𝑦𝑗 = 0 otherwise (i.e. 𝑦𝑘 = 𝕀 class 𝑥 = 𝑘 )  Known as a one-hot vector  Our model 𝑓(𝑥; 𝐖) is parametrized by a 𝑚 × (𝑑 + 1) matrix 𝐖 = 𝑤 1 , … , 𝑤 𝑚  The model returns an 𝑚-dimensional vector (like 𝑦) with 𝑓𝑘 𝑥; 𝐖 = 𝕀 arg max 𝑖 𝑤 𝑖 𝑇 𝑥′ = 𝑘
  • 26. MULTI-CLASS GENERALIZATION  𝑤 𝑗 𝑇 𝑥′ = 𝑤 𝑘 𝑇 𝑥′ describes the intersection of 2 hyperplanes in ℝ𝑑+1 (where 𝑥 ∈ ℝ𝑑 )  Divides ℝ𝑑 into half-spaces; 𝑤 𝑗 𝑇 𝑥′ > 𝑤 𝑘 𝑇 𝑥′ on one side, vice versa on the other side.  If 𝑤 𝑗 𝑇 𝑥′ = 𝑤 𝑘 𝑇 𝑥′ = max 𝑖 𝑤 𝑖 𝑇 𝑥′ , this is a decision boundary!  Illustrative figures follow
  • 27. MULTI-CLASS GENERALIZATION  Below: 𝑑 = 1, 𝑚 = 4. max 𝑖 𝑤 𝑖 𝑇 𝑥′ is graphed.
  • 28. MULTI-CLASS GENERALIZATION  Below: 𝑑 = 2, 𝑚 = 3. Lines are decision boundaries 𝑤 𝑗 𝑇 𝑥 = 𝑤 𝑘 𝑇 𝑥 = max 𝑖 𝑤 𝑖 𝑇 𝑥
  • 29. MULTI-CLASS GENERALIZATION  For 𝑚 = 2 (binary classification), we get the scalar version by setting 𝑤 = 𝑤 1 − 𝑤 0  𝑓1 𝑥; 𝐖 = 𝕀 arg max 𝑖 𝑤 𝑖 𝑇 𝑥′ = 1 = 𝕀 𝑤 1 𝑇 𝑥′ > 𝑤 0 𝑇 𝑥′ = 𝕀 𝑤 1 − 𝑤 0 𝑇 𝑥′ > 0
  • 30. FITTING A LINEAR CLASSIFIER  𝑓 𝑥; 𝑤 = 𝕀 𝑤𝑇 𝑥′ > 0  How do we turn this into something continuous and differentiable?  We really want to replace the indicator function 𝕀 with a smooth function indicating the probability of whether 𝑦 is 0 or 1, based on the value of 𝑤𝑇 𝑥′
  • 31. PROBABILISTIC INTERPRETATION  Interpreting 𝑤𝑇 𝑥′  𝑤𝑇 𝑥′ large and positive  ℙ 𝑦 = 0 ≪ ℙ[𝑦 = 1]  𝑤𝑇 𝑥′ large and negative  ℙ 𝑦 = 0 ≫ ℙ[𝑦 = 1]  𝑤𝑇 𝑥′ small  ℙ 𝑦 = 0 ≈ ℙ[𝑦 = 1]
  • 33. PROBABILISTIC INTERPRETATION  We therefore use the probability functions  𝑝0 𝑥; 𝑤 = ℙ 𝑦 = 0 = 1 1+exp(𝑤𝑇𝑥′)  𝑝1 𝑥; 𝑤 = ℙ 𝑦 = 1 = exp(𝑤𝑇𝑥′) 1+exp(𝑤𝑇𝑥′)  If 𝑤 = 𝑤 1 − 𝑤 0 as before, this is just 𝑝𝑘 𝑥; 𝑤 = ℙ 𝑦 = 𝑘 = exp 𝑤 𝑘 𝑇 𝑥′ exp 𝑤 0 𝑇 𝑥′ +exp 𝑤 1 𝑇 𝑥′
  • 34. PROBABILISTIC INTERPRETATION  In the more general 𝑚-class case, we have 𝑝𝑘 𝑥; 𝐖 = ℙ 𝑦𝑘 = 1 = exp 𝑤 𝑘 𝑇 𝑥′ 𝑖=1 𝑚 exp 𝑤 𝑖 𝑇 𝑥′  This is called the softmax activation and will be used to define our loss function
  • 35. THE CROSS-ENTROPY LOSS  We want to heavily penalize cases where 𝑦𝑘 = 1 with 𝑝𝑘 𝑥; 𝐖 ≪ 1  This leads us to define the cross-entropy loss as follows: 𝐽 𝐖; 𝐗, 𝐘 = − 1 𝑁 𝑖=1 𝑁 𝑘=1 𝑚 𝑦𝑘 𝑖 ln 𝑝𝑘 𝑥 𝑖 ; 𝐖
  • 36. MINIMIZING CROSS-ENTROPY  As with mean-squared error, the cross-entropy loss is convex and differentiable   That means that we can use gradient descent to converge to a global minimum!  This global minimum defines the best possible linear classifier with respect to the cross-entropy loss and the data set given
  • 37. SUMMARY  Basic process of constructing a machine learning model  Choose an analytically well-behaved loss function that represents some notion of error for your task  Use gradient descent to choose model parameters that minimize that loss function for your data set  Examples: linear regression and mean squared error, linear classification and cross-entropy
  • 38. NEXT TIME  Gradient of the cross-entropy loss  Neural networks  Backpropagation algorithm for gradient descent

Editor's Notes

  • #8: Matt Wilson (MIT) – things ML is not good at
  • #18: Distance per step is proportional to the size of the gradient at that step
  • #19: Note: frequency of contour lines = magnitude of gradient
  • #25: Consider mentioning Johnson-Lindenstrass Theorem + picture (one slide) – to
  • #26: One hot vectors: say we have 𝑚=4 classes. Then, class 𝑥 =1 corresponds to 𝑦= 1, 0, 0, 0 , class 𝑥 =2 has 𝑦=(0, 1, 0, 0), class 𝑥 =3 has 𝑦=(0, 0, 1, 0), and class 𝑥 =4 has 𝑦=(0, 0, 1, 0).
  • #34: If you don’t see why the last step is true, just multiply each of the probability functions by exp 𝑤 0 𝑇 𝑥 /exp 𝑤 0 𝑇 𝑥 .
  • #36: For each 𝑥 𝑖 , 𝑦 𝑖 pair, multiplying by 𝑦 𝑘 𝑖 in the inner sum extracts ln 𝑝 𝑘 𝑥 𝑖 ;𝐖 only for the TRUE class 𝑘. Thus, we ONLY penalize cases where 𝑦 𝑘 =1 and 𝑝 𝑘 𝑥;𝐖 ≪1, and not cases where 𝑦 𝑗 =0 and 1− 𝑝 𝑗 𝑥;𝐖 ≪1. This is still okay because 𝑝(𝑥;𝐖) defines a probability distribution; if 1− 𝑝 𝑗 𝑥;𝐖 ≪1 for any 𝑗, then we MUST have 𝑝 𝑖 𝑥;𝐖 ≪1 for all 𝑖≠𝑗, including the 𝑖 for which 𝑦 𝑖 =1. This is just an intuitive treatment of the cross-entropy! It has formal foundations in information theory, but that’s beyond the scope of this class.