SlideShare a Scribd company logo
SIMS 290-2:
Applied Natural Language Processing

Barbara Rosario
October 4, 2004

1
Today
Algorithms for Classification
Binary classification

Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods

Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor

2
Binary Classification: examples
Spam filtering (spam, not spam)
Customer service message classification (urgent vs.
not urgent)
Information retrieval (relevant, not relevant)
Sentiment classification (positive, negative)
Sometime it can be convenient to treat a multi-way
problem like a binary one: one class versus all the
others, for all classes

3
Binary Classification
Given: some data items that belong to a positive (+1
) or a negative (-1 ) class
Task: Train the classifier and predict the class for a
new data item
Geometrically: find a separator

4
Linear versus Non Linear
algorithms
Linearly separable data: if all the data points can
be correctly classified by a linear (hyperplanar)
decision boundary

5
Linearly separable data

Linear Decision boundary

Class1
Class2
6
Non linearly separable data

Class1
Class2
7
Non linearly separable data

Non Linear Classifier

Class1
Class2
8
Linear versus Non Linear
algorithms
Linear or Non linear separable data?
We can find out only empirically

Linear algorithms (algorithms that find a linear decision
boundary)
When we think the data is linearly separable
Advantages
– Simpler, less parameters

Disadvantages
– High dimensional data (like for NLT) is usually not linearly separable

Examples: Perceptron, Winnow, SVM
Note: we can use linear algorithms also for non linear problems
(see Kernel methods)

9
Linear versus Non Linear
algorithms
Non Linear
When the data is non linearly separable
Advantages
– More accurate

Disadvantages
– More complicated, more parameters

Example: Kernel methods

Note: the distinction between linear and non linear
applies also for multi-class classification (we’ll see
this later)
10
Simple linear algorithms
Perceptron and Winnow algorithm
Linear
Binary classification
Online (process data sequentially, one data point at the
time)
Mistake driven
Simple single layer Neural Networks

11
Linear binary classification
Data: {(xi,yi)}i=1...n
x in Rd (x is a vector in d-dimensional space)
 feature vector
y in {-1,+1}
 label (class, category)

Question:
Design a linear decision boundary: wx + b (equation of hyperplane) such
that the classification rule associated with it has minimal probability of error
classification rule :
– y = sign(w x + b) which means:
– if wx + b > 0 then y = +1
– if wx + b < 0 then y = -1

From Gert Lanckriet, Statistical Learning Theory Tutorial

12
Linear binary classification
Find a good hyperplane
(w,b) in R d+1
that correctly classifies data
points as much as possible
In online fashion: one data
point at the time, update
weights as necessary

wx + b = 0
Classification Rule:
y = sign(wx + b)

From Gert Lanckriet, Statistical Learning Theory Tutorial

13
Perceptron algorithm
Initialize: w1 = 0
Updating rule For each data point x

If class(x) != decision(x,w)

wk

then
wk+1  wk + yixi
k k+1
else
wk+1  wk

0

+1

wk+1

-1
wk x + b = 0

Function decision(x, w)
If wx + b > 0 return +1
Else return -1

From Gert Lanckriet, Statistical Learning Theory Tutorial

Wk+1 x + b = 0

14
Perceptron algorithm
Online: can adjust to changing target, over time
Advantages
Simple and computationally efficient
Guaranteed to learn a linearly separable problem
(convergence, global optimum)

Limitations
Only linear separations
Only converges for linearly separable data
Not really “efficient with many features”

From Gert Lanckriet, Statistical Learning Theory Tutorial

15
Winnow algorithm
Another online algorithm for learning perceptron
weights:
f(x) = sign(wx + b)
Linear, binary classification
Update-rule: again error-driven, but multiplicative
(instead of additive)

From Gert Lanckriet, Statistical Learning Theory Tutorial

16
Winnow algorithm
Initialize: w1 = 0
Updating rule For each data point x

wk

If class(x) != decision(x,w)
then

wk+1  wk + yixi
 Perceptron
w k+1  w k *exp(y i x i )  Winnow

k k+1
else
wk+1  wk

0

+1

wk+1

-1
wk x + b= 0

Function decision(x, w)
If wx + b > 0 return +1
Else return -1
From Gert Lanckriet, Statistical Learning Theory Tutorial

Wk+1 x + b = 0

17
Perceptron vs. Winnow
Assume
N available features
only K relevant items, with K<<N

Perceptron: number of mistakes: O( K N)
Winnow: number of mistakes: O(K log N)
Winnow is more robust to high-dimensional feature spaces

From Gert Lanckriet, Statistical Learning Theory Tutorial

18
Perceptron vs. Winnow
Perceptron
Online: can adjust to changing
target, over time
Advantages

Simple and computationally
efficient
Guaranteed to learn a linearly
separable problem

Limitations

only linear separations
only converges for linearly
separable data
not really “efficient with many
features”

Winnow
Online: can adjust to changing
target, over time
Advantages

Simple and computationally
efficient
Guaranteed to learn a linearly
separable problem
Suitable for problems with
many irrelevant attributes

Limitations

only linear separations
only converges for linearly
separable data
not really “efficient with many
features”

Used in NLP

From Gert Lanckriet, Statistical Learning Theory Tutorial

19
Weka
Winnow in Weka

20
Large margin classifier
Another family of linear
algorithms
Intuition (Vapnik, 1965)
If the classes are linearly separable:
Separate the data
Place hyper-plane “far” from the
data: large margin
Statistical results guarantee
good generalization
BAD

From Gert Lanckriet, Statistical Learning Theory Tutorial

21
Large margin classifier
Intuition (Vapnik, 1965) if linearly
separable:
Separate the data
Place hyperplane “far” from the
data: large margin
Statistical results guarantee
good generalization
GOOD

 Maximal Margin Classifier

From Gert Lanckriet, Statistical Learning Theory Tutorial

22
Large margin classifier
If not linearly separable
Allow some errors
Still, try to place hyperplane
“far” from each class

From Gert Lanckriet, Statistical Learning Theory Tutorial

23
Large Margin Classifiers
Advantages
Theoretically better (better error bounds)

Limitations
Computationally more expensive, large quadratic
programming

24
Support Vector Machine (SVM)
M

Large Margin Classifier

wTxa + b = 1

wTxb + b = -1

Linearly separable case
Goal: find the
hyperplane that
maximizes the margin

wT x + b = 0

Support vectors

From Gert Lanckriet, Statistical Learning Theory Tutorial

25
Support Vector Machine (SVM)
Text classification
Hand-writing recognition
Computational biology (e.g., micro-array data)
Face detection
Face expression recognition
Time series prediction

From Gert Lanckriet, Statistical Learning Theory Tutorial

26
Non Linear problem

27
Non Linear problem

28
Non Linear problem
Kernel methods
A family of non-linear algorithms
Transform the non linear problem in a linear one (in a
different feature space)
Use linear algorithms to solve the linear problem in
the new space

From Gert Lanckriet, Statistical Learning Theory Tutorial

29
Main intuition of Kernel methods
(Copy here from black board)

30
Basic principle kernel methods
Φ : Rd  RD

(D >> d)

wTΦ(x)+b=0

Φ(X)=[x2 z2 xz]

X=[x z]

f(x) = sign(w1x2+w2z2+w3xz +b)
From Gert Lanckriet, Statistical Learning Theory Tutorial

31
Basic principle kernel methods
Linear separability : more likely in high dimensions
Mapping: Φ maps input into high-dimensional
feature space
Classifier: construct linear classifier in highdimensional feature space
Motivation: appropriate choice of Φ leads to linear
separability
We can do this efficiently!

From Gert Lanckriet, Statistical Learning Theory Tutorial

32
Basic principle kernel methods
We can use the linear algorithms seen before
(Perceptron, SVM) for classification in the higher
dimensional space

33
Multi-class classification
Given: some data items that belong to one of M
possible classes
Task: Train the classifier and predict the class for a
new data item
Geometrically: harder problem, no more simple
geometry

34
Multi-class classification

35
Multi-class classification: Examples
Author identification
Language identification
Text categorization (topics)

36
(Some) Algorithms for Multi-class
classification
Linear
Parallel class separators: Decision Trees
Non parallel class separators: Naïve Bayes

Non Linear
K-nearest neighbors

37
Linear, parallel class separators
(ex: Decision Trees)

38
Linear, NON parallel class separators
(ex: Naïve Bayes)

39
Non Linear (ex: k Nearest Neighbor)

40
Decision Trees
Decision tree is a classifier in the form of a tree
structure, where each node is either:
Leaf node - indicates the value of the target attribute (class)
of examples, or
Decision node - specifies some test to be carried out on a
single attribute-value, with one branch and sub-tree for each
possible outcome of the test.

A decision tree can be used to classify an example
by starting at the root of the tree and moving through
it until a leaf node, which provides the classification of
the instance.
http://dms.irb.hr/tutorial/tut_dtrees.php

41
Training Examples
Goal: learn when we can play Tennis and when we cannot
Day

Outlook

Temp.

Humidity

Wind

Play Tennis

D1

Sunny

Hot

High

Weak

No

D2

Sunny

Hot

High

Strong

No

D3

Overcast

Hot

High

Weak

Yes

D4

Rain

Mild

High

Weak

Yes

D5

Rain

Cool

Normal

Weak

Yes

D6

Rain

Cool

Normal

Strong

No

D7

Overcast

Cool

Normal

Weak

Yes

D8

Sunny

Mild

High

Weak

No

D9

Sunny

Cold

Normal

Weak

Yes

D10

Rain

Mild

Normal

Strong

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rain

Mild

High

Strong

No
42
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No

Overcast

Rain

Yes

Normal
Yes

www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp

Wind
Strong
No

Weak
Yes
43
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No

Overcast

Rain

Each internal node tests an attribute

Normal
Yes

www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp

Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
44
Decision Tree for PlayTennis
Outlook Temperature Humidity Wind
Sunny

Hot

High

Weak

PlayTennis
? No

Outlook
Sunny
Humidity
High
No

Overcast

Rain

Yes

Normal
Yes

www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp

Wind
Strong
No

Weak
Yes

45
Decision Tree for Reuter classification

Foundations of Statistical Natural Language Processing,
Manning and Schuetze

46
Decision Tree for Reuter classification

Foundations of Statistical Natural Language Processing,
Manning and Schuetze

47
Building Decision Trees
Given training data, how do we construct them?
The central focus of the decision tree growing
algorithm is selecting which attribute to test at each
node in the tree. The goal is to select the attribute
that is most useful for classifying examples.
Top-down, greedy search through the space of
possible decision trees.
That is, it picks the best attribute and never looks back to
reconsider earlier choices.

48
Building Decision Trees
Splitting criterion
Finding the features and the values to split on
– for example, why test first “cts” and not “vs”?
– Why test on “cts < 2” and not “cts < 5” ?

Split that gives us the maximum information gain (or the
maximum reduction of uncertainty)

Stopping criterion
When all the elements at one node have the same class,
no need to split further

In practice, one first builds a large tree and then one prunes it
back (to avoid overfitting)
See Foundations of Statistical Natural Language Processing ,
Manning and Schuetze for a good introduction

49
Decision Trees: Strengths
Decision trees are able to generate understandable
rules.
Decision trees perform classification without requiring
much computation.
Decision trees are able to handle both continuous
and categorical variables.
Decision trees provide a clear indication of which
features are most important for prediction or
classification.

http://dms.irb.hr/tutorial/tut_dtrees.php

50
Decision Trees: weaknesses
Decision trees are prone to errors in classification
problems with many classes and relatively small
number of training examples.
Decision tree can be computationally expensive to
train.
Need to compare all possible splits
Pruning is also expensive

Most decision-tree algorithms only examine a single
field at a time. This leads to rectangular classification
boxes that may not correspond well with the actual
distribution of records in the decision space.

http://dms.irb.hr/tutorial/tut_dtrees.php

51
Decision Trees
Decision Trees in Weka

52
Naïve Bayes
More powerful that Decision Trees
Decision Trees

Naïve Bayes

53
Naïve Bayes Models
Graphical Models:
graph theory plus
probability theory
Nodes are variables
Edges are conditional
probabilities

A

B

C
P(A)
P(B|A)
P(C|A)
54
Naïve Bayes Models
Graphical Models:
graph theory plus
probability theory
Nodes are variables
Edges are conditional
probabilities
Absence of an edge
between nodes implies
independence between
the variables of the
nodes

A

B

C
P(A)
P(B|A)
P(C|A)

 P(C|
A,B)
55
Naïve Bayes for text classification

Foundations of Statistical Natural Language Processing,
Manning and Schuetze

56
Naïve Bayes for text classification

earn

Shr

34

cts

vs

per

shr
57
Naïve Bayes for text classification
Topic

w1

w2

w3

w4

wn-1

wn

The words depend on the topic: P(wi| Topic)
P(cts|earn) > P(tennis| earn)

Naïve Bayes assumption: all words are independent given the topic
From training set we learn the probabilities P(w i| Topic) for each word
and for each topic in the training set
58
Naïve Bayes for text classification
Topic

w1

w2

w3

w4

wn-1

wn

To: Classify new example
Calculate P(Topic | w1, w2, … wn) for each topic
Bayes decision rule:
Choose the topic T’ for which
P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T≠ T’
59
Naïve Bayes: Math
Naïve Bayes define a joint probability distribution:
P(Topic , w1, w2, … wn) = P(Topic)∏ P(wi| Topic)
We learn P(Topic) and P(wi| Topic) in training
Test: we need P(Topic | w1, w2, … wn)
P(Topic | w1, w2, … wn) = P(Topic , w1, w2, … wn) / P(w1, w2, … wn)

60
Naïve Bayes: Strengths
Very simple model
Easy to understand
Very easy to implement

Very efficient, fast training and classification
Modest space storage
Widely used because it works really well for text
categorization
Linear, but non parallel decision boundaries

61
Naïve Bayes: weaknesses
Naïve Bayes independence assumption has two consequences:
The linear ordering of words is ignored (bag of words model)
The words are independent of each other given the class: False
– President is more likely to occur in a context that contains election than
in a context that contains poet

Naïve Bayes assumption is inappropriate if there are strong
conditional dependencies between the variables
(But even if the model is not “right”, Naïve Bayes models do well
in a surprisingly large number of cases because often we are
interested in classification accuracy and not in accurate
probability estimations)

62
Naïve Bayes
Naïve Bayes in Weka

63
k Nearest Neighbor Classification
Nearest Neighbor classification rule: to classify a new
object, find the object in the training set that is most
similar. Then assign the category of this nearest
neighbor
K Nearest Neighbor (KNN): consult k nearest
neighbors. Decision based on the majority category
of these neighbors. More robust than k = 1
Example of similarity measure often used in NLP is cosine
similarity

64
1-Nearest Neighbor

65
1-Nearest Neighbor

66
3-Nearest Neighbor

67
3-Nearest Neighbor
But this is closer..
We can weight neighbors
according to their similarity

Assign the category of the majority of the neighbors
68
k Nearest Neighbor Classification
Strengths
Robust
Conceptually simple
Often works well
Powerful (arbitrary decision boundaries)

Weaknesses
Performance is very dependent on the similarity measure
used (and to a lesser extent on the number of neighbors k
used)
Finding a good similarity measure can be difficult
Computationally expensive
69
Summary
Algorithms for Classification
Linear versus non linear classification
Binary classification
Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods

Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor

On Wednesday: Weka

70

More Related Content

PPT
Support Vector Machines
nextlib
 
PDF
Introduction to XGBoost
Joonyoung Yi
 
PPTX
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
PPT
Lect12 graph mining
Houw Liong The
 
PDF
Deep learning based object detection basics
Brodmann17
 
PPTX
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
PPTX
Dbscan algorithom
Mahbubur Rahman Shimul
 
Support Vector Machines
nextlib
 
Introduction to XGBoost
Joonyoung Yi
 
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
Lect12 graph mining
Houw Liong The
 
Deep learning based object detection basics
Brodmann17
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
Dbscan algorithom
Mahbubur Rahman Shimul
 

What's hot (20)

PPT
K means Clustering Algorithm
Kasun Ranga Wijeweera
 
PPTX
Web usage mining
Monu Chaudhary
 
PPTX
Hierarchical clustering.pptx
NTUConcepts1
 
PPTX
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
Lalit Jain
 
PDF
Real time pedestrian detection, tracking, and distance estimation
omid Asudeh
 
PPTX
Singular Value Decomposition (SVD).pptx
rajalakshmi5921
 
PPTX
Support vector machines (svm)
Sharayu Patil
 
PPTX
K means clustering
keshav goyal
 
PPT
Machine Learning
Dhananjay Birmole
 
PPTX
lazy learners and other classication methods
rajshreemuthiah
 
PPT
1.7 data reduction
Krish_ver2
 
PDF
Feature Extraction
skylian
 
PDF
Transfer Learning -- The Next Frontier for Machine Learning
Sebastian Ruder
 
PPT
Pattern Recognition
Talal Alsubaie
 
PPTX
Case based reasoning
ParthVichhi1
 
PPTX
XGBoost (System Overview)
Natallie Baikevich
 
PDF
Support Vector Machines for Classification
Prakash Pimpale
 
PPTX
Grid based method & model based clustering method
rajshreemuthiah
 
PDF
Transfer Learning
Hichem Felouat
 
K means Clustering Algorithm
Kasun Ranga Wijeweera
 
Web usage mining
Monu Chaudhary
 
Hierarchical clustering.pptx
NTUConcepts1
 
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
Lalit Jain
 
Real time pedestrian detection, tracking, and distance estimation
omid Asudeh
 
Singular Value Decomposition (SVD).pptx
rajalakshmi5921
 
Support vector machines (svm)
Sharayu Patil
 
K means clustering
keshav goyal
 
Machine Learning
Dhananjay Birmole
 
lazy learners and other classication methods
rajshreemuthiah
 
1.7 data reduction
Krish_ver2
 
Feature Extraction
skylian
 
Transfer Learning -- The Next Frontier for Machine Learning
Sebastian Ruder
 
Pattern Recognition
Talal Alsubaie
 
Case based reasoning
ParthVichhi1
 
XGBoost (System Overview)
Natallie Baikevich
 
Support Vector Machines for Classification
Prakash Pimpale
 
Grid based method & model based clustering method
rajshreemuthiah
 
Transfer Learning
Hichem Felouat
 
Ad

Viewers also liked (20)

PPTX
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Rafiul Sabbir
 
PDF
Machine learning Lecture 3
Srinivasan R
 
PPTX
Binary Class and Multi Class Strategies for Machine Learning
Paxcel Technologies
 
PDF
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
Vikram Jeet Singh
 
PDF
Machine learning-cheat-sheet
Willy Marroquin (WillyDevNET)
 
PPTX
NORMAL DISTRIBUTION
Jimnaira Abanto
 
PPT
Patents 101: How to Do a Patent Search
Kristina Gomez
 
PDF
Neural Networks: Rosenblatt's Perceptron
Mostafa G. M. Mostafa
 
PDF
Machine Learning: Generative and Discriminative Models
butest
 
PDF
The Perceptron (D1L2 Deep Learning for Speech and Language)
Universitat Politècnica de Catalunya
 
PPTX
Patent Ductus Arteriosus
The Medical Post
 
PDF
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Mohammed Bennamoun
 
PPT
Perceptron
Nagarajan
 
PPT
WEKA Tutorial
butest
 
PPT
Demonstration of a z transformation of a normal distribution
kkong
 
PDF
Intro to Classification: Logistic Regression & SVM
NYC Predictive Analytics
 
PPTX
Neural networks...
Molly Chugh
 
PPT
Normal distribution stat
Pacurib Jonathan
 
PDF
Logic Programming and ILP
Pierre de Lacaze
 
PPTX
Normal distribution and sampling distribution
Mridul Arora
 
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Rafiul Sabbir
 
Machine learning Lecture 3
Srinivasan R
 
Binary Class and Multi Class Strategies for Machine Learning
Paxcel Technologies
 
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
Vikram Jeet Singh
 
Machine learning-cheat-sheet
Willy Marroquin (WillyDevNET)
 
NORMAL DISTRIBUTION
Jimnaira Abanto
 
Patents 101: How to Do a Patent Search
Kristina Gomez
 
Neural Networks: Rosenblatt's Perceptron
Mostafa G. M. Mostafa
 
Machine Learning: Generative and Discriminative Models
butest
 
The Perceptron (D1L2 Deep Learning for Speech and Language)
Universitat Politècnica de Catalunya
 
Patent Ductus Arteriosus
The Medical Post
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Mohammed Bennamoun
 
Perceptron
Nagarajan
 
WEKA Tutorial
butest
 
Demonstration of a z transformation of a normal distribution
kkong
 
Intro to Classification: Logistic Regression & SVM
NYC Predictive Analytics
 
Neural networks...
Molly Chugh
 
Normal distribution stat
Pacurib Jonathan
 
Logic Programming and ILP
Pierre de Lacaze
 
Normal distribution and sampling distribution
Mridul Arora
 
Ad

Similar to Winnow vs perceptron (20)

PPT
Supervised and unsupervised learning
AmAn Singh
 
PPTX
Machine Learning Deep Learning Deep Learning
ssuserd89c50
 
PPTX
Introduction to Machine Learning
Shahar Cohen
 
PPTX
Machine Learning_PPT.pptx
RajeshBabu833061
 
PPTX
demo lecture for foundation class for btech
ROHIT738213
 
PPTX
How Machine Learning Helps Organizations to Work More Efficiently?
Tuan Yang
 
PPTX
super vector machines algorithms using deep
KNaveenKumarECE
 
PPTX
Machine_Learning.pptx
shubhamatak136
 
PPT
slides
butest
 
PPT
slides
butest
 
PPTX
Introduction to Machine Learning Elective Course
MayuraD1
 
PPTX
Lecture 09(introduction to machine learning)
Jeet Das
 
PPT
Machine Learning Deep Learning Machine learning
ssuserd89c50
 
PDF
22PCOAM16_MACHINE_LEARNING_UNIT_I_NOTES.pdf
Guru Nanak Technical Institutions
 
PPT
Introduction to Support Vector Machine 221 CMU.ppt
MuhammadImtiazHossai
 
PPT
An Introduction to Support Vector Machines.ppt
zjadidfard
 
PPT
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
snehajuly2004
 
PDF
20MEMECH Part 3- Classification.pdf
MariaKhan905189
 
PDF
Chapter8 LINEAR DESCRIMINANT FOR MACHINE LEARNING.pdf
PRABHUCECC
 
PPT
SVM (2).ppt
NoorUlHaq47
 
Supervised and unsupervised learning
AmAn Singh
 
Machine Learning Deep Learning Deep Learning
ssuserd89c50
 
Introduction to Machine Learning
Shahar Cohen
 
Machine Learning_PPT.pptx
RajeshBabu833061
 
demo lecture for foundation class for btech
ROHIT738213
 
How Machine Learning Helps Organizations to Work More Efficiently?
Tuan Yang
 
super vector machines algorithms using deep
KNaveenKumarECE
 
Machine_Learning.pptx
shubhamatak136
 
slides
butest
 
slides
butest
 
Introduction to Machine Learning Elective Course
MayuraD1
 
Lecture 09(introduction to machine learning)
Jeet Das
 
Machine Learning Deep Learning Machine learning
ssuserd89c50
 
22PCOAM16_MACHINE_LEARNING_UNIT_I_NOTES.pdf
Guru Nanak Technical Institutions
 
Introduction to Support Vector Machine 221 CMU.ppt
MuhammadImtiazHossai
 
An Introduction to Support Vector Machines.ppt
zjadidfard
 
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
snehajuly2004
 
20MEMECH Part 3- Classification.pdf
MariaKhan905189
 
Chapter8 LINEAR DESCRIMINANT FOR MACHINE LEARNING.pdf
PRABHUCECC
 
SVM (2).ppt
NoorUlHaq47
 

Recently uploaded (20)

PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Future of Artificial Intelligence (AI)
Mukul
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Software Development Methodologies in 2025
KodekX
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 

Winnow vs perceptron

  • 1. SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004 1
  • 2. Today Algorithms for Classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multi-Class classification Decision Trees Naïve Bayes K nearest neighbor 2
  • 3. Binary Classification: examples Spam filtering (spam, not spam) Customer service message classification (urgent vs. not urgent) Information retrieval (relevant, not relevant) Sentiment classification (positive, negative) Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes 3
  • 4. Binary Classification Given: some data items that belong to a positive (+1 ) or a negative (-1 ) class Task: Train the classifier and predict the class for a new data item Geometrically: find a separator 4
  • 5. Linear versus Non Linear algorithms Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary 5
  • 6. Linearly separable data Linear Decision boundary Class1 Class2 6
  • 7. Non linearly separable data Class1 Class2 7
  • 8. Non linearly separable data Non Linear Classifier Class1 Class2 8
  • 9. Linear versus Non Linear algorithms Linear or Non linear separable data? We can find out only empirically Linear algorithms (algorithms that find a linear decision boundary) When we think the data is linearly separable Advantages – Simpler, less parameters Disadvantages – High dimensional data (like for NLT) is usually not linearly separable Examples: Perceptron, Winnow, SVM Note: we can use linear algorithms also for non linear problems (see Kernel methods) 9
  • 10. Linear versus Non Linear algorithms Non Linear When the data is non linearly separable Advantages – More accurate Disadvantages – More complicated, more parameters Example: Kernel methods Note: the distinction between linear and non linear applies also for multi-class classification (we’ll see this later) 10
  • 11. Simple linear algorithms Perceptron and Winnow algorithm Linear Binary classification Online (process data sequentially, one data point at the time) Mistake driven Simple single layer Neural Networks 11
  • 12. Linear binary classification Data: {(xi,yi)}i=1...n x in Rd (x is a vector in d-dimensional space)  feature vector y in {-1,+1}  label (class, category) Question: Design a linear decision boundary: wx + b (equation of hyperplane) such that the classification rule associated with it has minimal probability of error classification rule : – y = sign(w x + b) which means: – if wx + b > 0 then y = +1 – if wx + b < 0 then y = -1 From Gert Lanckriet, Statistical Learning Theory Tutorial 12
  • 13. Linear binary classification Find a good hyperplane (w,b) in R d+1 that correctly classifies data points as much as possible In online fashion: one data point at the time, update weights as necessary wx + b = 0 Classification Rule: y = sign(wx + b) From Gert Lanckriet, Statistical Learning Theory Tutorial 13
  • 14. Perceptron algorithm Initialize: w1 = 0 Updating rule For each data point x If class(x) != decision(x,w) wk then wk+1  wk + yixi k k+1 else wk+1  wk 0 +1 wk+1 -1 wk x + b = 0 Function decision(x, w) If wx + b > 0 return +1 Else return -1 From Gert Lanckriet, Statistical Learning Theory Tutorial Wk+1 x + b = 0 14
  • 15. Perceptron algorithm Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem (convergence, global optimum) Limitations Only linear separations Only converges for linearly separable data Not really “efficient with many features” From Gert Lanckriet, Statistical Learning Theory Tutorial 15
  • 16. Winnow algorithm Another online algorithm for learning perceptron weights: f(x) = sign(wx + b) Linear, binary classification Update-rule: again error-driven, but multiplicative (instead of additive) From Gert Lanckriet, Statistical Learning Theory Tutorial 16
  • 17. Winnow algorithm Initialize: w1 = 0 Updating rule For each data point x wk If class(x) != decision(x,w) then wk+1  wk + yixi  Perceptron w k+1  w k *exp(y i x i )  Winnow k k+1 else wk+1  wk 0 +1 wk+1 -1 wk x + b= 0 Function decision(x, w) If wx + b > 0 return +1 Else return -1 From Gert Lanckriet, Statistical Learning Theory Tutorial Wk+1 x + b = 0 17
  • 18. Perceptron vs. Winnow Assume N available features only K relevant items, with K<<N Perceptron: number of mistakes: O( K N) Winnow: number of mistakes: O(K log N) Winnow is more robust to high-dimensional feature spaces From Gert Lanckriet, Statistical Learning Theory Tutorial 18
  • 19. Perceptron vs. Winnow Perceptron Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Winnow Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Used in NLP From Gert Lanckriet, Statistical Learning Theory Tutorial 19
  • 21. Large margin classifier Another family of linear algorithms Intuition (Vapnik, 1965) If the classes are linearly separable: Separate the data Place hyper-plane “far” from the data: large margin Statistical results guarantee good generalization BAD From Gert Lanckriet, Statistical Learning Theory Tutorial 21
  • 22. Large margin classifier Intuition (Vapnik, 1965) if linearly separable: Separate the data Place hyperplane “far” from the data: large margin Statistical results guarantee good generalization GOOD  Maximal Margin Classifier From Gert Lanckriet, Statistical Learning Theory Tutorial 22
  • 23. Large margin classifier If not linearly separable Allow some errors Still, try to place hyperplane “far” from each class From Gert Lanckriet, Statistical Learning Theory Tutorial 23
  • 24. Large Margin Classifiers Advantages Theoretically better (better error bounds) Limitations Computationally more expensive, large quadratic programming 24
  • 25. Support Vector Machine (SVM) M Large Margin Classifier wTxa + b = 1 wTxb + b = -1 Linearly separable case Goal: find the hyperplane that maximizes the margin wT x + b = 0 Support vectors From Gert Lanckriet, Statistical Learning Theory Tutorial 25
  • 26. Support Vector Machine (SVM) Text classification Hand-writing recognition Computational biology (e.g., micro-array data) Face detection Face expression recognition Time series prediction From Gert Lanckriet, Statistical Learning Theory Tutorial 26
  • 29. Non Linear problem Kernel methods A family of non-linear algorithms Transform the non linear problem in a linear one (in a different feature space) Use linear algorithms to solve the linear problem in the new space From Gert Lanckriet, Statistical Learning Theory Tutorial 29
  • 30. Main intuition of Kernel methods (Copy here from black board) 30
  • 31. Basic principle kernel methods Φ : Rd  RD (D >> d) wTΦ(x)+b=0 Φ(X)=[x2 z2 xz] X=[x z] f(x) = sign(w1x2+w2z2+w3xz +b) From Gert Lanckriet, Statistical Learning Theory Tutorial 31
  • 32. Basic principle kernel methods Linear separability : more likely in high dimensions Mapping: Φ maps input into high-dimensional feature space Classifier: construct linear classifier in highdimensional feature space Motivation: appropriate choice of Φ leads to linear separability We can do this efficiently! From Gert Lanckriet, Statistical Learning Theory Tutorial 32
  • 33. Basic principle kernel methods We can use the linear algorithms seen before (Perceptron, SVM) for classification in the higher dimensional space 33
  • 34. Multi-class classification Given: some data items that belong to one of M possible classes Task: Train the classifier and predict the class for a new data item Geometrically: harder problem, no more simple geometry 34
  • 36. Multi-class classification: Examples Author identification Language identification Text categorization (topics) 36
  • 37. (Some) Algorithms for Multi-class classification Linear Parallel class separators: Decision Trees Non parallel class separators: Naïve Bayes Non Linear K-nearest neighbors 37
  • 38. Linear, parallel class separators (ex: Decision Trees) 38
  • 39. Linear, NON parallel class separators (ex: Naïve Bayes) 39
  • 40. Non Linear (ex: k Nearest Neighbor) 40
  • 41. Decision Trees Decision tree is a classifier in the form of a tree structure, where each node is either: Leaf node - indicates the value of the target attribute (class) of examples, or Decision node - specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. http://dms.irb.hr/tutorial/tut_dtrees.php 41
  • 42. Training Examples Goal: learn when we can play Tennis and when we cannot Day Outlook Temp. Humidity Wind Play Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Weak Yes D8 Sunny Mild High Weak No D9 Sunny Cold Normal Weak Yes D10 Rain Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No 42
  • 43. Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Wind Strong No Weak Yes 43
  • 44. Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Each internal node tests an attribute Normal Yes www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Each branch corresponds to an attribute value node Each leaf node assigns a classification 44
  • 45. Decision Tree for PlayTennis Outlook Temperature Humidity Wind Sunny Hot High Weak PlayTennis ? No Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Wind Strong No Weak Yes 45
  • 46. Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 46
  • 47. Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 47
  • 48. Building Decision Trees Given training data, how do we construct them? The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples. Top-down, greedy search through the space of possible decision trees. That is, it picks the best attribute and never looks back to reconsider earlier choices. 48
  • 49. Building Decision Trees Splitting criterion Finding the features and the values to split on – for example, why test first “cts” and not “vs”? – Why test on “cts < 2” and not “cts < 5” ? Split that gives us the maximum information gain (or the maximum reduction of uncertainty) Stopping criterion When all the elements at one node have the same class, no need to split further In practice, one first builds a large tree and then one prunes it back (to avoid overfitting) See Foundations of Statistical Natural Language Processing , Manning and Schuetze for a good introduction 49
  • 50. Decision Trees: Strengths Decision trees are able to generate understandable rules. Decision trees perform classification without requiring much computation. Decision trees are able to handle both continuous and categorical variables. Decision trees provide a clear indication of which features are most important for prediction or classification. http://dms.irb.hr/tutorial/tut_dtrees.php 50
  • 51. Decision Trees: weaknesses Decision trees are prone to errors in classification problems with many classes and relatively small number of training examples. Decision tree can be computationally expensive to train. Need to compare all possible splits Pruning is also expensive Most decision-tree algorithms only examine a single field at a time. This leads to rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space. http://dms.irb.hr/tutorial/tut_dtrees.php 51
  • 53. Naïve Bayes More powerful that Decision Trees Decision Trees Naïve Bayes 53
  • 54. Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges are conditional probabilities A B C P(A) P(B|A) P(C|A) 54
  • 55. Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges are conditional probabilities Absence of an edge between nodes implies independence between the variables of the nodes A B C P(A) P(B|A) P(C|A)  P(C| A,B) 55
  • 56. Naïve Bayes for text classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 56
  • 57. Naïve Bayes for text classification earn Shr 34 cts vs per shr 57
  • 58. Naïve Bayes for text classification Topic w1 w2 w3 w4 wn-1 wn The words depend on the topic: P(wi| Topic) P(cts|earn) > P(tennis| earn) Naïve Bayes assumption: all words are independent given the topic From training set we learn the probabilities P(w i| Topic) for each word and for each topic in the training set 58
  • 59. Naïve Bayes for text classification Topic w1 w2 w3 w4 wn-1 wn To: Classify new example Calculate P(Topic | w1, w2, … wn) for each topic Bayes decision rule: Choose the topic T’ for which P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T≠ T’ 59
  • 60. Naïve Bayes: Math Naïve Bayes define a joint probability distribution: P(Topic , w1, w2, … wn) = P(Topic)∏ P(wi| Topic) We learn P(Topic) and P(wi| Topic) in training Test: we need P(Topic | w1, w2, … wn) P(Topic | w1, w2, … wn) = P(Topic , w1, w2, … wn) / P(w1, w2, … wn) 60
  • 61. Naïve Bayes: Strengths Very simple model Easy to understand Very easy to implement Very efficient, fast training and classification Modest space storage Widely used because it works really well for text categorization Linear, but non parallel decision boundaries 61
  • 62. Naïve Bayes: weaknesses Naïve Bayes independence assumption has two consequences: The linear ordering of words is ignored (bag of words model) The words are independent of each other given the class: False – President is more likely to occur in a context that contains election than in a context that contains poet Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variables (But even if the model is not “right”, Naïve Bayes models do well in a surprisingly large number of cases because often we are interested in classification accuracy and not in accurate probability estimations) 62
  • 64. k Nearest Neighbor Classification Nearest Neighbor classification rule: to classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighbor K Nearest Neighbor (KNN): consult k nearest neighbors. Decision based on the majority category of these neighbors. More robust than k = 1 Example of similarity measure often used in NLP is cosine similarity 64
  • 68. 3-Nearest Neighbor But this is closer.. We can weight neighbors according to their similarity Assign the category of the majority of the neighbors 68
  • 69. k Nearest Neighbor Classification Strengths Robust Conceptually simple Often works well Powerful (arbitrary decision boundaries) Weaknesses Performance is very dependent on the similarity measure used (and to a lesser extent on the number of neighbors k used) Finding a good similarity measure can be difficult Computationally expensive 69
  • 70. Summary Algorithms for Classification Linear versus non linear classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multi-Class classification Decision Trees Naïve Bayes K nearest neighbor On Wednesday: Weka 70