Lecture 2 Basic Concepts in Machine Learning for Language Technology

Machine Learning for Language
Technology
Lecture 2: Basic Concepts
Marina Santini
Department of Linguistics and Philology
Uppsala University, Uppsala, Sweden
Autumn 2014
Acknowledgement: Thanks to Prof. Joakim Nivre for course design and material

Outline
• Definition of Machine Learning
• Type of Machine Learning:
– Classification
– Regression
– Supervised Learning
– Unsupervised Learning
– Reinforcement Learning
• Supervised Learning:
– Supervised Classification
– Training set
– Hypothesis class
– Empirical error
– Margin
– Noise
– Inductive bias
– Generalization
– Model assessment
– Cross-Validation
– Classification in NLP
– Types of Classification
Lecture 2: Basic Concepts 2

What is Machine Learning
• Machine learning is programming computers to
optimize a performance criterion for some task
using example data or past experience
• Why learning?
– No known exact method – vision, speech recognition,
robotics, spam filters, etc.
– Exact method too expensive – statistical physics
– Task evolves over time – network routing
• Compare:
– No need to use machine learning for computing
payroll… we just need an algorithm

Machine Learning – Data Mining –
Artificial Intelligence – Statistics
• Machine Learning: creation of a model that uses training data or
past experience
• Data Mining: application of learning methods to large datasets (ex.
physics, astronomy, biology, etc.)
– Text mining = machine learning applied to unstructured textual data
(ex. sentiment analyisis, social media monitoring, etc. Text Mining,
Wikipedia)
• Artificial intelligence: a model that can adapt to a changing
environment.
• Statistics: Machine learning uses the theory of statistics in building
mathematical models, because the core task is making inference from a
sample.

The bio-cognitive analogy
• Imagine that a learning algorithm as a single neuron.
• This neuron receives input from other neurons, one
for each input feature.
• The strength of these inputs are the feature values.
• Each input has a weight and the neuron simply sums
up all the weighted inputs.
• Based on this sum, the neuron decides whether to
“fire” or not. Firing is interpreted as being a positive
example and not firing is interpreted as being a
negative example.

Elements of Machine Learning
1. Generalization:
– Generalize from specific examples
– Based on statistical inference
2. Data:
– Training data: specific examples to learn from
– Test data: (new) specific examples to assess performance
3. Models:
– Theoretical assumptions about the task/domain
– Parameters that can be inferred from data
4. Algorithms:
– Learning algorithm: infer model (parameters) from data
– Inference algorithm: infer predictions from model

Types of Machine Learning
• Association
• Supervised Learning
– Classification
– Regression
• Unsupervised Learning
• Reinforcement Learning

Learning Associations
• Basket analysis:
P (Y | X ) probability that somebody who buys
X also buys Y where X and Y are
products/services
Example: P ( chips | beer ) = 0.7

Classification
• Example: Credit
scoring
• Differentiating
between low-risk and
high-risk customers
from their income and
savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk

Classification in NLP
• Binary classification:
– Spam filtering (spam vs. non-spam)
– Spelling error detection (error vs. non error)
• Multiclass classification:
– Text categorization (news, economy, culture, sport, ...)
– Named entity classification (person, location,
organization, ...)
• Structured prediction:
– Part-of-speech tagging (classes = tag sequences)
– Syntactic parsing (classes = parse trees)

Regression
• Example:
Price of used car
• x : car attributes
y : price
y = g (x | q )
g ( ) model,
q parameters
11
y = wx+w0

Uses of Supervised Learning
• Prediction of future cases:
– Use the rule to predict the output for future inputs
• Knowledge extraction:
– The rule is easy to understand
• Compression:
– The rule is simpler than the data it explains
• Outlier detection:
– Exceptions that are not covered by the rule, e.g., fraud

Unsupervised Learning
• Finding regularities in data
• No mapping to outputs
• Clustering:
– Grouping similar instances
• Example applications:
– Customer segmentation in CRM
– Image compression: Color quantization
– NLP: Unsupervised text categorization

Reinforcement Learning
• Learning a policy = sequence of
outputs/actions
• No supervised output but delayed reward
• Example applications:
– Game playing
– Robot in a maze
– NLP: Dialogue systems

Supervised Classification
• Learning the class C of a “family car” from
examples
– Prediction: Is car x a family car?
– Knowledge extraction: What do people expect from a
family car?
• Output (labels):
Positive (+) and negative (–) examples
• Input representation (features):
x1: price, x2 : engine power

Training set X

X  {xt
,rt
}t1
N

r 
1 if x is positive
0 if x is negative



16

x 
x1
x2







Hypothesis class H

p1  price  p2 AND e1  engine power  e2 

Empirical (training) error

h(x) 
1 if h says x is positive
0 if h says x is negative




E(h | X)  1 h xt
  rt
 t1
N

Empirical error of h on X:

S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G
h  H, between S and G is
consistent [E( h | X) = 0] and
make up the version space

Margin
• Choose h with largest margin

Noise
Unwanted anomaly in data
• Imprecision in input attributes
• Errors in labeling data points
• Hidden attributes (relative to H)
Consequence:
• No h in H may be consistent!

Noise and Model Complexity
Arguments for simpler model (Occam’s razor principle):
1. Easier to make predictions
2. Easier to train (fewer parameters)
3. Easier to understand
4. Generalizes better (if data is noisy)

Inductive Bias
• Learning is an ill-posed problem
– Training data is never sufficient to find a unique
solution
– There are always infinitely many consistent
hypotheses
• We need an inductive bias:
– Assumptions that entail a unique h for a training set X
1. Hypothesis class H – axis-aligned rectangles
2. Learning algorithm – find consistent hypothesis with max-
margin
3. Hyperparameters – trade-off between training error and
margin

Model Selection and Generalization
• Generalization – how well a model performs
on new data
– Overfitting: H more complex than C
– Underfitting: H less complex than C

Triple Trade-Off
• Trade-off between three factors:
1. Complexity of H, c(H)
2. Training set size N
3. Generalization error E on new data
• Dependencies:
– As N, E
– As c(H), first E and then E

Model Selection  Generalization Error
• To estimate generalization error, we need data unseen
during training:
• Given models (hypotheses) h1, ..., hk induced from the
training set X, we can use E(hi | V ) to select the
model hi with the smallest generalization error

ˆE  E(h | V)  1 h xt
  rt
 t1
M


V  {xt
,rt
}t1
M
 X

Model Assessment
• To estimate the generalization error of the best
model hi, we need data unseen during training
and model selection
• Standard setup:
1. Training set X (50–80%)
2. Validation (development) set V (10–25%)
3. Test (publication) set T (10–25%)
• Note:
– Validation data can be added to training set before
testing
– Resampling methods can be used if data is limited

Cross-Validation
121
31
2
2
2
32
1
1
1



K
K
K
K
K
K
XXXTXV
XXXTXV
XXXTXV




• K-fold cross-validation: Divide X into X1, ..., XK
• Note:
– Generalization error estimated by means across K folds
– Training sets for different folds share K–2 parts
– Separate test set must be maintained for model
assessment

Bootstrapping
3680
1
1 1
.





 
e
N
N
• Generate new training sets of size N from X by random
sampling with replacement
• Use original training set as validation set (V = X )
• Probability that we do not pick an instance after N
draws
that is, only 36.8% of instances are new!

Measuring Error
• Error rate = # of errors / # of instances = (FP+FN) / N
• Accuracy = # of correct / # of instances = (TP+TN) / N
• Recall = # of found positives / # of positives = TP / (TP+FN)
• Precision = # of found positives / # of found = TP / (TP+FP)

Statistical Inference
• Interval estimation to quantify the precision of
our measurements
• Hypothesis testing to assess whether
differences between models are statistically
significant

m 1.96

N
e01  e10 1 
2
e01  e10
~ X1
2

Supervised Learning – Summary
• Training data + learner  hypothesis
– Learner incorporates inductive bias
• Test data + hypothesis  estimated generalization
– Test data must be unseen

Anatomy of a Supervised Learner
(Dimensions of a supervised machine learning algorithm)
• Model:
• Loss function:
• Optimization procedure:

g x |q 

E q | X  L rt
,g xt
|q  t

33

q*  arg min
q
E q | X 

Supervised Classification: Extension
34
• Divide instances into (two or more) classes
– Instance (feature vector):
• Features may be categorical or numerical
– Class (label):
– Training data:
• Classification in Language Technology
– Spam filtering (spam vs. non-spam)
– Spelling error detection (error vs. no error)
– Text categorization (news, economy, culture, sport, ...)
– Named entity classification (person, location, organization, ...)

X  {xt
,yt
}t1
N

x  x1, , xm

y
Lec 2: Decision Trees - Nearest Neighbors

Reading
• Alpaydin (2010): chs 1-2; 19
• Daume’ III (2012): ch 4: only 4.5-4.6

End of Lecture 2

Lecture 2 Basic Concepts in Machine Learning for Language Technology

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Lecture 2 Basic Concepts in Machine Learning for Language Technology (20)

More from Marina Santini (20)

Recently uploaded (20)

Lecture 2 Basic Concepts in Machine Learning for Language Technology