SlideShare a Scribd company logo
N-gram
Language
Modeling
Introduction to N-gram
Language Models
Predicting words
The water of Walden Pond is beautifully ...
*refrigerator
*that
blue
green
clear
Language Models
Systems that can predict upcoming words
• Can assign a probability to each potential next word
• Can assign a probability to a whole sentence
Why word prediction?
It's a helpful part of language tasks
• Grammar or spell checking
Their are two midterms Their There are two midterms
Everything has improve Everything has improve improved
• Speech recognition
Why word prediction?
It's how large language models (LLMs) work!
LLMs are trained to predict words
• Left-to-right (autoregressive) LMs learn to predict next word
LLMs generate text by predicting words
• By predicting the next word over and over again
Language Modeling (LM) more formally
Goal: compute the probability of a sentence or
sequence of words W:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4) or P(wn|w1,w2…wn-1)
An LM computes either of these:
P(W) or P(wn|w1,w2…wn-1)
How to estimate these probabilities
Could we just count and divide?
No! Too many possible sentences!
We’ll never see enough data for estimating these
P(blue|The water of Walden Pond is so beautifully) (3.1)
ne way to estimate this probability is directly from relative frequency counts: take a
ry large corpus, count the number of times we see The water of Walden Pond
s so beautifully, and count the number of times this is followed by blue. This
ould be answering the question ā€œOut of the times we saw the history h, how many
mes was it followed by the word wā€, as follows:
P(blue|The water of Walden Pond is so beautifully) =
C(The water of Walden Pond is so beautifully blue)
C(The water of Walden Pond is so beautifully)
(3.2)
we had a large enough corpus, we could compute these two counts and estimate
e probability from Eq. 3.2. But even the entire web isn’t big enough to give us
ood estimates for counts of entire sentences. This is because language is creative;
w sentences are invented all the time, and we can’t expect to get accurate counts
r such large objects as entire sentences. For this reason, we’ll need more clever
ne way to estimate this probability is directly from relative frequency counts: take a
ery large corpus, count the number of times we see The water of Walden Pond
s so beautifully, and count the number of times this is followed by blue. This
ould be answering the question ā€œOut of the times we saw the history h, how many
mes was it followed by the word wā€, as follows:
P(blue|The water of Walden Pond is so beautifully) =
C(The water of Walden Pond is so beautifully blue)
C(The water of Walden Pond is so beautifully)
(3.2
we had a large enough corpus, we could compute these two counts and estimate
e probability from Eq. 3.2. But even the entire web isn’t big enough to give us
ood estimates for counts of entire sentences. This is because language is creative
ew sentences are invented all the time, and we can’t expect to get accurate counts
r such large objects as entire sentences. For this reason, we’ll need more cleve
=
How to compute P(W) or P(wn|w1, …wn-1)
How to compute the joint probability P(W):
P(The, water, of, Walden, Pond, is, so, beautifully, blue)
Intuition: let’s rely on the Chain Rule of Probability
Reminder: The Chain Rule
Recall the definition of conditional probabilities
P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A) P(B|A)
More variables:
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)
The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint
probability of words in sentence
P(ā€œThe water of Walden Pondā€) =
P(The) Ɨ P(water|The) Ɨ P(of|The water)
Ɨ P(Walden|The water of) Ɨ P(Pond|The water of Walden)
3.1 • N-GRAM
plying the chain rule to words, we get
P(w1:n) = P(w1)P(w2|w1)P(w3|w1:2)...P(wn|w1:n 1)
=
n
Y
k=1
P(wk|w1:k 1)
chain rule shows the link between computing the joint probability of a se
computing the conditional probability of a word given previous words.
3.4 suggests that we could estimate the joint probability of an entire seque
ds by multiplying together a number of conditional probabilities. But us
in rule doesn’t really seem to help us! We don’t know any way to comp
Markov Assumption
Simplifying assumption:
Andrei Markov
ntuition of the n-gram model is that instead of computing the prob
given its entire history, we can approximate the history by just t
s.
he bigram model, for example, approximates the probability of a
e previous words P(wn|w1:n 1) by using only the conditional probab
ding word P(wn|wn 1). In other words, instead of computing the pr
P(blue|The water of Walden Pond is so beautifully)
pproximate it with the probability
P(blue|beautifully)
n we use a bigram model to predict the conditional probability of the
am model, for example, approximates the probability of a word gi
us words P(wn|w1:n 1) by using only the conditional probability of
rd P(wn|wn 1). In other words, instead of computing the probabili
ue|The water of Walden Pond is so beautifully) (
ate it with the probability
P(blue|beautifully) (
a bigram model to predict the conditional probability of the next w
making the following approximation:
ā‰ˆ
(wn|wn 1). In other words, instead of computing the proba
he water of Walden Pond is so beautifully)
t with the probability
P(blue|beautifully)
gram model to predict the conditional probability of the nex
ng the following approximation:
P(wn|w1:n 1) ⇔ P(wn|wn 1)
Wikimedia commons
Bigram Markov Assumption
Instead of:
More generally, we approximate each
component in the product
of a complete word sequence by substituting Eq. 3.
P(w1:n) ⇔
n
Y
k=1
P(wk|wk 1)
3.1 • N-G
Applying the chain rule to words, we get
P(w1:n) = P(w1)P(w2|w1)P(w3|w1:2)...P(wn|w1:n 1)
=
n
Y
k=1
P(wk|w1:k 1)
The chain rule shows the link between computing the joint probability of a
and computing the conditional probability of a word given previous wor
tion 3.4 suggests that we could estimate the joint probability of an entire se
words by multiplying together a number of conditional probabilities. But
chain rule doesn’t really seem to help us! We don’t know any way to co
e can predict the probability of some future unit w
t. We can generalize the bigram (which looks one w
which looks two words into the past) and thus to t
rds into the past).
general equation for this n-gram approximation
the next word in a sequence. We’ll use N here to
means bigrams and N = 3 means trigrams. Then w
a word given its entire context as follows:
P(wn|w1:n 1) ⇔ P(wn|wn N+1:n 1)
Simplest case: Unigram model
To him swallowed confess hear both . Which . Of save on trail
for are ay device and rote life have
Hill he late speaks ; or ! a more to leg less first you enter
Months the my and issue of year foreign new exchange’s September
were recession exchange new endorsed a acquire to six executives
Some automatically generated sentences from two different unigram models
€
P(w1w2…wn ) ā‰ˆ P(wi)
i
āˆ
Bigram model
Why dost stand forth thy canopy, forsooth; he is this palpable hit
the King Henry. Live king. Follow.
What means, sir. I confess she? then all sorts, he is trim, captain.
Last December through the way to preserve the Hudson corporation N.
B. E. C. Taylor would seem to complete the major central planners
one gram point five percent of U. S. E. has already old M. X.
corporation of living
on information such as more frequently fishing to keep her
P(wi | w1w2…wiāˆ’1) ā‰ˆ P(wi | wiāˆ’1)
Some automatically generated sentences from two different biigram models
Problems with N-gram models
• N-grams can't handle long-distance dependencies:
ā€œThe soups that I made from that new cookbook I
bought yesterday were amazingly delicious."
• N-grams don't do well at modeling new sequences
with similar meanings
The solution: Large language models
• can handle much longer contexts
• because of using embedding spaces, can model
synonymy better, and generate better novel strings
Why N-gram models?
A nice clear paradigm that lets us introduce many of
the important issues for large language models
• training and test sets
• the perplexity metric
• sampling to generate sentences
• ideas like interpolation and backoff
N-gram
Language
Modeling
Introduction to N-grams
N-gram
Language
Modeling
Estimating N-gram
Probabilities
Estimating bigram probabilities
The Maximum Likelihood Estimate
all the bigrams that share the same first word wn 1:
P(wn|wn 1) =
C(wn 1wn)
P
w C(wn 1w)
ify this equation, since the sum of all bigram counts that sta
1 must be equal to the unigram count for that word wn 1 (the
ment to be convinced of this):
P(wn|wn 1) =
C(wn 1wn)
C(wn 1)
hrough an example using a mini-corpus of three sentences.
l between 0 and 1.
e, to compute a particular bigram probability of a word wn
wn 1, we’ll compute the count of the bigram C(wn 1wn) and
of all the bigrams that share the same first word wn 1:
P(wn|wn 1) =
C(wn 1wn)
P
w C(wn 1w)
plify this equation, since the sum of all bigram counts that s
1 must be equal to the unigram count for that word wn 1 (th
oment to be convinced of this):
C(wn 1wn)
An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
€
P(wi | wiāˆ’1) =
c(wiāˆ’1,wi)
c(wiāˆ’1)
More examples:
Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants close by
tell me about chez panisse
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Raw bigram counts
Out of 9332 sentences, |V|=1446, taken only 8
words
Raw bigram probabilities
Normalize by unigrams:
Result:
Bigram estimates of sentence probabilities
P(<s> I want english food </s>) =
P(I|<s>)
Ɨ P(want|I)
Ɨ P(english|want)
Ɨ P(food|english)
Ɨ P(</s>|food)
= .000031
What kinds of knowledge do N-grams represent?
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
Dealing with scale in large n-grams
LM probabilities are stored and computed in
log format, i.e. log probabilities
This avoids underflow from multiplying many
small numbers
log(p1 Ɨ p2 Ɨ p3 Ɨ p4 ) = log p1 + log p2 + log p3 + log p4
3.1.3 Dealing with scale in large n-gram models
In practice, language models can be very large, leading to practical issues.
Log probabilities Language model probabilities are always stored and com
in log format, i.e., as log probabilities. This is because probabilities are (b
inition) less than or equal to 1, and so the more probabilities we multiply to
the smaller the product becomes. Multiplying enough n-grams together would
in numerical underflow. Adding in log space is equivalent to multiplying in
space, so we combine log probabilities by adding them. By adding log proba
instead of multiplying probabilities, we get results that are not as small. We
computation and storage in log space, and just convert back into probabilitie
need to report probabilities at the end by taking the exp of the logprob:
p1 ⇄ p2 ⇄ p3 ⇄ p4 = exp(log p1 +log p2 +log p3 +log p4)
If we need probabilities we can do one exp at the end
Larger ngrams
4-grams, 5-grams
Large datasets of large n-grams have been released
• N-grams from Corpus of Contemporary American English (COCA)
1 billion words (Davies 2020)
• Google Web 5-grams (Franz and Brants 2006) 1 trillion words)
Newest model: infini-grams (āˆž-grams) (Liu et al 2024)
• No precomputing! Instead, store 5 trillion words of web text in
suffix arrays. Can compute n-gram probabilities with any n!
• Efficiency: quantize probabilities to 4-8 bits instead of 8-byte
float
N-gram LM Toolkits
SRILM
ā—¦ http://www.speech.sri.com/projects/srilm/
KenLM
ā—¦ https://kheafield.com/code/kenlm/
Language
Modeling
Evaluation and Perplexity
How to evaluate N-gram models
"Extrinsic (in-vivo) Evaluation"
To compare models A and B
1. Put each model in a real task
• Machine Translation, speech recognition, etc.
2. Run the task, get a score for A and for B
• How many words translated correctly
3. Compare accuracy for A and B
Intrinsic (in-vitro) evaluation
Extrinsic evaluation not always possible
• Expensive, time-consuming
• Doesn't always generalize to other applications
Intrinsic evaluation: perplexity
• Directly measures language model performance at
predicting words.
• Doesn't necessarily correspond with real application
performance
• But gives us a single general metric for language models
• Useful for large language models (LLMs) as well as n-grams
Training sets and test sets
We train parameters of our model on a training set.
We test the model’s performance on data we
haven’t seen.
ā—¦ A test set is an unseen dataset; different from training set.
ā—¦ Intuition: we want to measure generalization to unseen data
ā—¦ An evaluation metric (like perplexity) tells us how well
our model does on the test set.
Choosing training and test sets
• If we're building an LM for a specific task
• The test set should reflect the task language we
want to use the model for
• If we're building a general-purpose model
• We'll need lots of different kinds of training
data
• We don't want the training set or the test set to
be just from one domain or author or language.
Training on the test set
We can’t allow test sentences into the training set
• Or else the LM will assign that sentence an artificially
high probability when we see it in the test set
• And hence assign the whole test set a falsely high
probability.
• Making the LM look better than it really is
This is called ā€œTraining on the test setā€
Bad science!
34
Dev sets
•If we test on the test set many times we might
implicitly tune to its characteristics
•Noticing which changes make the model better.
•So we run on the test set only once, or a few times
•That means we need a third dataset:
• A development test set or, devset.
• We test our LM on the devset until the very end
• And then test our LM on the test set once
Intuition of perplexity as evaluation metric:
How good is our language model?
Intuition: A good LM prefers "real" sentences
• Assign higher probability to ā€œrealā€ or ā€œfrequently
observedā€ sentences
• Assigns lower probability to ā€œword saladā€ or
ā€œrarely observedā€ sentences?
Intuition of perplexity 2:
Predicting upcoming words
The Shannon Game: How well can we
predict the next word?
• Once upon a ____
• That is a picture of a ____
• For breakfast I ate my usual ____
Unigrams are terrible at this game (Why?)
time 0.9
dream 0.03
midnight 0.02
…
and 1e-100
Picture credit: Historiska bildsamlingen
https://creativecommons.org/licenses/by/2.0/
Claude Shannon
A good LM is one that assigns a higher probability
to the next word that actually occurs
Intuition of perplexity 3: The best language model
is one that best predicts the entire unseen test set
• We said: a good LM is one that assigns a higher
probability to the next word that actually occurs.
• Let's generalize to all the words!
• The best LM assigns high probability to the entire test
set.
• When comparing two LMs, A and B
• We compute PA(test set) and PB(test set)
• The better LM will give a higher probability to (=be less
surprised by) the test set than the other LM.
• Probability depends on size of test set
• Probability gets smaller the longer the text
• Better: a metric that is per-word, normalized by length
• Perplexity is the inverse probability of the test set,
normalized by the number of words
Intuition of perplexity 4: Use perplexity instead of
raw probability
PP(W) = P(w1w2...wN )
āˆ’
1
N
=
1
P(w1w2...wN )
N
Perplexity is the inverse probability of the test set,
normalized by the number of words
Probability range is [0,1], perplexity range is [1,āˆž]
Minimizing perplexity is the same as maximizing probability
Intuition of perplexity 5: the inverse
PP(W) = P(w1w2...wN )
āˆ’
1
N
=
1
P(w1w2...wN )
N
Intuition of perplexity 6: N-grams
PP(W) = P(w1w2...wN )
āˆ’
1
N
=
1
P(w1w2...wN )
N
Bigrams:
Chain rule:
Intuition of perplexity 7:
Weighted average branching factor
Perplexity is also the weighted average branching factor of a language.
Branching factor: number of possible next words that can follow any word
Example: Deterministic language L = {red,blue, green}
Branching factor = 3 (any word can be followed by red, blue, green)
Now assume LM A where each word follows any other word with equal probability ā…“
Given a test set T = "red red red red blue"
PerplexityA(T) = PA(red red red red blue)-1/5 =
But now suppose red was very likely in training set, such that for LM B:
ā—¦ P(red) = .8 p(green) = .1 p(blue) = .1
We would expect the probability to be higher, and hence the perplexity to be smaller:
PerplexityB(T) = PB(red red red red blue)-1/5
((ā…“)5)-1/5 = (ā…“)-1 =3
= (.8 * .8 * .8 * .8 * .1) -1/5 =.04096 -1/5 = .527-1 = 1.89
Holding test set constant:
Lower perplexity = better language model
Training 38 million words, test 1.5 million words, WSJ
N-gram
Order
Unigram Bigram Trigram
Perplexity 962 170 109
Language
Modeling
Sampling and Generalization
The Shannon (1948) Visualization Method
Sample words from an LM
Unigram:
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME
CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO
OF TO EXPERT GRAY COME TO FURNISHES THE LINE
MESSAGE HAD BE THESE.
Bigram:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER
THAT THE CHARACTER OF THIS POINT IS THEREFORE
ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO
EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
Claude Shannon
How Shannon sampled those words in 1948
"Open a book at random and select a letter at random on the page.
This letter is recorded. The book is then opened to another page
and one reads until this letter is encountered. The succeeding
letter is then recorded. Turning to another page this second letter
is searched for and the succeeding letter recorded, etc."
Sampling a word from a distribution
0 1
0.06
the
.06
0.03
of
0.02
a
0.02
to in
.09 .11 .13 .15
…
however
(p=.0003)
polyphonic
p=.0000018
…
0.02
.66 .99
…
Visualizing Bigrams the Shannon Way
Choose a random bigram (<s>, w)
according to its probability p(w|<s>)
Now choose a random bigram (w, x)
according to its probability p(x|w)
And so on until we choose </s>
Then string the words together
<s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>
I want to eat Chinese food
Approximating Shakespeare
We can use the sampling method from the prior section to visualize both of
these facts! To give an intuition for the increasing power of higher-order n-grams,
Fig. 3.4 shows random sentences generated from unigram, bigram, trigram, and 4-
gram models trained on Shakespeare’s works.
1
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
rote life have
gram –Hill he late speaks; or! a more to leg less first you enter
2
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
king. Follow.
gram –What means, sir. I confess she? then all sorts, he is trim, captain.
3
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
’tis done.
gram –This shall forbid it should be branded, if renown made it empty.
4
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
great banquet serv’d in;
gram –It cannot be but so.
Figure 3.4 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
Shakespeare as corpus
N=884,647 tokens, V=29,066
Shakespeare produced 300,000 bigram types out of
V2= 844 million possible bigrams.
ā—¦ So 99.96% of the possible bigrams were never seen (have
zero entries in the table)
ā—¦ That sparsity is even worse for 4-grams, explaining why
our sampling generated actual Shakespeare.
The Wall Street Journal is not Shakespeare
3.5 • GENERALIZATION AND ZEROS 13
1 Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
gram
2
Last December through the way to preserve the Hudson corporation N.
B. E. C. Taylor would seem to complete the major central planners one
gram point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
3
They also point to ninety nine point six billion dollars from two hundred
four oh six three percent of the rates of interest stores as Mexico and
gram Brazil on market conditions
Figure 3.5 Three sentences randomly generated from three n-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
Can you guess the author? These 3-gram sentences
are sampled from an LM trained on who?
1) They also point to ninety nine point
six billion dollars from two hundred four
oh six three percent of the rates of
interest stores as Mexico and gram Brazil
on market conditions
2) This shall forbid it should be branded,
if renown made it empty.
3) ā€œYou are uniformly charming!ā€ cried he,
with a smile of associating and now and
then I bowed and they perceived a chaise
and four to wish for.
52
Choosing training data
If task-specific, use a training corpus that has a similar
genre to your task.
• If legal or medical, need lots of special-purpose documents
Make sure to cover different kinds of dialects and
speaker/authors.
• Example: African-American Vernacular English (AAVE)
• One of many varieties that can be used by African Americans and others
• Can include the auxiliary verb finna that marks immediate future tense:
• "My phone finna die"
The perils of overfitting
N-grams only work well for word prediction if the
test corpus looks like the training corpus
• But even when we try to pick a good training
corpus, the test set will surprise us!
• We need to train robust models that generalize!
One kind of generalization: Zeros
• Things that don’t ever occur in the training set
• But occur in the test set
Zeros
Training set:
… ate lunch
… ate dinner
… ate a
… ate the
P(ā€œbreakfastā€ | ate) = 0
• Test set
… ate lunch
… ate breakfast
Zero probability bigrams
Bigrams with zero probability
ā—¦ Will hurt our performance for texts where those words
appear!
ā—¦ And mean that we will assign 0 probability to the test set!
And hence we cannot compute perplexity (can’t
divide by 0)!
N-gram
Language
Modeling
Smoothing, Interpolation,
and Backoff
The intuition of smoothing (from Dan Klein)
When we have sparse statistics:
Steal probability mass to generalize better
P(w | denied the)
3 allegations
2 reports
1 claims
1 request
7 total
P(w | denied the)
2.5 allegations
1.5 reports
0.5 claims
0.5 request
2 other
7 total allegations
reports
claims
attack
request
man
outcome
…
allegations
attack
man
outcome
…
allegations
reports
claims
request
Add-one estimation
Also called Laplace smoothing
Pretend we saw each word one more time than we did
Just add one to all the counts!
MLE estimate:
Add-1 estimate:
spend 2 1 2 1 1 1 1
Figure 3.6 Add-one smoothed bigram counts for eight of the words (out of V
the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero count
Figure 3.7 shows the add-one smoothed probabilities for the bigrams
Recall that normal bigram probabilities are computed by normalizing e
counts by the unigram count:
P(wn|wn 1) =
C(wn 1wn)
C(wn 1)
For add-one smoothed bigram counts, we need to augment the unigram c
number of total word types in the vocabulary V:
PLaplace(wn|wn 1) =
C(wn 1wn)+1
P
w (C(wn 1w)+1)
=
C(wn 1wn)+1
C(wn 1)+V
lunch 3 1 1 1 1 2 1
spend 2 1 2 1 1 1 1
Figure 3.6 Add-one smoothed bigram counts for eight of the words (out
the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero co
Figure 3.7 shows the add-one smoothed probabilities for the bigra
Recall that normal bigram probabilities are computed by normalizin
counts by the unigram count:
PMLE(wn|wn 1) =
C(wn 1wn)
C(wn 1)
For add-one smoothed bigram counts, we need to augment the unigra
number of total word types in the vocabulary V:
C(wn 1wn)+1 C(wn 1wn)+
spend 2 1 2 1 1 1 1
Figure 3.6 Add-one smoothed bigram counts for eight of the words (out of V
the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero counts
Figure 3.7 shows the add-one smoothed probabilities for the bigrams
Recall that normal bigram probabilities are computed by normalizing e
counts by the unigram count:
P(wn|wn 1) =
C(wn 1wn)
C(wn 1)
For add-one smoothed bigram counts, we need to augment the unigram co
number of total word types in the vocabulary V:
PLaplace(wn|wn 1) =
C(wn 1wn)+1
P
w (C(wn 1w)+1)
=
C(wn 1wn)+1
C(wn 1)+V
Berkeley Restaurant Corpus: Laplace
smoothed bigram counts
Laplace-smoothed bigrams
Reconstituted counts
Compare with raw bigram counts
Add-1 estimation is a blunt instrument
So add-1 isn’t used for N-grams:
ā—¦ Generally we use interpolation or backoff instead
But add-1 is used to smooth other NLP models
ā—¦ For text classification
ā—¦ In domains where the number of zeros isn’t so huge.
Backoff and Interpolation
Sometimes it helps to use less context
ā—¦ Condition on less context for contexts you know less about
Backoff:
ā—¦ use trigram if you have good evidence,
ā—¦ otherwise bigram, otherwise unigram
Interpolation:
ā—¦ mix unigram, bigram, trigram
Interpolation works better
Linear Interpolation
Simple interpolation
unigram counts.
In simple linear interpolation, we combine different order N-grams by linearly
interpolating all the models. Thus, we estimate the trigram probability P(wn|wn 2wn 1)
by mixing together the unigram, bigram, and trigram probabilities, each weighted
by a l:
PĢ‚(wn|wn 2wn 1) = l1P(wn|wn 2wn 1)
+l2P(wn|wn 1)
+l3P(wn) (4.24)
such that the ls sum to 1: X
i
li = 1 (4.25)
In a slightly more sophisticated version of linear interpolation, each l weight is
computed in a more sophisticated way, by conditioning on the context. This way,
if we have particularly accurate counts for a particular bigram, we assume that the
counts of the trigrams based on this bigram will be more trustworthy, so we can
make the ls for those trigrams higher and thus give that trigram more weight in
PĢ‚(wn|wn 2wn 1) = l1P(wn|wn
+l2P(wn|
+l3P(wn)
such that the ls sum to 1: X
i
li = 1
In a slightly more sophisticated version of linear i
computed in a more sophisticated way, by condition
if we have particularly accurate counts for a particul
counts of the trigrams based on this bigram will be
make the ls for those trigrams higher and thus give
How to set λs for interpolation?
Use a held-out corpus
Choose λs to maximize probability of held-out data:
ā—¦ Fix the N-gram probabilities (on the training data)
◦ Then search for λs that give largest probability to held-
out set
Training Data
Held-Out
Data
Test
Data
Backoff
Suppose you want:
P(pancakes| delicious soufflƩ)
If the trigram probability is 0, use the bigram
P(pancakes| soufflƩ)
If the bigram probability is 0, use the unigram
P(pancakes)
Complication: need to discount the higher-order ngram so
probabilities don't sum higher than 1 (e.g., Katz backoff)
Stupid Backoff
Backoff without discounting (not a true probability)
69
S(wi | wiāˆ’k+1
iāˆ’1
) =
count(wiāˆ’k+1
i
)
count(wiāˆ’k+1
iāˆ’1
)
if count(wiāˆ’k+1
i
) > 0
0.4S(wi | wiāˆ’k+2
iāˆ’1
) otherwise
"
#
$
$
%
$
$
S(wi ) =
count(wi )
N
N-gram
Language
Modeling
Interpolation and Backoff

More Related Content

PPTX
NLP_KASHK:N-Grams
Hemantha Kulathilake
Ā 
PPTX
Ngrams smoothing
Digvijay Singh
Ā 
PPTX
Language Model (N-Gram).pptx
HeneWijaya
Ā 
PDF
lec03-LanguageModels_230214_161016.pdf
ykyog
Ā 
PDF
Probability Theory Application and statitics
malickizorom1
Ā 
PPTX
Jarrar: Probabilistic Language Modeling - Introduction to N-grams
Mustafa Jarrar
Ā 
PPTX
Language model in nature language processing
attaurahman
Ā 
PDF
2_Corpora_and_Smoothing_2024.pdf
GeraldPenn2
Ā 
NLP_KASHK:N-Grams
Hemantha Kulathilake
Ā 
Ngrams smoothing
Digvijay Singh
Ā 
Language Model (N-Gram).pptx
HeneWijaya
Ā 
lec03-LanguageModels_230214_161016.pdf
ykyog
Ā 
Probability Theory Application and statitics
malickizorom1
Ā 
Jarrar: Probabilistic Language Modeling - Introduction to N-grams
Mustafa Jarrar
Ā 
Language model in nature language processing
attaurahman
Ā 
2_Corpora_and_Smoothing_2024.pdf
GeraldPenn2
Ā 

Similar to An N-Gram Language Model predicts the next word in a sequence (20)

PPTX
Next word Prediction
KHUSHISOLANKI3
Ā 
PPT
Natural Language Processing: N-Gram Language Models
JCGonzaga1
Ā 
PPT
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
varshakumari296060
Ā 
PPT
Natural Language Processing: N-Gram Language Models
vardadhande
Ā 
PDF
Cl.week5-6
shukaihsieh
Ā 
PPTX
Natural langaugea processing n gram models
tivoy24550
Ā 
PDF
Presentation
Koichi Akabe
Ā 
PPTX
Natural Language Processing
GeekNightHyderabad
Ā 
PDF
[Book Reading] 機械翻訳 - Section 3 No.1
NAIST Machine Translation Study Group
Ā 
PPT
2-Chapter Two-N-gram Language Models.ppt
milkesa13
Ā 
PPTX
Artificial Intelligence
KALPANATCSE
Ā 
PDF
[2019] Class-based N-gram Models of Natural Language
Jinho Choi
Ā 
PDF
Lecture 6
hunglq
Ā 
PPTX
Language models
Maryam Khordad
Ā 
PDF
[2019] Language Modeling
Jinho Choi
Ā 
PDF
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
ijfcstjournal
Ā 
PDF
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
ijfcstjournal
Ā 
PDF
CS571: Language Models
Jinho Choi
Ā 
PDF
12.4.1 n-grams.pdf
RajMani28
Ā 
Next word Prediction
KHUSHISOLANKI3
Ā 
Natural Language Processing: N-Gram Language Models
JCGonzaga1
Ā 
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
varshakumari296060
Ā 
Natural Language Processing: N-Gram Language Models
vardadhande
Ā 
Cl.week5-6
shukaihsieh
Ā 
Natural langaugea processing n gram models
tivoy24550
Ā 
Presentation
Koichi Akabe
Ā 
Natural Language Processing
GeekNightHyderabad
Ā 
[Book Reading] 機械翻訳 - Section 3 No.1
NAIST Machine Translation Study Group
Ā 
2-Chapter Two-N-gram Language Models.ppt
milkesa13
Ā 
Artificial Intelligence
KALPANATCSE
Ā 
[2019] Class-based N-gram Models of Natural Language
Jinho Choi
Ā 
Lecture 6
hunglq
Ā 
Language models
Maryam Khordad
Ā 
[2019] Language Modeling
Jinho Choi
Ā 
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
ijfcstjournal
Ā 
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
ijfcstjournal
Ā 
CS571: Language Models
Jinho Choi
Ā 
12.4.1 n-grams.pdf
RajMani28
Ā 
Ad

Recently uploaded (20)

PPTX
Information Retrieval and Extraction - Module 7
premSankar19
Ā 
PPTX
easa module 3 funtamental electronics.pptx
tryanothert7
Ā 
PDF
Zero Carbon Building Performance standard
BassemOsman1
Ā 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
Ā 
PPT
Ppt for engineering students application on field effect
lakshmi.ec
Ā 
PDF
Principles of Food Science and Nutritions
Dr. Yogesh Kumar Kosariya
Ā 
PDF
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
Ā 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
Ā 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
Ā 
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
Ā 
PDF
Software Testing Tools - names and explanation
shruti533256
Ā 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
Ā 
PPT
SCOPE_~1- technology of green house and poyhouse
bala464780
Ā 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
Ā 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
Ā 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
Ā 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
Ā 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
Ā 
PDF
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
Ā 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
Ā 
Information Retrieval and Extraction - Module 7
premSankar19
Ā 
easa module 3 funtamental electronics.pptx
tryanothert7
Ā 
Zero Carbon Building Performance standard
BassemOsman1
Ā 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
Ā 
Ppt for engineering students application on field effect
lakshmi.ec
Ā 
Principles of Food Science and Nutritions
Dr. Yogesh Kumar Kosariya
Ā 
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
Ā 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
Ā 
database slide on modern techniques for optimizing database queries.pptx
aky52024
Ā 
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
Ā 
Software Testing Tools - names and explanation
shruti533256
Ā 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
Ā 
SCOPE_~1- technology of green house and poyhouse
bala464780
Ā 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
Ā 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
Ā 
Zero carbon Building Design Guidelines V4
BassemOsman1
Ā 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
Ā 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
Ā 
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
Ā 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
Ā 
Ad

An N-Gram Language Model predicts the next word in a sequence

  • 2. Predicting words The water of Walden Pond is beautifully ... *refrigerator *that blue green clear
  • 3. Language Models Systems that can predict upcoming words • Can assign a probability to each potential next word • Can assign a probability to a whole sentence
  • 4. Why word prediction? It's a helpful part of language tasks • Grammar or spell checking Their are two midterms Their There are two midterms Everything has improve Everything has improve improved • Speech recognition
  • 5. Why word prediction? It's how large language models (LLMs) work! LLMs are trained to predict words • Left-to-right (autoregressive) LMs learn to predict next word LLMs generate text by predicting words • By predicting the next word over and over again
  • 6. Language Modeling (LM) more formally Goal: compute the probability of a sentence or sequence of words W: P(W) = P(w1,w2,w3,w4,w5…wn) Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4) or P(wn|w1,w2…wn-1) An LM computes either of these: P(W) or P(wn|w1,w2…wn-1)
  • 7. How to estimate these probabilities Could we just count and divide? No! Too many possible sentences! We’ll never see enough data for estimating these P(blue|The water of Walden Pond is so beautifully) (3.1) ne way to estimate this probability is directly from relative frequency counts: take a ry large corpus, count the number of times we see The water of Walden Pond s so beautifully, and count the number of times this is followed by blue. This ould be answering the question ā€œOut of the times we saw the history h, how many mes was it followed by the word wā€, as follows: P(blue|The water of Walden Pond is so beautifully) = C(The water of Walden Pond is so beautifully blue) C(The water of Walden Pond is so beautifully) (3.2) we had a large enough corpus, we could compute these two counts and estimate e probability from Eq. 3.2. But even the entire web isn’t big enough to give us ood estimates for counts of entire sentences. This is because language is creative; w sentences are invented all the time, and we can’t expect to get accurate counts r such large objects as entire sentences. For this reason, we’ll need more clever ne way to estimate this probability is directly from relative frequency counts: take a ery large corpus, count the number of times we see The water of Walden Pond s so beautifully, and count the number of times this is followed by blue. This ould be answering the question ā€œOut of the times we saw the history h, how many mes was it followed by the word wā€, as follows: P(blue|The water of Walden Pond is so beautifully) = C(The water of Walden Pond is so beautifully blue) C(The water of Walden Pond is so beautifully) (3.2 we had a large enough corpus, we could compute these two counts and estimate e probability from Eq. 3.2. But even the entire web isn’t big enough to give us ood estimates for counts of entire sentences. This is because language is creative ew sentences are invented all the time, and we can’t expect to get accurate counts r such large objects as entire sentences. For this reason, we’ll need more cleve =
  • 8. How to compute P(W) or P(wn|w1, …wn-1) How to compute the joint probability P(W): P(The, water, of, Walden, Pond, is, so, beautifully, blue) Intuition: let’s rely on the Chain Rule of Probability
  • 9. Reminder: The Chain Rule Recall the definition of conditional probabilities P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A) P(B|A) More variables: P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C) The Chain Rule in General P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
  • 10. The Chain Rule applied to compute joint probability of words in sentence P(ā€œThe water of Walden Pondā€) = P(The) Ɨ P(water|The) Ɨ P(of|The water) Ɨ P(Walden|The water of) Ɨ P(Pond|The water of Walden) 3.1 • N-GRAM plying the chain rule to words, we get P(w1:n) = P(w1)P(w2|w1)P(w3|w1:2)...P(wn|w1:n 1) = n Y k=1 P(wk|w1:k 1) chain rule shows the link between computing the joint probability of a se computing the conditional probability of a word given previous words. 3.4 suggests that we could estimate the joint probability of an entire seque ds by multiplying together a number of conditional probabilities. But us in rule doesn’t really seem to help us! We don’t know any way to comp
  • 11. Markov Assumption Simplifying assumption: Andrei Markov ntuition of the n-gram model is that instead of computing the prob given its entire history, we can approximate the history by just t s. he bigram model, for example, approximates the probability of a e previous words P(wn|w1:n 1) by using only the conditional probab ding word P(wn|wn 1). In other words, instead of computing the pr P(blue|The water of Walden Pond is so beautifully) pproximate it with the probability P(blue|beautifully) n we use a bigram model to predict the conditional probability of the am model, for example, approximates the probability of a word gi us words P(wn|w1:n 1) by using only the conditional probability of rd P(wn|wn 1). In other words, instead of computing the probabili ue|The water of Walden Pond is so beautifully) ( ate it with the probability P(blue|beautifully) ( a bigram model to predict the conditional probability of the next w making the following approximation: ā‰ˆ (wn|wn 1). In other words, instead of computing the proba he water of Walden Pond is so beautifully) t with the probability P(blue|beautifully) gram model to predict the conditional probability of the nex ng the following approximation: P(wn|w1:n 1) ⇔ P(wn|wn 1) Wikimedia commons
  • 12. Bigram Markov Assumption Instead of: More generally, we approximate each component in the product of a complete word sequence by substituting Eq. 3. P(w1:n) ⇔ n Y k=1 P(wk|wk 1) 3.1 • N-G Applying the chain rule to words, we get P(w1:n) = P(w1)P(w2|w1)P(w3|w1:2)...P(wn|w1:n 1) = n Y k=1 P(wk|w1:k 1) The chain rule shows the link between computing the joint probability of a and computing the conditional probability of a word given previous wor tion 3.4 suggests that we could estimate the joint probability of an entire se words by multiplying together a number of conditional probabilities. But chain rule doesn’t really seem to help us! We don’t know any way to co e can predict the probability of some future unit w t. We can generalize the bigram (which looks one w which looks two words into the past) and thus to t rds into the past). general equation for this n-gram approximation the next word in a sequence. We’ll use N here to means bigrams and N = 3 means trigrams. Then w a word given its entire context as follows: P(wn|w1:n 1) ⇔ P(wn|wn N+1:n 1)
  • 13. Simplest case: Unigram model To him swallowed confess hear both . Which . Of save on trail for are ay device and rote life have Hill he late speaks ; or ! a more to leg less first you enter Months the my and issue of year foreign new exchange’s September were recession exchange new endorsed a acquire to six executives Some automatically generated sentences from two different unigram models € P(w1w2…wn ) ā‰ˆ P(wi) i āˆ
  • 14. Bigram model Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow. What means, sir. I confess she? then all sorts, he is trim, captain. Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one gram point five percent of U. S. E. has already old M. X. corporation of living on information such as more frequently fishing to keep her P(wi | w1w2…wiāˆ’1) ā‰ˆ P(wi | wiāˆ’1) Some automatically generated sentences from two different biigram models
  • 15. Problems with N-gram models • N-grams can't handle long-distance dependencies: ā€œThe soups that I made from that new cookbook I bought yesterday were amazingly delicious." • N-grams don't do well at modeling new sequences with similar meanings The solution: Large language models • can handle much longer contexts • because of using embedding spaces, can model synonymy better, and generate better novel strings
  • 16. Why N-gram models? A nice clear paradigm that lets us introduce many of the important issues for large language models • training and test sets • the perplexity metric • sampling to generate sentences • ideas like interpolation and backoff
  • 19. Estimating bigram probabilities The Maximum Likelihood Estimate all the bigrams that share the same first word wn 1: P(wn|wn 1) = C(wn 1wn) P w C(wn 1w) ify this equation, since the sum of all bigram counts that sta 1 must be equal to the unigram count for that word wn 1 (the ment to be convinced of this): P(wn|wn 1) = C(wn 1wn) C(wn 1) hrough an example using a mini-corpus of three sentences. l between 0 and 1. e, to compute a particular bigram probability of a word wn wn 1, we’ll compute the count of the bigram C(wn 1wn) and of all the bigrams that share the same first word wn 1: P(wn|wn 1) = C(wn 1wn) P w C(wn 1w) plify this equation, since the sum of all bigram counts that s 1 must be equal to the unigram count for that word wn 1 (th oment to be convinced of this): C(wn 1wn)
  • 20. An example <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> € P(wi | wiāˆ’1) = c(wiāˆ’1,wi) c(wiāˆ’1)
  • 21. More examples: Berkeley Restaurant Project sentences can you tell me about any good cantonese restaurants close by tell me about chez panisse i’m looking for a good place to eat breakfast when is caffe venezia open during the day
  • 22. Raw bigram counts Out of 9332 sentences, |V|=1446, taken only 8 words
  • 23. Raw bigram probabilities Normalize by unigrams: Result:
  • 24. Bigram estimates of sentence probabilities P(<s> I want english food </s>) = P(I|<s>) Ɨ P(want|I) Ɨ P(english|want) Ɨ P(food|english) Ɨ P(</s>|food) = .000031
  • 25. What kinds of knowledge do N-grams represent? P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25
  • 26. Dealing with scale in large n-grams LM probabilities are stored and computed in log format, i.e. log probabilities This avoids underflow from multiplying many small numbers log(p1 Ɨ p2 Ɨ p3 Ɨ p4 ) = log p1 + log p2 + log p3 + log p4 3.1.3 Dealing with scale in large n-gram models In practice, language models can be very large, leading to practical issues. Log probabilities Language model probabilities are always stored and com in log format, i.e., as log probabilities. This is because probabilities are (b inition) less than or equal to 1, and so the more probabilities we multiply to the smaller the product becomes. Multiplying enough n-grams together would in numerical underflow. Adding in log space is equivalent to multiplying in space, so we combine log probabilities by adding them. By adding log proba instead of multiplying probabilities, we get results that are not as small. We computation and storage in log space, and just convert back into probabilitie need to report probabilities at the end by taking the exp of the logprob: p1 ⇄ p2 ⇄ p3 ⇄ p4 = exp(log p1 +log p2 +log p3 +log p4) If we need probabilities we can do one exp at the end
  • 27. Larger ngrams 4-grams, 5-grams Large datasets of large n-grams have been released • N-grams from Corpus of Contemporary American English (COCA) 1 billion words (Davies 2020) • Google Web 5-grams (Franz and Brants 2006) 1 trillion words) Newest model: infini-grams (āˆž-grams) (Liu et al 2024) • No precomputing! Instead, store 5 trillion words of web text in suffix arrays. Can compute n-gram probabilities with any n! • Efficiency: quantize probabilities to 4-8 bits instead of 8-byte float
  • 28. N-gram LM Toolkits SRILM ā—¦ http://www.speech.sri.com/projects/srilm/ KenLM ā—¦ https://kheafield.com/code/kenlm/
  • 30. How to evaluate N-gram models "Extrinsic (in-vivo) Evaluation" To compare models A and B 1. Put each model in a real task • Machine Translation, speech recognition, etc. 2. Run the task, get a score for A and for B • How many words translated correctly 3. Compare accuracy for A and B
  • 31. Intrinsic (in-vitro) evaluation Extrinsic evaluation not always possible • Expensive, time-consuming • Doesn't always generalize to other applications Intrinsic evaluation: perplexity • Directly measures language model performance at predicting words. • Doesn't necessarily correspond with real application performance • But gives us a single general metric for language models • Useful for large language models (LLMs) as well as n-grams
  • 32. Training sets and test sets We train parameters of our model on a training set. We test the model’s performance on data we haven’t seen. ā—¦ A test set is an unseen dataset; different from training set. ā—¦ Intuition: we want to measure generalization to unseen data ā—¦ An evaluation metric (like perplexity) tells us how well our model does on the test set.
  • 33. Choosing training and test sets • If we're building an LM for a specific task • The test set should reflect the task language we want to use the model for • If we're building a general-purpose model • We'll need lots of different kinds of training data • We don't want the training set or the test set to be just from one domain or author or language.
  • 34. Training on the test set We can’t allow test sentences into the training set • Or else the LM will assign that sentence an artificially high probability when we see it in the test set • And hence assign the whole test set a falsely high probability. • Making the LM look better than it really is This is called ā€œTraining on the test setā€ Bad science! 34
  • 35. Dev sets •If we test on the test set many times we might implicitly tune to its characteristics •Noticing which changes make the model better. •So we run on the test set only once, or a few times •That means we need a third dataset: • A development test set or, devset. • We test our LM on the devset until the very end • And then test our LM on the test set once
  • 36. Intuition of perplexity as evaluation metric: How good is our language model? Intuition: A good LM prefers "real" sentences • Assign higher probability to ā€œrealā€ or ā€œfrequently observedā€ sentences • Assigns lower probability to ā€œword saladā€ or ā€œrarely observedā€ sentences?
  • 37. Intuition of perplexity 2: Predicting upcoming words The Shannon Game: How well can we predict the next word? • Once upon a ____ • That is a picture of a ____ • For breakfast I ate my usual ____ Unigrams are terrible at this game (Why?) time 0.9 dream 0.03 midnight 0.02 … and 1e-100 Picture credit: Historiska bildsamlingen https://creativecommons.org/licenses/by/2.0/ Claude Shannon A good LM is one that assigns a higher probability to the next word that actually occurs
  • 38. Intuition of perplexity 3: The best language model is one that best predicts the entire unseen test set • We said: a good LM is one that assigns a higher probability to the next word that actually occurs. • Let's generalize to all the words! • The best LM assigns high probability to the entire test set. • When comparing two LMs, A and B • We compute PA(test set) and PB(test set) • The better LM will give a higher probability to (=be less surprised by) the test set than the other LM.
  • 39. • Probability depends on size of test set • Probability gets smaller the longer the text • Better: a metric that is per-word, normalized by length • Perplexity is the inverse probability of the test set, normalized by the number of words Intuition of perplexity 4: Use perplexity instead of raw probability PP(W) = P(w1w2...wN ) āˆ’ 1 N = 1 P(w1w2...wN ) N
  • 40. Perplexity is the inverse probability of the test set, normalized by the number of words Probability range is [0,1], perplexity range is [1,āˆž] Minimizing perplexity is the same as maximizing probability Intuition of perplexity 5: the inverse PP(W) = P(w1w2...wN ) āˆ’ 1 N = 1 P(w1w2...wN ) N
  • 41. Intuition of perplexity 6: N-grams PP(W) = P(w1w2...wN ) āˆ’ 1 N = 1 P(w1w2...wN ) N Bigrams: Chain rule:
  • 42. Intuition of perplexity 7: Weighted average branching factor Perplexity is also the weighted average branching factor of a language. Branching factor: number of possible next words that can follow any word Example: Deterministic language L = {red,blue, green} Branching factor = 3 (any word can be followed by red, blue, green) Now assume LM A where each word follows any other word with equal probability ā…“ Given a test set T = "red red red red blue" PerplexityA(T) = PA(red red red red blue)-1/5 = But now suppose red was very likely in training set, such that for LM B: ā—¦ P(red) = .8 p(green) = .1 p(blue) = .1 We would expect the probability to be higher, and hence the perplexity to be smaller: PerplexityB(T) = PB(red red red red blue)-1/5 ((ā…“)5)-1/5 = (ā…“)-1 =3 = (.8 * .8 * .8 * .8 * .1) -1/5 =.04096 -1/5 = .527-1 = 1.89
  • 43. Holding test set constant: Lower perplexity = better language model Training 38 million words, test 1.5 million words, WSJ N-gram Order Unigram Bigram Trigram Perplexity 962 170 109
  • 45. The Shannon (1948) Visualization Method Sample words from an LM Unigram: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE. Bigram: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED. Claude Shannon
  • 46. How Shannon sampled those words in 1948 "Open a book at random and select a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc."
  • 47. Sampling a word from a distribution 0 1 0.06 the .06 0.03 of 0.02 a 0.02 to in .09 .11 .13 .15 … however (p=.0003) polyphonic p=.0000018 … 0.02 .66 .99 …
  • 48. Visualizing Bigrams the Shannon Way Choose a random bigram (<s>, w) according to its probability p(w|<s>) Now choose a random bigram (w, x) according to its probability p(x|w) And so on until we choose </s> Then string the words together <s> I I want want to to eat eat Chinese Chinese food food </s> I want to eat Chinese food
  • 49. Approximating Shakespeare We can use the sampling method from the prior section to visualize both of these facts! To give an intuition for the increasing power of higher-order n-grams, Fig. 3.4 shows random sentences generated from unigram, bigram, trigram, and 4- gram models trained on Shakespeare’s works. 1 –To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have gram –Hill he late speaks; or! a more to leg less first you enter 2 –Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow. gram –What means, sir. I confess she? then all sorts, he is trim, captain. 3 –Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, ’tis done. gram –This shall forbid it should be branded, if renown made it empty. 4 –King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet serv’d in; gram –It cannot be but so. Figure 3.4 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
  • 50. Shakespeare as corpus N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams. ā—¦ So 99.96% of the possible bigrams were never seen (have zero entries in the table) ā—¦ That sparsity is even worse for 4-grams, explaining why our sampling generated actual Shakespeare.
  • 51. The Wall Street Journal is not Shakespeare 3.5 • GENERALIZATION AND ZEROS 13 1 Months the my and issue of year foreign new exchange’s september were recession exchange new endorsed a acquire to six executives gram 2 Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one gram point five percent of U. S. E. has already old M. X. corporation of living on information such as more frequently fishing to keep her 3 They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and gram Brazil on market conditions Figure 3.5 Three sentences randomly generated from three n-gram models computed from 40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
  • 52. Can you guess the author? These 3-gram sentences are sampled from an LM trained on who? 1) They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and gram Brazil on market conditions 2) This shall forbid it should be branded, if renown made it empty. 3) ā€œYou are uniformly charming!ā€ cried he, with a smile of associating and now and then I bowed and they perceived a chaise and four to wish for. 52
  • 53. Choosing training data If task-specific, use a training corpus that has a similar genre to your task. • If legal or medical, need lots of special-purpose documents Make sure to cover different kinds of dialects and speaker/authors. • Example: African-American Vernacular English (AAVE) • One of many varieties that can be used by African Americans and others • Can include the auxiliary verb finna that marks immediate future tense: • "My phone finna die"
  • 54. The perils of overfitting N-grams only work well for word prediction if the test corpus looks like the training corpus • But even when we try to pick a good training corpus, the test set will surprise us! • We need to train robust models that generalize! One kind of generalization: Zeros • Things that don’t ever occur in the training set • But occur in the test set
  • 55. Zeros Training set: … ate lunch … ate dinner … ate a … ate the P(ā€œbreakfastā€ | ate) = 0 • Test set … ate lunch … ate breakfast
  • 56. Zero probability bigrams Bigrams with zero probability ā—¦ Will hurt our performance for texts where those words appear! ā—¦ And mean that we will assign 0 probability to the test set! And hence we cannot compute perplexity (can’t divide by 0)!
  • 58. The intuition of smoothing (from Dan Klein) When we have sparse statistics: Steal probability mass to generalize better P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total allegations reports claims attack request man outcome … allegations attack man outcome … allegations reports claims request
  • 59. Add-one estimation Also called Laplace smoothing Pretend we saw each word one more time than we did Just add one to all the counts! MLE estimate: Add-1 estimate: spend 2 1 2 1 1 1 1 Figure 3.6 Add-one smoothed bigram counts for eight of the words (out of V the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero count Figure 3.7 shows the add-one smoothed probabilities for the bigrams Recall that normal bigram probabilities are computed by normalizing e counts by the unigram count: P(wn|wn 1) = C(wn 1wn) C(wn 1) For add-one smoothed bigram counts, we need to augment the unigram c number of total word types in the vocabulary V: PLaplace(wn|wn 1) = C(wn 1wn)+1 P w (C(wn 1w)+1) = C(wn 1wn)+1 C(wn 1)+V lunch 3 1 1 1 1 2 1 spend 2 1 2 1 1 1 1 Figure 3.6 Add-one smoothed bigram counts for eight of the words (out the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero co Figure 3.7 shows the add-one smoothed probabilities for the bigra Recall that normal bigram probabilities are computed by normalizin counts by the unigram count: PMLE(wn|wn 1) = C(wn 1wn) C(wn 1) For add-one smoothed bigram counts, we need to augment the unigra number of total word types in the vocabulary V: C(wn 1wn)+1 C(wn 1wn)+ spend 2 1 2 1 1 1 1 Figure 3.6 Add-one smoothed bigram counts for eight of the words (out of V the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero counts Figure 3.7 shows the add-one smoothed probabilities for the bigrams Recall that normal bigram probabilities are computed by normalizing e counts by the unigram count: P(wn|wn 1) = C(wn 1wn) C(wn 1) For add-one smoothed bigram counts, we need to augment the unigram co number of total word types in the vocabulary V: PLaplace(wn|wn 1) = C(wn 1wn)+1 P w (C(wn 1w)+1) = C(wn 1wn)+1 C(wn 1)+V
  • 60. Berkeley Restaurant Corpus: Laplace smoothed bigram counts
  • 63. Compare with raw bigram counts
  • 64. Add-1 estimation is a blunt instrument So add-1 isn’t used for N-grams: ā—¦ Generally we use interpolation or backoff instead But add-1 is used to smooth other NLP models ā—¦ For text classification ā—¦ In domains where the number of zeros isn’t so huge.
  • 65. Backoff and Interpolation Sometimes it helps to use less context ā—¦ Condition on less context for contexts you know less about Backoff: ā—¦ use trigram if you have good evidence, ā—¦ otherwise bigram, otherwise unigram Interpolation: ā—¦ mix unigram, bigram, trigram Interpolation works better
  • 66. Linear Interpolation Simple interpolation unigram counts. In simple linear interpolation, we combine different order N-grams by linearly interpolating all the models. Thus, we estimate the trigram probability P(wn|wn 2wn 1) by mixing together the unigram, bigram, and trigram probabilities, each weighted by a l: PĢ‚(wn|wn 2wn 1) = l1P(wn|wn 2wn 1) +l2P(wn|wn 1) +l3P(wn) (4.24) such that the ls sum to 1: X i li = 1 (4.25) In a slightly more sophisticated version of linear interpolation, each l weight is computed in a more sophisticated way, by conditioning on the context. This way, if we have particularly accurate counts for a particular bigram, we assume that the counts of the trigrams based on this bigram will be more trustworthy, so we can make the ls for those trigrams higher and thus give that trigram more weight in PĢ‚(wn|wn 2wn 1) = l1P(wn|wn +l2P(wn| +l3P(wn) such that the ls sum to 1: X i li = 1 In a slightly more sophisticated version of linear i computed in a more sophisticated way, by condition if we have particularly accurate counts for a particul counts of the trigrams based on this bigram will be make the ls for those trigrams higher and thus give
  • 67. How to set Ī»s for interpolation? Use a held-out corpus Choose Ī»s to maximize probability of held-out data: ā—¦ Fix the N-gram probabilities (on the training data) ā—¦ Then search for Ī»s that give largest probability to held- out set Training Data Held-Out Data Test Data
  • 68. Backoff Suppose you want: P(pancakes| delicious soufflĆ©) If the trigram probability is 0, use the bigram P(pancakes| soufflĆ©) If the bigram probability is 0, use the unigram P(pancakes) Complication: need to discount the higher-order ngram so probabilities don't sum higher than 1 (e.g., Katz backoff)
  • 69. Stupid Backoff Backoff without discounting (not a true probability) 69 S(wi | wiāˆ’k+1 iāˆ’1 ) = count(wiāˆ’k+1 i ) count(wiāˆ’k+1 iāˆ’1 ) if count(wiāˆ’k+1 i ) > 0 0.4S(wi | wiāˆ’k+2 iāˆ’1 ) otherwise " # $ $ % $ $ S(wi ) = count(wi ) N