An N-Gram Language Model predicts the next word in a sequence

N-gram
Language
Modeling
Introduction to N-gram
Language Models

Predicting words
The water of Walden Pond is beautifully ...
*refrigerator
*that
blue
green
clear

Language Models
Systems that can predict upcoming words
• Can assign a probability to each potential next word
• Can assign a probability to a whole sentence

Why word prediction?
It's a helpful part of language tasks
• Grammar or spell checking
Their are two midterms Their There are two midterms
Everything has improve Everything has improve improved
• Speech recognition

Why word prediction?
It's how large language models (LLMs) work!
LLMs are trained to predict words
• Left-to-right (autoregressive) LMs learn to predict next word
LLMs generate text by predicting words
• By predicting the next word over and over again

Language Modeling (LM) more formally
Goal: compute the probability of a sentence or
sequence of words W:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4) or P(wn|w1,w2…wn-1)
An LM computes either of these:
P(W) or P(wn|w1,w2…wn-1)

How to estimate these probabilities
Could we just count and divide?
No! Too many possible sentences!
We’ll never see enough data for estimating these
P(blue|The water of Walden Pond is so beautifully) (3.1)
ne way to estimate this probability is directly from relative frequency counts: take a
ry large corpus, count the number of times we see The water of Walden Pond
s so beautifully, and count the number of times this is followed by blue. This
ould be answering the question “Out of the times we saw the history h, how many
mes was it followed by the word w”, as follows:
P(blue|The water of Walden Pond is so beautifully) =
C(The water of Walden Pond is so beautifully blue)
C(The water of Walden Pond is so beautifully)
(3.2)
we had a large enough corpus, we could compute these two counts and estimate
e probability from Eq. 3.2. But even the entire web isn’t big enough to give us
ood estimates for counts of entire sentences. This is because language is creative;
w sentences are invented all the time, and we can’t expect to get accurate counts
r such large objects as entire sentences. For this reason, we’ll need more clever
ne way to estimate this probability is directly from relative frequency counts: take a
ery large corpus, count the number of times we see The water of Walden Pond
s so beautifully, and count the number of times this is followed by blue. This
ould be answering the question “Out of the times we saw the history h, how many
mes was it followed by the word w”, as follows:
P(blue|The water of Walden Pond is so beautifully) =
C(The water of Walden Pond is so beautifully blue)
C(The water of Walden Pond is so beautifully)
(3.2
we had a large enough corpus, we could compute these two counts and estimate
e probability from Eq. 3.2. But even the entire web isn’t big enough to give us
ood estimates for counts of entire sentences. This is because language is creative
ew sentences are invented all the time, and we can’t expect to get accurate counts
r such large objects as entire sentences. For this reason, we’ll need more cleve
=

How to compute P(W) or P(wn|w1, …wn-1)
How to compute the joint probability P(W):
P(The, water, of, Walden, Pond, is, so, beautifully, blue)
Intuition: let’s rely on the Chain Rule of Probability

The Chain Rule applied to compute joint
probability of words in sentence
P(“The water of Walden Pond”) =
P(The) × P(water|The) × P(of|The water)
× P(Walden|The water of) × P(Pond|The water of Walden)
3.1 • N-GRAM
plying the chain rule to words, we get
P(w1:n) = P(w1)P(w2|w1)P(w3|w1:2)...P(wn|w1:n 1)
=
n
Y
k=1
P(wk|w1:k 1)
chain rule shows the link between computing the joint probability of a se
computing the conditional probability of a word given previous words.
3.4 suggests that we could estimate the joint probability of an entire seque
ds by multiplying together a number of conditional probabilities. But us
in rule doesn’t really seem to help us! We don’t know any way to comp

Markov Assumption
Simplifying assumption:
Andrei Markov
ntuition of the n-gram model is that instead of computing the prob
given its entire history, we can approximate the history by just t
s.
he bigram model, for example, approximates the probability of a
e previous words P(wn|w1:n 1) by using only the conditional probab
ding word P(wn|wn 1). In other words, instead of computing the pr
P(blue|The water of Walden Pond is so beautifully)
pproximate it with the probability
P(blue|beautifully)
n we use a bigram model to predict the conditional probability of the
am model, for example, approximates the probability of a word gi
us words P(wn|w1:n 1) by using only the conditional probability of
rd P(wn|wn 1). In other words, instead of computing the probabili
ue|The water of Walden Pond is so beautifully) (
ate it with the probability
P(blue|beautifully) (
a bigram model to predict the conditional probability of the next w
making the following approximation:
≈
(wn|wn 1). In other words, instead of computing the proba
he water of Walden Pond is so beautifully)
t with the probability
P(blue|beautifully)
gram model to predict the conditional probability of the nex
ng the following approximation:
P(wn|w1:n 1) ⇡ P(wn|wn 1)
Wikimedia commons

Bigram Markov Assumption
Instead of:
More generally, we approximate each
component in the product
of a complete word sequence by substituting Eq. 3.
P(w1:n) ⇡
n
Y
k=1
P(wk|wk 1)
3.1 • N-G
Applying the chain rule to words, we get
P(w1:n) = P(w1)P(w2|w1)P(w3|w1:2)...P(wn|w1:n 1)
=
n
Y
k=1
P(wk|w1:k 1)
The chain rule shows the link between computing the joint probability of a
and computing the conditional probability of a word given previous wor
tion 3.4 suggests that we could estimate the joint probability of an entire se
words by multiplying together a number of conditional probabilities. But
chain rule doesn’t really seem to help us! We don’t know any way to co
e can predict the probability of some future unit w
t. We can generalize the bigram (which looks one w
which looks two words into the past) and thus to t
rds into the past).
general equation for this n-gram approximation
the next word in a sequence. We’ll use N here to
means bigrams and N = 3 means trigrams. Then w
a word given its entire context as follows:
P(wn|w1:n 1) ⇡ P(wn|wn N+1:n 1)

Simplest case: Unigram model
To him swallowed confess hear both . Which . Of save on trail
for are ay device and rote life have
Hill he late speaks ; or ! a more to leg less first you enter
Months the my and issue of year foreign new exchange’s September
were recession exchange new endorsed a acquire to six executives
Some automatically generated sentences from two different unigram models
€
P(w1w2…wn ) ≈ P(wi)
i
∏

Bigram model
Why dost stand forth thy canopy, forsooth; he is this palpable hit
the King Henry. Live king. Follow.
What means, sir. I confess she? then all sorts, he is trim, captain.
Last December through the way to preserve the Hudson corporation N.
B. E. C. Taylor would seem to complete the major central planners
one gram point five percent of U. S. E. has already old M. X.
corporation of living
on information such as more frequently fishing to keep her
P(wi | w1w2…wi−1) ≈ P(wi | wi−1)
Some automatically generated sentences from two different biigram models

Problems with N-gram models
• N-grams can't handle long-distance dependencies:
“The soups that I made from that new cookbook I
bought yesterday were amazingly delicious."
• N-grams don't do well at modeling new sequences
with similar meanings
The solution: Large language models
• can handle much longer contexts
• because of using embedding spaces, can model
synonymy better, and generate better novel strings

Why N-gram models?
A nice clear paradigm that lets us introduce many of
the important issues for large language models
• training and test sets
• the perplexity metric
• sampling to generate sentences
• ideas like interpolation and backoff

N-gram
Language
Modeling
Introduction to N-grams

N-gram
Language
Modeling
Estimating N-gram
Probabilities

Estimating bigram probabilities
The Maximum Likelihood Estimate
all the bigrams that share the same first word wn 1:
P(wn|wn 1) =
C(wn 1wn)
P
w C(wn 1w)
ify this equation, since the sum of all bigram counts that sta
1 must be equal to the unigram count for that word wn 1 (the
ment to be convinced of this):
P(wn|wn 1) =
C(wn 1wn)
C(wn 1)
hrough an example using a mini-corpus of three sentences.
l between 0 and 1.
e, to compute a particular bigram probability of a word wn
wn 1, we’ll compute the count of the bigram C(wn 1wn) and
of all the bigrams that share the same first word wn 1:
P(wn|wn 1) =
C(wn 1wn)
P
w C(wn 1w)
plify this equation, since the sum of all bigram counts that s
1 must be equal to the unigram count for that word wn 1 (th
oment to be convinced of this):
C(wn 1wn)

An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
€
P(wi | wi−1) =
c(wi−1,wi)
c(wi−1)

More examples:
Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants close by
tell me about chez panisse
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day

Raw bigram counts
Out of 9332 sentences, |V|=1446, taken only 8
words

Raw bigram probabilities
Normalize by unigrams:
Result:

Dealing with scale in large n-grams
LM probabilities are stored and computed in
log format, i.e. log probabilities
This avoids underflow from multiplying many
small numbers
log(p1 × p2 × p3 × p4 ) = log p1 + log p2 + log p3 + log p4
3.1.3 Dealing with scale in large n-gram models
In practice, language models can be very large, leading to practical issues.
Log probabilities Language model probabilities are always stored and com
in log format, i.e., as log probabilities. This is because probabilities are (b
inition) less than or equal to 1, and so the more probabilities we multiply to
the smaller the product becomes. Multiplying enough n-grams together would
in numerical underflow. Adding in log space is equivalent to multiplying in
space, so we combine log probabilities by adding them. By adding log proba
instead of multiplying probabilities, we get results that are not as small. We
computation and storage in log space, and just convert back into probabilitie
need to report probabilities at the end by taking the exp of the logprob:
p1 ⇥ p2 ⇥ p3 ⇥ p4 = exp(log p1 +log p2 +log p3 +log p4)
If we need probabilities we can do one exp at the end

Larger ngrams
4-grams, 5-grams
Large datasets of large n-grams have been released
• N-grams from Corpus of Contemporary American English (COCA)
1 billion words (Davies 2020)
• Google Web 5-grams (Franz and Brants 2006) 1 trillion words)
Newest model: infini-grams (∞-grams) (Liu et al 2024)
• No precomputing! Instead, store 5 trillion words of web text in
suffix arrays. Can compute n-gram probabilities with any n!
• Efficiency: quantize probabilities to 4-8 bits instead of 8-byte
float

N-gram LM Toolkits
SRILM
◦ http://www.speech.sri.com/projects/srilm/
KenLM
◦ https://kheafield.com/code/kenlm/

Language
Modeling
Evaluation and Perplexity

How to evaluate N-gram models
"Extrinsic (in-vivo) Evaluation"
To compare models A and B
1. Put each model in a real task
• Machine Translation, speech recognition, etc.
2. Run the task, get a score for A and for B
• How many words translated correctly
3. Compare accuracy for A and B

Intrinsic (in-vitro) evaluation
Extrinsic evaluation not always possible
• Expensive, time-consuming
• Doesn't always generalize to other applications
Intrinsic evaluation: perplexity
• Directly measures language model performance at
predicting words.
• Doesn't necessarily correspond with real application
performance
• But gives us a single general metric for language models
• Useful for large language models (LLMs) as well as n-grams

Training sets and test sets
We train parameters of our model on a training set.
We test the model’s performance on data we
haven’t seen.
◦ A test set is an unseen dataset; different from training set.
◦ Intuition: we want to measure generalization to unseen data
◦ An evaluation metric (like perplexity) tells us how well
our model does on the test set.

Choosing training and test sets
• If we're building an LM for a specific task
• The test set should reflect the task language we
want to use the model for
• If we're building a general-purpose model
• We'll need lots of different kinds of training
data
• We don't want the training set or the test set to
be just from one domain or author or language.

Training on the test set
We can’t allow test sentences into the training set
• Or else the LM will assign that sentence an artificially
high probability when we see it in the test set
• And hence assign the whole test set a falsely high
probability.
• Making the LM look better than it really is
This is called “Training on the test set”
Bad science!
34

Dev sets
•If we test on the test set many times we might
implicitly tune to its characteristics
•Noticing which changes make the model better.
•So we run on the test set only once, or a few times
•That means we need a third dataset:
• A development test set or, devset.
• We test our LM on the devset until the very end
• And then test our LM on the test set once

Intuition of perplexity as evaluation metric:
How good is our language model?
Intuition: A good LM prefers "real" sentences
• Assign higher probability to “real” or “frequently
observed” sentences
• Assigns lower probability to “word salad” or
“rarely observed” sentences?

Intuition of perplexity 2:
Predicting upcoming words
The Shannon Game: How well can we
predict the next word?
• Once upon a ____
• That is a picture of a ____
• For breakfast I ate my usual ____
Unigrams are terrible at this game (Why?)
time 0.9
dream 0.03
midnight 0.02
…
and 1e-100
Picture credit: Historiska bildsamlingen
https://creativecommons.org/licenses/by/2.0/
Claude Shannon
A good LM is one that assigns a higher probability
to the next word that actually occurs

Intuition of perplexity 3: The best language model
is one that best predicts the entire unseen test set
• We said: a good LM is one that assigns a higher
probability to the next word that actually occurs.
• Let's generalize to all the words!
• The best LM assigns high probability to the entire test
set.
• When comparing two LMs, A and B
• We compute PA(test set) and PB(test set)
• The better LM will give a higher probability to (=be less
surprised by) the test set than the other LM.

• Probability depends on size of test set
• Probability gets smaller the longer the text
• Better: a metric that is per-word, normalized by length
• Perplexity is the inverse probability of the test set,
normalized by the number of words
Intuition of perplexity 4: Use perplexity instead of
raw probability
PP(W) = P(w1w2...wN )
−
1
N
=
1
P(w1w2...wN )
N

Perplexity is the inverse probability of the test set,
normalized by the number of words
Probability range is [0,1], perplexity range is [1,∞]
Minimizing perplexity is the same as maximizing probability
Intuition of perplexity 5: the inverse
PP(W) = P(w1w2...wN )
−
1
N
=
1
P(w1w2...wN )
N

Intuition of perplexity 6: N-grams
PP(W) = P(w1w2...wN )
−
1
N
=
1
P(w1w2...wN )
N
Bigrams:
Chain rule:

Intuition of perplexity 7:
Weighted average branching factor
Perplexity is also the weighted average branching factor of a language.
Branching factor: number of possible next words that can follow any word
Example: Deterministic language L = {red,blue, green}
Branching factor = 3 (any word can be followed by red, blue, green)
Now assume LM A where each word follows any other word with equal probability ⅓
Given a test set T = "red red red red blue"
PerplexityA(T) = PA(red red red red blue)-1/5 =
But now suppose red was very likely in training set, such that for LM B:
◦ P(red) = .8 p(green) = .1 p(blue) = .1
We would expect the probability to be higher, and hence the perplexity to be smaller:
PerplexityB(T) = PB(red red red red blue)-1/5
((⅓)5)-1/5 = (⅓)-1 =3
= (.8 * .8 * .8 * .8 * .1) -1/5 =.04096 -1/5 = .527-1 = 1.89

Holding test set constant:
Lower perplexity = better language model
Training 38 million words, test 1.5 million words, WSJ
N-gram
Order
Unigram Bigram Trigram
Perplexity 962 170 109

Language
Modeling
Sampling and Generalization

The Shannon (1948) Visualization Method
Sample words from an LM
Unigram:
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME
CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO
OF TO EXPERT GRAY COME TO FURNISHES THE LINE
MESSAGE HAD BE THESE.
Bigram:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER
THAT THE CHARACTER OF THIS POINT IS THEREFORE
ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO
EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
Claude Shannon

How Shannon sampled those words in 1948
"Open a book at random and select a letter at random on the page.
This letter is recorded. The book is then opened to another page
and one reads until this letter is encountered. The succeeding
letter is then recorded. Turning to another page this second letter
is searched for and the succeeding letter recorded, etc."

Sampling a word from a distribution
0 1
0.06
the
.06
0.03
of
0.02
a
0.02
to in
.09 .11 .13 .15
…
however
(p=.0003)
polyphonic
p=.0000018
…
0.02
.66 .99
…

Visualizing Bigrams the Shannon Way
Choose a random bigram (<s>, w)
according to its probability p(w|<s>)
Now choose a random bigram (w, x)
according to its probability p(x|w)
And so on until we choose </s>
Then string the words together
<s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>
I want to eat Chinese food

Approximating Shakespeare
We can use the sampling method from the prior section to visualize both of
these facts! To give an intuition for the increasing power of higher-order n-grams,
Fig. 3.4 shows random sentences generated from unigram, bigram, trigram, and 4-
gram models trained on Shakespeare’s works.
1
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
rote life have
gram –Hill he late speaks; or! a more to leg less first you enter
2
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
king. Follow.
gram –What means, sir. I confess she? then all sorts, he is trim, captain.
3
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
’tis done.
gram –This shall forbid it should be branded, if renown made it empty.
4
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
great banquet serv’d in;
gram –It cannot be but so.
Figure 3.4 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected

Shakespeare as corpus
N=884,647 tokens, V=29,066
Shakespeare produced 300,000 bigram types out of
V2= 844 million possible bigrams.
◦ So 99.96% of the possible bigrams were never seen (have
zero entries in the table)
◦ That sparsity is even worse for 4-grams, explaining why
our sampling generated actual Shakespeare.

The Wall Street Journal is not Shakespeare
3.5 • GENERALIZATION AND ZEROS 13
1 Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
gram
2
Last December through the way to preserve the Hudson corporation N.
B. E. C. Taylor would seem to complete the major central planners one
gram point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
3
They also point to ninety nine point six billion dollars from two hundred
four oh six three percent of the rates of interest stores as Mexico and
gram Brazil on market conditions
Figure 3.5 Three sentences randomly generated from three n-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-

Can you guess the author? These 3-gram sentences
are sampled from an LM trained on who?
1) They also point to ninety nine point
six billion dollars from two hundred four
oh six three percent of the rates of
interest stores as Mexico and gram Brazil
on market conditions
2) This shall forbid it should be branded,
if renown made it empty.
3) “You are uniformly charming!” cried he,
with a smile of associating and now and
then I bowed and they perceived a chaise
and four to wish for.
52

Choosing training data
If task-specific, use a training corpus that has a similar
genre to your task.
• If legal or medical, need lots of special-purpose documents
Make sure to cover different kinds of dialects and
speaker/authors.
• Example: African-American Vernacular English (AAVE)
• One of many varieties that can be used by African Americans and others
• Can include the auxiliary verb finna that marks immediate future tense:
• "My phone finna die"

The perils of overfitting
N-grams only work well for word prediction if the
test corpus looks like the training corpus
• But even when we try to pick a good training
corpus, the test set will surprise us!
• We need to train robust models that generalize!
One kind of generalization: Zeros
• Things that don’t ever occur in the training set
• But occur in the test set

Zeros
Training set:
… ate lunch
… ate dinner
… ate a
… ate the
P(“breakfast” | ate) = 0
• Test set
… ate lunch
… ate breakfast

Zero probability bigrams
Bigrams with zero probability
◦ Will hurt our performance for texts where those words
appear!
◦ And mean that we will assign 0 probability to the test set!
And hence we cannot compute perplexity (can’t
divide by 0)!

N-gram
Language
Modeling
Smoothing, Interpolation,
and Backoff

The intuition of smoothing (from Dan Klein)
When we have sparse statistics:
Steal probability mass to generalize better
P(w | denied the)
3 allegations
2 reports
1 claims
1 request
7 total
P(w | denied the)
2.5 allegations
1.5 reports
0.5 claims
0.5 request
2 other
7 total allegations
reports
claims
attack
request
man
outcome
…
allegations
attack
man
outcome
…
allegations
reports
claims
request

Add-one estimation
Also called Laplace smoothing
Pretend we saw each word one more time than we did
Just add one to all the counts!
MLE estimate:
Add-1 estimate:
spend 2 1 2 1 1 1 1
Figure 3.6 Add-one smoothed bigram counts for eight of the words (out of V
the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero count
Figure 3.7 shows the add-one smoothed probabilities for the bigrams
Recall that normal bigram probabilities are computed by normalizing e
counts by the unigram count:
P(wn|wn 1) =
C(wn 1wn)
C(wn 1)
For add-one smoothed bigram counts, we need to augment the unigram c
number of total word types in the vocabulary V:
PLaplace(wn|wn 1) =
C(wn 1wn)+1
P
w (C(wn 1w)+1)
=
C(wn 1wn)+1
C(wn 1)+V
lunch 3 1 1 1 1 2 1
spend 2 1 2 1 1 1 1
Figure 3.6 Add-one smoothed bigram counts for eight of the words (out
the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero co
Figure 3.7 shows the add-one smoothed probabilities for the bigra
Recall that normal bigram probabilities are computed by normalizin
PMLE(wn|wn 1) =
C(wn 1wn)
C(wn 1)
For add-one smoothed bigram counts, we need to augment the unigra
C(wn 1wn)+1 C(wn 1wn)+
spend 2 1 2 1 1 1 1
Figure 3.6 Add-one smoothed bigram counts for eight of the words (out of V
the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero counts
Figure 3.7 shows the add-one smoothed probabilities for the bigrams
Recall that normal bigram probabilities are computed by normalizing e
P(wn|wn 1) =
C(wn 1wn)
C(wn 1)
For add-one smoothed bigram counts, we need to augment the unigram co
PLaplace(wn|wn 1) =
C(wn 1wn)+1
P
w (C(wn 1w)+1)
=
C(wn 1wn)+1
C(wn 1)+V

Berkeley Restaurant Corpus: Laplace
smoothed bigram counts

Compare with raw bigram counts

Add-1 estimation is a blunt instrument
So add-1 isn’t used for N-grams:
◦ Generally we use interpolation or backoff instead
But add-1 is used to smooth other NLP models
◦ For text classification
◦ In domains where the number of zeros isn’t so huge.

Backoff and Interpolation
Sometimes it helps to use less context
◦ Condition on less context for contexts you know less about
Backoff:
◦ use trigram if you have good evidence,
◦ otherwise bigram, otherwise unigram
Interpolation:
◦ mix unigram, bigram, trigram
Interpolation works better

Linear Interpolation
Simple interpolation
unigram counts.
In simple linear interpolation, we combine different order N-grams by linearly
interpolating all the models. Thus, we estimate the trigram probability P(wn|wn 2wn 1)
by mixing together the unigram, bigram, and trigram probabilities, each weighted
by a l:
P̂(wn|wn 2wn 1) = l1P(wn|wn 2wn 1)
+l2P(wn|wn 1)
+l3P(wn) (4.24)
such that the ls sum to 1: X
i
li = 1 (4.25)
In a slightly more sophisticated version of linear interpolation, each l weight is
computed in a more sophisticated way, by conditioning on the context. This way,
if we have particularly accurate counts for a particular bigram, we assume that the
counts of the trigrams based on this bigram will be more trustworthy, so we can
make the ls for those trigrams higher and thus give that trigram more weight in
P̂(wn|wn 2wn 1) = l1P(wn|wn
+l2P(wn|
+l3P(wn)
such that the ls sum to 1: X
i
li = 1
In a slightly more sophisticated version of linear i
computed in a more sophisticated way, by condition
if we have particularly accurate counts for a particul
counts of the trigrams based on this bigram will be
make the ls for those trigrams higher and thus give

How to set λs for interpolation?
Use a held-out corpus
Choose λs to maximize probability of held-out data:
◦ Fix the N-gram probabilities (on the training data)
◦ Then search for λs that give largest probability to held-
out set
Training Data
Held-Out
Data
Test
Data

Backoff
Suppose you want:
P(pancakes| delicious soufflé)
If the trigram probability is 0, use the bigram
P(pancakes| soufflé)
If the bigram probability is 0, use the unigram
P(pancakes)
Complication: need to discount the higher-order ngram so
probabilities don't sum higher than 1 (e.g., Katz backoff)

Stupid Backoff
Backoff without discounting (not a true probability)
69
S(wi | wi−k+1
i−1
) =
count(wi−k+1
i
)
count(wi−k+1
i−1
)
if count(wi−k+1
i
) > 0
0.4S(wi | wi−k+2
i−1
) otherwise
"
#
$
$
%
$
$
S(wi ) =
count(wi )
N

N-gram
Language
Modeling
Interpolation and Backoff

An N-Gram Language Model predicts the next word in a sequence

More Related Content

Similar to An N-Gram Language Model predicts the next word in a sequence (20)

Recently uploaded (20)

An N-Gram Language Model predicts the next word in a sequence