1
CS 388:
Natural Language Processing:
N-Gram Language Models
Raymond J. Mooney
University of Texas at Austin
Language Models
• Formal grammars (e.g. regular, context
free) give a hard “binary” model of the
legal sentences in a language.
• For NLP, a probabilistic model of a
language that gives a probability that a
string is a member of a language is more
useful.
• To specify a correct probability distribution,
the probability of all sentences in a
language must sum to 1.
Uses of Language Models
• Speech recognition
– “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”
• OCR & Handwriting recognition
– More probable sentences are more likely correct readings.
• Machine translation
– More likely sentences are probably better translations.
• Generation
– More likely sentences are probably better NL generations.
• Context sensitive spelling correction
– “Their are problems wit this sentence.”
Completion Prediction
• A language model also supports predicting
the completion of a sentence.
– Please turn off your cell _____
– Your program does not ______
• Predictive text input systems can guess what
you are typing and give choices on how to
complete it.
N-Gram Models
• Estimate probability of each word given prior context.
– P(phone | Please turn off your cell)
• Number of parameters required grows exponentially with
the number of words of prior context.
• An N-gram model uses only N1 words of prior context.
– Unigram: P(phone)
– Bigram: P(phone | cell)
– Trigram: P(phone | your cell)
• The Markov assumption is the presumption that the future
behavior of a dynamical system only depends on its recent
history. In particular, in a kth-order Markov model, the
next state only depends on the k most recent states,
therefore an N-gram model is a (N1)-order Markov model.
N-Gram Model Formulas
• Word sequences
• Chain rule of probability
• Bigram approximation
• N-gram approximation
n
n
w
w
w ...
1
1 
)
|
(
)
|
(
)...
|
(
)
|
(
)
(
)
( 1
1
1
1
1
2
1
3
1
2
1
1





 k
n
k
k
n
n
n
w
w
P
w
w
P
w
w
P
w
w
P
w
P
w
P
)
|
(
)
( 1
1
1
1





 k
N
k
n
k
k
n
w
w
P
w
P
)
|
(
)
( 1
1
1 


 k
n
k
k
n
w
w
P
w
P
Estimating Probabilities
• N-gram conditional probabilities can be estimated
from raw text based on the relative frequency of
word sequences.
• To have a consistent probabilistic model, append a
unique start (<s>) and end (</s>) symbol to every
sentence and treat these as additional words.
)
(
)
(
)
|
(
1
1
1


 
n
n
n
n
n
w
C
w
w
C
w
w
P
)
(
)
(
)
|
( 1
1
1
1
1
1 







  n
N
n
n
n
N
n
n
N
n
n
w
C
w
w
C
w
w
P
Bigram:
N-gram:
Generative Model & MLE
• An N-gram model can be seen as a probabilistic
automata for generating sentences.
• Relative frequency estimates can be proven to be
maximum likelihood estimates (MLE) since they
maximize the probability that the model M will
generate the training corpus T.
Initialize sentence with N1 <s> symbols
Until </s> is generated do:
Stochastically pick the next word based on the conditional
probability of each word given the previous N 1 words.
))
(
|
(
argmax
ˆ 


M
T
P

Example from Textbook
• P(<s> i want english food </s>)
= P(i | <s>) P(want | i) P(english | want)
P(food | english) P(</s> | food)
= .25 x .33 x .0011 x .5 x .68 = .000031
• P(<s> i want chinese food </s>)
= P(i | <s>) P(want | i) P(chinese | want)
P(food | chinese) P(</s> | food)
= .25 x .33 x .0065 x .52 x .68 = .00019
Train and Test Corpora
• A language model must be trained on a large
corpus of text to estimate good parameter values.
• Model can be evaluated based on its ability to
predict a high probability for a disjoint (held-out)
test corpus (testing on the training corpus would
give an optimistically biased estimate).
• Ideally, the training (and test) corpus should be
representative of the actual application data.
• May need to adapt a general model to a small
amount of new (in-domain) data by adding highly
weighted small corpus to original training data.
A Problem for N-Grams:
Long Distance Dependencies
• Many times local context does not provide the
most useful predictive clues, which instead are
provided by long-distance dependencies.
– Syntactic dependencies
• “The man next to the large oak tree near the grocery store on
the corner is tall.”
• “The men next to the large oak tree near the grocery store on
the corner are tall.”
– Semantic dependencies
• “The bird next to the large oak tree near the grocery store on
the corner flies rapidly.”
• “The man next to the large oak tree near the grocery store on
the corner talks rapidly.”
• More complex models of language are needed to
handle such dependencies.
Summary
• Language models assign a probability that a
sentence is a legal string in a language.
• They are useful as a component of many NLP
systems, such as ASR, OCR, and MT.
• Simple N-gram models are easy to train on
unsupervised corpora and can provide useful
estimates of sentence likelihood.
• MLE gives inaccurate parameters for models
trained on sparse data.
• Smoothing techniques adjust parameter estimates
to account for unseen (but not impossible) events.

More Related Content

PPTX
Language models
PPT
Natural Language Processing: N-Gram Language Models
PPT
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
PPT
Natural Language Processing: N-Gram Language Models
PDF
lec03-LanguageModels_230214_161016.pdf
PPT
2-Chapter Two-N-gram Language Models.ppt
PPTX
Jarrar: Probabilistic Language Modeling - Introduction to N-grams
PPTX
Language model in nature language processing
Language models
Natural Language Processing: N-Gram Language Models
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
Natural Language Processing: N-Gram Language Models
lec03-LanguageModels_230214_161016.pdf
2-Chapter Two-N-gram Language Models.ppt
Jarrar: Probabilistic Language Modeling - Introduction to N-grams
Language model in nature language processing

Similar to Natural langaugea processing n gram models (20)

PDF
Lectures 10-11_ Representation Capacity – Large Language Models.pdf
DOCX
Language Modeling.docx
PPTX
Language Model (N-Gram).pptx
PDF
Natural_Language_processing_Unit_2_notes.pdf
PDF
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
DOCX
Langauage model
PPTX
Artificial Intelligence
PDF
Turkish language modeling using BERT
PPTX
NLP_KASHK:N-Grams
PPTX
Next word Prediction
PPT
Moore_slides.ppt
PDF
2_Corpora_and_Smoothing_2024.pdf
PDF
Probability Theory Application and statitics
PDF
Crash Course in Natural Language Processing (2016)
PPTX
Natural Language Processing - Language Model.pptx
PPT
haenelt.ppt
PDF
Visual-Semantic Embeddings: some thoughts on Language
PPTX
PPT Unit 5=software- engineering-21.pptx
PDF
An N-Gram Language Model predicts the next word in a sequence
PDF
12.4.1 n-grams.pdf
Lectures 10-11_ Representation Capacity – Large Language Models.pdf
Language Modeling.docx
Language Model (N-Gram).pptx
Natural_Language_processing_Unit_2_notes.pdf
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Langauage model
Artificial Intelligence
Turkish language modeling using BERT
NLP_KASHK:N-Grams
Next word Prediction
Moore_slides.ppt
2_Corpora_and_Smoothing_2024.pdf
Probability Theory Application and statitics
Crash Course in Natural Language Processing (2016)
Natural Language Processing - Language Model.pptx
haenelt.ppt
Visual-Semantic Embeddings: some thoughts on Language
PPT Unit 5=software- engineering-21.pptx
An N-Gram Language Model predicts the next word in a sequence
12.4.1 n-grams.pdf
Ad

Recently uploaded (20)

PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
LIFE & LIVING TRILOGY - PART - (2) THE PURPOSE OF LIFE.pdf
PDF
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
PDF
semiconductor packaging in vlsi design fab
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
LIFE & LIVING TRILOGY - PART (3) REALITY & MYSTERY.pdf
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
PPTX
Module on health assessment of CHN. pptx
PDF
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
PPTX
Core Concepts of Personalized Learning and Virtual Learning Environments
PDF
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
PDF
HVAC Specification 2024 according to central public works department
PDF
My India Quiz Book_20210205121199924.pdf
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PPTX
Computer Architecture Input Output Memory.pptx
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
Complications of Minimal Access-Surgery.pdf
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
LIFE & LIVING TRILOGY - PART - (2) THE PURPOSE OF LIFE.pdf
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
semiconductor packaging in vlsi design fab
Virtual and Augmented Reality in Current Scenario
LIFE & LIVING TRILOGY - PART (3) REALITY & MYSTERY.pdf
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
Module on health assessment of CHN. pptx
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
Core Concepts of Personalized Learning and Virtual Learning Environments
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
HVAC Specification 2024 according to central public works department
My India Quiz Book_20210205121199924.pdf
FORM 1 BIOLOGY MIND MAPS and their schemes
AI-driven educational solutions for real-life interventions in the Philippine...
Computer Architecture Input Output Memory.pptx
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
Complications of Minimal Access-Surgery.pdf
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
Uderstanding digital marketing and marketing stratergie for engaging the digi...
Ad

Natural langaugea processing n gram models

  • 1. 1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin
  • 2. Language Models • Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language. • For NLP, a probabilistic model of a language that gives a probability that a string is a member of a language is more useful. • To specify a correct probability distribution, the probability of all sentences in a language must sum to 1.
  • 3. Uses of Language Models • Speech recognition – “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry” • OCR & Handwriting recognition – More probable sentences are more likely correct readings. • Machine translation – More likely sentences are probably better translations. • Generation – More likely sentences are probably better NL generations. • Context sensitive spelling correction – “Their are problems wit this sentence.”
  • 4. Completion Prediction • A language model also supports predicting the completion of a sentence. – Please turn off your cell _____ – Your program does not ______ • Predictive text input systems can guess what you are typing and give choices on how to complete it.
  • 5. N-Gram Models • Estimate probability of each word given prior context. – P(phone | Please turn off your cell) • Number of parameters required grows exponentially with the number of words of prior context. • An N-gram model uses only N1 words of prior context. – Unigram: P(phone) – Bigram: P(phone | cell) – Trigram: P(phone | your cell) • The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model.
  • 6. N-Gram Model Formulas • Word sequences • Chain rule of probability • Bigram approximation • N-gram approximation n n w w w ... 1 1  ) | ( ) | ( )... | ( ) | ( ) ( ) ( 1 1 1 1 1 2 1 3 1 2 1 1       k n k k n n n w w P w w P w w P w w P w P w P ) | ( ) ( 1 1 1 1       k N k n k k n w w P w P ) | ( ) ( 1 1 1     k n k k n w w P w P
  • 7. Estimating Probabilities • N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences. • To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words. ) ( ) ( ) | ( 1 1 1     n n n n n w C w w C w w P ) ( ) ( ) | ( 1 1 1 1 1 1           n N n n n N n n N n n w C w w C w w P Bigram: N-gram:
  • 8. Generative Model & MLE • An N-gram model can be seen as a probabilistic automata for generating sentences. • Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) since they maximize the probability that the model M will generate the training corpus T. Initialize sentence with N1 <s> symbols Until </s> is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N 1 words. )) ( | ( argmax ˆ    M T P 
  • 9. Example from Textbook • P(<s> i want english food </s>) = P(i | <s>) P(want | i) P(english | want) P(food | english) P(</s> | food) = .25 x .33 x .0011 x .5 x .68 = .000031 • P(<s> i want chinese food </s>) = P(i | <s>) P(want | i) P(chinese | want) P(food | chinese) P(</s> | food) = .25 x .33 x .0065 x .52 x .68 = .00019
  • 10. Train and Test Corpora • A language model must be trained on a large corpus of text to estimate good parameter values. • Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate). • Ideally, the training (and test) corpus should be representative of the actual application data. • May need to adapt a general model to a small amount of new (in-domain) data by adding highly weighted small corpus to original training data.
  • 11. A Problem for N-Grams: Long Distance Dependencies • Many times local context does not provide the most useful predictive clues, which instead are provided by long-distance dependencies. – Syntactic dependencies • “The man next to the large oak tree near the grocery store on the corner is tall.” • “The men next to the large oak tree near the grocery store on the corner are tall.” – Semantic dependencies • “The bird next to the large oak tree near the grocery store on the corner flies rapidly.” • “The man next to the large oak tree near the grocery store on the corner talks rapidly.” • More complex models of language are needed to handle such dependencies.
  • 12. Summary • Language models assign a probability that a sentence is a legal string in a language. • They are useful as a component of many NLP systems, such as ASR, OCR, and MT. • Simple N-gram models are easy to train on unsupervised corpora and can provide useful estimates of sentence likelihood. • MLE gives inaccurate parameters for models trained on sparse data. • Smoothing techniques adjust parameter estimates to account for unseen (but not impossible) events.