Crash Course in
Natural Language Processing
Vsevolod Dyomkin
04/2016
A Bit about Me
* Lisp programmer
* 5+ years of NLP work at Grammarly
* Occasional lecturer
https://vseloved.github.io
A Bit about Grammarly
The best English language writing app
Spellcheck - Grammar check - Style
improvement - Synonyms and word choice
Plagiarism check
Plan
* Overview of NLP
* Where to get Data
* Common NLP problems
and approaches
* How to develop an NLP
system
What Is NLP?
Transforming free-form text
into structured data and back
What Is NLP?
Transforming free-form text
into structured data and back
Intersection of:
* Computational Linguistics
* CompSci & AI
* Stats & Information Theory
Linguistic Basis
* Syntax (form)
* Semantics (meaning)
* Pragmatics (intent/logic)
Natural Language
* ambiguous
* noisy
* evolving
Time flies like an arrow.
Fruit flies like a banana.
I read a story about evolution in ten minutes.
I read a story about evolution in the last million years.
NLP & Data
Types of text data:
* structured
* semi-structured
* unstructured
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig
The Unreasonable Effectiveness of Data.
http://youtu.be/yvDCzhbjYWs
Kinds of Data
* Dictionaries
* Databases/Ontologies
* Corpora
* User Data
Where to Get Data?
* Linguistic Data Consortium
http://www.ldc.upenn.edu/
* Common Crawl
* Wikimedia
* Wordnet
* APIs: Twitter, Wordnik, ...
* University sites &
the academic community:
Stanford, Oxford, CMU, ...
Create Your Own!
* Linguists
* Crowdsourcing
* By-product
-- Johnatahn Zittrain
http://goo.gl/hs4qB
Classic NLP Problems
* Linguistically-motivated:
segmentation, tagging, parsing
* Analytical:
classification, sentiment analysis
* Transformation:
translation, correction, generation
* Conversation:
question answering, dialog
Tokenization
Example:
This is a test that isn't so simple: 1.23.
"This" "is" "a" "test" "that" "is" "n't"
"so" "simple" ":" "1.23" "."
Issues:
* Finland’s capital -
Finland Finlands Finland’s
* what’re, I’m, isn’t -
what ’re, I ’m, is n’t
* Hewlett-Packard or Hewlett Packard
* San Francisco - one token or two?
* m.p.h., PhD.
Regular Expressions
Simplest regex: [^s]+
More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–—
«»“”‘’-]―
Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+
Post-processing
* concatenate abbreviations and decimals
* split contractions with regexes
2-character:
i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)
['‘’`]d$
3-character:
(?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
Rule-based Approach
* easy to understand and
reason about
* can be arbitrarily precise
* iterative, can be used to
gather more data
Limitations:
* recall problems
* poor adaptability
Rule-based NLP tools
* SpamAssasin
* LanguageTool
* ELIZA
* GATE
Statistical Approach
“Probability theory
is nothing but
common sense
reduced to calculation.”
-- Pierre-Simon Laplace
Language Models
Question: what is the probability of a
sequence of words/sentence?
Language Models
Question: what is the probability of a
sequence of words/sentence?
Answer: Apply the chain rule
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w0 w1 w2) * …
where S = w0 w1 w2 …
Ngrams
Apply Markov assumption: each word depends
only on N previous words (in practice
N=1..4 which results in bigrams-fivegrams,
because we include the current word also).
If n=2:
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w1 w2) * …
According to the chain rule:
P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
Spelling Correction
Problem: given an out-of-dictionary
word return a list of most probable
in-dictionary corrections.
http://norvig.com/spell-correct.html
Edit Distance
Minimum-edit (Levenstein) distance the–
minimum number of
insertions/deletions/substitutions needed
to transform string A into B.
Other distance metrics:
* the Damerau-Levenstein distance adds
another operation: transposition
* the longest common subsequence (LCS)
metric allows only insertion and deletion,
not substitution
* the Hamming distance allows only
substitution, hence, it only applies to
strings of the same length
Dynamic Programming
Initialization:
D(i,0) = i
D(0,j) = j
Recurrence relation:
For each i = 1..M
For each j = 1..N
D(i,j) = D(i-1,j-1), if X(i) = Y(j)
otherwise:
min D(i-1,j) + w_del(Y(j))
D(i,j-1) + w_ins(X(i))
D(i-1,j-1) + w_subst(X(i),Y(j))
Noisy Channel Model
Given an alphabet A, let A* be the set of all finite
strings over A. Let the dictionary D of valid words be
some subset of A*.
The noisy channel is the matrix G = P(s|w) where w in D is
the intended word and s in A* is the scrambled word that
was actually received.
P(s|w) = sum(P(x(i)|y(i)))
for x(i) in s* (s aligned with w)
for y(i) in w* (w aligned with s)
Machine Learning
Approach
Spam Filtering
A 2-class classification problem with a
bias towards minimizing FPs.
Default approach: rule-based (SpamAssassin)
Problems:
* scales poorly
* hard to reach arbitrary precision
* hard to rank the importance of
complex features?
Bag-of-words Models
* each word is a feature
* each word is independent of others
* position of the word in a sentence is irrelevant
Pros:
* simple
* fast
* scalable
Limitations:
* independence assumption doesn't hold
Initial results: recall: 92%, precision: 98.84%
Improved results: recall: 99.5%, precision: 99.97%
http://www.paulgraham.com/spam.html
Naive Bayes
Classifier
P(Y|X) = P(Y) * P(X|Y) / P(X)
select Y = argmax P(Y|x)
Naive step:
P(Y|x) = P(Y) * prod(P(x|Y))
for all x in X
(P(x) is marginalized out because it's the
same for all Y)
Dependency Parsing
nsubj(ate-2, They-1)
root(ROOT-0, ate-2)
det(pizza-4, the-3)
dobj(ate-2, pizza-4)
prep(ate-2, with-5)
pobj(with-5, anchovies-6)
https://honnibal.wordpress.com/2013/12/18/a-simple-fas
t-algorithm-for-natural-language-dependency-parsing/
Shift-reduce Parsing
Shift-reduce Parsing
ML-based Parsing
The parser starts with an empty stack, and a buffer index at 0, with no
dependencies recorded. It chooses one of the valid actions, and applies it to
the state. It continues choosing actions and applying them until the stack is
empty and the buffer index is at the end of the input.
SHIFT = 0; RIGHT = 1; LEFT = 2
MOVES = [SHIFT, RIGHT, LEFT]
def parse(words, tags):
n = len(words)
deps = init_deps(n)
idx = 1
stack = [0]
while stack or idx < n:
features = extract_features(words, tags, idx, n, stack, deps)
scores = score(features)
valid_moves = get_valid_moves(i, n, len(stack))
next_move = max(valid_moves, key=lambda move: scores[move])
idx = transition(next_move, idx, stack, parse)
return tags, parse
Averaged Perceptron
def train(model, number_iter, examples):
for i in range(number_iter):
for features, true_tag in examples:
guess = model.predict(features)
if guess != true_tag:
for f in features:
model.weights[f][true_tag] += 1
model.weights[f][guess] -= 1
random.shuffle(examples)
Features
* Word and tag unigrams, bigrams, trigrams
* The first three words of the buffer
* The top three words of the stack
* The two leftmost children of the top of
the stack
* The two rightmost children of the top of
the stack
* The two leftmost children of the first
word in the buffer
* Distance between top of buffer and stack
Discriminative ML
Models
Linear:
* (Averaged) Perceptron
* Maximum Entropy / LogLinear / Logistic
Regression; Conditional Random Field
* SVM
Non-linear:
* Decision Trees, Random Forests
* Other ensemble classifiers
* Neural networks
Semantics
Question: how to model relationships
between words?
Semantics
Question: how to model relationships
between words?
Answer: build a graph
Wordnet
Freebase
DBPedia
Word Similarity
Next question: now, how do we measure those
relations?
Word Similarity
Next question: now, how do we measure those
relations?
* different Wordnet similarity measures
Word Similarity
Next question: now, how do we measure those
relations?
* different Wordnet similarity measures
* PMI(x,y) = log(p(x,y) / p(x) * p(y))
Distributional
Semantics
Distributional hypothesis:
"You shall know a word by
the company it keeps"
--John Rupert Firth
Word representations:
* Explicit representation
Number of nonzero dimensions:
max:474234, min:3, mean:1595, median:415
* Dense representation (word2vec, GloVe)
* Hierarchical representation
(Brown clustering)
Steps to Develop
an NLP System
* Translate real-world requirements
into a measurable goal
* Find a suitable level and
representation
* Find initial data for experiments
* Find and utilize existing tools and
Frameworks where possible
* Don't trust research results
* Setup and perform a proper
experiment (series of experiments)
Going into Prod
* NLP tasks are usually CPU-intensive
but stateless
* General-purpose NLP frameworks are
(mostly) not production-ready
* Value pre- and post- processing
* Gather user feedback
Final Words
We have discussed:
* linguistic basis of NLP
- although some people manage to do NLP
without it:
http://arxiv.org/pdf/1103.0398.pdf
* rule-based & statistical/ML approaches
* different concrete tasks
We haven't covered:
* all the different tasks, such as MT,
question answering, etc.
(but they use the same technics)
* deep learning for NLP
* natural language understanding
(which remains an unsolved problem)

More Related Content

PDF
Crash-course in Natural Language Processing
PDF
NLP Project Full Cycle
PDF
Aspects of NLP Practice
PDF
Natural Language Processing in Practice
PDF
Can functional programming be liberated from static typing?
PDF
The State of #NLProc
PDF
Practical NLP with Lisp
PPTX
Language models
Crash-course in Natural Language Processing
NLP Project Full Cycle
Aspects of NLP Practice
Natural Language Processing in Practice
Can functional programming be liberated from static typing?
The State of #NLProc
Practical NLP with Lisp
Language models

What's hot (20)

PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Enriching Word Vectors with Subword Information
PDF
AINL 2016: Kravchenko
PDF
AINL 2016: Eyecioglu
PDF
OUTDATED Text Mining 5/5: Information Extraction
PDF
Language Models for Information Retrieval
PDF
OUTDATED Text Mining 3/5: String Processing
PDF
Overview of text mining and NLP (+software)
PPT
Information extraction for Free Text
PDF
AINL 2016: Galinsky, Alekseev, Nikolenko
PDF
AINL 2016: Maraev
PDF
Semantics and Computational Semantics
PPTX
What is word2vec?
PDF
Word representation: SVD, LSA, Word2Vec
PDF
Elements of Text Mining Part - I
PPTX
AINL 2016: Yagunova
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PDF
A general method applicable to the search for anglicisms in russian social ne...
PDF
AINL 2016: Malykh
PDF
IE: Named Entity Recognition (NER)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Enriching Word Vectors with Subword Information
AINL 2016: Kravchenko
AINL 2016: Eyecioglu
OUTDATED Text Mining 5/5: Information Extraction
Language Models for Information Retrieval
OUTDATED Text Mining 3/5: String Processing
Overview of text mining and NLP (+software)
Information extraction for Free Text
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Maraev
Semantics and Computational Semantics
What is word2vec?
Word representation: SVD, LSA, Word2Vec
Elements of Text Mining Part - I
AINL 2016: Yagunova
Tomáš Mikolov - Distributed Representations for NLP
A general method applicable to the search for anglicisms in russian social ne...
AINL 2016: Malykh
IE: Named Entity Recognition (NER)
Ad

Similar to Crash Course in Natural Language Processing (2016) (20)

PDF
Lecture 6
PDF
Всеволод Демкин "Natural language processing на практике"
PPTX
Summary distributed representations_words_phrases
PPT
ppt
PPT
ppt
PDF
On Semi-Supervised Learning and Beyond
PPTX
Natural Language Processing
PPT
Moore_slides.ppt
DOCX
Machine Learning
PDF
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
PPTX
Deep learning - Chatbot
PPTX
Language model in nature language processing
PPT
Introduction to Machine Learning.
PPT
dynamic binding ppt is important for education perpose
PDF
Artificial intelligence for Social Good
PPT
2-Chapter Two-N-gram Language Models.ppt
PPT
Machine learning
PPT
Language Modeling Putting a curve to the bag of words
PPTX
LLM24aug.pptxxz khi ong mat troi thuc dat me
Lecture 6
Всеволод Демкин "Natural language processing на практике"
Summary distributed representations_words_phrases
ppt
ppt
On Semi-Supervised Learning and Beyond
Natural Language Processing
Moore_slides.ppt
Machine Learning
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
Deep learning - Chatbot
Language model in nature language processing
Introduction to Machine Learning.
dynamic binding ppt is important for education perpose
Artificial intelligence for Social Good
2-Chapter Two-N-gram Language Models.ppt
Machine learning
Language Modeling Putting a curve to the bag of words
LLM24aug.pptxxz khi ong mat troi thuc dat me
Ad

More from Vsevolod Dyomkin (13)

PDF
NLP Project Full Circle
PDF
Lisp in a Startup: the Good, the Bad, and the Ugly
PDF
Loading Multiple Versions of an ASDF System in the Same Lisp Image
PDF
NLP in the WILD or Building a System for Text Language Identification
PDF
Sugaring Lisp for the 21st Century
PDF
Lisp Machine Prunciples
PDF
PDF
Lisp как универсальная обертка
PDF
Lisp for Python Programmers
ODP
Tedxkyiv communication guidelines
ODP
Новые нереляционные системы хранения данных
ODP
Чему мы можем научиться у Lisp'а?
PPT
Экосистема Common Lisp
NLP Project Full Circle
Lisp in a Startup: the Good, the Bad, and the Ugly
Loading Multiple Versions of an ASDF System in the Same Lisp Image
NLP in the WILD or Building a System for Text Language Identification
Sugaring Lisp for the 21st Century
Lisp Machine Prunciples
Lisp как универсальная обертка
Lisp for Python Programmers
Tedxkyiv communication guidelines
Новые нереляционные системы хранения данных
Чему мы можем научиться у Lisp'а?
Экосистема Common Lisp

Recently uploaded (20)

PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
CloudStack 4.21: First Look Webinar slides
DOCX
search engine optimization ppt fir known well about this
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Five Habits of High-Impact Board Members
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
UiPath Agentic Automation session 1: RPA to Agents
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
Zenith AI: Advanced Artificial Intelligence
A contest of sentiment analysis: k-nearest neighbor versus neural network
Convolutional neural network based encoder-decoder for efficient real-time ob...
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
sustainability-14-14877-v2.pddhzftheheeeee
OpenACC and Open Hackathons Monthly Highlights July 2025
CloudStack 4.21: First Look Webinar slides
search engine optimization ppt fir known well about this
Benefits of Physical activity for teenagers.pptx
A review of recent deep learning applications in wood surface defect identifi...
A proposed approach for plagiarism detection in Myanmar Unicode text
Getting started with AI Agents and Multi-Agent Systems
Five Habits of High-Impact Board Members
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
UiPath Agentic Automation session 1: RPA to Agents
Chapter 5: Probability Theory and Statistics
Taming the Chaos: How to Turn Unstructured Data into Decisions
Consumable AI The What, Why & How for Small Teams.pdf
Comparative analysis of machine learning models for fake news detection in so...
Zenith AI: Advanced Artificial Intelligence

Crash Course in Natural Language Processing (2016)

  • 1. Crash Course in Natural Language Processing Vsevolod Dyomkin 04/2016
  • 2. A Bit about Me * Lisp programmer * 5+ years of NLP work at Grammarly * Occasional lecturer https://vseloved.github.io
  • 3. A Bit about Grammarly The best English language writing app Spellcheck - Grammar check - Style improvement - Synonyms and word choice Plagiarism check
  • 4. Plan * Overview of NLP * Where to get Data * Common NLP problems and approaches * How to develop an NLP system
  • 5. What Is NLP? Transforming free-form text into structured data and back
  • 6. What Is NLP? Transforming free-form text into structured data and back Intersection of: * Computational Linguistics * CompSci & AI * Stats & Information Theory
  • 7. Linguistic Basis * Syntax (form) * Semantics (meaning) * Pragmatics (intent/logic)
  • 9. Time flies like an arrow. Fruit flies like a banana. I read a story about evolution in ten minutes. I read a story about evolution in the last million years.
  • 10. NLP & Data Types of text data: * structured * semi-structured * unstructured “Data is ten times more powerful than algorithms.” -- Peter Norvig The Unreasonable Effectiveness of Data. http://youtu.be/yvDCzhbjYWs
  • 11. Kinds of Data * Dictionaries * Databases/Ontologies * Corpora * User Data
  • 12. Where to Get Data? * Linguistic Data Consortium http://www.ldc.upenn.edu/ * Common Crawl * Wikimedia * Wordnet * APIs: Twitter, Wordnik, ... * University sites & the academic community: Stanford, Oxford, CMU, ...
  • 13. Create Your Own! * Linguists * Crowdsourcing * By-product -- Johnatahn Zittrain http://goo.gl/hs4qB
  • 14. Classic NLP Problems * Linguistically-motivated: segmentation, tagging, parsing * Analytical: classification, sentiment analysis * Transformation: translation, correction, generation * Conversation: question answering, dialog
  • 15. Tokenization Example: This is a test that isn't so simple: 1.23. "This" "is" "a" "test" "that" "is" "n't" "so" "simple" ":" "1.23" "." Issues: * Finland’s capital - Finland Finlands Finland’s * what’re, I’m, isn’t - what ’re, I ’m, is n’t * Hewlett-Packard or Hewlett Packard * San Francisco - one token or two? * m.p.h., PhD.
  • 16. Regular Expressions Simplest regex: [^s]+ More advanced regex: w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–— «»“”‘’-]― Even more advanced regex: [+-]?[0-9](?:[0-9,.]*[0-9])? |[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']? |["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—― |[.!?]+ |-+
  • 17. Post-processing * concatenate abbreviations and decimals * split contractions with regexes 2-character: i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they) ['‘’`]d$ 3-character: (?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
  • 18. Rule-based Approach * easy to understand and reason about * can be arbitrarily precise * iterative, can be used to gather more data Limitations: * recall problems * poor adaptability
  • 19. Rule-based NLP tools * SpamAssasin * LanguageTool * ELIZA * GATE
  • 20. Statistical Approach “Probability theory is nothing but common sense reduced to calculation.” -- Pierre-Simon Laplace
  • 21. Language Models Question: what is the probability of a sequence of words/sentence?
  • 22. Language Models Question: what is the probability of a sequence of words/sentence? Answer: Apply the chain rule P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w0 w1 w2) * … where S = w0 w1 w2 …
  • 23. Ngrams Apply Markov assumption: each word depends only on N previous words (in practice N=1..4 which results in bigrams-fivegrams, because we include the current word also). If n=2: P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w1 w2) * … According to the chain rule: P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
  • 24. Spelling Correction Problem: given an out-of-dictionary word return a list of most probable in-dictionary corrections. http://norvig.com/spell-correct.html
  • 25. Edit Distance Minimum-edit (Levenstein) distance the– minimum number of insertions/deletions/substitutions needed to transform string A into B. Other distance metrics: * the Damerau-Levenstein distance adds another operation: transposition * the longest common subsequence (LCS) metric allows only insertion and deletion, not substitution * the Hamming distance allows only substitution, hence, it only applies to strings of the same length
  • 26. Dynamic Programming Initialization: D(i,0) = i D(0,j) = j Recurrence relation: For each i = 1..M For each j = 1..N D(i,j) = D(i-1,j-1), if X(i) = Y(j) otherwise: min D(i-1,j) + w_del(Y(j)) D(i,j-1) + w_ins(X(i)) D(i-1,j-1) + w_subst(X(i),Y(j))
  • 27. Noisy Channel Model Given an alphabet A, let A* be the set of all finite strings over A. Let the dictionary D of valid words be some subset of A*. The noisy channel is the matrix G = P(s|w) where w in D is the intended word and s in A* is the scrambled word that was actually received. P(s|w) = sum(P(x(i)|y(i))) for x(i) in s* (s aligned with w) for y(i) in w* (w aligned with s)
  • 29. Spam Filtering A 2-class classification problem with a bias towards minimizing FPs. Default approach: rule-based (SpamAssassin) Problems: * scales poorly * hard to reach arbitrary precision * hard to rank the importance of complex features?
  • 30. Bag-of-words Models * each word is a feature * each word is independent of others * position of the word in a sentence is irrelevant Pros: * simple * fast * scalable Limitations: * independence assumption doesn't hold Initial results: recall: 92%, precision: 98.84% Improved results: recall: 99.5%, precision: 99.97% http://www.paulgraham.com/spam.html
  • 31. Naive Bayes Classifier P(Y|X) = P(Y) * P(X|Y) / P(X) select Y = argmax P(Y|x) Naive step: P(Y|x) = P(Y) * prod(P(x|Y)) for all x in X (P(x) is marginalized out because it's the same for all Y)
  • 32. Dependency Parsing nsubj(ate-2, They-1) root(ROOT-0, ate-2) det(pizza-4, the-3) dobj(ate-2, pizza-4) prep(ate-2, with-5) pobj(with-5, anchovies-6) https://honnibal.wordpress.com/2013/12/18/a-simple-fas t-algorithm-for-natural-language-dependency-parsing/
  • 35. ML-based Parsing The parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the valid actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input. SHIFT = 0; RIGHT = 1; LEFT = 2 MOVES = [SHIFT, RIGHT, LEFT] def parse(words, tags): n = len(words) deps = init_deps(n) idx = 1 stack = [0] while stack or idx < n: features = extract_features(words, tags, idx, n, stack, deps) scores = score(features) valid_moves = get_valid_moves(i, n, len(stack)) next_move = max(valid_moves, key=lambda move: scores[move]) idx = transition(next_move, idx, stack, parse) return tags, parse
  • 36. Averaged Perceptron def train(model, number_iter, examples): for i in range(number_iter): for features, true_tag in examples: guess = model.predict(features) if guess != true_tag: for f in features: model.weights[f][true_tag] += 1 model.weights[f][guess] -= 1 random.shuffle(examples)
  • 37. Features * Word and tag unigrams, bigrams, trigrams * The first three words of the buffer * The top three words of the stack * The two leftmost children of the top of the stack * The two rightmost children of the top of the stack * The two leftmost children of the first word in the buffer * Distance between top of buffer and stack
  • 38. Discriminative ML Models Linear: * (Averaged) Perceptron * Maximum Entropy / LogLinear / Logistic Regression; Conditional Random Field * SVM Non-linear: * Decision Trees, Random Forests * Other ensemble classifiers * Neural networks
  • 39. Semantics Question: how to model relationships between words?
  • 40. Semantics Question: how to model relationships between words? Answer: build a graph Wordnet Freebase DBPedia
  • 41. Word Similarity Next question: now, how do we measure those relations?
  • 42. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures
  • 43. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures * PMI(x,y) = log(p(x,y) / p(x) * p(y))
  • 44. Distributional Semantics Distributional hypothesis: "You shall know a word by the company it keeps" --John Rupert Firth Word representations: * Explicit representation Number of nonzero dimensions: max:474234, min:3, mean:1595, median:415 * Dense representation (word2vec, GloVe) * Hierarchical representation (Brown clustering)
  • 45. Steps to Develop an NLP System * Translate real-world requirements into a measurable goal * Find a suitable level and representation * Find initial data for experiments * Find and utilize existing tools and Frameworks where possible * Don't trust research results * Setup and perform a proper experiment (series of experiments)
  • 46. Going into Prod * NLP tasks are usually CPU-intensive but stateless * General-purpose NLP frameworks are (mostly) not production-ready * Value pre- and post- processing * Gather user feedback
  • 47. Final Words We have discussed: * linguistic basis of NLP - although some people manage to do NLP without it: http://arxiv.org/pdf/1103.0398.pdf * rule-based & statistical/ML approaches * different concrete tasks We haven't covered: * all the different tasks, such as MT, question answering, etc. (but they use the same technics) * deep learning for NLP * natural language understanding (which remains an unsolved problem)