Fady Morris Natural Language Processing
Fady Morris Natural Language Processing
. .
Natural Language Processing .
. 1.4 Logistic Regression: Regression and .
.
(1.6)
θ : = θ − α∇θ J (θ)
. .
. Sigmoid .
Specialization .
. The logits z (i) for an example i can be calculated as:
.
.
. . Figure 1.1: Training Logistic Regression
Formula Sheet .
.
. z (i)
=θ xT (i)
= θ0 x0 + θ1 x1 + θ2 x2 + . . . + θn xn (1.2)
.
.
.
Fady Morris Ebeid . .
. .
. The hypothesis function h (sigmoid function σ): .
(2020) .
. 1
.
.
. .
. h x(i) , θ = h(z (i) ) = σ z (i) = (i)
(1.3) .
. 1 + e−z .
. .
. .
Chapter 1 .
.
.
Note: All the h values are between 0 and 1. .
.
.
. 1.5 Cost Function .
Classification and Vector .
.
. The loss function for a single training example is:
.
.
.
. .
. .
Spaces .
.
.
h
L(θ) = − y (i) log h(z (i) ) + 1 − y (i) log 1 − h(z (i) )
i .
.
.
. .
. . 1.8 Testing Logistic Regression
1 Logistic Regression .
. The cost function used for logistic regression is the average of the .
.
. log loss across all training examples: . m(val) : Total number of examples (sentences) in validation set.
corpus: a language resource consisting of a large and structured set . . (val)
of texts. . . yi : Ground truth label for an example i ∈ {1, . . . , m(val) } in the
. .
. m h . validation set. 1 for positive sentiment, 0 for negative sentiment.
. 1 .
1.1 Notation
X i
. J (θ) = − y (i) log h(z (i) ) + 1 − y (i) log 1 − h(z (i) ) . (val)
ŷi : Predicted label (sentiment) for the ith example in the
. m .
V : Vocabulary size, the number of unique words in the entire set . i=1 . validation set.
. .
of sentences. . (1.4) .
.
. Where:
.
. 1. Perform testing on unseen validation data X (val) , y(val)
θ: Parameter vector, θ = [θ0 , θ1 , . . . , θn ] . .
. . using trained weights θ.
m: Number of examples (sentences) . • m is the number of training examples. .
P (class): Probability that a sentence is in a given class. . . 2. Calculate h(X (val) , θ) = h(z)
. • y (i) : is the actual label of the ith training example. .
class ∈ {pos, neg}. . . (val)
. . 3. Predict ŷi for each example as follows
freq(wi , class): Frequency of a word wi in a specific class. . • h(z (i) ) is the model prediction for the ith training example. .
. .
. . (
. . 1, If h(z)i ≥ 0.5
1.2 Preprocessing . 1.6 Gradient Descent . (val)
ŷi =
. . 0, otherwise
1. Eliminate handles and URLs. . The gradient of the cost function J with respect to one of the .
. .
. weights θj is .
2. Tokenize the string w = [w1 , w2 , . . . , wn ]. . . 4. Calculate the accuracy score for all examples in the
. .
. m . validation set:
3. Remove stop words(and, is, are, at, has, for, a, . . . ) and . 1 X
.
. ∇θj J (θ) = h(z (i) ) − y (i) xj (1.5) .
punctuation (, . : ! ” ’). . m i=1 . m(val)
. . 1 X (val) (val)
4. Stemming: Convert every word to its stem.(use Porter . . accuracy = ŷi == yi
. To update the weight θj using gradient descent: . m(val)
Stemmer [Por80]). . . i=1
. .
. . m(val)
5. Convert words to lowercase. . θj := θj − α∇θj J (θ) (1.6) . 1 X (val) (val)
. . =1− ŷi − yi
. . m(val)
. Where α is the learning rate, a value to control how big a single . i=1
1.3 Feature Extraction with Frequencies .
. update will be.
.
. | {z }
. . error
X (m) : Features vector of a sentence m.It is a row vector. . .
.
. 1.7 Vectorized Implementation .
.
. Putting all the examples in a matrix X (Equation 1.1), then the .
X (m)
= |{z}
1 ,
X
freq(w, pos),
X
freq(w, neg)
.
. previous equations become:
.
. 2 Naı̈ve Bayes
. .
w w . . 2.1 Conditional Probability and Bayes Rule
bias . .
. (1.2) . Conditional Probability:
Then all the examples m can be represented as the matrix X: . z = Xθ .
. 1 .
. (1.3) .
(1) (1) . h (X, θ) = h(z) = σ (z) = . P (A ∩ B)
1 X1 X2 . 1 + e−z . P (A|B) = (1.7)
1 X (2) (2) . . P (B)
X2 . (1.4) 1 h T i .
1 . J (θ) = − y · log (h(z)) + (1 − y)T · log (1 − h(z)) .
X= .. .. .. (1.1) . m . Bayes Rule:
.
. .
. . . (1.5) 1
. P (B|A)P (A)
(m) (m) . ∇θ J (θ) = X T · (h(z) − y) . P (A|B) = (1.8)
1 X1 X2 m P (B)
1
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
2.2 Naı̈ve Bayes Assumptions . . 2.5 Training Naı̈ve Bayes
. .
. 0 : 1 Negative sentiment. . 1. Collect and annotate corpus.
• Independence of events P (A ∩ B) = P (A)P (B). It assumes . .
. ratio(w) = 1 Neutral Sentiment. .
that the words in a piece of text are independent of one . . Preprocess text:
. .
another, which is not true in reality, but it works well. .
1:∞ Positive sentiment. .
. . • Lowercase.
• Relative frequency in corpus: It relies on the distribution of . P (class)P (wi |class)
(1.8) .
. P (class|wi ) = (1.13) . • Remove punctuation, URLs, names.
the training data sets. A good data set will contain the . .
. P (wi ) .
same proportion of positive and negative tweets as a . . • Remove stop words.
. P (pos|wi ) (1.13) P (pos)P (wi |pos) .
random sample would. However, most of available . = (1.14) . • Stemming [Por80].
. P (neg|wi ) P (neg)P (wi |neg) .
annotated corpora are artificially balanced. In reality . .
. . • Tokenize sentences w = [w1 , w2 , . . . , wn ]
positive sentences occur more frequently than negative. . n .
. P (pos|sentence) (1.14) P (pos)Y P (pos)P (wi |pos) .
. = (1.15) . 2. Word count.
. P (neg|sentence) P (neg) i=1 P (neg)P (wi |neg) .
2.3 Notation . .
. . (a) Compute freq(w, class) for every word in the
class ∈ {pos, neg}. . n . vocabulary.
. P (pos) Y .
. = ratio(wi ) .
w: A unique word in the vocabulary. . P (neg) i=1 . (b) Compute Nclass [equation 1.9]
. .
ratio(wi ): Ratio of the probability that the word wi being positive . .
. n
P (pos) Y P (wi |pos) + 1
(1.12) . 3. Compute conditional probabilities P (w|pos), P (w|neg)
to being negative. . ≈ (1.16) .
. . [equation 1.11]
Nclass : The total number of words in a class. . P (neg) P (wi |neg) + 1 .
. | {z } i=1 . 4. Calculate the lambda score (λ(w)) for each word [equation
N : total number of words in the corpus. . | {z } .
. prior likelihood . 1.18]
. ratio .
2.4 Naı̈ve Bayes Introduction . .
. . 5. Get the logprior :
. Where n: number of words in a sentence. .
V . . P (pos) (1.10) Npos
X . Log Likelihood Score . log = log
Nclass = freq(wi , class) (1.9) . .
. . P (neg) Nneg
i=1 . Carrying repeated multiplications in 1.16 can result in numerical .
. . If you are working with a balanced dataset (Npos = Nneg ),
Nclass . underflow. This problem is solved by taking log of both sides of .
P (class) = (1.10) . . then logprior = 0
. the equation to calculate the log likelihood score of a sentence .
N . .
. using the following equation: . 2.6 Testing Naı̈ve Bayes
. .
N = Npos + Nneg . .
. " n
# . m(val) : Total number of examples (sentences) in validation set.
. P (pos|sentence) (1.16) P (pos) Y .
. log = log ratio(wi ) . (val)
yi : Ground truth label for an example i ∈ {1, . . . , m(val) } in the
. .
P (neg) = 1 − P (pos) . P (neg|sentence) P (neg) i=1 . validation set. 1 for positive sentiment, 0 for negative sentiment.
. .
. . (val)
. P (pos) X
n . ŷi : Predicted label (sentiment) for the ith example in the
. = log + log(ratio(wi )) .
. P (neg) i=1 . validation set.
freq(w, class) . .
P (w|class) = .
.
.
. 1. Perform testing on unseen validation data X (val) , y(val)
Nclass . P (pos) X
n
P (wi |pos) + 1 .
. = log + log . 2. first, calculate log likelihood score for each sentence in the
freq(w, class) + 1 . P (neg) i=1 P (wi |neg) + 1 .
≈ (Laplacian smoothing) (1.11) . . examples [equation 1.17]
Nclass + V . | {z } | .
. logprior
{z }
log likelihood
. 3. Predict ŷi
(val)
for each example as follows
. .
V . n
.
X . P (pos) . (
P (wi |class) = 1 . . 1, If log likelihood score > 0
X
= log + λ(wi ) (1.17) (val)
. . ŷi =
i=1 . P (neg) i=1
. 0, otherwise
. .
The Naive Bayes inference condition rule for binary classification . | {z } .
. log . 4. Calculate the accuracy score for all examples in the
(of a sentence): . .
. likelihood . validation set:
n . .
Y P (wi |pos) . Where . m(val)
. . 1 X (val)
P (wi |neg) . . accuracy = ŷi == yi
(val)
i=1 . λ(wi ) = log(ratio(wi )) .
. . m(val) i=1
Where n: number of words in a sentence. . (1.12) P (wi |pos) + 1 .
. = log (1.18) . m(val)
. .
Likelihood . P (wi |neg) + 1 . 1 X (val) (val)
. . =1− ŷi − yi
. . m(val)
.
. < 0 Negative word.
.
.
i=1
P (w|pos) | {z }
ratio(w) = . λ(wi ) = 0 Neutral word. . error
P (w|neg) . .
.
> 0 Positive word. . For a word not in the corpus, it is treated as neutral
. .
(1.11) P (w|pos)+1 . . (λ(w) = 0)
≈ (Laplacian smoothing) (1.12) . If log likelihood score is > 0, the sentence is positive. If it is < 0, .
P (w|neg) + 1 . .
the sentence is negative.
2
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
3 Vector Space Models .
. • Numbers in the range [0, 1] indicate a similarity score. .
. 4 Machine Translation and Document
• Represent words and documents as vectors. . .
.
. • Numbers in the range [−1, 0] indicate a dissimilarity score. .
. Search
• Representation that captures relative meaning. . .
.
. 3.4 Manipulating Words in Vector Spaces
.
.
4.1 Machine Translation
3.1 Word by Word and Word by Doc. . . Transforming Word Vectors
. [Mik+13] .
. .
Word by Word Design (W/W) . . Assume that we have a subset of a source language dataset of
. .
Counts the co-occurrence of two different words, which is the .
.
3.5 Visualization and PCA .
. word embeddings X = [x1 |x2 | . . . |xm ]T and a translation subset of
number of times they occur together within a certain distance k. . PCA is used to visualize the embeddings on a k-dimensional . destination language dataset Y = [y1 |y2 | . . . |ym ]T We want to
. .
With word by word design you get a representation matrix with . subspace of the original n-dimensional subspace of the word . find a transformation matrix R such that:
. .
n × n entries, where n equals to vocabulary size V . . embeddings. .
. . XR ≈ Y
. Eigenvector : Uncorrelated features for your data. .
Word by Document Design (W/D) . .
. Eigenvalue: The amount of information retained by each feature. . Cost function:
. . 1
Counts the Number of times a word occurs within a certain . Perform PCA on a data matrix X = [x1 |x2 | . . . |xn ]T ∈ Rm×n , . J = kXR − Y k2F
. .
category. . where m is the number of examples, n is the dimension (length) of . m
. . where:
Represented by a matrix with n × c entries, where c is the number . a word embedding. .
. .
of categories. . Steps of PCA: . • m is the number of examples.
. .
. .
3.2 Euclidean Distance .
. 1. Mean normalize data and obtain the normalized data .
. • kAkF is the Frobenius norm,
The euclidean distance between two n-dimensional vectors: . matrix X̄ .
. .
v
. . um X
n
uX
. . |aij |2
v
d(~v , w)
~ = d(w,
~ ~v ) m u m kAk = t
. 1 X u1 X . F
= k~v − wk
~ . µ= xi , σ=t x 2 − µ2 . i=1 j=1
. m i=1 m i=1 i .
. .
q . . • The reason for taking the square is that it’s easier to
= (v1 − w1 )2 + (v2 − w2 )2 + . . . + (vn − wn )2 . .
. . compute the gradient of the squared Frobenius.
. xi − µ .
. .
v
u n
. x̄i = .
uX σ The gradient of the cost function with respect to the
=t (vi − wi )2 .
.
.
.
. . transformation matrix :
i=1 . xi − µ x i .
. xi = . ∂J ∂ 1
Where . σ xi . = kXR − Y k2F
. . ∂R ∂R m
• n is the number of elements in the vector. . .
. . 2
. X̄ = [x̄1 |x̄2 | . . . |x̄n ]T . = (XR − Y )T X
• The more similar the words, the more likely the Euclidean . . m
. .
distance will be close to 0. . 2. Get the n × n covariance matrix Σ . 2 T
. . = X (XR − Y )
. 1 T .
3.3 Cosine Similarity .
. Σ= X̄ X̄
.
.
m
The main advantage of this metric over the euclidean distance is . m . Then we use gradient descent to optimize the transformation
. .
that it isn’t biased by the size difference between the . . matrix:
. 3. Perform a singular value decomposition to get the . ∂J
representations. . . R := R − α
.
. eigenvectors U ∈ Rn×n and eigenvalues diagonal matrix .
. ∂R
Vector norm: v . S ∈ Rn×n . . The predictions can be obtained using the trained R matrix:
u n . .
uX . U , S = SVD(Σ) .
k~v k = t vi2 . . Ŷ = XR
. .
i=1 . 4. Project data onto the k-dimensional principal subspace: .
. . The translation of a word i can be found using k-nearest neighbor
Dot product: . Multiply your normalized data by the first k eigenvectors .
n . . of ŷi from Y with k = 1.
X . associated with the k largest eigenvalues to compute the .
~v · w
~ = vi · wi . .
. projection X 0 ∈ Rm×k . . 4.2 Document Search
i=1 . .
. .
Cosine similarity: . B = (U ij )1≤i≤n . Document Representation
. .
. 1≤j≤k .
~v · w
~ . . 1. Bag-of-words (BOW) document models
cos(θ) = . .
k~v kkwk
~ .
. X 0 = X̄B .
.
Text documents are sequences of words. The ordering of
. . words makes a difference.
. The precentage of retained variance can be calculated from .
. . 2. Document embeddings
Cosine similarity gives values between -1 and 1. . .
. P1 .
. i=0 Sii . A document can be represented as a document vector by
. .
1 Parallel and in the same direction. Pd
. . summing up the word embeddings of every word in the
j=0 Sjj
cos(θ) = 0 . .
Orthogonal(perpendicular). . . document. If we don’t know the embedding of a word, we
. .
can ignore that word.
−1 Point exactly in opposite directions.
3
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
Locality Sensitive Hashing . 3. Filter candidates. . Minimum edit distance is the sum of costs of edits needed to
. .
. . transform one string into the other.It evaluates the similarity
A more efficient version of k-nearest neighbors can be impelmented . Given a vocabulary, filter the edit list for candidate words .
. found in the vocabulary. . between two strings.
using locality sensitive hashing. Instead of searching the vector . .
. .
space we can only search in a subspace for the nearest neighboring . 4. Calculate word probabilities. .
. .
vectors. . . It is used in spelling correction, document similarity, machine
. count(w) .
Assume we have a plane(hyperplane) π that divides the vector . . translation, DNA sequencing and more.
. P (w) = .
space that has a normal vector p, then for any point with a . M .
. .
position vector v: . . Edits (operations) are:
. Where: .
. . Operation Description Cost
. .
> 0, the point is above the plane.
.
. • P (w): Probability of a word. .
. Insert Add a letter 1
p · v = 0, the point is on the plane. . . Delete Remove a letter 1
. • count(w): Number of times the word apprears. .
< 0, the point is below the plane. . . Replace Change 1 letter to another 2
. .
. • M : Total number of words in the corpus. .
Multiplanes Hash Functions . .
. .
. Then select the word with the highest probability as your .
• Multiplanes hash functions are based on the idea of . .
. autocorrect replacement. .
numbering every single region that is formed by the . . Minimum Edit Distance Algorithm
. .
intersection of n planes. . .
.
.
Algorithm 1: Autocorrect .
.
• We can divide the vector space into 2n parts(hash buckets). .
. 1 def autocorrect(word, n):
.
.
. . Minimum edit distance can be calculated using dynamic
The hash value for a position of a vector v with respect to a plane . Data: . programming. It breaks a problem down into subproblems which
. .
pi is: . probs: a dictionary that maps each word to its . can be combined to form the final solution.To do this efficiently,
. .
(
. probability in the corpus. . we will use a table (see Figure 2.1) to maintain the previously
1, If sign(pi · v) ≥ 0. . .
hi = . . computed substrings and use those to calculate larger substrings.
0, If sign(pi · v) < 0. . .
.
count(w) .
.
P (w) = for x = w .
Where i = {1, . . . , n} . probs[w → P (w)](x) = M
.
. . Initialization:
The combined hash bucket number for a vector (for all planes): . 0 otherwise .
. .
. .
n . .
X . vocab: a set containing all the vocabulary. .
hash = 2i−1 × hi .
. Result: n-best: a set of tuples with the most
.
.
D[0, 0] = 0 (2.1)
i=1 . . D[i, 0] = D[i − 1, 0] + del cost(source[i]) (2.2)
. probable n corrected words and their .
. probabilities. .
. . D[0, j] = D[0, j − 1] + ins cost(target[j]) (2.3)
. .
Chapter 2 .
.
.
2
3
suggestions = φ
n-best= φ
.
.
.
. if word ∈ vocab: .
. 4 .
Probabilistic Models .
.
.
5 suggestions = suggestions ∪ {word} .
.
. Per cell operations:
. 6 else: .
1 Autocorrect and Minimum Edit . 7 one-edit-set = one-edit-distance(word) ∩ vocab .
. .
. 8 if one-edit-set 6= φ: . D[i, j] =
Distance .
. 9 suggestions = suggestions ∪ one-edit-set
.
.
. .
1.1 Autocorrect . 10 else: .
. .
. 11 two-edit-set = two-edit-distance(word) ∩ vocab .
Reference: [Nor07] . .
. 12 if two-edit-set 6= φ: .
How it works . .
. 13 suggestions = suggestions ∪ two-edit-set .
D[i − 1, j] + del cost
1. Identify a misspelled word. . .
. 14 else: .
D[i, j − 1]
. . +(ins cost
Words not in the dictionary are misspelled words. . 15 suggestions = suggestions ∪ {word} . min
. . rep cost; if source[i] 6= target[j]
. .
D[i − 1, j − 1]
+
2. Find strings n edit distance away. . .
0; if source[i] = target[j]
. 16 best-words[w → probs(w)](x) = .
Edit: an operation performed on a string to change it. . .
. {x = w|w ∈ suggestions} . (2.4)
. .
Examples (for a string with n letters): . 17 n-best = {The set of top n words from best-words .
. .
Operation Description Output Count . sorted by probabilities} .
. .
Insert Add a letter 26(n+1) . .
. .
Delete Remove a letter n . . Minimum edit distance = D[m, n] (2.5)
. 1.2 Minimum Edit Distance .
Replace Change 1 letter to another 25n . .
. .
Switch Swap 2 adjacent letters n-1 Reference: [Jur12]
4
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. . No. Tag Description
. Example: .
. . 1. CC Coordinating conjunction
j 0 1 ... n . .
. . 2. CD Cardinal number
. source −→ target . 3. DT Determiner
. .
. “play” −→“stay” . 4. EX Existential there
# ... Tn−1
. .
i T0 . . 5. FW Foreign word
. . 6. IN Preposition or subordinating conjunction
. .
. . 7. JJ Adjective
. . 8. JJR Adjective, comparative
0 # D[0, 0] D[0, 1] ... D[0, n] . .
. D[i, j] = source[: i] −→ target[: j] . 9. JJS Adjective, superlative
. .
. . 10. LS List item marker
. D[2, 3] = “pl” −→ “sta” .
. . 11. MD Modal
1 S0 D[1, 0] D[1, 1] ... D[1, n] . D[0, 0] = # −→ # . 12. NN Noun, singular or mass
. .
. . 13. NNS Noun, plural
. D[m, n] = source −→ target . 14. NNP Proper noun, singular
. .
. . . . .. . . 15. NNPS Proper noun, plural
. . . . D[2, n] . .
. . . . . . where #: empty string. . 16. PDT Predeterminer
. . 17. POS Possessive ending
. .
. . 18. PRP Personal pronoun
m Sm−1 D[m, 0] D[m, 1] ... D[m, n] . . 19. PRP$ Possessive pronoun
. j .
. 0 1 2 3 4 . 20. RB Adverb
. .
. i # s t a y . 21. RBR Adverb, comparative
. . 22. RBS Adverb, superlative
. .
Figure 2.1: Minimum Edit Distance Table . 0 # 0 1 2 3 4 . 23. RP Particle
. .
. p
. 24. SYM Symbol
. 1 1 2 3 4 5 . 25. TO to
. .
. . 26. UH Interjection
. 2 l 2 3 4 5 6 .
. . 27. VB Verb, base form
. 3 a 3 4 5 4 5 . 28. VBD Verb, past tense
. .
. . 29. VBG Verb, gerund or present participle
. 4 y 4 5 6 5 4 . 30. VBN Verb, past participle
Algorithm 2: Minimun Edit Distance . .
. . 31. VBP Verb, non-3rd person singular present
. .
1 def min-edit-distance(source, target): . . 32. VBZ Verb, 3rd person singular present
. Figure 2.2: Minimum Edit Distance of “play” −→ “stay” . 33. WDT Wh-determiner
Data: . .
. . 34. WP Wh-pronoun
source: a string corresponding to the string you are . .
. . 35. WP$ Possessive wh-pronoun
starting with. . . 36. WRB Wh-adverb
. .
target: a string corresponding to the string you want . .
to end with.
.
. 2 Part of Speech Tagging and Hidden .
.
. . Table 2.1: Part-of-Speech Tags
Result: .
. Markov Models .
. Source: [San90] and [LJP03]
D: a matrix of size (m + 1 × n + 1) containing . .
. Reference: [JM19, Chapter 8] .
minimum edit distances (see Figure 2.1) . .
. .
med: the minimum edit distance required to convert . . 2.2 Markov Chains
the source string to the target .
.
2.1 Part of Speech Tagging .
.
. . Sates:
(2.1) . Part of speech (POS) tagging is the process of assigning tags that .
2 D[0, 0] = 0 . . S = {s1 , s2 , . . . , sN }
for i ∈ {1, 2, . . . , m}: . represent categories of parts of speech to words of a corpus. .
3 . . Markov property: The probability of the next event only depends
(2.2) . Applications of POS tagging: . on the current event.
4 D[i, 0] = [i − 1, 0] + del cost . .
. .
5 for j ∈ {1, 2, . . . , n}: . • Identifying named entities. . Initial Probability Vector
. .
(2.3) . .
6 D[0, j] = [0, j − 1] + ins cost . Eiffel tower is located in Paris. . π = [π1 , π2 , . . . , πN ]
. .
7 if source[i − 1] = target[j − 1]: . .
. • Co-reference resolution. . Example:
8 r cost = 0 . .
. . NN VB O
9 else: . The Eiffel tower is located in Paris, it is 324 meters high. . π=
. . π (initial) 0.4 0.1 0.5
10 r cost = 2 . .
. • Speech recognition. .
. .
D[i − 1, j]
+ del cost .
.
.
.
(2.4)
. lexical term tag example .
11 D[i, j] = min D[i, j − 1] + ins cost . .
. noun NN something, nothing .
D[i − 1, j − 1] +r cost
. verb VB learn, study .
. .
. determiner DT the,a .
(2.5) . .
12 med = D[m, n] . w-adverb WRB why, where .
. .
... ... ...
5
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
The Transition Matrix . Emission Matrix . Smoothing
. .
. .
The transition matrix has a dimension (N × N ). . . To avoid division by zero and zero probabilities apply smoothing
. .
b11 b12 ... b1V
. . to the equation (2.8)
. b21 b22 ... b2V .
. .
. B= .
. .
(2.7) .
a1,1 a1,2 . . . a1,N ..
. .. .. .. .
. . . (2.8) C(ti−1 , ti ) + ε
a2,1 a2,2 . . . a2,N . . P (ti |ti−1 ) = PN
A= . (2.6) . bN 1 bN 2 ... bN V . C(ti−1 , tj ) + N · ε
.. ..
.. .. . . j=1
. . . . .
. P (o1 |s1 ) P (o2 |s1 ) ... P (oV |s1 ) .
. . C(ti−1 , ti ) + ε
aN,1 aN,2 . . . aN,N . P (o1 |s2 ) P (o2 |s2 ) ... P (oV |s2 ) . =
. . C(ti−1 ) + N · ε
. = . .. .. .
P (s1 |s1 ) P (s2 |s1 ) ... P (sN |s1 ) ..
. .. .
P (s1 |s2 ) P (s2 |s2 ) ... P (sN |s2 ) . . . .
. Where
. P (o1 |sN ) P (o2 |sN ) ... P (oV |sN ) .
= . .. .. . .
.. ..
. . • C(ti−1 ) is the count of the previous POS tag occurrence in
. . .
. . the corpus.
P (s1 |sN ) P (s2 |sN ) . . . P (sN |sN ) . Emission matrix B has a dimension (N × V ) .
. .
. Where N is the number hidden states (parts of speech tags), V is . • ε is a smoothing parameter.
. .
For all the outgoing transition probabilities: . the number of observables (words in corpus). .
. . 2.5 Calculating Emission Probabilities
. .
. .
N . M . Calculate the number of time a (tag, word) pair showed in the
X . X . training set:
aij = 1 . bij = 1 .
. . C(ti , wi )
j=1 . j=1 .
. .
. . Compute the probability of a word given its tag:
. Example: .
Example: . .
. going to eat ... . (2.7)
NN VB O . . P (wi |ti ) = brindex(ti ),cindex(wi )
. NN (noun) 0.5 0.1 0.02 .
NN (noun) 0.2 0.2 0.6 . B= .
A= . VB (verb) 0.3 0.1 0.5 . C(ti , wi ) + ε
VB (verb) 0.4 0.3 0.3 . . = PV
. O (other) 0.3 0.5 0.68 . C(ti , wj ) + V · ε
O (other) 0.2 0.3 0.5 . . j=1
. .
. A word can have different parts of speech assigned depending on . C(ti , wi ) + ε
. . =
2.3 Hidden Markov Models .
.
context in which they appear: .
. C(ti ) + V · ε
. NN .
Hidden states: parts of speech. States that are hidden and not . .
. z }| { . Where
directly observable from the text data. . • “He lay on his back” .
. . • w is a word (observable) in the corpus.
Emission probabilities: The probability of a visible observation . .
. RB .
when we are in a particular state. Emission probabilities describe . z }| { . • C(ti ) is the number of times a tag has occured in the corpus.
. • “I will be back” .
the transition probabilities between the hidden states . .
. . • V is the number of words in the vocabulary.
S = {s1 , s2 , . . . , sN } (parts of speech) of hidden Markov model to . .
the observables or emissions (words of corpus)
.
. 2.4 Calculating Transition Probabilities .
.
. . 2.6 The Viterbi Algorithm
O = {o1 , o2 , . . . , oV }. . 1. Count the occurrences of tag pairs (the number of times .
. . The Viterbi algorithm computes the most likely sequence of parts
. each tag (ti ∈ S) at time step i happened next to another .
. . of speech tags for a given sentence (sequence of observations)
. tag (ti−1 ∈ S) at time step i − 1 ). .
o1 o1 .
.
.
.
w = [w1 , w2 , . . . , wK ]
. C(ti , ti−1 ) . The joint probability (combined probability) of the observing a
. .
. . word is calculated by multiplying the transition probability with
o2 s3 s1 o2 .
.
.
. the emission probability.
. 2. Calculate probabilities: divide the counts by row sum to .
. normalize the counts: The probability of a tag at position i . The total probability is calculated by multiplying all joint
. .
o3 π o3 .
. given the tag at position i − 1 becomes: .
.
probabilities of steps of the sequence.
. .
. .
. . w1 w2 wK
. . <s>
. (2.6) .
. P (ti |ti−1 ) = arindex(ti−1 ),cindex(ti ) .
P (wK |tK )
. .
P (w1 |t1 )
P (w2 |t2 )
s2 . .
. C(ti−1 , ti ) .
. = PN (2.8) .
. .
. j=1 C(ti−1 , tj ) . P (t1 |π) P (t2 |t1 ) P (t3 |t2 ) P (tK |tK−1 )
o1 o2 o3 . . ...
. . π t1 t2 tK
. Where .
. .
. . P (t1 |π)P (w1 |t1 ) × P (t2 |t1 )P (w2 |t2 ) × × P (tK |tK−1 )P (wK |tK ) =
total
Figure 2.3: Hidden Markov Model . . probability
• N is the total number of tags.
6
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
Auxiliary Matrices . Vectorized: .
. . Algorithm 3: Viterbi Algorithm
. .
Given your transition and emission probabilities, you first . . Data:
. .
. a0i · bi,cindex(wj ) .
populates and then use the auxiliary matrices C and D. . cj = max cj−1 (2.13) . • O = {o1 , o2 , . . . , oV }, the observation space
The matrix C ∈ RN ×K holds the intermediate optimal . . • S = {s1 , s2 , . . . , sN }, the state space , where sn is a tag.
. .
probabilities. . . • π = [π1 , π2 , . . . , πN ].An array of initial probabilities,
. .
The matrix D ∈ RN ×K holds the indices of the visited states (or . . where πi = P (x1 = si )
. .
a0i · bi,cindex(wj )
best paths, the different states you’re traversing when finding the . dj = arg max cj−1 (2.14) . • w = [w1 , w2 , . . . , wK ], A sequence of observations,
. .
most likely sequence of parts of speech tags for the given sequence . . wi ∈ O
. .
where AT = a01 |a02 | . . . |a0N
of words). . . • A ∈ RN ×N . Transition matrix
. .
. .
w1 w2 ... wK . . • B ∈ RN ×V . Emission matrix, where Bij = P (oj |si )
. 3. Backward pass: .
t1 . . Result:
C= . . .
. . The probability at ci,K is the probability of the most likely . t = [t1 , t2 , . . . , tK ], the most likely hidden state sequence of
. . .
. sequence of hidden states, generating the given sequence of . parts of speech tags.
tN . .
. words. . 1 def VITERBI(O, S, π, y, A, B):
. . /* Initialization */
w1 w2 ... wK . .
. Get the index of ci,K . for each state i = 1, 2, . . . , N :
t1 . . 2