0% found this document useful (0 votes)
23 views10 pages

Fady Morris Natural Language Processing

The document is a formula sheet for a Natural Language Processing specialization by Fady Morris Ebeid, detailing key concepts such as Logistic Regression and Naïve Bayes. It includes mathematical formulations for cost functions, gradient descent, and conditional probabilities, along with preprocessing and feature extraction techniques. The document serves as a reference for understanding the algorithms and their implementation in NLP tasks.

Uploaded by

abdo adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views10 pages

Fady Morris Natural Language Processing

The document is a formula sheet for a Natural Language Processing specialization by Fady Morris Ebeid, detailing key concepts such as Logistic Regression and Naïve Bayes. It includes mathematical formulations for cost functions, gradient descent, and conditional probabilities, along with preprocessing and feature extraction techniques. The document serves as a reference for understanding the algorithms and their implementation in NLP tasks.

Uploaded by

abdo adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)

. .
Natural Language Processing .
. 1.4 Logistic Regression: Regression and .
.
(1.6)
θ : = θ − α∇θ J (θ)
. .
. Sigmoid .
Specialization .
. The logits z (i) for an example i can be calculated as:
.
.
. . Figure 1.1: Training Logistic Regression
Formula Sheet .
.
. z (i)
=θ xT (i)
= θ0 x0 + θ1 x1 + θ2 x2 + . . . + θn xn (1.2)
.
.
.
Fady Morris Ebeid . .
. .
. The hypothesis function h (sigmoid function σ): .
(2020) .
. 1
.
.
. .
   
. h x(i) , θ = h(z (i) ) = σ z (i) = (i)
(1.3) .
. 1 + e−z .
. .
. .
Chapter 1 .
.
.
Note: All the h values are between 0 and 1. .
.
.
. 1.5 Cost Function .
Classification and Vector .
.
. The loss function for a single training example is:
.
.
.
. .
. .
Spaces .
.
.
h     
L(θ) = − y (i) log h(z (i) ) + 1 − y (i) log 1 − h(z (i) )
i .
.
.
. .
. . 1.8 Testing Logistic Regression
1 Logistic Regression .
. The cost function used for logistic regression is the average of the .
.
. log loss across all training examples: . m(val) : Total number of examples (sentences) in validation set.
corpus: a language resource consisting of a large and structured set . . (val)
of texts. . . yi : Ground truth label for an example i ∈ {1, . . . , m(val) } in the
. .
. m h . validation set. 1 for positive sentiment, 0 for negative sentiment.
. 1 .
1.1 Notation
X      i
. J (θ) = − y (i) log h(z (i) ) + 1 − y (i) log 1 − h(z (i) ) . (val)
ŷi : Predicted label (sentiment) for the ith example in the
. m .
V : Vocabulary size, the number of unique words in the entire set . i=1 . validation set.
. .
of sentences. . (1.4) .
.
. Where:
.
. 1. Perform testing on unseen validation data X (val) , y(val)
θ: Parameter vector, θ = [θ0 , θ1 , . . . , θn ] . .
. . using trained weights θ.
m: Number of examples (sentences) . • m is the number of training examples. .
P (class): Probability that a sentence is in a given class. . . 2. Calculate h(X (val) , θ) = h(z)
. • y (i) : is the actual label of the ith training example. .
class ∈ {pos, neg}. . . (val)
. . 3. Predict ŷi for each example as follows
freq(wi , class): Frequency of a word wi in a specific class. . • h(z (i) ) is the model prediction for the ith training example. .
. .
. . (
. . 1, If h(z)i ≥ 0.5
1.2 Preprocessing . 1.6 Gradient Descent . (val)
ŷi =
. . 0, otherwise
1. Eliminate handles and URLs. . The gradient of the cost function J with respect to one of the .
. .
. weights θj is .
2. Tokenize the string w = [w1 , w2 , . . . , wn ]. . . 4. Calculate the accuracy score for all examples in the
. .
. m . validation set:
3. Remove stop words(and, is, are, at, has, for, a, . . . ) and . 1 X 
.
. ∇θj J (θ) = h(z (i) ) − y (i) xj (1.5) .
punctuation (, . : ! ” ’). . m i=1 . m(val)
. . 1 X  (val) (val)

4. Stemming: Convert every word to its stem.(use Porter . . accuracy = ŷi == yi
. To update the weight θj using gradient descent: . m(val)
Stemmer [Por80]). . . i=1
. .
. . m(val)
5. Convert words to lowercase. . θj := θj − α∇θj J (θ) (1.6) . 1 X (val) (val)
. . =1− ŷi − yi
. . m(val)
. Where α is the learning rate, a value to control how big a single . i=1
1.3 Feature Extraction with Frequencies .
. update will be.
.
. | {z }
. . error
X (m) : Features vector of a sentence m.It is a row vector. . .
.
. 1.7 Vectorized Implementation .
.
  . Putting all the examples in a matrix X (Equation 1.1), then the .
X (m)
= |{z}
1 ,
X
freq(w, pos),
X
freq(w, neg)
.
. previous equations become:
.
. 2 Naı̈ve Bayes
. .
w w . . 2.1 Conditional Probability and Bayes Rule
bias . .
. (1.2) . Conditional Probability:
Then all the examples m can be represented as the matrix X: . z = Xθ .
. 1 .
. (1.3) .
 (1) (1)  . h (X, θ) = h(z) = σ (z) = . P (A ∩ B)
1 X1 X2 . 1 + e−z . P (A|B) = (1.7)
1 X (2) (2)  . . P (B)
X2  . (1.4) 1 h T i .
1 . J (θ) = − y · log (h(z)) + (1 − y)T · log (1 − h(z)) .

X=  .. .. ..  (1.1) . m . Bayes Rule:
.
 . .
. .  . (1.5) 1
  . P (B|A)P (A)
(m) (m) . ∇θ J (θ) = X T · (h(z) − y) . P (A|B) = (1.8)
1 X1 X2 m P (B)
1
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
2.2 Naı̈ve Bayes Assumptions . . 2.5 Training Naı̈ve Bayes
.  .
. 0 : 1 Negative sentiment. . 1. Collect and annotate corpus.
• Independence of events P (A ∩ B) = P (A)P (B). It assumes .  .
. ratio(w) = 1 Neutral Sentiment. .
that the words in a piece of text are independent of one . . Preprocess text:
.  .
another, which is not true in reality, but it works well. .

1:∞ Positive sentiment. .
. . • Lowercase.
• Relative frequency in corpus: It relies on the distribution of . P (class)P (wi |class)
(1.8) .
. P (class|wi ) = (1.13) . • Remove punctuation, URLs, names.
the training data sets. A good data set will contain the . .
. P (wi ) .
same proportion of positive and negative tweets as a . . • Remove stop words.
. P (pos|wi ) (1.13) P (pos)P (wi |pos) .
random sample would. However, most of available . = (1.14) . • Stemming [Por80].
. P (neg|wi ) P (neg)P (wi |neg) .
annotated corpora are artificially balanced. In reality . .
. . • Tokenize sentences w = [w1 , w2 , . . . , wn ]
positive sentences occur more frequently than negative. . n .
. P (pos|sentence) (1.14) P (pos)Y P (pos)P (wi |pos) .
. = (1.15) . 2. Word count.
. P (neg|sentence) P (neg) i=1 P (neg)P (wi |neg) .
2.3 Notation . .
. . (a) Compute freq(w, class) for every word in the
class ∈ {pos, neg}. . n . vocabulary.
. P (pos) Y .
. = ratio(wi ) .
w: A unique word in the vocabulary. . P (neg) i=1 . (b) Compute Nclass [equation 1.9]
. .
ratio(wi ): Ratio of the probability that the word wi being positive . .
. n
P (pos) Y P (wi |pos) + 1
(1.12) . 3. Compute conditional probabilities P (w|pos), P (w|neg)
to being negative. . ≈ (1.16) .
. . [equation 1.11]
Nclass : The total number of words in a class. . P (neg) P (wi |neg) + 1 .
. | {z } i=1 . 4. Calculate the lambda score (λ(w)) for each word [equation
N : total number of words in the corpus. . | {z } .
. prior likelihood . 1.18]
. ratio .
2.4 Naı̈ve Bayes Introduction . .
. . 5. Get the logprior :
. Where n: number of words in a sentence. .
V . . P (pos) (1.10) Npos
X . Log Likelihood Score . log = log
Nclass = freq(wi , class) (1.9) . .
. . P (neg) Nneg
i=1 . Carrying repeated multiplications in 1.16 can result in numerical .
. . If you are working with a balanced dataset (Npos = Nneg ),
Nclass . underflow. This problem is solved by taking log of both sides of .
P (class) = (1.10) . . then logprior = 0
. the equation to calculate the log likelihood score of a sentence .
N . .
. using the following equation: . 2.6 Testing Naı̈ve Bayes
. .
N = Npos + Nneg . .
. " n
# . m(val) : Total number of examples (sentences) in validation set.
. P (pos|sentence) (1.16) P (pos) Y .
. log = log ratio(wi ) . (val)
yi : Ground truth label for an example i ∈ {1, . . . , m(val) } in the
. .
P (neg) = 1 − P (pos) . P (neg|sentence) P (neg) i=1 . validation set. 1 for positive sentiment, 0 for negative sentiment.
. .
. . (val)
. P (pos) X
n . ŷi : Predicted label (sentiment) for the ith example in the
. = log + log(ratio(wi )) .
. P (neg) i=1 . validation set.
freq(w, class) . .
P (w|class) = .
.
.
. 1. Perform testing on unseen validation data X (val) , y(val)
Nclass . P (pos) X
n
P (wi |pos) + 1 .
. = log + log . 2. first, calculate log likelihood score for each sentence in the
freq(w, class) + 1 . P (neg) i=1 P (wi |neg) + 1 .
≈ (Laplacian smoothing) (1.11) . . examples [equation 1.17]
Nclass + V . | {z } | .
. logprior
{z }
log likelihood
. 3. Predict ŷi
(val)
for each example as follows
. .
V . n
.
X . P (pos) . (
P (wi |class) = 1 . . 1, If log likelihood score > 0
X
= log + λ(wi ) (1.17) (val)
. . ŷi =
i=1 . P (neg) i=1
. 0, otherwise
. .
The Naive Bayes inference condition rule for binary classification . | {z } .
. log . 4. Calculate the accuracy score for all examples in the
(of a sentence): . .
. likelihood . validation set:
n . .
Y P (wi |pos) . Where . m(val)
. . 1 X  (val) 
P (wi |neg) . . accuracy = ŷi == yi
(val)
i=1 . λ(wi ) = log(ratio(wi )) .
. . m(val) i=1
Where n: number of words in a sentence. . (1.12) P (wi |pos) + 1 .
. = log (1.18) . m(val)
. .
Likelihood . P (wi |neg) + 1 . 1 X (val) (val)
. . =1− ŷi − yi
.  . m(val)
.
. < 0 Negative word.
 .
.
i=1
P (w|pos) | {z }
ratio(w) = . λ(wi ) = 0 Neutral word. . error
P (w|neg) . .
. 

> 0 Positive word. . For a word not in the corpus, it is treated as neutral
. .
(1.11) P (w|pos)+1 . . (λ(w) = 0)
≈ (Laplacian smoothing) (1.12) . If log likelihood score is > 0, the sentence is positive. If it is < 0, .
P (w|neg) + 1 . .
the sentence is negative.
2
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
3 Vector Space Models .
. • Numbers in the range [0, 1] indicate a similarity score. .
. 4 Machine Translation and Document
• Represent words and documents as vectors. . .
.
. • Numbers in the range [−1, 0] indicate a dissimilarity score. .
. Search
• Representation that captures relative meaning. . .
.
. 3.4 Manipulating Words in Vector Spaces
.
.
4.1 Machine Translation
3.1 Word by Word and Word by Doc. . . Transforming Word Vectors
. [Mik+13] .
. .
Word by Word Design (W/W) . . Assume that we have a subset of a source language dataset of
. .
Counts the co-occurrence of two different words, which is the .
.
3.5 Visualization and PCA .
. word embeddings X = [x1 |x2 | . . . |xm ]T and a translation subset of
number of times they occur together within a certain distance k. . PCA is used to visualize the embeddings on a k-dimensional . destination language dataset Y = [y1 |y2 | . . . |ym ]T We want to
. .
With word by word design you get a representation matrix with . subspace of the original n-dimensional subspace of the word . find a transformation matrix R such that:
. .
n × n entries, where n equals to vocabulary size V . . embeddings. .
. . XR ≈ Y
. Eigenvector : Uncorrelated features for your data. .
Word by Document Design (W/D) . .
. Eigenvalue: The amount of information retained by each feature. . Cost function:
. . 1
Counts the Number of times a word occurs within a certain . Perform PCA on a data matrix X = [x1 |x2 | . . . |xn ]T ∈ Rm×n , . J = kXR − Y k2F
. .
category. . where m is the number of examples, n is the dimension (length) of . m
. . where:
Represented by a matrix with n × c entries, where c is the number . a word embedding. .
. .
of categories. . Steps of PCA: . • m is the number of examples.
. .
. .
3.2 Euclidean Distance .
. 1. Mean normalize data and obtain the normalized data .
. • kAkF is the Frobenius norm,
The euclidean distance between two n-dimensional vectors: . matrix X̄ .
. .
v
. . um X
n
uX
. . |aij |2
v
d(~v , w)
~ = d(w,
~ ~v ) m u m kAk = t
. 1 X u1 X . F
= k~v − wk
~ . µ= xi , σ=t x 2 − µ2 . i=1 j=1
. m i=1 m i=1 i .
. .
q . . • The reason for taking the square is that it’s easier to
= (v1 − w1 )2 + (v2 − w2 )2 + . . . + (vn − wn )2 . .
. . compute the gradient of the squared Frobenius.
. xi − µ .
. .
v
u n
. x̄i = .
uX σ The gradient of the cost function with respect to the
=t (vi − wi )2 .
.
.
.
. . transformation matrix :
i=1 . xi − µ x i .
. xi = . ∂J ∂ 1
Where . σ xi . = kXR − Y k2F
. . ∂R ∂R m
• n is the number of elements in the vector. . .
. . 2
. X̄ = [x̄1 |x̄2 | . . . |x̄n ]T . = (XR − Y )T X
• The more similar the words, the more likely the Euclidean . . m
. .
distance will be close to 0. . 2. Get the n × n covariance matrix Σ . 2 T
. . = X (XR − Y )
. 1 T .
3.3 Cosine Similarity .
. Σ= X̄ X̄
.
.
m
The main advantage of this metric over the euclidean distance is . m . Then we use gradient descent to optimize the transformation
. .
that it isn’t biased by the size difference between the . . matrix:
. 3. Perform a singular value decomposition to get the . ∂J
representations. . . R := R − α
.
. eigenvectors U ∈ Rn×n and eigenvalues diagonal matrix .
. ∂R
Vector norm: v . S ∈ Rn×n . . The predictions can be obtained using the trained R matrix:
u n . .
uX . U , S = SVD(Σ) .
k~v k = t vi2 . . Ŷ = XR
. .
i=1 . 4. Project data onto the k-dimensional principal subspace: .
. . The translation of a word i can be found using k-nearest neighbor
Dot product: . Multiply your normalized data by the first k eigenvectors .
n . . of ŷi from Y with k = 1.
X . associated with the k largest eigenvalues to compute the .
~v · w
~ = vi · wi . .
. projection X 0 ∈ Rm×k . . 4.2 Document Search
i=1 . .
. .
Cosine similarity: . B = (U ij )1≤i≤n . Document Representation
. .
. 1≤j≤k .
~v · w
~ . . 1. Bag-of-words (BOW) document models
cos(θ) = . .
k~v kkwk
~ .
. X 0 = X̄B .
.
Text documents are sequences of words. The ordering of
. . words makes a difference.
. The precentage of retained variance can be calculated from .
. . 2. Document embeddings
Cosine similarity gives values between -1 and 1. . .
. P1 .
 . i=0 Sii . A document can be represented as a document vector by
. .
1 Parallel and in the same direction. Pd
. . summing up the word embeddings of every word in the
j=0 Sjj

cos(θ) = 0 . .
Orthogonal(perpendicular). . . document. If we don’t know the embedding of a word, we
. .
can ignore that word.

−1 Point exactly in opposite directions.

3
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
Locality Sensitive Hashing . 3. Filter candidates. . Minimum edit distance is the sum of costs of edits needed to
. .
. . transform one string into the other.It evaluates the similarity
A more efficient version of k-nearest neighbors can be impelmented . Given a vocabulary, filter the edit list for candidate words .
. found in the vocabulary. . between two strings.
using locality sensitive hashing. Instead of searching the vector . .
. .
space we can only search in a subspace for the nearest neighboring . 4. Calculate word probabilities. .
. .
vectors. . . It is used in spelling correction, document similarity, machine
. count(w) .
Assume we have a plane(hyperplane) π that divides the vector . . translation, DNA sequencing and more.
. P (w) = .
space that has a normal vector p, then for any point with a . M .
. .
position vector v: . . Edits (operations) are:
. Where: .
. . Operation Description Cost
 . .
> 0, the point is above the plane.
 .
. • P (w): Probability of a word. .
. Insert Add a letter 1
p · v = 0, the point is on the plane. . . Delete Remove a letter 1
. • count(w): Number of times the word apprears. .


< 0, the point is below the plane. . . Replace Change 1 letter to another 2
. .
. • M : Total number of words in the corpus. .
Multiplanes Hash Functions . .
. .
. Then select the word with the highest probability as your .
• Multiplanes hash functions are based on the idea of . .
. autocorrect replacement. .
numbering every single region that is formed by the . . Minimum Edit Distance Algorithm
. .
intersection of n planes. . .
.
.
Algorithm 1: Autocorrect .
.
• We can divide the vector space into 2n parts(hash buckets). .
. 1 def autocorrect(word, n):
.
.
. . Minimum edit distance can be calculated using dynamic
The hash value for a position of a vector v with respect to a plane . Data: . programming. It breaks a problem down into subproblems which
. .
pi is: . probs: a dictionary that maps each word to its . can be combined to form the final solution.To do this efficiently,
. .
(
. probability in the corpus. . we will use a table (see Figure 2.1) to maintain the previously
1, If sign(pi · v) ≥ 0. . .
hi = . . computed substrings and use those to calculate larger substrings.
0, If sign(pi · v) < 0. . .
. 
count(w) .
. 
P (w) = for x = w .
Where i = {1, . . . , n} . probs[w → P (w)](x) = M
.
. . Initialization:
The combined hash bucket number for a vector (for all planes): . 0 otherwise .
. .
. .
n . .
X . vocab: a set containing all the vocabulary. .
hash = 2i−1 × hi .
. Result: n-best: a set of tuples with the most
.
.
D[0, 0] = 0 (2.1)
i=1 . . D[i, 0] = D[i − 1, 0] + del cost(source[i]) (2.2)
. probable n corrected words and their .
. probabilities. .
. . D[0, j] = D[0, j − 1] + ins cost(target[j]) (2.3)
. .
Chapter 2 .
.
.
2
3
suggestions = φ
n-best= φ
.
.
.
. if word ∈ vocab: .
. 4 .
Probabilistic Models .
.
.
5 suggestions = suggestions ∪ {word} .
.
. Per cell operations:
. 6 else: .
1 Autocorrect and Minimum Edit . 7 one-edit-set = one-edit-distance(word) ∩ vocab .
. .
. 8 if one-edit-set 6= φ: . D[i, j] =
Distance .
. 9 suggestions = suggestions ∪ one-edit-set
.
.
. .
1.1 Autocorrect . 10 else: .
. .
. 11 two-edit-set = two-edit-distance(word) ∩ vocab .
Reference: [Nor07] . .
. 12 if two-edit-set 6= φ: .
How it works . .
. 13 suggestions = suggestions ∪ two-edit-set . 
D[i − 1, j] + del cost
1. Identify a misspelled word. . . 
. 14 else: . 
D[i, j − 1]

. . +(ins cost
Words not in the dictionary are misspelled words. . 15 suggestions = suggestions ∪ {word} . min
. . rep cost; if source[i] 6= target[j]
. . 
D[i − 1, j − 1]
 +
2. Find strings n edit distance away. . . 
0; if source[i] = target[j]
. 16 best-words[w → probs(w)](x) = .
Edit: an operation performed on a string to change it. . .
. {x = w|w ∈ suggestions} . (2.4)
. .
Examples (for a string with n letters): . 17 n-best = {The set of top n words from best-words .
. .
Operation Description Output Count . sorted by probabilities} .
. .
Insert Add a letter 26(n+1) . .
. .
Delete Remove a letter n . . Minimum edit distance = D[m, n] (2.5)
. 1.2 Minimum Edit Distance .
Replace Change 1 letter to another 25n . .
. .
Switch Swap 2 adjacent letters n-1 Reference: [Jur12]
4
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. . No. Tag Description
. Example: .
. . 1. CC Coordinating conjunction
j 0 1 ... n . .
. . 2. CD Cardinal number
. source −→ target . 3. DT Determiner
. .
. “play” −→“stay” . 4. EX Existential there
# ... Tn−1
. .
i T0 . . 5. FW Foreign word
. . 6. IN Preposition or subordinating conjunction
. .
. . 7. JJ Adjective
. . 8. JJR Adjective, comparative
0 # D[0, 0] D[0, 1] ... D[0, n] . .
. D[i, j] = source[: i] −→ target[: j] . 9. JJS Adjective, superlative
. .
. . 10. LS List item marker
. D[2, 3] = “pl” −→ “sta” .
. . 11. MD Modal
1 S0 D[1, 0] D[1, 1] ... D[1, n] . D[0, 0] = # −→ # . 12. NN Noun, singular or mass
. .
. . 13. NNS Noun, plural
. D[m, n] = source −→ target . 14. NNP Proper noun, singular
. .
. . . . .. . . 15. NNPS Proper noun, plural
. . . . D[2, n] . .
. . . . . . where #: empty string. . 16. PDT Predeterminer
. . 17. POS Possessive ending
. .
. . 18. PRP Personal pronoun
m Sm−1 D[m, 0] D[m, 1] ... D[m, n] . . 19. PRP$ Possessive pronoun
. j .
. 0 1 2 3 4 . 20. RB Adverb
. .
. i # s t a y . 21. RBR Adverb, comparative
. . 22. RBS Adverb, superlative
. .
Figure 2.1: Minimum Edit Distance Table . 0 # 0 1 2 3 4 . 23. RP Particle
. .
. p
. 24. SYM Symbol
. 1 1 2 3 4 5 . 25. TO to
. .
. . 26. UH Interjection
. 2 l 2 3 4 5 6 .
. . 27. VB Verb, base form
. 3 a 3 4 5 4 5 . 28. VBD Verb, past tense
. .
. . 29. VBG Verb, gerund or present participle
. 4 y 4 5 6 5 4 . 30. VBN Verb, past participle
Algorithm 2: Minimun Edit Distance . .
. . 31. VBP Verb, non-3rd person singular present
. .
1 def min-edit-distance(source, target): . . 32. VBZ Verb, 3rd person singular present
. Figure 2.2: Minimum Edit Distance of “play” −→ “stay” . 33. WDT Wh-determiner
Data: . .
. . 34. WP Wh-pronoun
source: a string corresponding to the string you are . .
. . 35. WP$ Possessive wh-pronoun
starting with. . . 36. WRB Wh-adverb
. .
target: a string corresponding to the string you want . .
to end with.
.
. 2 Part of Speech Tagging and Hidden .
.
. . Table 2.1: Part-of-Speech Tags
Result: .
. Markov Models .
. Source: [San90] and [LJP03]
D: a matrix of size (m + 1 × n + 1) containing . .
. Reference: [JM19, Chapter 8] .
minimum edit distances (see Figure 2.1) . .
. .
med: the minimum edit distance required to convert . . 2.2 Markov Chains
the source string to the target .
.
2.1 Part of Speech Tagging .
.
. . Sates:
(2.1) . Part of speech (POS) tagging is the process of assigning tags that .
2 D[0, 0] = 0 . . S = {s1 , s2 , . . . , sN }
for i ∈ {1, 2, . . . , m}: . represent categories of parts of speech to words of a corpus. .
3 . . Markov property: The probability of the next event only depends
(2.2) . Applications of POS tagging: . on the current event.
4 D[i, 0] = [i − 1, 0] + del cost . .
. .
5 for j ∈ {1, 2, . . . , n}: . • Identifying named entities. . Initial Probability Vector
. .
(2.3) . .
6 D[0, j] = [0, j − 1] + ins cost . Eiffel tower is located in Paris. . π = [π1 , π2 , . . . , πN ]
. .
7 if source[i − 1] = target[j − 1]: . .
. • Co-reference resolution. . Example:
8 r cost = 0 . .
. . NN VB O
9 else: . The Eiffel tower is located in Paris, it is 324 meters high. . π=
. . π (initial) 0.4 0.1 0.5
10 r cost = 2 . .
. • Speech recognition. .
 . .
D[i − 1, j]
 + del cost .
.
.
.
(2.4)
. lexical term tag example .
11 D[i, j] = min D[i, j − 1] + ins cost . .
 . noun NN something, nothing .
D[i − 1, j − 1] +r cost

. verb VB learn, study .
. .
. determiner DT the,a .
(2.5) . .
12 med = D[m, n] . w-adverb WRB why, where .
. .
... ... ...
5
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
The Transition Matrix . Emission Matrix . Smoothing
. .
. .
The transition matrix has a dimension (N × N ). . . To avoid division by zero and zero probabilities apply smoothing
. .
b11 b12 ... b1V
 
. . to the equation (2.8)
.  b21 b22 ... b2V  .
. .
. B= .

. . 

(2.7) .
a1,1 a1,2 . . . a1,N ..
 
.  .. .. ..  .
. . . (2.8) C(ti−1 , ti ) + ε
 a2,1 a2,2 . . . a2,N  . . P (ti |ti−1 ) = PN
A= . (2.6) . bN 1 bN 2 ... bN V . C(ti−1 , tj ) + N · ε
.. .. 
 
 .. .. . . j=1
. . .  . .
. P (o1 |s1 ) P (o2 |s1 ) ... P (oV |s1 ) .
 
. . C(ti−1 , ti ) + ε
aN,1 aN,2 . . . aN,N . P (o1 |s2 ) P (o2 |s2 ) ... P (oV |s2 )  . =
. . C(ti−1 ) + N · ε
. = . .. .. .
 
P (s1 |s1 ) P (s2 |s1 ) ... P (sN |s1 ) ..
 
. .. .

P (s1 |s2 ) P (s2 |s2 ) ... P (sN |s2 )  . . . . 
. Where
. P (o1 |sN ) P (o2 |sN ) ... P (oV |sN ) .
= . .. .. . .
 
.. .. 
. . • C(ti−1 ) is the count of the previous POS tag occurrence in
. . . 
. . the corpus.
P (s1 |sN ) P (s2 |sN ) . . . P (sN |sN ) . Emission matrix B has a dimension (N × V ) .
. .
. Where N is the number hidden states (parts of speech tags), V is . • ε is a smoothing parameter.
. .
For all the outgoing transition probabilities: . the number of observables (words in corpus). .
. . 2.5 Calculating Emission Probabilities
. .
. .
N . M . Calculate the number of time a (tag, word) pair showed in the
X . X . training set:
aij = 1 . bij = 1 .
. . C(ti , wi )
j=1 . j=1 .
. .
. . Compute the probability of a word given its tag:
. Example: .
Example: . .
. going to eat ... . (2.7)
NN VB O . . P (wi |ti ) = brindex(ti ),cindex(wi )
. NN (noun) 0.5 0.1 0.02 .
NN (noun) 0.2 0.2 0.6 . B= .
A= . VB (verb) 0.3 0.1 0.5 . C(ti , wi ) + ε
VB (verb) 0.4 0.3 0.3 . . = PV
. O (other) 0.3 0.5 0.68 . C(ti , wj ) + V · ε
O (other) 0.2 0.3 0.5 . . j=1
. .
. A word can have different parts of speech assigned depending on . C(ti , wi ) + ε
. . =
2.3 Hidden Markov Models .
.
context in which they appear: .
. C(ti ) + V · ε
. NN .
Hidden states: parts of speech. States that are hidden and not . .
. z }| { . Where
directly observable from the text data. . • “He lay on his back” .
. . • w is a word (observable) in the corpus.
Emission probabilities: The probability of a visible observation . .
. RB .
when we are in a particular state. Emission probabilities describe . z }| { . • C(ti ) is the number of times a tag has occured in the corpus.
. • “I will be back” .
the transition probabilities between the hidden states . .
. . • V is the number of words in the vocabulary.
S = {s1 , s2 , . . . , sN } (parts of speech) of hidden Markov model to . .
the observables or emissions (words of corpus)
.
. 2.4 Calculating Transition Probabilities .
.
. . 2.6 The Viterbi Algorithm
O = {o1 , o2 , . . . , oV }. . 1. Count the occurrences of tag pairs (the number of times .
. . The Viterbi algorithm computes the most likely sequence of parts
. each tag (ti ∈ S) at time step i happened next to another .
. . of speech tags for a given sentence (sequence of observations)
. tag (ti−1 ∈ S) at time step i − 1 ). .
o1 o1 .
.
.
.
w = [w1 , w2 , . . . , wK ]
. C(ti , ti−1 ) . The joint probability (combined probability) of the observing a
. .
. . word is calculated by multiplying the transition probability with
o2 s3 s1 o2 .
.
.
. the emission probability.
. 2. Calculate probabilities: divide the counts by row sum to .
. normalize the counts: The probability of a tag at position i . The total probability is calculated by multiplying all joint
. .
o3 π o3 .
. given the tag at position i − 1 becomes: .
.
probabilities of steps of the sequence.
. .
. .
. . w1 w2 wK
. . <s>
. (2.6) .
. P (ti |ti−1 ) = arindex(ti−1 ),cindex(ti ) .

P (wK |tK )
. .

P (w1 |t1 )

P (w2 |t2 )
s2 . .
. C(ti−1 , ti ) .
. = PN (2.8) .
. .
. j=1 C(ti−1 , tj ) . P (t1 |π) P (t2 |t1 ) P (t3 |t2 ) P (tK |tK−1 )
o1 o2 o3 . . ...
. . π t1 t2 tK
. Where .
. .
. . P (t1 |π)P (w1 |t1 ) × P (t2 |t1 )P (w2 |t2 ) × × P (tK |tK−1 )P (wK |tK ) =
total
Figure 2.3: Hidden Markov Model . . probability
• N is the total number of tags.
6
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
Auxiliary Matrices . Vectorized: .
. . Algorithm 3: Viterbi Algorithm
. .
Given your transition and emission probabilities, you first . . Data:
. .
. a0i · bi,cindex(wj ) .

populates and then use the auxiliary matrices C and D. . cj = max cj−1 (2.13) . • O = {o1 , o2 , . . . , oV }, the observation space
The matrix C ∈ RN ×K holds the intermediate optimal . . • S = {s1 , s2 , . . . , sN }, the state space , where sn is a tag.
. .
probabilities. . . • π = [π1 , π2 , . . . , πN ].An array of initial probabilities,
. .
The matrix D ∈ RN ×K holds the indices of the visited states (or . . where πi = P (x1 = si )
. .
a0i · bi,cindex(wj )

best paths, the different states you’re traversing when finding the . dj = arg max cj−1 (2.14) . • w = [w1 , w2 , . . . , wK ], A sequence of observations,
. .
most likely sequence of parts of speech tags for the given sequence . . wi ∈ O
. .
where AT = a01 |a02 | . . . |a0N
 
of words). . . • A ∈ RN ×N . Transition matrix
. .
. .
w1 w2 ... wK . . • B ∈ RN ×V . Emission matrix, where Bij = P (oj |si )
. 3. Backward pass: .
t1 . . Result:
C= . . .
. . The probability at ci,K is the probability of the most likely . t = [t1 , t2 , . . . , tK ], the most likely hidden state sequence of
. . .
. sequence of hidden states, generating the given sequence of . parts of speech tags.
tN . .
. words. . 1 def VITERBI(O, S, π, y, A, B):
. . /* Initialization */
w1 w2 ... wK . .
. Get the index of ci,K . for each state i = 1, 2, . . . , N :
t1 . . 2

D= . . Ci,1 ← log(πi ) + log(Bi,cindex(w1 ) )


.. . . 3
. . .
. zK = argmax ci,K . 4 Di,1 ← 0
tN . i
.
. . /* Forward pass */
. .
. . for each observation j = 2, 3, . . . , K:
Steps . . 5
. Use this index zK to traverse backwards through the matrix . 6 for each state i = 1, 2, . . . , N :
. .
1. Initialization: . D, to reconstruct the sequence of parts of speech tags. . 7 Cij ←
. .
The first column of each of the matrices C and D is . . max Ck,j−1 + log(Ak,i ) + log(Bi,cindex(wj ) )
. . k
populated. . .
. . 8 Dij ←
. Implementation note: .
ci,1 = P (si |π)P (w1 |si ) . . arg max Ck,j−1 +log(Ak,i )+log(Bi,cindex(wj ) )
. For numerical stability use log probabilities: . k
. .
= πi · bi,cindex(w1 ) (2.9) . .
. . /* Backward pass */
. .
. . 9 zK ← arg max Ci,K
For matrix D, in the first column, set all entries to zero, as . (2.9) . i
. log(ci,1 ) = log(πi ) + log(bi,cindex(w1 ) ) .
there are no proceeding parts of speech tags we have . . 10 tK ← szK
. .
traversed. . . 11 for j = K, K − 1, . . . , 2:
. .
. . 12 zj−1 ← Dzj , j
. .
di,1 = 0 . . 13 tj−1 ← szj−1
. .
. .
Vectorized: . (2.11) . 14 return t
. log(ci,j ) = max log(ck,j−1 ) + log(ak,i ) + log(bi,cindex(wj ) ) .
. k .
c1 = π bcindex(w1 ) (2.10) . .
. .
. .
. .
. (2.12) .
. di,j = arg max log(ck,j−1 ) + log(ak,i ) + log(bi,cindex(wj ) ) .
d1 = ~0 . k .
. .
. .
2. Forward pass: . Vectorized: .
. .
. .
. .
. .
ci,j = max ck,j−1 · ak,i · bi,cindex(wj ) (2.11) . .
k . (2.10) .
. log(c1 ) = log(π) + log(bcindex(w1 ) ) .
. .
. .
di,j = arg max ck,j−1 · ak,i · bi,cindex(wj ) (2.12) . .
k
. .
. .
. .
where . .
. .
. .
. (2.13) .
max log(cj−1 ) + log(a0i ) + log(bi,cindex(wj ) )
 
(a) ak,i is the transition probability from the parts of . log(cj ) = .
. .
speech tag tk to the current tag ti . . .
. .
. .
(b) ck,j−1 represents the probability for the preceding . (2.14)
.
. .
h i
path you’ve traversed. dj = arg max log(cj−1 ) + log(a0i ) + log(bi,cindex(wj ) )
7
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
. • Unigram probability: Probability of unigram: . • Bigram P (wn |w1n−1 ) ≈ P (wn |wn−1 )
. .
. C(w) .
3 Autocomplete and Language Models .
. P (w) =
m
.
. • N -gram n−1
P (wn |w1n−1 ) ≈ P (wn |wn−N +1 )
Definitions: . .
. Size of corpus m = 7 . • Entire sequence modeled with bigram:
. .
• Text corpus: a large database of text documents. . 2 1 .
. P (“I”) = P (“happy”) = .
• Language model: a tool that calculates the probabilities of . 7 7 .
. . n
sentences; it can also estimate the probability of an . .
. • Bigram probability: .
Y
. . P (w1n ) ≈ P (wi |wi−1 ) (2.18)
upcoming word given a history of previous words. . Probability of bigram: .
. . i=1
• Sentence: a sequence of words. . .
. . ≈ P (w1 )P (w2 |w1 ) . . . P (wn |wn−1 )
. C(wi−1 wi ) C(wi−1 wi ) .
Applications: . .
. P (wi |wi−1 ) = P = (2.16) .
• Speech recognition. .
. w∈V C(wi−1 w) C(wi−1 ) .
. 3.3 Starting and Ending Sentences
Example : P (“I saw a van”) > P (“eyes awe of an”) . C(“I am”) 2 .
. . Sentence : “the teacher drinks tea”
. P (“am”|“I”) = = =1 .
• Spelling correction. . C(“I”) 2 .
. . Start and End of Sentence Tokens for N -grams
Example: “He entered the ship to buy some groceries” . C(“am learning”) 1 .
. P (“learning”|“am”) = = .
P (“entered the shop to buy”) > . C(“am”) 2 . Add N − 1 start tokens hsi and one end token h/si
. .
P (“entered the ship to buy”) . . Example:
. • Trigram probability: .
. .
• Augmentative communication. . . • Bigram:
. C(“I am happy”) 1 .
. P (“happy”|“I am”) = = .
3.1 N -grams and Probabilities . . “the teacher drinks tea” ⇒ “hsi the teacher drinks tea h/si”
. C(“I am”) 2 .
An N -gram is a sequence of N elements which can be words, . .
. 3.2 Sequence Probabilities . P (“the teacher drinks tea”) =P (“the”|hsi)
characters, or other elements. . .
. .
. Conditional probability and chain rule: . P (“teacher”|“the”)
Example . .
. . P (“drinks”|“teacher”)
. .
Corpus: “I am happy because I am learning” . (1.7) P (A, B) .
. P (B|A) = . P (“tea”|“drinks”)
• Unigrams: {“I”, “am”, “happy”, “because”, “learning” } . P (A) .
. . P (h/si|“tea”)
. .
• Bigrams: { “I am”, “am happy”, “happy because”, . . . } . ⇒ P (A, B) = P (A)P (B|A) .
. .
. P (A, B, C, D) = P (A)P (B|A)P (C|A, B)P (D|A, B, C) . • Trigram:
• Trigrams: { “I am happy”, “am happy because”, . . . } . .
. .
. . “the teacher drinks tea” ⇒
Sequence Notation . .
. P (“the teacher drinks tea”) =P (“the”) . “hsi hsi the teacher drinks tea h/si”
Corpus : “This is great . . . teacher drinks tea . . . w1 w2 ... wn−1 wn
. .
w1 w2 w3 w498 w499 w500 . P (“teacher”|“the”) .
m = 500 . .
. P (“drinks”|“the teacher”) .
. . P (w1n ) ≈ P (w1 |hsihsi)P (w2 |hsiw1 ) . . . P (h/si|wn wn−1 )
. .
w1m = w2 w2 . . . wm . P (“tea”|“the teacher drinks”) .
. .
. (2.17) . 3.4 The N -gram Language Model
w13 = w1 w2 w3 .
.
.
.
. The problem is that the corpus almost never contains the exact . Count Matrix
m
wm−2 = wm−2 wm−1 wm . sentence we’re interested in, so: .
. .
. . Represents the numerator of Equation 2.15 where:
N-gram Probability . .
. C(“the teacher drinks tea”) .
. P (“tea”|“the teacher drinks”) = . • Rows: unique corpus (N − 1)-grams.
Probability of N -gram: . .
. C(“the teacher drinks”) .
. . • Columns: unique corpus words.
. =0 .

n−1
 . . The count matrix could be made in a single pass through the
C wn−N +1 wn
. .
. .
 
n−1
P wn |wn−N +1 = P   . Approximation of Sequence Probability . corpus.
n−1 . .
w∈V C wn−N +1 w . P (“tea”|“the teacher drinks”) ≈ P (“tea”|“drinks”) . Example:
. .

n−1
 . . Corpus: “hsi I study I learnh/si”
C wn−N . .
+1 wn . (2.17) . Bigram count matrix:
= (2.15) . P (“the teacher drinks tea”) ≈ P (“the”) .
. .
 
n−1
C wn−N . . hsi h/si “I” “study” “learn”
+1 . .
P (“teacher”|“the”) hsi 0 0 1 0 0
. .

n−1

n
 . P (“drinks”|“teacher”) . h/si 0 0 0 0 0
C wn−N +1 wn = C wn−N +1 . .
. . “I” 0 0 0 1 1
. P (“tea”|“drinks”) .
Example: . . “study” 0 0 1 0 0
. .
Corpus: “I am happy because I am learning” Markov assumption: only last N words matter: “learn” 0 1 0 0 0
8
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
Probability Matrix . Perplexity for Bigram Models: . Example
. .
. . Min frequency f = 2
. .
. . Corpus: Corpus:
  X  
n−1 n−1 1
sum(row) = C wn−N +1 = C wn−N +1 , w . P P (W ) = P (s1 , s2 , . . . , sm ) −m .
. . “hsi Lyn drinks chocolate h/si” “hsi Lyn drinks chocolate h/si”
w∈V . . =⇒
. . “hsi John drinks tea h/si” “hsi hUNKi drinks hUNKi h/si”
v
. .
u
m |s i|
Divide each cell of the bigram count matrix by its row sum: . 1 . “hsi Lyn eats chocolate h/si” “hsi Lyn hUNKi chocolate h/si”
uY Y
m
. .
u
hsi h/si “I” “study” “learn” = (2.19)
. .
t
(i) (i)
. i=1 j=1 P (wj |wj−1 ) .
hsi 0 0 1 0 0 . .
h/si 0 0 0 0 0 . . Vocabulary : {“Lyn”, “drinks”, “chocolate”}
. Where: .
“I” 0 0 0 0.5 0.5 . . Input query:
. .
“study” 0 0 1 0 0 . . “hsi Adam drinks chocolate h/si”
. .
“learn” 0 1 0 0 0 . . ⇓
. W → test set containing m sentences s. .
. . “hsi hUNKi drinks chocolate h/si”
Log Probability . si → i-th sentence in the test set, each ending with h/si. .
. m → Count of all sentences in entire test set W .
. . How to Create Vocabulary V
To avoid numerical underflow of multiplying numbers <= 1 use log . |si | → Number of words in a sentence si including h/sibut not hsi . ..
. • Criteria:
probabilities: . (i) .
. wj → j-th word in i-th sentence. .
. . – Min word frequency f .
. .
n
! . .
(2.18) Y . . – Max |V |, include words by frequency (most common
log (P (w1n )) ≈ log P (wi |wi−1 ) . Concatenate all sentences in W : .
. . words).
i=1 . v .
. uM .
n . (2.19) Mu Y 1 . • Use hUNKi sparingly.
X . P P (W ) = t (2.20) ..
≈ log[P (wi |wi−1 )] . P (w |w ) • Perplexity: only compare Language models with same V .
. i=1 i i−1 .
i=1 . .
. . 3.7 Smoothing
. Where: .
Sentence Probability .
.
.
. Problem: N -grams made of known words still might be missing in
. .
. . the training corpus. Their counts cannot be used for probability
P (“hsi I learn h/si”) =P (“I”|hsi)P (“learn”|“I”)P (h/si|“learn”) . M → Number of words in entire test set W .
. . estimation and will evaluate to 0.
=1 × 0.5 ×1 . including h/sibut not hsi . .
. .
. wi → i-th word in test set. . Smooting
=0 . .
. . • Add-one smoothing (Laplacian smooting)
. .
Next Word Prediction (Generative Language Model) . Log perplexity: . For bigrams:
. .
. .
Algorithm: . m . (2.16) C(wn−1 , wn ) + 1 C(wn−1 , wn ) + 1
. (2.20) 1 X . P (wn |wn−1 ) = =
. log2 (P P (W )) = − log2 [P (wi |wi−1 )] . P
[C(wn−1 , w) + 1] C(wn−1 ) + |V |
• Choose sentence start. . m i=1 . w∈V
. .
• Choose next bigram starting with previous word. . .
. . • Add-k smoothing
. Properties of Perplexity Score: .
• Continue until hsi is picked. . .
. .
. • A text written by a human is more likely to have a lower . 
n−1

3.5 Language Model Evaluation . .   (2.15) C wn−N +1 wn + k
. perplexity score. . n−1
P wn |wn−N +1 = P
. . h   i
Test Data . . n−1
. • smaller perplexity = better model. . w∈V C wn−N +1 w + k
. .
• Train: Used to train the model. . . 
n−1

. • Good language models have perplexity score between 60 and .. C wn−N +1 wn + k
• Validation: Used for tuning hyperparameters. .
. 20 and sometimes even lower for English. . =   (2.21)
. . n−1
• Test: Test using a metric to reflect how well your model . . C wn−N +1 + k|V |
. • Character level models P P < word-based models P P . .
performs on unseen data. . .
. . For N -grams that have a zero count, the probability in
. • In a good model with perplexity between 20 and 60, log .
For smaller corpora: 80% train, 10% validation and 10% test. . . 1
. perplexity would be between 4.3 and 5.9. . Equation 2.21 becomes
For large corpora (typical for text): 98% train, 1% validation and . . |V |
. .
1% test. . . For bigrams:
. 3.6 Out of Vocabulary Words .
. .
Perplexity Metric . . (2.16) C(wn−1 , wn ) + k C(wn−1 , wn ) + k
. Using hUNKi in Corpus . P (wn |wn−1 ) = =
. .
P
. . w∈V[C(wn−1 , w) + k] C(wn−1 ) + k|V |
Perplexity is defined as the state of confusion or uncertainty.
. • Create vocabulary V . .
You can think of it as a measure of complexity in a sample of . . • Advanced methods:
. .
text.It is used to tell us whether a set of sentences look like they . • Replace any word in corpus not in V by hUNKi . .
. . – Kneser-Ney smoothing.
were written by humans rather than by a simple program choosing . .
. .
words at random. • Count probabilities with hUNKi as with any other word. – Good-Turing smooting.
9
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
Back-off . .
. .
. .
• If an N -gram is missing, use (N − 1)-gram. If (N − 1)-gram . .
. .
is missing, use (N − 2)-gram, . . .
– Probability discounting e.g. Katz backoff.
.
.
.
.
References .
.
.
.
. .
. .
– “Stupid” back-off: Use lower-order N -grams and . [JM19] Daniel Jurafsky and James H Martin. Speech and .
. language processing: an introduction to natural .
multiply by a constant. A constant of about 0.4 was . .
. language processing, computational linguistics, and .
experimentally shown to work well. . .
. speech recognition. 3rd ed. Upper Saddle River, NJ: .
. .
Example: . Prentice-Hall, 2019. url: .
. .
Using corpus in section 3.6 . https://web.stanford.edu/%20jurafsky/slp3/. .
. .
. .
P (“chocolate”|“john drinks”) = 0.4 × P (“chocolate”|“drinks”) . [Jur12] Dan Jurafsky. Minimum Edit Distance. Stanford .
. .
. University, Jan. 2012. url: https: .
Interpolation . .
. //web.stanford.edu/class/cs124/lec/med.pdf .
. .
Example for trigram: . (visited on 08/07/2020). .
. .
. .
P̂ (wn |wn−2 wn−1 ) = λ1 × P (wn |wn−2 wn−1 ) . [LJP03] Mark Lieberman, Yoon-Kyoung Joh, and .
. .
. Marjorie Pak. Alphabetical list of part-of-speech tags .
+ λ2 × P (wn |wn−1 ) . used in the Penn Treebank Project. 2003. url: .
. .
+ λ3 × P (wn ) . https://www.ling.upenn.edu/courses/Fall_2003/ .
. .
. ling001/penn_treebank_pos.html (visited on .
X . .
where λi = 1 . 08/08/2020). .
. .
i . .
. [Mik+13] Tomas Mikolov et al. “Distributed Representations of .
λi weights come from optimization on a validation set. . .
Example: . Words and Phrases and their Compositionality”. In: .
. .
. Advances in Neural Information Processing Systems .
. .
. 26. Ed. by C. J. C. Burges et al. Vol. 26. Curran .
. .
P̂ (“chocolate”|“John drinks”) = 0.7 × P (“chocolate”|“John drinks”) . Associates, Inc., Oct. 2013, pp. 3111–3119. url: .
. .
+ 0.2 × P (“chocolate”|“drinks”) . http://papers.nips.cc/paper/5021-distributed- .
. .
. representations-of-words-and-phrases-and- .
+ 0.1 × P (“chocolate”) . .
. their-compositionality.pdf (visited on .
. .
. 07/22/2020). .
. .
. [Nor07] Peter Norvig. How to Write a Spelling Corrector. .
. .
. Feb. 2007. url: .
. .
. https://norvig.com/spell-correct.html (visited .
. .
. on 08/04/2020). .
. .
. .
. [Por80] Martin F. Porter. “An algorithm for suffix .
. .
. stripping”. In: Program 14.3 (1980), pp. 130–137. .
. .
. doi: 10.1108/eb046814. url: .
. .
. https://tartarus.org/martin/PorterStemmer/. .
. .
. .
. [San90] Beatrice Santorini. “Part-of-speech tagging guidelines .
. .
. for the penn treebank project (3rd revision)”. In: .
. Technical Reports (CIS) (1990), p. 570. url: .
. .
. https://catalog.ldc.upenn.edu/docs/LDC99T42/ .
. .
. tagguid1.pdf. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
c 2020 Fady Morris Ebeid . .
. .
https://github.com/FadyMorris/formula-sheets . .
. .
DOI: 10.5281/zenodo.3987960
10

You might also like