Fady Morris Natural Language Processing

The document is a formula sheet for a Natural Language Processing specialization by Fady Morris Ebeid, detailing key concepts such as Logistic Regression and Naïve Bayes. It includes mathematical formulations for cost functions, gradient descent, and conditional probabilities, along with preprocessing and feature extraction techniques. The document serves as a reference for understanding the algorithms and their implementation in NLP tasks.

Uploaded by

abdo adnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views10 pages

Fady Morris Natural Language Processing

Uploaded by

abdo adnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)

. .
Natural Language Processing .
. 1.4 Logistic Regression: Regression and .
.
(1.6)
θ : = θ − α∇θ J (θ)
. .
. Sigmoid .
Specialization .
. The logits z (i) for an example i can be calculated as:
.
.
. . Figure 1.1: Training Logistic Regression
Formula Sheet .
.
. z (i)
=θ xT (i)
= θ0 x0 + θ1 x1 + θ2 x2 + . . . + θn xn (1.2)
.
.
.
Fady Morris Ebeid . .
. .
. The hypothesis function h (sigmoid function σ): .
(2020) .
. 1
.
.
. .

. h x(i) , θ = h(z (i) ) = σ z (i) = (i)
(1.3) .
. 1 + e−z .
. .
. .
Chapter 1 .
.
.
Note: All the h values are between 0 and 1. .
.
.
. 1.5 Cost Function .
Classification and Vector .
.
. The loss function for a single training example is:
.
.
.
. .
. .
Spaces .
.
.
h
L(θ) = − y (i) log h(z (i) ) + 1 − y (i) log 1 − h(z (i) )
i .
.
.
. .
. . 1.8 Testing Logistic Regression
1 Logistic Regression .
. The cost function used for logistic regression is the average of the .
.
. log loss across all training examples: . m(val) : Total number of examples (sentences) in validation set.
corpus: a language resource consisting of a large and structured set . . (val)
of texts. . . yi : Ground truth label for an example i ∈ {1, . . . , m(val) } in the
. .
. m h . validation set. 1 for positive sentiment, 0 for negative sentiment.
. 1 .
1.1 Notation
X i
. J (θ) = − y (i) log h(z (i) ) + 1 − y (i) log 1 − h(z (i) ) . (val)
ŷi : Predicted label (sentiment) for the ith example in the
. m .
V : Vocabulary size, the number of unique words in the entire set . i=1 . validation set.
. .
of sentences. . (1.4) .
.
. Where:
.
. 1. Perform testing on unseen validation data X (val) , y(val)
θ: Parameter vector, θ = [θ0 , θ1 , . . . , θn ] . .
. . using trained weights θ.
m: Number of examples (sentences) . • m is the number of training examples. .
P (class): Probability that a sentence is in a given class. . . 2. Calculate h(X (val) , θ) = h(z)
. • y (i) : is the actual label of the ith training example. .
class ∈ {pos, neg}. . . (val)
. . 3. Predict ŷi for each example as follows
freq(wi , class): Frequency of a word wi in a specific class. . • h(z (i) ) is the model prediction for the ith training example. .
. .
. . (
. . 1, If h(z)i ≥ 0.5
1.2 Preprocessing . 1.6 Gradient Descent . (val)
ŷi =
. . 0, otherwise
1. Eliminate handles and URLs. . The gradient of the cost function J with respect to one of the .
. .
. weights θj is .
2. Tokenize the string w = [w1 , w2 , . . . , wn ]. . . 4. Calculate the accuracy score for all examples in the
. .
. m . validation set:
3. Remove stop words(and, is, are, at, has, for, a, . . . ) and . 1 X
.
. ∇θj J (θ) = h(z (i) ) − y (i) xj (1.5) .
punctuation (, . : ! ” ’). . m i=1 . m(val)
. . 1 X (val) (val)

4. Stemming: Convert every word to its stem.(use Porter . . accuracy = ŷi == yi
. To update the weight θj using gradient descent: . m(val)
Stemmer [Por80]). . . i=1
. .
. . m(val)
5. Convert words to lowercase. . θj := θj − α∇θj J (θ) (1.6) . 1 X (val) (val)
. . =1− ŷi − yi
. . m(val)
. Where α is the learning rate, a value to control how big a single . i=1
1.3 Feature Extraction with Frequencies .
. update will be.
.
. | {z }
. . error
X (m) : Features vector of a sentence m.It is a row vector. . .
.
. 1.7 Vectorized Implementation .
.
  . Putting all the examples in a matrix X (Equation 1.1), then the .
X (m)
= |{z}
1 ,
X
freq(w, pos),
X
freq(w, neg)
.
. previous equations become:
.
. 2 Naı̈ve Bayes
. .
w w . . 2.1 Conditional Probability and Bayes Rule
bias . .
. (1.2) . Conditional Probability:
Then all the examples m can be represented as the matrix X: . z = Xθ .
. 1 .
. (1.3) .
 (1) (1)  . h (X, θ) = h(z) = σ (z) = . P (A ∩ B)
1 X1 X2 . 1 + e−z . P (A|B) = (1.7)
1 X (2) (2)  . . P (B)
X2  . (1.4) 1 h T i .
1 . J (θ) = − y · log (h(z)) + (1 − y)T · log (1 − h(z)) .

X=  .. .. ..  (1.1) . m . Bayes Rule:
.
 . .
. .  . (1.5) 1
. P (B|A)P (A)
(m) (m) . ∇θ J (θ) = X T · (h(z) − y) . P (A|B) = (1.8)
1 X1 X2 m P (B)
1
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
2.2 Naı̈ve Bayes Assumptions . . 2.5 Training Naı̈ve Bayes
.  .
. 0 : 1 Negative sentiment. . 1. Collect and annotate corpus.
• Independence of events P (A ∩ B) = P (A)P (B). It assumes .  .
. ratio(w) = 1 Neutral Sentiment. .
that the words in a piece of text are independent of one . . Preprocess text:
.  .
another, which is not true in reality, but it works well. .

1:∞ Positive sentiment. .
. . • Lowercase.
• Relative frequency in corpus: It relies on the distribution of . P (class)P (wi |class)
(1.8) .
. P (class|wi ) = (1.13) . • Remove punctuation, URLs, names.
the training data sets. A good data set will contain the . .
. P (wi ) .
same proportion of positive and negative tweets as a . . • Remove stop words.
. P (pos|wi ) (1.13) P (pos)P (wi |pos) .
random sample would. However, most of available . = (1.14) . • Stemming [Por80].
. P (neg|wi ) P (neg)P (wi |neg) .
annotated corpora are artificially balanced. In reality . .
. . • Tokenize sentences w = [w1 , w2 , . . . , wn ]
positive sentences occur more frequently than negative. . n .
. P (pos|sentence) (1.14) P (pos)Y P (pos)P (wi |pos) .
. = (1.15) . 2. Word count.
. P (neg|sentence) P (neg) i=1 P (neg)P (wi |neg) .
2.3 Notation . .
. . (a) Compute freq(w, class) for every word in the
class ∈ {pos, neg}. . n . vocabulary.
. P (pos) Y .
. = ratio(wi ) .
w: A unique word in the vocabulary. . P (neg) i=1 . (b) Compute Nclass [equation 1.9]
. .
ratio(wi ): Ratio of the probability that the word wi being positive . .
. n
P (pos) Y P (wi |pos) + 1
(1.12) . 3. Compute conditional probabilities P (w|pos), P (w|neg)
to being negative. . ≈ (1.16) .
. . [equation 1.11]
Nclass : The total number of words in a class. . P (neg) P (wi |neg) + 1 .
. | {z } i=1 . 4. Calculate the lambda score (λ(w)) for each word [equation
N : total number of words in the corpus. . | {z } .
. prior likelihood . 1.18]
. ratio .
2.4 Naı̈ve Bayes Introduction . .
. . 5. Get the logprior :
. Where n: number of words in a sentence. .
V . . P (pos) (1.10) Npos
X . Log Likelihood Score . log = log
Nclass = freq(wi , class) (1.9) . .
. . P (neg) Nneg
i=1 . Carrying repeated multiplications in 1.16 can result in numerical .
. . If you are working with a balanced dataset (Npos = Nneg ),
Nclass . underflow. This problem is solved by taking log of both sides of .
P (class) = (1.10) . . then logprior = 0
. the equation to calculate the log likelihood score of a sentence .
N . .
. using the following equation: . 2.6 Testing Naı̈ve Bayes
. .
N = Npos + Nneg . .
. " n
# . m(val) : Total number of examples (sentences) in validation set.
. P (pos|sentence) (1.16) P (pos) Y .
. log = log ratio(wi ) . (val)
yi : Ground truth label for an example i ∈ {1, . . . , m(val) } in the
. .
P (neg) = 1 − P (pos) . P (neg|sentence) P (neg) i=1 . validation set. 1 for positive sentiment, 0 for negative sentiment.
. .
. . (val)
. P (pos) X
n . ŷi : Predicted label (sentiment) for the ith example in the
. = log + log(ratio(wi )) .
. P (neg) i=1 . validation set.
freq(w, class) . .
P (w|class) = .
.
.
. 1. Perform testing on unseen validation data X (val) , y(val)
Nclass . P (pos) X
n
P (wi |pos) + 1 .
. = log + log . 2. first, calculate log likelihood score for each sentence in the
freq(w, class) + 1 . P (neg) i=1 P (wi |neg) + 1 .
≈ (Laplacian smoothing) (1.11) . . examples [equation 1.17]
Nclass + V . | {z } | .
. logprior
{z }
log likelihood
. 3. Predict ŷi
(val)
for each example as follows
. .
V . n
.
X . P (pos) . (
P (wi |class) = 1 . . 1, If log likelihood score > 0
X
= log + λ(wi ) (1.17) (val)
. . ŷi =
i=1 . P (neg) i=1
. 0, otherwise
. .
The Naive Bayes inference condition rule for binary classification . | {z } .
. log . 4. Calculate the accuracy score for all examples in the
(of a sentence): . .
. likelihood . validation set:
n . .
Y P (wi |pos) . Where . m(val)
. . 1 X (val)
P (wi |neg) . . accuracy = ŷi == yi
(val)
i=1 . λ(wi ) = log(ratio(wi )) .
. . m(val) i=1
Where n: number of words in a sentence. . (1.12) P (wi |pos) + 1 .
. = log (1.18) . m(val)
. .
Likelihood . P (wi |neg) + 1 . 1 X (val) (val)
. . =1− ŷi − yi
.  . m(val)
.
. < 0 Negative word.
 .
.
i=1
P (w|pos) | {z }
ratio(w) = . λ(wi ) = 0 Neutral word. . error
P (w|neg) . .
. 

> 0 Positive word. . For a word not in the corpus, it is treated as neutral
. .
(1.11) P (w|pos)+1 . . (λ(w) = 0)
≈ (Laplacian smoothing) (1.12) . If log likelihood score is > 0, the sentence is positive. If it is < 0, .
P (w|neg) + 1 . .
the sentence is negative.
2
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
3 Vector Space Models .
. • Numbers in the range [0, 1] indicate a similarity score. .
. 4 Machine Translation and Document
• Represent words and documents as vectors. . .
.
. • Numbers in the range [−1, 0] indicate a dissimilarity score. .
. Search
• Representation that captures relative meaning. . .
.
. 3.4 Manipulating Words in Vector Spaces
.
.
4.1 Machine Translation
3.1 Word by Word and Word by Doc. . . Transforming Word Vectors
. [Mik+13] .
. .
Word by Word Design (W/W) . . Assume that we have a subset of a source language dataset of
. .
Counts the co-occurrence of two different words, which is the .
.
3.5 Visualization and PCA .
. word embeddings X = [x1 |x2 | . . . |xm ]T and a translation subset of
number of times they occur together within a certain distance k. . PCA is used to visualize the embeddings on a k-dimensional . destination language dataset Y = [y1 |y2 | . . . |ym ]T We want to
. .
With word by word design you get a representation matrix with . subspace of the original n-dimensional subspace of the word . find a transformation matrix R such that:
. .
n × n entries, where n equals to vocabulary size V . . embeddings. .
. . XR ≈ Y
. Eigenvector : Uncorrelated features for your data. .
Word by Document Design (W/D) . .
. Eigenvalue: The amount of information retained by each feature. . Cost function:
. . 1
Counts the Number of times a word occurs within a certain . Perform PCA on a data matrix X = [x1 |x2 | . . . |xn ]T ∈ Rm×n , . J = kXR − Y k2F
. .
category. . where m is the number of examples, n is the dimension (length) of . m
. . where:
Represented by a matrix with n × c entries, where c is the number . a word embedding. .
. .
of categories. . Steps of PCA: . • m is the number of examples.
. .
. .
3.2 Euclidean Distance .
. 1. Mean normalize data and obtain the normalized data .
. • kAkF is the Frobenius norm,
The euclidean distance between two n-dimensional vectors: . matrix X̄ .
. .
v
. . um X
n
uX
. . |aij |2
v
d(~v , w)
~ = d(w,
~ ~v ) m u m kAk = t
. 1 X u1 X . F
= k~v − wk
~ . µ= xi , σ=t x 2 − µ2 . i=1 j=1
. m i=1 m i=1 i .
. .
q . . • The reason for taking the square is that it’s easier to
= (v1 − w1 )2 + (v2 − w2 )2 + . . . + (vn − wn )2 . .
. . compute the gradient of the squared Frobenius.
. xi − µ .
. .
v
u n
. x̄i = .
uX σ The gradient of the cost function with respect to the
=t (vi − wi )2 .
.
.
.
. . transformation matrix :
i=1 . xi − µ x i .
. xi = . ∂J ∂ 1
Where . σ xi . = kXR − Y k2F
. . ∂R ∂R m
• n is the number of elements in the vector. . .
. . 2
. X̄ = [x̄1 |x̄2 | . . . |x̄n ]T . = (XR − Y )T X
• The more similar the words, the more likely the Euclidean . . m
. .
distance will be close to 0. . 2. Get the n × n covariance matrix Σ . 2 T
. . = X (XR − Y )
. 1 T .
3.3 Cosine Similarity .
. Σ= X̄ X̄
.
.
m
The main advantage of this metric over the euclidean distance is . m . Then we use gradient descent to optimize the transformation
. .
that it isn’t biased by the size difference between the . . matrix:
. 3. Perform a singular value decomposition to get the . ∂J
representations. . . R := R − α
.
. eigenvectors U ∈ Rn×n and eigenvalues diagonal matrix .
. ∂R
Vector norm: v . S ∈ Rn×n . . The predictions can be obtained using the trained R matrix:
u n . .
uX . U , S = SVD(Σ) .
k~v k = t vi2 . . Ŷ = XR
. .
i=1 . 4. Project data onto the k-dimensional principal subspace: .
. . The translation of a word i can be found using k-nearest neighbor
Dot product: . Multiply your normalized data by the first k eigenvectors .
n . . of ŷi from Y with k = 1.
X . associated with the k largest eigenvalues to compute the .
~v · w
~ = vi · wi . .
. projection X 0 ∈ Rm×k . . 4.2 Document Search
i=1 . .
. .
Cosine similarity: . B = (U ij )1≤i≤n . Document Representation
. .
. 1≤j≤k .
~v · w
~ . . 1. Bag-of-words (BOW) document models
cos(θ) = . .
k~v kkwk
~ .
. X 0 = X̄B .
.
Text documents are sequences of words. The ordering of
. . words makes a difference.
. The precentage of retained variance can be calculated from .
. . 2. Document embeddings
Cosine similarity gives values between -1 and 1. . .
. P1 .
 . i=0 Sii . A document can be represented as a document vector by
. .
1 Parallel and in the same direction. Pd
. . summing up the word embeddings of every word in the
j=0 Sjj

cos(θ) = 0 . .
Orthogonal(perpendicular). . . document. If we don’t know the embedding of a word, we
. .
can ignore that word.

−1 Point exactly in opposite directions.


3
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
Locality Sensitive Hashing . 3. Filter candidates. . Minimum edit distance is the sum of costs of edits needed to
. .
. . transform one string into the other.It evaluates the similarity
A more efficient version of k-nearest neighbors can be impelmented . Given a vocabulary, filter the edit list for candidate words .
. found in the vocabulary. . between two strings.
using locality sensitive hashing. Instead of searching the vector . .
. .
space we can only search in a subspace for the nearest neighboring . 4. Calculate word probabilities. .
. .
vectors. . . It is used in spelling correction, document similarity, machine
. count(w) .
Assume we have a plane(hyperplane) π that divides the vector . . translation, DNA sequencing and more.
. P (w) = .
space that has a normal vector p, then for any point with a . M .
. .
position vector v: . . Edits (operations) are:
. Where: .
. . Operation Description Cost
 . .
> 0, the point is above the plane.
 .
. • P (w): Probability of a word. .
. Insert Add a letter 1
p · v = 0, the point is on the plane. . . Delete Remove a letter 1
. • count(w): Number of times the word apprears. .


< 0, the point is below the plane. . . Replace Change 1 letter to another 2
. .
. • M : Total number of words in the corpus. .
Multiplanes Hash Functions . .
. .
. Then select the word with the highest probability as your .
• Multiplanes hash functions are based on the idea of . .
. autocorrect replacement. .
numbering every single region that is formed by the . . Minimum Edit Distance Algorithm
. .
intersection of n planes. . .
.
.
Algorithm 1: Autocorrect .
.
• We can divide the vector space into 2n parts(hash buckets). .
. 1 def autocorrect(word, n):
.
.
. . Minimum edit distance can be calculated using dynamic
The hash value for a position of a vector v with respect to a plane . Data: . programming. It breaks a problem down into subproblems which
. .
pi is: . probs: a dictionary that maps each word to its . can be combined to form the final solution.To do this efficiently,
. .
(
. probability in the corpus. . we will use a table (see Figure 2.1) to maintain the previously
1, If sign(pi · v) ≥ 0. . .
hi = . . computed substrings and use those to calculate larger substrings.
0, If sign(pi · v) < 0. . .
. 
count(w) .
. 
P (w) = for x = w .
Where i = {1, . . . , n} . probs[w → P (w)](x) = M
.
. . Initialization:
The combined hash bucket number for a vector (for all planes): . 0 otherwise .
. .
. .
n . .
X . vocab: a set containing all the vocabulary. .
hash = 2i−1 × hi .
. Result: n-best: a set of tuples with the most
.
.
D[0, 0] = 0 (2.1)
i=1 . . D[i, 0] = D[i − 1, 0] + del cost(source[i]) (2.2)
. probable n corrected words and their .
. probabilities. .
. . D[0, j] = D[0, j − 1] + ins cost(target[j]) (2.3)
. .
Chapter 2 .
.
.
2
3
suggestions = φ
n-best= φ
.
.
.
. if word ∈ vocab: .
. 4 .
Probabilistic Models .
.
.
5 suggestions = suggestions ∪ {word} .
.
. Per cell operations:
. 6 else: .
1 Autocorrect and Minimum Edit . 7 one-edit-set = one-edit-distance(word) ∩ vocab .
. .
. 8 if one-edit-set 6= φ: . D[i, j] =
Distance .
. 9 suggestions = suggestions ∪ one-edit-set
.
.
. .
1.1 Autocorrect . 10 else: .
. .
. 11 two-edit-set = two-edit-distance(word) ∩ vocab .
Reference: [Nor07] . .
. 12 if two-edit-set 6= φ: .
How it works . .
. 13 suggestions = suggestions ∪ two-edit-set . 
D[i − 1, j] + del cost
1. Identify a misspelled word. . . 
. 14 else: . 
D[i, j − 1]

. . +(ins cost
Words not in the dictionary are misspelled words. . 15 suggestions = suggestions ∪ {word} . min
. . rep cost; if source[i] 6= target[j]
. . 
D[i − 1, j − 1]
 +
2. Find strings n edit distance away. . . 
0; if source[i] = target[j]
. 16 best-words[w → probs(w)](x) = .
Edit: an operation performed on a string to change it. . .
. {x = w|w ∈ suggestions} . (2.4)
. .
Examples (for a string with n letters): . 17 n-best = {The set of top n words from best-words .
. .
Operation Description Output Count . sorted by probabilities} .
. .
Insert Add a letter 26(n+1) . .
. .
Delete Remove a letter n . . Minimum edit distance = D[m, n] (2.5)
. 1.2 Minimum Edit Distance .
Replace Change 1 letter to another 25n . .
. .
Switch Swap 2 adjacent letters n-1 Reference: [Jur12]
4
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. . No. Tag Description
. Example: .
. . 1. CC Coordinating conjunction
j 0 1 ... n . .
. . 2. CD Cardinal number
. source −→ target . 3. DT Determiner
. .
. “play” −→“stay” . 4. EX Existential there
# ... Tn−1
. .
i T0 . . 5. FW Foreign word
. . 6. IN Preposition or subordinating conjunction
. .
. . 7. JJ Adjective
. . 8. JJR Adjective, comparative
0 # D[0, 0] D[0, 1] ... D[0, n] . .
. D[i, j] = source[: i] −→ target[: j] . 9. JJS Adjective, superlative
. .
. . 10. LS List item marker
. D[2, 3] = “pl” −→ “sta” .
. . 11. MD Modal
1 S0 D[1, 0] D[1, 1] ... D[1, n] . D[0, 0] = # −→ # . 12. NN Noun, singular or mass
. .
. . 13. NNS Noun, plural
. D[m, n] = source −→ target . 14. NNP Proper noun, singular
. .
. . . . .. . . 15. NNPS Proper noun, plural
. . . . D[2, n] . .
. . . . . . where #: empty string. . 16. PDT Predeterminer
. . 17. POS Possessive ending
. .
. . 18. PRP Personal pronoun
m Sm−1 D[m, 0] D[m, 1] ... D[m, n] . . 19. PRP$ Possessive pronoun
. j .
. 0 1 2 3 4 . 20. RB Adverb
. .
. i # s t a y . 21. RBR Adverb, comparative
. . 22. RBS Adverb, superlative
. .
Figure 2.1: Minimum Edit Distance Table . 0 # 0 1 2 3 4 . 23. RP Particle
. .
. p
. 24. SYM Symbol
. 1 1 2 3 4 5 . 25. TO to
. .
. . 26. UH Interjection
. 2 l 2 3 4 5 6 .
. . 27. VB Verb, base form
. 3 a 3 4 5 4 5 . 28. VBD Verb, past tense
. .
. . 29. VBG Verb, gerund or present participle
. 4 y 4 5 6 5 4 . 30. VBN Verb, past participle
Algorithm 2: Minimun Edit Distance . .
. . 31. VBP Verb, non-3rd person singular present
. .
1 def min-edit-distance(source, target): . . 32. VBZ Verb, 3rd person singular present
. Figure 2.2: Minimum Edit Distance of “play” −→ “stay” . 33. WDT Wh-determiner
Data: . .
. . 34. WP Wh-pronoun
source: a string corresponding to the string you are . .
. . 35. WP$ Possessive wh-pronoun
starting with. . . 36. WRB Wh-adverb
. .
target: a string corresponding to the string you want . .
to end with.
.
. 2 Part of Speech Tagging and Hidden .
.
. . Table 2.1: Part-of-Speech Tags
Result: .
. Markov Models .
. Source: [San90] and [LJP03]
D: a matrix of size (m + 1 × n + 1) containing . .
. Reference: [JM19, Chapter 8] .
minimum edit distances (see Figure 2.1) . .
. .
med: the minimum edit distance required to convert . . 2.2 Markov Chains
the source string to the target .
.
2.1 Part of Speech Tagging .
.
. . Sates:
(2.1) . Part of speech (POS) tagging is the process of assigning tags that .
2 D[0, 0] = 0 . . S = {s1 , s2 , . . . , sN }
for i ∈ {1, 2, . . . , m}: . represent categories of parts of speech to words of a corpus. .
3 . . Markov property: The probability of the next event only depends
(2.2) . Applications of POS tagging: . on the current event.
4 D[i, 0] = [i − 1, 0] + del cost . .
. .
5 for j ∈ {1, 2, . . . , n}: . • Identifying named entities. . Initial Probability Vector
. .
(2.3) . .
6 D[0, j] = [0, j − 1] + ins cost . Eiffel tower is located in Paris. . π = [π1 , π2 , . . . , πN ]
. .
7 if source[i − 1] = target[j − 1]: . .
. • Co-reference resolution. . Example:
8 r cost = 0 . .
. . NN VB O
9 else: . The Eiffel tower is located in Paris, it is 324 meters high. . π=
. . π (initial) 0.4 0.1 0.5
10 r cost = 2 . .
. • Speech recognition. .
 . .
D[i − 1, j]
 + del cost .
.
.
.
(2.4)
. lexical term tag example .
11 D[i, j] = min D[i, j − 1] + ins cost . .
 . noun NN something, nothing .
D[i − 1, j − 1] +r cost

. verb VB learn, study .
. .
. determiner DT the,a .
(2.5) . .
12 med = D[m, n] . w-adverb WRB why, where .
. .
... ... ...
5
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
The Transition Matrix . Emission Matrix . Smoothing
. .
. .
The transition matrix has a dimension (N × N ). . . To avoid division by zero and zero probabilities apply smoothing
. .
b11 b12 ... b1V
 
. . to the equation (2.8)
.  b21 b22 ... b2V  .
. .
. B= .

. . 

(2.7) .
a1,1 a1,2 . . . a1,N ..
 
.  .. .. ..  .
. . . (2.8) C(ti−1 , ti ) + ε
 a2,1 a2,2 . . . a2,N  . . P (ti |ti−1 ) = PN
A= . (2.6) . bN 1 bN 2 ... bN V . C(ti−1 , tj ) + N · ε
.. .. 
 
 .. .. . . j=1
. . .  . .
. P (o1 |s1 ) P (o2 |s1 ) ... P (oV |s1 ) .
 
. . C(ti−1 , ti ) + ε
aN,1 aN,2 . . . aN,N . P (o1 |s2 ) P (o2 |s2 ) ... P (oV |s2 )  . =
. . C(ti−1 ) + N · ε
. = . .. .. .
 
P (s1 |s1 ) P (s2 |s1 ) ... P (sN |s1 ) ..
 
. .. .

P (s1 |s2 ) P (s2 |s2 ) ... P (sN |s2 )  . . . . 
. Where
. P (o1 |sN ) P (o2 |sN ) ... P (oV |sN ) .
= . .. .. . .
 
.. .. 
. . • C(ti−1 ) is the count of the previous POS tag occurrence in
. . . 
. . the corpus.
P (s1 |sN ) P (s2 |sN ) . . . P (sN |sN ) . Emission matrix B has a dimension (N × V ) .
. .
. Where N is the number hidden states (parts of speech tags), V is . • ε is a smoothing parameter.
. .
For all the outgoing transition probabilities: . the number of observables (words in corpus). .
. . 2.5 Calculating Emission Probabilities
. .
. .
N . M . Calculate the number of time a (tag, word) pair showed in the
X . X . training set:
aij = 1 . bij = 1 .
. . C(ti , wi )
j=1 . j=1 .
. .
. . Compute the probability of a word given its tag:
. Example: .
Example: . .
. going to eat ... . (2.7)
NN VB O . . P (wi |ti ) = brindex(ti ),cindex(wi )
. NN (noun) 0.5 0.1 0.02 .
NN (noun) 0.2 0.2 0.6 . B= .
A= . VB (verb) 0.3 0.1 0.5 . C(ti , wi ) + ε
VB (verb) 0.4 0.3 0.3 . . = PV
. O (other) 0.3 0.5 0.68 . C(ti , wj ) + V · ε
O (other) 0.2 0.3 0.5 . . j=1
. .
. A word can have different parts of speech assigned depending on . C(ti , wi ) + ε
. . =
2.3 Hidden Markov Models .
.
context in which they appear: .
. C(ti ) + V · ε
. NN .
Hidden states: parts of speech. States that are hidden and not . .
. z }| { . Where
directly observable from the text data. . • “He lay on his back” .
. . • w is a word (observable) in the corpus.
Emission probabilities: The probability of a visible observation . .
. RB .
when we are in a particular state. Emission probabilities describe . z }| { . • C(ti ) is the number of times a tag has occured in the corpus.
. • “I will be back” .
the transition probabilities between the hidden states . .
. . • V is the number of words in the vocabulary.
S = {s1 , s2 , . . . , sN } (parts of speech) of hidden Markov model to . .
the observables or emissions (words of corpus)
.
. 2.4 Calculating Transition Probabilities .
.
. . 2.6 The Viterbi Algorithm
O = {o1 , o2 , . . . , oV }. . 1. Count the occurrences of tag pairs (the number of times .
. . The Viterbi algorithm computes the most likely sequence of parts
. each tag (ti ∈ S) at time step i happened next to another .
. . of speech tags for a given sentence (sequence of observations)
. tag (ti−1 ∈ S) at time step i − 1 ). .
o1 o1 .
.
.
.
w = [w1 , w2 , . . . , wK ]
. C(ti , ti−1 ) . The joint probability (combined probability) of the observing a
. .
. . word is calculated by multiplying the transition probability with
o2 s3 s1 o2 .
.
.
. the emission probability.
. 2. Calculate probabilities: divide the counts by row sum to .
. normalize the counts: The probability of a tag at position i . The total probability is calculated by multiplying all joint
. .
o3 π o3 .
. given the tag at position i − 1 becomes: .
.
probabilities of steps of the sequence.
. .
. .
. . w1 w2 wK
. . <s>
. (2.6) .
. P (ti |ti−1 ) = arindex(ti−1 ),cindex(ti ) .

P (wK |tK )
. .

P (w1 |t1 )

P (w2 |t2 )
s2 . .
. C(ti−1 , ti ) .
. = PN (2.8) .
. .
. j=1 C(ti−1 , tj ) . P (t1 |π) P (t2 |t1 ) P (t3 |t2 ) P (tK |tK−1 )
o1 o2 o3 . . ...
. . π t1 t2 tK
. Where .
. .
. . P (t1 |π)P (w1 |t1 ) × P (t2 |t1 )P (w2 |t2 ) × × P (tK |tK−1 )P (wK |tK ) =
total
Figure 2.3: Hidden Markov Model . . probability
• N is the total number of tags.
6
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
Auxiliary Matrices . Vectorized: .
. . Algorithm 3: Viterbi Algorithm
. .
Given your transition and emission probabilities, you first . . Data:
. .
. a0i · bi,cindex(wj ) .

populates and then use the auxiliary matrices C and D. . cj = max cj−1 (2.13) . • O = {o1 , o2 , . . . , oV }, the observation space
The matrix C ∈ RN ×K holds the intermediate optimal . . • S = {s1 , s2 , . . . , sN }, the state space , where sn is a tag.
. .
probabilities. . . • π = [π1 , π2 , . . . , πN ].An array of initial probabilities,
. .
The matrix D ∈ RN ×K holds the indices of the visited states (or . . where πi = P (x1 = si )
. .
a0i · bi,cindex(wj )

best paths, the different states you’re traversing when finding the . dj = arg max cj−1 (2.14) . • w = [w1 , w2 , . . . , wK ], A sequence of observations,
. .
most likely sequence of parts of speech tags for the given sequence . . wi ∈ O
. .
where AT = a01 |a02 | . . . |a0N

of words). . . • A ∈ RN ×N . Transition matrix
. .
. .
w1 w2 ... wK . . • B ∈ RN ×V . Emission matrix, where Bij = P (oj |si )
. 3. Backward pass: .
t1 . . Result:
C= . . .
. . The probability at ci,K is the probability of the most likely . t = [t1 , t2 , . . . , tK ], the most likely hidden state sequence of
. . .
. sequence of hidden states, generating the given sequence of . parts of speech tags.
tN . .
. words. . 1 def VITERBI(O, S, π, y, A, B):
. . /* Initialization */
w1 w2 ... wK . .
. Get the index of ci,K . for each state i = 1, 2, . . . , N :
t1 . . 2

D= . . Ci,1 ← log(πi ) + log(Bi,cindex(w1 ) )

.. . . 3
. . .
. zK = argmax ci,K . 4 Di,1 ← 0
tN . i
.
. . /* Forward pass */
. .
. . for each observation j = 2, 3, . . . , K:
Steps . . 5
. Use this index zK to traverse backwards through the matrix . 6 for each state i = 1, 2, . . . , N :
. .
1. Initialization: . D, to reconstruct the sequence of parts of speech tags. . 7 Cij ←
. .
The first column of each of the matrices C and D is . . max Ck,j−1 + log(Ak,i ) + log(Bi,cindex(wj ) )
. . k
populated. . .
. . 8 Dij ←
. Implementation note: .
ci,1 = P (si |π)P (w1 |si ) . . arg max Ck,j−1 +log(Ak,i )+log(Bi,cindex(wj ) )
. For numerical stability use log probabilities: . k
. .
= πi · bi,cindex(w1 ) (2.9) . .
. . /* Backward pass */
. .
. . 9 zK ← arg max Ci,K
For matrix D, in the first column, set all entries to zero, as . (2.9) . i
. log(ci,1 ) = log(πi ) + log(bi,cindex(w1 ) ) .
there are no proceeding parts of speech tags we have . . 10 tK ← szK
. .
traversed. . . 11 for j = K, K − 1, . . . , 2:
. .
. . 12 zj−1 ← Dzj , j
. .
di,1 = 0 . . 13 tj−1 ← szj−1
. .
. .
Vectorized: . (2.11) . 14 return t
. log(ci,j ) = max log(ck,j−1 ) + log(ak,i ) + log(bi,cindex(wj ) ) .
. k .
c1 = π bcindex(w1 ) (2.10) . .
. .
. .
. .
. (2.12) .
. di,j = arg max log(ck,j−1 ) + log(ak,i ) + log(bi,cindex(wj ) ) .
d1 = ~0 . k .
. .
. .
2. Forward pass: . Vectorized: .
. .
. .
. .
. .
ci,j = max ck,j−1 · ak,i · bi,cindex(wj ) (2.11) . .
k . (2.10) .
. log(c1 ) = log(π) + log(bcindex(w1 ) ) .
. .
. .
di,j = arg max ck,j−1 · ak,i · bi,cindex(wj ) (2.12) . .
k
. .
. .
. .
where . .
. .
. .
. (2.13) .
max log(cj−1 ) + log(a0i ) + log(bi,cindex(wj ) )

(a) ak,i is the transition probability from the parts of . log(cj ) = .
. .
speech tag tk to the current tag ti . . .
. .
. .
(b) ck,j−1 represents the probability for the preceding . (2.14)
.
. .
h i
path you’ve traversed. dj = arg max log(cj−1 ) + log(a0i ) + log(bi,cindex(wj ) )
7
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
. • Unigram probability: Probability of unigram: . • Bigram P (wn |w1n−1 ) ≈ P (wn |wn−1 )
. .
. C(w) .
3 Autocomplete and Language Models .
. P (w) =
m
.
. • N -gram n−1
P (wn |w1n−1 ) ≈ P (wn |wn−N +1 )
Definitions: . .
. Size of corpus m = 7 . • Entire sequence modeled with bigram:
. .
• Text corpus: a large database of text documents. . 2 1 .
. P (“I”) = P (“happy”) = .
• Language model: a tool that calculates the probabilities of . 7 7 .
. . n
sentences; it can also estimate the probability of an . .
. • Bigram probability: .
Y
. . P (w1n ) ≈ P (wi |wi−1 ) (2.18)
upcoming word given a history of previous words. . Probability of bigram: .
. . i=1
• Sentence: a sequence of words. . .
. . ≈ P (w1 )P (w2 |w1 ) . . . P (wn |wn−1 )
. C(wi−1 wi ) C(wi−1 wi ) .
Applications: . .
. P (wi |wi−1 ) = P = (2.16) .
• Speech recognition. .
. w∈V C(wi−1 w) C(wi−1 ) .
. 3.3 Starting and Ending Sentences
Example : P (“I saw a van”) > P (“eyes awe of an”) . C(“I am”) 2 .
. . Sentence : “the teacher drinks tea”
. P (“am”|“I”) = = =1 .
• Spelling correction. . C(“I”) 2 .
. . Start and End of Sentence Tokens for N -grams
Example: “He entered the ship to buy some groceries” . C(“am learning”) 1 .
. P (“learning”|“am”) = = .
P (“entered the shop to buy”) > . C(“am”) 2 . Add N − 1 start tokens hsi and one end token h/si
. .
P (“entered the ship to buy”) . . Example:
. • Trigram probability: .
. .
• Augmentative communication. . . • Bigram:
. C(“I am happy”) 1 .
. P (“happy”|“I am”) = = .
3.1 N -grams and Probabilities . . “the teacher drinks tea” ⇒ “hsi the teacher drinks tea h/si”
. C(“I am”) 2 .
An N -gram is a sequence of N elements which can be words, . .
. 3.2 Sequence Probabilities . P (“the teacher drinks tea”) =P (“the”|hsi)
characters, or other elements. . .
. .
. Conditional probability and chain rule: . P (“teacher”|“the”)
Example . .
. . P (“drinks”|“teacher”)
. .
Corpus: “I am happy because I am learning” . (1.7) P (A, B) .
. P (B|A) = . P (“tea”|“drinks”)
• Unigrams: {“I”, “am”, “happy”, “because”, “learning” } . P (A) .
. . P (h/si|“tea”)
. .
• Bigrams: { “I am”, “am happy”, “happy because”, . . . } . ⇒ P (A, B) = P (A)P (B|A) .
. .
. P (A, B, C, D) = P (A)P (B|A)P (C|A, B)P (D|A, B, C) . • Trigram:
• Trigrams: { “I am happy”, “am happy because”, . . . } . .
. .
. . “the teacher drinks tea” ⇒
Sequence Notation . .
. P (“the teacher drinks tea”) =P (“the”) . “hsi hsi the teacher drinks tea h/si”
Corpus : “This is great . . . teacher drinks tea . . . w1 w2 ... wn−1 wn
. .
w1 w2 w3 w498 w499 w500 . P (“teacher”|“the”) .
m = 500 . .
. P (“drinks”|“the teacher”) .
. . P (w1n ) ≈ P (w1 |hsihsi)P (w2 |hsiw1 ) . . . P (h/si|wn wn−1 )
. .
w1m = w2 w2 . . . wm . P (“tea”|“the teacher drinks”) .
. .
. (2.17) . 3.4 The N -gram Language Model
w13 = w1 w2 w3 .
.
.
.
. The problem is that the corpus almost never contains the exact . Count Matrix
m
wm−2 = wm−2 wm−1 wm . sentence we’re interested in, so: .
. .
. . Represents the numerator of Equation 2.15 where:
N-gram Probability . .
. C(“the teacher drinks tea”) .
. P (“tea”|“the teacher drinks”) = . • Rows: unique corpus (N − 1)-grams.
Probability of N -gram: . .
. C(“the teacher drinks”) .
. . • Columns: unique corpus words.
. =0 .

n−1
. . The count matrix could be made in a single pass through the
C wn−N +1 wn
. .
. .

n−1
P wn |wn−N +1 = P . Approximation of Sequence Probability . corpus.
n−1 . .
w∈V C wn−N +1 w . P (“tea”|“the teacher drinks”) ≈ P (“tea”|“drinks”) . Example:
. .

n−1
. . Corpus: “hsi I study I learnh/si”
C wn−N . .
+1 wn . (2.17) . Bigram count matrix:
= (2.15) . P (“the teacher drinks tea”) ≈ P (“the”) .
. .

n−1
C wn−N . . hsi h/si “I” “study” “learn”
+1 . .
P (“teacher”|“the”) hsi 0 0 1 0 0
. .

n−1

n
. P (“drinks”|“teacher”) . h/si 0 0 0 0 0
C wn−N +1 wn = C wn−N +1 . .
. . “I” 0 0 0 1 1
. P (“tea”|“drinks”) .
Example: . . “study” 0 0 1 0 0
. .
Corpus: “I am happy because I am learning” Markov assumption: only last N words matter: “learn” 0 1 0 0 0
8
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
Probability Matrix . Perplexity for Bigram Models: . Example
. .
. . Min frequency f = 2
. .
. . Corpus: Corpus:
X
n−1 n−1 1
sum(row) = C wn−N +1 = C wn−N +1 , w . P P (W ) = P (s1 , s2 , . . . , sm ) −m .
. . “hsi Lyn drinks chocolate h/si” “hsi Lyn drinks chocolate h/si”
w∈V . . =⇒
. . “hsi John drinks tea h/si” “hsi hUNKi drinks hUNKi h/si”
v
. .
u
m |s i|
Divide each cell of the bigram count matrix by its row sum: . 1 . “hsi Lyn eats chocolate h/si” “hsi Lyn hUNKi chocolate h/si”
uY Y
m
. .
u
hsi h/si “I” “study” “learn” = (2.19)
. .
t
(i) (i)
. i=1 j=1 P (wj |wj−1 ) .
hsi 0 0 1 0 0 . .
h/si 0 0 0 0 0 . . Vocabulary : {“Lyn”, “drinks”, “chocolate”}
. Where: .
“I” 0 0 0 0.5 0.5 . . Input query:
. .
“study” 0 0 1 0 0 . . “hsi Adam drinks chocolate h/si”
. .
“learn” 0 1 0 0 0 . . ⇓
. W → test set containing m sentences s. .
. . “hsi hUNKi drinks chocolate h/si”
Log Probability . si → i-th sentence in the test set, each ending with h/si. .
. m → Count of all sentences in entire test set W .
. . How to Create Vocabulary V
To avoid numerical underflow of multiplying numbers <= 1 use log . |si | → Number of words in a sentence si including h/sibut not hsi . ..
. • Criteria:
probabilities: . (i) .
. wj → j-th word in i-th sentence. .
. . – Min word frequency f .
. .
n
! . .
(2.18) Y . . – Max |V |, include words by frequency (most common
log (P (w1n )) ≈ log P (wi |wi−1 ) . Concatenate all sentences in W : .
. . words).
i=1 . v .
. uM .
n . (2.19) Mu Y 1 . • Use hUNKi sparingly.
X . P P (W ) = t (2.20) ..
≈ log[P (wi |wi−1 )] . P (w |w ) • Perplexity: only compare Language models with same V .
. i=1 i i−1 .
i=1 . .
. . 3.7 Smoothing
. Where: .
Sentence Probability .
.
.
. Problem: N -grams made of known words still might be missing in
. .
. . the training corpus. Their counts cannot be used for probability
P (“hsi I learn h/si”) =P (“I”|hsi)P (“learn”|“I”)P (h/si|“learn”) . M → Number of words in entire test set W .
. . estimation and will evaluate to 0.
=1 × 0.5 ×1 . including h/sibut not hsi . .
. .
. wi → i-th word in test set. . Smooting
=0 . .
. . • Add-one smoothing (Laplacian smooting)
. .
Next Word Prediction (Generative Language Model) . Log perplexity: . For bigrams:
. .
. .
Algorithm: . m . (2.16) C(wn−1 , wn ) + 1 C(wn−1 , wn ) + 1
. (2.20) 1 X . P (wn |wn−1 ) = =
. log2 (P P (W )) = − log2 [P (wi |wi−1 )] . P
[C(wn−1 , w) + 1] C(wn−1 ) + |V |
• Choose sentence start. . m i=1 . w∈V
. .
• Choose next bigram starting with previous word. . .
. . • Add-k smoothing
. Properties of Perplexity Score: .
• Continue until hsi is picked. . .
. .
. • A text written by a human is more likely to have a lower .
n−1

3.5 Language Model Evaluation . . (2.15) C wn−N +1 wn + k
. perplexity score. . n−1
P wn |wn−N +1 = P
. . h i
Test Data . . n−1
. • smaller perplexity = better model. . w∈V C wn−N +1 w + k
. .
• Train: Used to train the model. . .
n−1

. • Good language models have perplexity score between 60 and .. C wn−N +1 wn + k
• Validation: Used for tuning hyperparameters. .
. 20 and sometimes even lower for English. . = (2.21)
. . n−1
• Test: Test using a metric to reflect how well your model . . C wn−N +1 + k|V |
. • Character level models P P < word-based models P P . .
performs on unseen data. . .
. . For N -grams that have a zero count, the probability in
. • In a good model with perplexity between 20 and 60, log .
For smaller corpora: 80% train, 10% validation and 10% test. . . 1
. perplexity would be between 4.3 and 5.9. . Equation 2.21 becomes
For large corpora (typical for text): 98% train, 1% validation and . . |V |
. .
1% test. . . For bigrams:
. 3.6 Out of Vocabulary Words .
. .
Perplexity Metric . . (2.16) C(wn−1 , wn ) + k C(wn−1 , wn ) + k
. Using hUNKi in Corpus . P (wn |wn−1 ) = =
. .
P
. . w∈V[C(wn−1 , w) + k] C(wn−1 ) + k|V |
Perplexity is defined as the state of confusion or uncertainty.
. • Create vocabulary V . .
You can think of it as a measure of complexity in a sample of . . • Advanced methods:
. .
text.It is used to tell us whether a set of sentences look like they . • Replace any word in corpus not in V by hUNKi . .
. . – Kneser-Ney smoothing.
were written by humans rather than by a simple program choosing . .
. .
words at random. • Count probabilities with hUNKi as with any other word. – Good-Turing smooting.
9
Natural Language Processing Specialization - Formula Sheet - by Fady Morris Ebeid (2020)
. .
Back-off . .
. .
. .
• If an N -gram is missing, use (N − 1)-gram. If (N − 1)-gram . .
. .
is missing, use (N − 2)-gram, . . .
– Probability discounting e.g. Katz backoff.
.
.
.
.
References .
.
.
.
. .
. .
– “Stupid” back-off: Use lower-order N -grams and . [JM19] Daniel Jurafsky and James H Martin. Speech and .
. language processing: an introduction to natural .
multiply by a constant. A constant of about 0.4 was . .
. language processing, computational linguistics, and .
experimentally shown to work well. . .
. speech recognition. 3rd ed. Upper Saddle River, NJ: .
. .
Example: . Prentice-Hall, 2019. url: .
. .
Using corpus in section 3.6 . https://web.stanford.edu/%20jurafsky/slp3/. .
. .
. .
P (“chocolate”|“john drinks”) = 0.4 × P (“chocolate”|“drinks”) . [Jur12] Dan Jurafsky. Minimum Edit Distance. Stanford .
. .
. University, Jan. 2012. url: https: .
Interpolation . .
. //web.stanford.edu/class/cs124/lec/med.pdf .
. .
Example for trigram: . (visited on 08/07/2020). .
. .
. .
P̂ (wn |wn−2 wn−1 ) = λ1 × P (wn |wn−2 wn−1 ) . [LJP03] Mark Lieberman, Yoon-Kyoung Joh, and .
. .
. Marjorie Pak. Alphabetical list of part-of-speech tags .
+ λ2 × P (wn |wn−1 ) . used in the Penn Treebank Project. 2003. url: .
. .
+ λ3 × P (wn ) . https://www.ling.upenn.edu/courses/Fall_2003/ .
. .
. ling001/penn_treebank_pos.html (visited on .
X . .
where λi = 1 . 08/08/2020). .
. .
i . .
. [Mik+13] Tomas Mikolov et al. “Distributed Representations of .
λi weights come from optimization on a validation set. . .
Example: . Words and Phrases and their Compositionality”. In: .
. .
. Advances in Neural Information Processing Systems .
. .
. 26. Ed. by C. J. C. Burges et al. Vol. 26. Curran .
. .
P̂ (“chocolate”|“John drinks”) = 0.7 × P (“chocolate”|“John drinks”) . Associates, Inc., Oct. 2013, pp. 3111–3119. url: .
. .
+ 0.2 × P (“chocolate”|“drinks”) . http://papers.nips.cc/paper/5021-distributed- .
. .
. representations-of-words-and-phrases-and- .
+ 0.1 × P (“chocolate”) . .
. their-compositionality.pdf (visited on .
. .
. 07/22/2020). .
. .
. [Nor07] Peter Norvig. How to Write a Spelling Corrector. .
. .
. Feb. 2007. url: .
. .
. https://norvig.com/spell-correct.html (visited .
. .
. on 08/04/2020). .
. .
. .
. [Por80] Martin F. Porter. “An algorithm for suffix .
. .
. stripping”. In: Program 14.3 (1980), pp. 130–137. .
. .
. doi: 10.1108/eb046814. url: .
. .
. https://tartarus.org/martin/PorterStemmer/. .
. .
. .
. [San90] Beatrice Santorini. “Part-of-speech tagging guidelines .
. .
. for the penn treebank project (3rd revision)”. In: .
. Technical Reports (CIS) (1990), p. 570. url: .
. .
. https://catalog.ldc.upenn.edu/docs/LDC99T42/ .
. .
. tagguid1.pdf. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
c 2020 Fady Morris Ebeid . .
. .
https://github.com/FadyMorris/formula-sheets . .
. .
DOI: 10.5281/zenodo.3987960
10

NLP Essentials
No ratings yet
NLP Essentials
22 pages
Practice Midterm Solutions
No ratings yet
Practice Midterm Solutions
7 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
Naive Bayes Sentiment Tutorial
No ratings yet
Naive Bayes Sentiment Tutorial
17 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Neural Language Models & Classifiers Guide
No ratings yet
Neural Language Models & Classifiers Guide
7 pages
23 LogisticRegression
No ratings yet
23 LogisticRegression
67 pages
Probability And: Bayes' Rule
No ratings yet
Probability And: Bayes' Rule
71 pages
Multimedia Application L7 - For
No ratings yet
Multimedia Application L7 - For
46 pages
NLP Assignement Solution
No ratings yet
NLP Assignement Solution
6 pages
Lec 2
No ratings yet
Lec 2
21 pages
Week 4
No ratings yet
Week 4
45 pages
08 - Testing Naive Bayes - en
No ratings yet
08 - Testing Naive Bayes - en
2 pages
Unit 3 LOGISTIC
No ratings yet
Unit 3 LOGISTIC
7 pages
Statistical Inference
No ratings yet
Statistical Inference
38 pages
Week 2
No ratings yet
Week 2
157 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 2
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 2
8 pages
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
No ratings yet
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
15 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
Logistic Regression for NLP
No ratings yet
Logistic Regression for NLP
64 pages
Language Models
No ratings yet
Language Models
59 pages
Homework3 Sol
No ratings yet
Homework3 Sol
5 pages
Lecture 3 Sentiment Analysis
No ratings yet
Lecture 3 Sentiment Analysis
41 pages
AI Mid-Term
No ratings yet
AI Mid-Term
3 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
16 pages
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
100% (1)
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
86 pages
Complete NLP Mastery Study Plan
No ratings yet
Complete NLP Mastery Study Plan
18 pages
Multimedia Application L8
No ratings yet
Multimedia Application L8
68 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
14 pages
ISYE6740 Fall2024 HW4 Rubric
No ratings yet
ISYE6740 Fall2024 HW4 Rubric
5 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
Natural Language Processing
No ratings yet
Natural Language Processing
49 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
Text Classification
No ratings yet
Text Classification
60 pages
NLP Unit-5
No ratings yet
NLP Unit-5
13 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
15 pages
MLRD 2
No ratings yet
MLRD 2
15 pages
Logistic Regression
No ratings yet
Logistic Regression
78 pages
5 LR Apr 7 2021
No ratings yet
5 LR Apr 7 2021
93 pages
Eee3335 - Suggestion and Solution
No ratings yet
Eee3335 - Suggestion and Solution
5 pages
4 Classification 2
No ratings yet
4 Classification 2
55 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
Mla Unit Iii
No ratings yet
Mla Unit Iii
12 pages
Video v3
No ratings yet
Video v3
34 pages
Unit1 Extra
No ratings yet
Unit1 Extra
79 pages
Qta Lse Day5 PDF
No ratings yet
Qta Lse Day5 PDF
62 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
Module 3
No ratings yet
Module 3
25 pages
Natural Language Processing - Language Modelling
No ratings yet
Natural Language Processing - Language Modelling
117 pages
Lecture 05
No ratings yet
Lecture 05
45 pages
10253.exp 5
No ratings yet
10253.exp 5
12 pages
It-3035 (NLP) - CS Mid Feb 2024
No ratings yet
It-3035 (NLP) - CS Mid Feb 2024
6 pages
Top 10 NLP Question - Answer
No ratings yet
Top 10 NLP Question - Answer
16 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
58 pages
CS 904: Natural Language Processing Statistical Inference: N-Grams
No ratings yet
CS 904: Natural Language Processing Statistical Inference: N-Grams
30 pages
2022 Slide9 BayesML Eng
No ratings yet
2022 Slide9 BayesML Eng
34 pages
IR Project Report Aniket (1641012047)
No ratings yet
IR Project Report Aniket (1641012047)
22 pages
slides_session_12
No ratings yet
slides_session_12
4 pages
Slides Session 4
No ratings yet
Slides Session 4
6 pages
slides_session_5
No ratings yet
slides_session_5
26 pages
slides_session_14
No ratings yet
slides_session_14
6 pages
Report (Syst Anal and Desiig 1)
No ratings yet
Report (Syst Anal and Desiig 1)
12 pages
NLP Roadmap
No ratings yet
NLP Roadmap
10 pages
Casual Inference Important
No ratings yet
Casual Inference Important
12 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
19 pages
Emotional Communication and Therapeutic Change 1st Edition Wilma Bucci Ready To Read
No ratings yet
Emotional Communication and Therapeutic Change 1st Edition Wilma Bucci Ready To Read
99 pages
Yuwipi Songs Lyrics Complete CD PDF
100% (3)
Yuwipi Songs Lyrics Complete CD PDF
18 pages
ProVitalitPlus Brochure EN
No ratings yet
ProVitalitPlus Brochure EN
2 pages
Metaphor Translation Guide
No ratings yet
Metaphor Translation Guide
15 pages
Lesson Card
No ratings yet
Lesson Card
3 pages
7.testing and Debugging
No ratings yet
7.testing and Debugging
17 pages
CutOffHU 2008MBA R2
No ratings yet
CutOffHU 2008MBA R2
87 pages
Lea 3-Activities-Chapter 2
No ratings yet
Lea 3-Activities-Chapter 2
3 pages
Alibaba Cloud Big Data
No ratings yet
Alibaba Cloud Big Data
27 pages
RR IC Poster - 20241110 - 154308 - 0000
No ratings yet
RR IC Poster - 20241110 - 154308 - 0000
1 page
STD 7th - Syllabus 2025-26
No ratings yet
STD 7th - Syllabus 2025-26
1 page
Myofascial Pain Treatment Survey
No ratings yet
Myofascial Pain Treatment Survey
5 pages
Lesson Plan No.2
No ratings yet
Lesson Plan No.2
7 pages
Resume - Gopal K B Alpure
No ratings yet
Resume - Gopal K B Alpure
2 pages
Ipte 18 Module 5 Exams
100% (1)
Ipte 18 Module 5 Exams
9 pages
High School Sonnet Writing Guide
No ratings yet
High School Sonnet Writing Guide
3 pages
Bernard Dewagtere: O Sole Mio (It's Now or Never)
No ratings yet
Bernard Dewagtere: O Sole Mio (It's Now or Never)
2 pages
Influence of Students' Industrial Work Experience Scheme On Business Education Students Skills Acquisition in University of Benin
100% (1)
Influence of Students' Industrial Work Experience Scheme On Business Education Students Skills Acquisition in University of Benin
8 pages
Lesson Plan - All About Computers
No ratings yet
Lesson Plan - All About Computers
7 pages
HESI Study Guide Psychiatric Nursing
100% (6)
HESI Study Guide Psychiatric Nursing
26 pages
Darshan1 200200107079
No ratings yet
Darshan1 200200107079
11 pages
Teaching Philosophy
No ratings yet
Teaching Philosophy
8 pages
CV - Saket
No ratings yet
CV - Saket
1 page
Physics 1
No ratings yet
Physics 1
5 pages
12 App 006 - 1ST Sem
No ratings yet
12 App 006 - 1ST Sem
6 pages
1 Dr.+Anitha+V
No ratings yet
1 Dr.+Anitha+V
8 pages
Intro to IT Research Methods
No ratings yet
Intro to IT Research Methods
13 pages
HP 439 Frost Syllabus Spring 25
No ratings yet
HP 439 Frost Syllabus Spring 25
5 pages
Chapter 1 - Introduction To Technopreneurship
No ratings yet
Chapter 1 - Introduction To Technopreneurship
27 pages
High School Chemistry Escape Room
No ratings yet
High School Chemistry Escape Room
6 pages