A Survey of Text Mining
How to solve complex problems with text
A Survey of Text Mining
A Survey of Text Mining
While not normally known for his musical
talent, Elon Musk is releasing a debut album.
The "Elon Musk" is a collection of eight new
songs which are inspired by the founder's life.
The music, which is available for pre-order on
iTunes, was created by one-man-band and
fellow Tesla Motors and SpaceX executive,
Paul Kasmin, who's known for playing guitar at
Tesla events. The album is a collaboration
between Kasmin and Musk himself, although
it's also being marketed under the Tesla brand.
Example: Quantify Survey Results
Note: These are not real examples from our dataset.
I believe the training classes
led to a valuable benefit to my
work life.
I liked the training so much
that I decided to try out what I
learned on a personal project.
I thought the training was a
huge waste of my time.
I think the training is leading us
down the wrong path, and I
reported this to my manager.
Sentiment
Activation
Typical Regression
Example: Boston housing dataset
Challenge: Numeric Features from Text
Convert "plain text" written by humans
into vectors understandable by computers
I liked the training so much
that I decided to try out what I
learned on a personal project.
.25
.3
.1
.75
.4
.1
0
Sentiment
Activation
Bag of Words (BOW)
Split words by spaces
Create a vector of word counts
I liked the training so much
that I decided to try out what I
learned on a personal project.
I 3
liked 1
the 1
training 1
so 1
much 1
that 1
decided 1
... ...
BOW Vector Properties
Vectors are sparse
Size of vocab = size of vector
I 3
liked 1
the 1
training 1
so 1
much 1
that 1
decided 1
... ...
SizeofVocabulary
ZerosOmitted
BOW Issues
Frequent words dominate the representation
Word order removed
Similar words become totally different features
I 3
liked 1
the 1
training 1
so 1
much 1
that 1
decided 1
... ...
SizeofVocabulary
ZerosOmitted
Working with BOW Vectors
Use cosine similarity to relate different texts
Documents have a similarity between 0 and 1
(if A and B are nonnegative)
Cosine Similarity Example
I didn't like the training I 1
didn't 1
like 1
the 1
training 1
I liked the training a lotI 1
liked 1
the 1
training 1
a 1
lot 1
Text A B Ai*Bi
I 1 1 1
like 1 0 0
liked 0 1 0
the 1 1 1
training 1 1 1
a 0 1 0
lot 0 1 0
didn't 1 0 0
√5 √6 3
similarity(A, B)
= 3 / ( √5 * √6 )
≈ 0.5477
Text A B Ai*Bi
I 1 1 1
like 1 0 0
liked 0 1 0
the 1 1 1
training 1 1 1
a 0 1 0
lot 0 1 0
didn't 1 0 0
√5 √6 3
similarity(A, B)
= 3 / ( √5 * √6 )
≈ 0.5477
Issue:
Most similar words aren't relevant.
Stopwords
Words that we know aren't relevant
Often we can just remove these
Term Frequency Inverse Document Frequency
(TF-IDF)
Prioritize rare words
Demote common words
The man saw the
thief escapingThe man saw the
thief escapingThe man saw the
thief escapingThe man saw the
thief escaping Term Frequency:
# occurrences within a document
Document Frequency:
# docs containing the word
The
Let t be a term
The man saw the
thief escaping
The 2
man 1
saw 1
thief 1
escaping 1
Let t be a term, d be a document
The man saw the
thief escaping
The 2
man 1
saw 1
thief 1
escaping 1
The 100
man 75
saw 10
thief 5
escaping 3
150 total
documents
Let t be a term, d be a document, and C be a corpus.
The man saw the
thief escaping
(size = 6)
The 2/6
man 1/6
saw 1/6
thief 1/6
escaping 1/6
The 100
man 75
saw 10
thief 5
escaping 3
150 total
documents
Let t be a term, d be a document, and C be a corpus.
TF(t, d) = # times t occurs in d / size of d
The man saw the
thief escaping
(size = 6)
The 2/6
man 1/6
saw 1/6
thief 1/6
escaping 1/6
The log(150/100)
man log(150/75)
saw log(150/10)
thief log(150/5)
escaping log(150/3)
150 total
documents
Let t be a term, d be a document, and C be a corpus.
TF(t, d) = # times t occurs in d / size of d
IDF(t, C) = log(size of C / number of documents in C containing t)
The man saw the
thief escaping
(size = 6)
The 2/6
man 1/6
saw 1/6
thief 1/6
escaping 1/6
The log(150/100)
man log(150/75)
saw log(150/10)
thief log(150/5)
escaping log(150/3)
150 total
documents
Let t be a term, d be a document, and C be a corpus.
TF(t, d) = # times t occurs in d / size of d
IDF(t, C) = log(size of C / number of documents in C containing t)
TFIDF(t, d, C) = TF(t, d) IDF(t, C)
The man saw the
thief escaping
(size = 6)
The 2/6
man 1/6
saw 1/6
thief 1/6
escaping 1/6
The log(150/100)
man log(150/75)
saw log(150/10)
thief log(150/5)
escaping log(150/3)
150 total
documents
Let t be a term, d be a document, and C be a corpus.
TF(t, d) = # times t occurs in d / size of d
IDF(t, C) = log(size of C / number of documents in C containing t)
TF-IDF(t, d, C) = TF(t, d) IDF(t, C)
TFIDF("the", d, C) = (2/6)(log(150/100) = 0.058
TFIDF("thief", d, C) = (1/6)(log(150/5) = 0.246
Text A B Ai*Bi
I 1 1 1
like 1 0 0
liked 0 1 0
the 1 1 1
training 1 1 1
a 0 1 0
lot 0 1 0
didn't 1 0 0
√5 √6 3
similarity(A, B)
= 3 / ( √5 * √6 )
≈ 0.5477
Issue:
Most similar words aren't relevant.
Text A B Ai*Bi
I 0.01 0.01 0.0001
like 0.25 0 0
liked 0 0.2 0
the 0.01 0.01 0.001
training 0.5 0.5 0.25
a 0 0.002 0
lot 0 0.03 0
didn't 0.02 0 0
0.5596 0.5395 0.2511
similarity(A, B)
= 3 / ( √5 * √6 )
≈ 0.5477
= 0.2511 / (0.5596 * 0.5395)
≈ 0.8317
Issue:
Most similar words aren't relevant.
Solved!
Text A B Ai*Bi
I 0.01 0.01 0.0001
like 0.25 0 0
liked 0 0.2 0
the 0.01 0.01 0.001
training 0.5 0.5 0.25
a 0 0.002 0
lot 0 0.03 0
didn't 0.02 0 0
0.5596 0.5395 0.2511
similarity(A, B)
= 3 / ( √5 * √6 )
≈ 0.5477
= 0.2511 / (0.5596 * 0.5395)
≈ 0.8317
Issue:
Many similar words are treated differently.
Stemming and Lemmatization
Reduce words to their "root" or "lemma"
Stemming: find/replace word endings
Lemmatization: lookup "dictionary form" of word
Lemmatization requires part-of-speech tagging.
Original Stemmed Lemmatized
running runn (-ing) run
ran ran run
is is be
was wa (-s) be
studies studi (-es) study
studying study (-ing) study
better bett (-er) good
betting bett (-ing) bet
Part-of-Speech Tagging
Assign a "tag" to each word, such as:
● noun
● verb
● article
● adjective
● preposition
● pronoun
● adverb
● conjunction
● interjection.
Text A B Ai*Bi
I 0.01 0.01 0.0001
like 0.25 0 0
liked 0 0.2 0
the 0.01 0.01 0.001
training 0.5 0.5 0.25
a 0 0.002 0
lot 0 0.03 0
didn't 0.02 0 0
0.5596 0.5395 0.2511
similarity(A, B)
= 3 / ( √5 * √6 )
≈ 0.5477
= 0.2511 / (0.5596 * 0.5395)
≈ 0.8317
Issue:
Many similar words are treated differently.
Text A B Ai*Bi
I 0.01 0.01 0.0001
like 0.2 0.2 0.04
the 0.01 0.01 0.001
training 0.5 0.5 0.25
a 0 0.002 0
lot 0 0.03 0
didn't 0.02 0 0
0.5596 0.5395 0.2911
similarity(A, B)
= 0.2911 / (0.5596 * 0.5395)
≈ 0.964
Issue:
Many similar words are treated differently.
Solved!
Text A B Ai*Bi
I 0.01 0.01 0.0001
like 0.2 0.2 0.04
the 0.01 0.01 0.001
training 0.5 0.5 0.25
a 0 0.002 0
lot 0 0.03 0
didn't 0.02 0 0
0.5596 0.5395 0.2911
similarity(A, B)
= 0.2911 / (0.5596 * 0.5395)
≈ 0.964
Issue:
This measure doesn't incorporate semantics
Beyond Bag of Words
We would like to come up with a vector representation that captures meaning
Other wanted benefits:
● Smaller vectors
● Dense
● Reusable
Super Basics of Neural Networks
Given input data, target outputs
Learn parameters to minimize loss
Training consists of feedforward and backpropagation
Basic building block: perceptron
Super Basics of Neural Networks
Stack perceptrons to make network
Arrows indicate learnable weights
Circles sum all inputs and apply activation
functions
Super Basics of Neural Networks
Stack perceptrons to make network
Arrows indicate learnable weights
Circles sum all inputs and apply activation
functions
People often really simplify these
diagrams
Inputs
HiddenLayer
Outputs
HiddenLayer
Word2Vec
Create a neural network to learn useful word
representations
Words are "known by the company they keep"
Neural network learns to predict words by
their co-occurrences
Sampling: Sliding Window
Record "center word" (in blue) and
"context words"
Training
Lookup center word
Predict context words in order
Extract embeddings from internal weights
to the model
Embedding Visualized
Properties of embeddings
Additional Vector Properties
Improve performance
Preprocessing techniques, such as POS tagging, lemmatization, and stopword removal all improve
performance of Word2Vec embeddings.
Corpus size: Google trained on Google News (3 billion words)
Embedding size: Between 100- and 500-dimensional embeddings
Text A B Ai*Bi
I 0.01 0.01 0.0001
like 0.2 0.2 0.04
the 0.01 0.01 0.001
training 0.5 0.5 0.25
a 0 0.002 0
lot 0 0.03 0
didn't 0.02 0 0
0.5596 0.5395 0.2911
similarity(A, B)
= 0.2911 / (0.5596 * 0.5395)
≈ 0.964
Issue:
This measure doesn't incorporate semantics
Cosine Similarity Embedding Example
I didn't like the training
I liked the training a lot
I 0.1 0
didn't 0.1 -1
like 0 0.5
the 0 0.1
train -0.1 0.25
a 0.1 0
lot -1 0
EmbeddingTable
I didn't like the training I liked the training a lot
I didn't like the training I liked the training a lot
I 0.1 0
didn't 0.1 -1
like 0 0.2
the 0 0.1
train -0.1 0.1
I 0.1 0
like 0 0.2
the 0 0.1
train -0.1 0.1
a 0.1 0
lot -1 0
For short texts, average embeddings
to get document representation.
I didn't like the training I liked the training a lot
I 0.1 0
didn't 0.1 -1
like 0 0.2
the 0 0.1
train -0.1 0.1
Average: 0.02 -0.12
I 0.1 0
like 0 0.2
the 0 0.1
train -0.1 0.1
a 0.1 0
lot -1 0
Average: -0.15 0.06
For short texts, average embeddings
to get document representation.
I didn't like the training I liked the training a lot
I 0.1 0
didn't 0.1 -1
like 0 0.2
the 0 0.1
train -0.1 0.1
Average: 0.02 -0.12
I 0.1 0
like 0 0.2
the 0 0.1
train -0.1 0.1
a 0.1 0
lot -1 0
Average: -0.15 0.06
Doc A Doc B Ai*Bi
0.02 -0.15 −0.003
-0.12 0.06 −0.007
0.122 0.162 -0.01
Replace BOW columns with
embeddings
For short texts, average embeddings
to get document representation.
I didn't like the training I liked the training a lot
I 0.1 0
didn't 0.1 -1
like 0 0.2
the 0 0.1
train -0.1 0.1
Average: 0.02 -0.12
I 0.1 0
like 0 0.2
the 0 0.1
train -0.1 0.1
a 0.1 0
lot -1 0
Average: -0.15 0.06
Doc A Doc B Ai*Bi
0.02 -0.15 −0.003
-0.12 0.06 −0.007
0.122 0.162 -0.01
Replace BOW columns with
embeddings
=-0.01 / (0.122 * 0.162)
≈ −0.505
Smaller Issues with Word2Vec
1) Word2Vec cannot handle out-of-vocabulary words
Bad solution: add an "Unknown" embedding
2) Large vocabularies require very large embedding tables
Bad solution: remove any word that only occurs once
Bad solution: make "meta" tokens, such as "[NUMBER]" or "[NAME]"
Big Issue
with Word2Vec
Embeddings are fixed after training
Homographs: same spelling, different word
BERT &
Transformer Models
First there was ElMO
Then, there was BERT
Now, there's a GROVER, and ERNIE, and so many
muppet names
Super Basics of Transformers
The model itself is very outside the scope of this talk
Designed around machine translation
Super Basics of Transformers
Embeddings are learned weighted averages of other words in the same sentence
Super Basics of Transformers
Per-word weights are called "attentions" and are
interpretable
A Survey of Text Mining
Big Issue
with Word2Vec
Embeddings are fixed after training
Homographs: same spelling, different word
Solved!
Smaller Issues with Word2Vec
1) Word2Vec cannot handle out-of-vocabulary words
Bad solution: add an "Unknown" embedding
2) Large vocabularies require very large embedding tables
Bad solution: remove any word that only occurs once
Bad solution: make "meta" tokens, such as "[NUMBER]" or "[NAME]"
Beyond "Words"
BERT uses the "WordPiece" tokenizer
Rather than split text on spaces, learn useful character sequences
Helps with out-of-vocabulary words
Provides fixed vocab size
Input: "I saw a girl with a telescope."
Output: [I][▁saw][▁a][▁girl][▁with][▁a][▁][te][le][s][c][o][pe][.]
What can we do with BERT?
Question Answering
What can we do with BERT?
Question Answering
Entity Extraction
What can we do with BERT?
Question Answering
Entity Extraction
Part of Speech Tagging
What can we do with BERT?
Question Answering
Entity Extraction
Part of Speech Tagging
Sentiment Analysis / Regression
I liked the training so much
that I decided to try out what I
learned on a personal project.
Sentiment
Activation
Generating Text
Language model
Given tokens t0
-> ti-1
return the probability distribution of ti
"If you're happy and you know it"
clap 0.7
your 0.2
head 0.01
hands 0.09
... ...
GPT-2
Changes the attention mechanism of BERT
GPT-2 as a Language Model
Masked attention allows us to predict every word
given all PREVIOUS words.
Talk to Transformer: Play with generation!
https://talktotransformer.com/
While not normally known for his musical
talent, Elon Musk is releasing a debut album.
The "Elon Musk" is a collection of eight new
songs which are inspired by the founder's life.
The music, which is available for pre-order on
iTunes, was created by one-man-band and
fellow Tesla Motors and SpaceX executive,
Paul Kasmin, who's known for playing guitar at
Tesla events. The album is a collaboration
between Kasmin and Musk himself, although
it's also being marketed under the Tesla brand.
Summary
● BOW
○ Stopwords
○ TFIDF
○ Stemming & Lemmatization
○ Part-of-speech tagging
○ Cosine Similarity
● Word2Vec
○ CBOW + SkipGram
○ Learn words by company they keep
○ Embeddings capture semantic properties
● BERT
○ Change word embeddings based on company
○ Uses smaller wordpiece vocab

More Related Content

PPTX
2 6 inequalities
PDF
Sybrandt Thesis Proposal Presentation
PDF
BigData'18: Validation and Analysis of Hypothesis Generation Systems
PDF
Moliere Project Summary Poster (July 2018)
PDF
MOLIERE: Automatic Biomedical Hypothesis Generation System
PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
2 6 inequalities
Sybrandt Thesis Proposal Presentation
BigData'18: Validation and Analysis of Hypothesis Generation Systems
Moliere Project Summary Poster (July 2018)
MOLIERE: Automatic Biomedical Hypothesis Generation System
2024 Trend Updates: What Really Works In SEO & Content Marketing
Storytelling For The Web: Integrate Storytelling in your Design Process

Recently uploaded (20)

PPT
Geologic Time for studying geology for geologist
PPTX
Configure Apache Mutual Authentication
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Comparative analysis of machine learning models for fake news detection in so...
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PPTX
Internet of Everything -Basic concepts details
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
Geologic Time for studying geology for geologist
Configure Apache Mutual Authentication
giants, standing on the shoulders of - by Daniel Stenberg
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Credit Without Borders: AI and Financial Inclusion in Bangladesh
The influence of sentiment analysis in enhancing early warning system model f...
Comparative analysis of machine learning models for fake news detection in so...
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
CloudStack 4.21: First Look Webinar slides
Enhancing plagiarism detection using data pre-processing and machine learning...
Custom Battery Pack Design Considerations for Performance and Safety
NewMind AI Weekly Chronicles – August ’25 Week IV
4 layer Arch & Reference Arch of IoT.pdf
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Internet of Everything -Basic concepts details
OpenACC and Open Hackathons Monthly Highlights July 2025
Ad
Ad

A Survey of Text Mining

  • 1. A Survey of Text Mining How to solve complex problems with text
  • 4. While not normally known for his musical talent, Elon Musk is releasing a debut album. The "Elon Musk" is a collection of eight new songs which are inspired by the founder's life. The music, which is available for pre-order on iTunes, was created by one-man-band and fellow Tesla Motors and SpaceX executive, Paul Kasmin, who's known for playing guitar at Tesla events. The album is a collaboration between Kasmin and Musk himself, although it's also being marketed under the Tesla brand.
  • 5. Example: Quantify Survey Results Note: These are not real examples from our dataset. I believe the training classes led to a valuable benefit to my work life. I liked the training so much that I decided to try out what I learned on a personal project. I thought the training was a huge waste of my time. I think the training is leading us down the wrong path, and I reported this to my manager. Sentiment Activation
  • 7. Challenge: Numeric Features from Text Convert "plain text" written by humans into vectors understandable by computers I liked the training so much that I decided to try out what I learned on a personal project. .25 .3 .1 .75 .4 .1 0 Sentiment Activation
  • 8. Bag of Words (BOW) Split words by spaces Create a vector of word counts I liked the training so much that I decided to try out what I learned on a personal project. I 3 liked 1 the 1 training 1 so 1 much 1 that 1 decided 1 ... ...
  • 9. BOW Vector Properties Vectors are sparse Size of vocab = size of vector I 3 liked 1 the 1 training 1 so 1 much 1 that 1 decided 1 ... ... SizeofVocabulary ZerosOmitted
  • 10. BOW Issues Frequent words dominate the representation Word order removed Similar words become totally different features I 3 liked 1 the 1 training 1 so 1 much 1 that 1 decided 1 ... ... SizeofVocabulary ZerosOmitted
  • 11. Working with BOW Vectors Use cosine similarity to relate different texts Documents have a similarity between 0 and 1 (if A and B are nonnegative)
  • 12. Cosine Similarity Example I didn't like the training I 1 didn't 1 like 1 the 1 training 1 I liked the training a lotI 1 liked 1 the 1 training 1 a 1 lot 1
  • 13. Text A B Ai*Bi I 1 1 1 like 1 0 0 liked 0 1 0 the 1 1 1 training 1 1 1 a 0 1 0 lot 0 1 0 didn't 1 0 0 √5 √6 3 similarity(A, B) = 3 / ( √5 * √6 ) ≈ 0.5477
  • 14. Text A B Ai*Bi I 1 1 1 like 1 0 0 liked 0 1 0 the 1 1 1 training 1 1 1 a 0 1 0 lot 0 1 0 didn't 1 0 0 √5 √6 3 similarity(A, B) = 3 / ( √5 * √6 ) ≈ 0.5477 Issue: Most similar words aren't relevant.
  • 15. Stopwords Words that we know aren't relevant Often we can just remove these
  • 16. Term Frequency Inverse Document Frequency (TF-IDF) Prioritize rare words Demote common words The man saw the thief escapingThe man saw the thief escapingThe man saw the thief escapingThe man saw the thief escaping Term Frequency: # occurrences within a document Document Frequency: # docs containing the word
  • 17. The Let t be a term
  • 18. The man saw the thief escaping The 2 man 1 saw 1 thief 1 escaping 1 Let t be a term, d be a document
  • 19. The man saw the thief escaping The 2 man 1 saw 1 thief 1 escaping 1 The 100 man 75 saw 10 thief 5 escaping 3 150 total documents Let t be a term, d be a document, and C be a corpus.
  • 20. The man saw the thief escaping (size = 6) The 2/6 man 1/6 saw 1/6 thief 1/6 escaping 1/6 The 100 man 75 saw 10 thief 5 escaping 3 150 total documents Let t be a term, d be a document, and C be a corpus. TF(t, d) = # times t occurs in d / size of d
  • 21. The man saw the thief escaping (size = 6) The 2/6 man 1/6 saw 1/6 thief 1/6 escaping 1/6 The log(150/100) man log(150/75) saw log(150/10) thief log(150/5) escaping log(150/3) 150 total documents Let t be a term, d be a document, and C be a corpus. TF(t, d) = # times t occurs in d / size of d IDF(t, C) = log(size of C / number of documents in C containing t)
  • 22. The man saw the thief escaping (size = 6) The 2/6 man 1/6 saw 1/6 thief 1/6 escaping 1/6 The log(150/100) man log(150/75) saw log(150/10) thief log(150/5) escaping log(150/3) 150 total documents Let t be a term, d be a document, and C be a corpus. TF(t, d) = # times t occurs in d / size of d IDF(t, C) = log(size of C / number of documents in C containing t) TFIDF(t, d, C) = TF(t, d) IDF(t, C)
  • 23. The man saw the thief escaping (size = 6) The 2/6 man 1/6 saw 1/6 thief 1/6 escaping 1/6 The log(150/100) man log(150/75) saw log(150/10) thief log(150/5) escaping log(150/3) 150 total documents Let t be a term, d be a document, and C be a corpus. TF(t, d) = # times t occurs in d / size of d IDF(t, C) = log(size of C / number of documents in C containing t) TF-IDF(t, d, C) = TF(t, d) IDF(t, C) TFIDF("the", d, C) = (2/6)(log(150/100) = 0.058 TFIDF("thief", d, C) = (1/6)(log(150/5) = 0.246
  • 24. Text A B Ai*Bi I 1 1 1 like 1 0 0 liked 0 1 0 the 1 1 1 training 1 1 1 a 0 1 0 lot 0 1 0 didn't 1 0 0 √5 √6 3 similarity(A, B) = 3 / ( √5 * √6 ) ≈ 0.5477 Issue: Most similar words aren't relevant.
  • 25. Text A B Ai*Bi I 0.01 0.01 0.0001 like 0.25 0 0 liked 0 0.2 0 the 0.01 0.01 0.001 training 0.5 0.5 0.25 a 0 0.002 0 lot 0 0.03 0 didn't 0.02 0 0 0.5596 0.5395 0.2511 similarity(A, B) = 3 / ( √5 * √6 ) ≈ 0.5477 = 0.2511 / (0.5596 * 0.5395) ≈ 0.8317 Issue: Most similar words aren't relevant. Solved!
  • 26. Text A B Ai*Bi I 0.01 0.01 0.0001 like 0.25 0 0 liked 0 0.2 0 the 0.01 0.01 0.001 training 0.5 0.5 0.25 a 0 0.002 0 lot 0 0.03 0 didn't 0.02 0 0 0.5596 0.5395 0.2511 similarity(A, B) = 3 / ( √5 * √6 ) ≈ 0.5477 = 0.2511 / (0.5596 * 0.5395) ≈ 0.8317 Issue: Many similar words are treated differently.
  • 27. Stemming and Lemmatization Reduce words to their "root" or "lemma" Stemming: find/replace word endings Lemmatization: lookup "dictionary form" of word Lemmatization requires part-of-speech tagging.
  • 28. Original Stemmed Lemmatized running runn (-ing) run ran ran run is is be was wa (-s) be studies studi (-es) study studying study (-ing) study better bett (-er) good betting bett (-ing) bet
  • 29. Part-of-Speech Tagging Assign a "tag" to each word, such as: ● noun ● verb ● article ● adjective ● preposition ● pronoun ● adverb ● conjunction ● interjection.
  • 30. Text A B Ai*Bi I 0.01 0.01 0.0001 like 0.25 0 0 liked 0 0.2 0 the 0.01 0.01 0.001 training 0.5 0.5 0.25 a 0 0.002 0 lot 0 0.03 0 didn't 0.02 0 0 0.5596 0.5395 0.2511 similarity(A, B) = 3 / ( √5 * √6 ) ≈ 0.5477 = 0.2511 / (0.5596 * 0.5395) ≈ 0.8317 Issue: Many similar words are treated differently.
  • 31. Text A B Ai*Bi I 0.01 0.01 0.0001 like 0.2 0.2 0.04 the 0.01 0.01 0.001 training 0.5 0.5 0.25 a 0 0.002 0 lot 0 0.03 0 didn't 0.02 0 0 0.5596 0.5395 0.2911 similarity(A, B) = 0.2911 / (0.5596 * 0.5395) ≈ 0.964 Issue: Many similar words are treated differently. Solved!
  • 32. Text A B Ai*Bi I 0.01 0.01 0.0001 like 0.2 0.2 0.04 the 0.01 0.01 0.001 training 0.5 0.5 0.25 a 0 0.002 0 lot 0 0.03 0 didn't 0.02 0 0 0.5596 0.5395 0.2911 similarity(A, B) = 0.2911 / (0.5596 * 0.5395) ≈ 0.964 Issue: This measure doesn't incorporate semantics
  • 33. Beyond Bag of Words We would like to come up with a vector representation that captures meaning Other wanted benefits: ● Smaller vectors ● Dense ● Reusable
  • 34. Super Basics of Neural Networks Given input data, target outputs Learn parameters to minimize loss Training consists of feedforward and backpropagation Basic building block: perceptron
  • 35. Super Basics of Neural Networks Stack perceptrons to make network Arrows indicate learnable weights Circles sum all inputs and apply activation functions
  • 36. Super Basics of Neural Networks Stack perceptrons to make network Arrows indicate learnable weights Circles sum all inputs and apply activation functions People often really simplify these diagrams Inputs HiddenLayer Outputs HiddenLayer
  • 37. Word2Vec Create a neural network to learn useful word representations Words are "known by the company they keep" Neural network learns to predict words by their co-occurrences
  • 38. Sampling: Sliding Window Record "center word" (in blue) and "context words"
  • 39. Training Lookup center word Predict context words in order Extract embeddings from internal weights to the model
  • 43. Improve performance Preprocessing techniques, such as POS tagging, lemmatization, and stopword removal all improve performance of Word2Vec embeddings. Corpus size: Google trained on Google News (3 billion words) Embedding size: Between 100- and 500-dimensional embeddings
  • 44. Text A B Ai*Bi I 0.01 0.01 0.0001 like 0.2 0.2 0.04 the 0.01 0.01 0.001 training 0.5 0.5 0.25 a 0 0.002 0 lot 0 0.03 0 didn't 0.02 0 0 0.5596 0.5395 0.2911 similarity(A, B) = 0.2911 / (0.5596 * 0.5395) ≈ 0.964 Issue: This measure doesn't incorporate semantics
  • 45. Cosine Similarity Embedding Example I didn't like the training I liked the training a lot I 0.1 0 didn't 0.1 -1 like 0 0.5 the 0 0.1 train -0.1 0.25 a 0.1 0 lot -1 0 EmbeddingTable
  • 46. I didn't like the training I liked the training a lot
  • 47. I didn't like the training I liked the training a lot I 0.1 0 didn't 0.1 -1 like 0 0.2 the 0 0.1 train -0.1 0.1 I 0.1 0 like 0 0.2 the 0 0.1 train -0.1 0.1 a 0.1 0 lot -1 0
  • 48. For short texts, average embeddings to get document representation. I didn't like the training I liked the training a lot I 0.1 0 didn't 0.1 -1 like 0 0.2 the 0 0.1 train -0.1 0.1 Average: 0.02 -0.12 I 0.1 0 like 0 0.2 the 0 0.1 train -0.1 0.1 a 0.1 0 lot -1 0 Average: -0.15 0.06
  • 49. For short texts, average embeddings to get document representation. I didn't like the training I liked the training a lot I 0.1 0 didn't 0.1 -1 like 0 0.2 the 0 0.1 train -0.1 0.1 Average: 0.02 -0.12 I 0.1 0 like 0 0.2 the 0 0.1 train -0.1 0.1 a 0.1 0 lot -1 0 Average: -0.15 0.06 Doc A Doc B Ai*Bi 0.02 -0.15 −0.003 -0.12 0.06 −0.007 0.122 0.162 -0.01 Replace BOW columns with embeddings
  • 50. For short texts, average embeddings to get document representation. I didn't like the training I liked the training a lot I 0.1 0 didn't 0.1 -1 like 0 0.2 the 0 0.1 train -0.1 0.1 Average: 0.02 -0.12 I 0.1 0 like 0 0.2 the 0 0.1 train -0.1 0.1 a 0.1 0 lot -1 0 Average: -0.15 0.06 Doc A Doc B Ai*Bi 0.02 -0.15 −0.003 -0.12 0.06 −0.007 0.122 0.162 -0.01 Replace BOW columns with embeddings =-0.01 / (0.122 * 0.162) ≈ −0.505
  • 51. Smaller Issues with Word2Vec 1) Word2Vec cannot handle out-of-vocabulary words Bad solution: add an "Unknown" embedding 2) Large vocabularies require very large embedding tables Bad solution: remove any word that only occurs once Bad solution: make "meta" tokens, such as "[NUMBER]" or "[NAME]"
  • 52. Big Issue with Word2Vec Embeddings are fixed after training Homographs: same spelling, different word
  • 53. BERT & Transformer Models First there was ElMO Then, there was BERT Now, there's a GROVER, and ERNIE, and so many muppet names
  • 54. Super Basics of Transformers The model itself is very outside the scope of this talk Designed around machine translation
  • 55. Super Basics of Transformers Embeddings are learned weighted averages of other words in the same sentence
  • 56. Super Basics of Transformers Per-word weights are called "attentions" and are interpretable
  • 58. Big Issue with Word2Vec Embeddings are fixed after training Homographs: same spelling, different word Solved!
  • 59. Smaller Issues with Word2Vec 1) Word2Vec cannot handle out-of-vocabulary words Bad solution: add an "Unknown" embedding 2) Large vocabularies require very large embedding tables Bad solution: remove any word that only occurs once Bad solution: make "meta" tokens, such as "[NUMBER]" or "[NAME]"
  • 60. Beyond "Words" BERT uses the "WordPiece" tokenizer Rather than split text on spaces, learn useful character sequences Helps with out-of-vocabulary words Provides fixed vocab size Input: "I saw a girl with a telescope." Output: [I][▁saw][▁a][▁girl][▁with][▁a][▁][te][le][s][c][o][pe][.]
  • 61. What can we do with BERT? Question Answering
  • 62. What can we do with BERT? Question Answering Entity Extraction
  • 63. What can we do with BERT? Question Answering Entity Extraction Part of Speech Tagging
  • 64. What can we do with BERT? Question Answering Entity Extraction Part of Speech Tagging Sentiment Analysis / Regression I liked the training so much that I decided to try out what I learned on a personal project. Sentiment Activation
  • 65. Generating Text Language model Given tokens t0 -> ti-1 return the probability distribution of ti "If you're happy and you know it" clap 0.7 your 0.2 head 0.01 hands 0.09 ... ...
  • 66. GPT-2 Changes the attention mechanism of BERT
  • 67. GPT-2 as a Language Model Masked attention allows us to predict every word given all PREVIOUS words.
  • 68. Talk to Transformer: Play with generation! https://talktotransformer.com/ While not normally known for his musical talent, Elon Musk is releasing a debut album. The "Elon Musk" is a collection of eight new songs which are inspired by the founder's life. The music, which is available for pre-order on iTunes, was created by one-man-band and fellow Tesla Motors and SpaceX executive, Paul Kasmin, who's known for playing guitar at Tesla events. The album is a collaboration between Kasmin and Musk himself, although it's also being marketed under the Tesla brand.
  • 69. Summary ● BOW ○ Stopwords ○ TFIDF ○ Stemming & Lemmatization ○ Part-of-speech tagging ○ Cosine Similarity ● Word2Vec ○ CBOW + SkipGram ○ Learn words by company they keep ○ Embeddings capture semantic properties ● BERT ○ Change word embeddings based on company ○ Uses smaller wordpiece vocab