SlideShare a Scribd company logo
An Embedding is Worth 1000 Words
Start using Word Embeddings in your Business!
Jaya Zenchenko
March 15, 2018
Women in Analytics
About Me
- BA Math
- MS Math Education
- MS Applied Math
- Research Scientist
- DoD: Images, Video, Graphs
- Data Scientist
- NLP, Unsupervised Learning
- Data Science Manager
- Images, Text
- Founder of “Women in Data Science - Austin” Meetup
@datanerd_jaya bijaya.zenchenko@gmail.com
Credit: http://briasimpson.com/wp-content/uploads/2013/04/Swamp-Overwhelm.jpg
Pop Quiz!
- What is an embedding and why do I care?
- Is an embedding really worth 1000 words?
- What does it mean for words to be related?
- What are some gotchas in dealing with text data?
- What approaches can I use to get insights from my text data?
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business
Word Embeddings
- Coined in 2003
- Also called “Word Vectors”
- Text being converted into numbers
- Not just for words! Sentences, documents, etc.
- Not magic!
Why am I here?
- Word Embeddings are cool!
- No magic
- Highlight built in functionality of “gensim” and “scikit-learn” python
package to quickly tackle your next text based project
What Can We Do?
- Get insights
- Search/Retrieval
- Clustering (grouping)
- Identify Topics (i.e. themes)
- Apply techniques to other data sets (images, click through, etc)!!
Definitions
Vector/Embedding
Vector or Embedding
2-dimensional vector = [2, 3]
3-dimensional vector = [2, 3, 1]
N-dimensional vector = [2, 3, 1, 5, …, nth value]
Credit: https://maths-wiki.wikispaces.com/Vectors
Clustering
Grouping points that are close - what is “close”?
Credit: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html
Text Based Similarity
Credit: https://alexn.org/blog/2012/01/16/cosine-similarity-euclidean-distance.html
Topics
Themes found in our data. I.e. “restaurants”, “amenities”, “activities”
Credit: https://nlpforhackers.io/topic-modeling/
Brief History of Word Embeddings
- 17th century - philosophers such as Leibniz and Descartes put forward
proposal for codes to relate words in different languages
- 1920s patent filed by Emanuel Goldberg
- “Statistical Machine that searched for documents stored on film”
Credit: http://museen-dresden.de/index.php?lang=de&node=termine&resartium=events&tempus=week&locus=technischesammlungen&event=2680
Resource: https://en.wikipedia.org/wiki/History_of_natural_language_processing
Brief History of Word Embeddings
- 1945 - Vannevar Bush - Inspired by Goldberg
- Desire for collective “memory” machine to make knowledge accessible
Credit: https://en.wikipedia.org/wiki/Vannevar_Bush
Themes In History - NLP
- Machine Translation
- Information Retrieval
- Language Understanding
“You shall know a word by the company it keeps — J. R. Firth (1957)”
Distributional hypothesis
“You shall know a word by the company it keeps — J. R. Firth (1957)”
Words that occur in similar contexts tend to have similar meanings
(Harris, 1954)
Similar Meaning
- Words that occur in similar contexts tend to have similar meanings
Type of “Similarity” Definition Examples
Semantic Relatedness Any relation between words Car, Road
Bee, Honey
Semantic Similarity Words used in the same way Car, Auto
Doctor, Nurse
Many more in Computational Linguistics!
Resource: https://arxiv.org/pdf/1003.1141.pdf
Context
- Words that occur in similar contexts tend to have similar meanings
- What is context?
- A word?
- A sentence?
- A whole document?
- A group (or window) or words?
- ...
Similar
- Words that occur in similar contexts tend to have similar meanings
- What does ‘similar context’ mean mathematically?
- Count words that appear together in context?
- Should we count words within the context]?
- Count how far apart they are?
- Weight them?
- …
- Thinking about “context”, “similar”, “meaning” in so many ways leads to
evolution of different word embeddings
- Words that occur in similar contexts tend to have similar meanings
- What is a word??
- Words when printed are letters surrounded by white space and/or
end of sentence punctuations (, . ? !)
- Words are combined to form sentences that follow language rules
- What is a sentence??
EASY!
Separate the text by ‘.’ , ‘!’, ‘?’
Dr. Ford did not ask Col. Mustard the name of Mr. Smith’s dog.
Data Preprocessing - Clean and Tokenize
- Very important - your results may change drastically
- Tokenize - to split text into “tokens”
- Required for gensim
- To keep or not to keep:
- Numbers
- Punctuation
- Stop words (Common words - no universal list)
- Sparse words
- HTML tags
- ...
- Other languages may tokenize differently!
- “i made her duck”
Credit: https://emojipedia.org/duck/
https://design.tutsplus.com/tutorials/how-to-animate-a-character-throwing-a-ball--cms-26207
Named Entity Extraction - Annotate Example
- “i love my apple”
Credit: https://www.dabur.com/realfruitpower/fruit-juices/apple-fruit-history
https://www.apple.com/shop/product/MK0C2AM/A/apple-pencil-for-ipad-pro
Data Preprocessing - Annotate
- Annotate
- Part of speech (Noun, Verb, etc)
- Named entity recognition (Organization, Money, Time, Locations, etc.)
- Choose to append, keep or ignore
More at https://spacy.io/usage/linguistic-features#section-named-entities
Data Preprocessing - Reduce Words
- Stemming
- Reduce word to “word stem”
- Crude heuristics to chop off word endings
- Many approaches - Porter Stemming in Gensim
- Fast!
- Running -> run
- Lemmatize
- Properly reduce the word based on part of speech annotation
- Available in gensim
- Slow
- Better -> good
- Both differ based on language!
Resource: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
Preprocessing Example
- Remove stop words and stem
Credit: https://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759
Embeddings
- Word
- Context (sentence, document, etc)
One-Hot-Encoding
- Convert words to vector - available in scikit-learn
quick 1 0 0 0 0
brown 0 1 0 0 0
dog 0 0 1 0 0
jump 0 0 0 1 0
lazy 0 0 0 0 1
- Convert document to vector - 1 if the word exists, 0 otherwise
- Available in scikit-learn
Boolean Embedding
Credit: https://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759
1 1 1 1 1
- “Term Frequency” - count the frequency of the word in a document
- “Count Vectorizer” in scikit-learn
Bag of Words
Document Term Matrix
- Term frequency for multiple documents
- “Count Vectorizer” in scikit-learn
Credit: http://ryanheuser.org/word-vectors-2/
Weighting
- Why do we want to weight words?
- “The”, “and”, …
- Term Frequency Inverse Document Frequency (TFIDF)
- Reduce the weight of very common words that appear in many of the documents
- Applies to Document-Term Matrix
Credit: http://trimc-nlp.blogspot.com/2013/04/tfidf-with-google-n-grams-and-pos-tags.html
Word Co-Occurrence Matrix
Word-Word Matrix for a given Context Window (# words to consider on each
side).
Word Co-Occurrence Matrix
Context Window Size = 1
Credit: https://towardsdatascience.com/word-to-vectors-natural-language-processing-b253dd0b0817
Weighting
- Why do we want to weight word pairs?
- “New” York” vs “in” “the”
- Pointwise Mutual Information (PMI)
- Higher weight for mutually common words that are infrequent
Pointwise Mutual Information (PMI)
Credit: https://en.wikipedia.org/wiki/Pointwise_mutual_information
Topic Embedding - Latent Spaces
One approach: Latent Semantic Analysis (LSA) - Singular Value Decomposition (SVD) on the
Document-Term Matrix or TFIDF Weighted Matrix
Credit: https://tex.stackexchange.com/questions/258811/diagram-for-svd
topics
topicstopics
Topic Embeddings
- LSA/LSI - Latent Semantic Analysis or Indexing
- Used for Search and Retrieval
- Can only capture linear relationships
- Use Non-Negative Matrix Factorization for “understandable” topics
- LDA (Latent Dirichlet Allocation)
- Can capture non-linear relationships
- Guided LDA (Semi-Supervised LDA)
- Seed the topics with a few words!
Prediction Based Embeddings
- Frequency Based -> Prediction Based
- Neural architecture to train
- Predict words based on other words (or characters!)
Neural Word Embedding - Word2Vec
vec(“king”) - vec(“man”) + vec(“woman”) = vec(“queen”)
Trained on Google News Data - 3 million words and phrases.
Credit: https://www.slideshare.net/dnldimitri/student-lecture-for-master-course-on-d-dimitrihermans
Popular Neural Word Embeddings by Mikolov
- Word2Vec (2013) - Better at semantic similarity
- brother : sister
- fastText - Better at syntactic similarity due to character n-grams
- great : greater
- Similar architecture for both - one trained on words, the other on
character n-grams
Resources: https://en.wikipedia.org/wiki/Word2vec
Pretrained Embeddings
- Quickly leverage pretrained embeddings trained on different data sets
(google news, wikipedia, etc)
- Many available in baseline gensim package
- To gain deeper domain specific insights - train your own model !
- Additional models available to download - different languages, etc
https://github.com/Hironsan/awesome-embedding-models
Popular Neural Word Embedding
- GloVe (2014) - by Pennington, Socher, Manning
- Closest to a variant of word co-occurrence matrix
- Pre-trained model available in gensim
- Comparison of Word2Vec and GloVe by Radim Řehůřek's (gensim author)
Resource: https://nlp.stanford.edu/projects/glove/
Breaking News!
- Omar Levy proves Word2Vec is basically SVD on Pointwise Mutual
Information (PMI) weighted Co-Occurrence word matrix!
- Levy NIPS 2014 - “Neural Word Embeddings as Implicit Matrix
Factorization”
- Chris Moody @ StichFix - Oct 2017 - “Stop Using Word2Vec”
Credit: http://www.keepcalmandposters.com/poster/5718140_keep_calm_and_show_me_the_data
Why Word Embeddings?
- Too much text data!
- I want to know
Credit: https://blog.beeminder.com/allthethings/
Data & Packages
Credit: https://disneyworld.disney.go.com/entertainment/magic-kingdom/move-it-shake-it-dance-play-it
https://pragmaticarchitect.files.wordpress.com/2013/06/mabi87.png
Example of Data Preprocessing
Search Example
Code to Create Embeddings
Models Created
- Word2Vec - window size = 3, window size = 10
- fastText - window size = 3, window size = 10
fastText FTW
* “most_similar” presents ordered words, similarity score
Credit: https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285
Where Can I Go?
Find me some grub!
Clustering
- Clustered the word embeddings using KMeans clustering
- Showing the word size in wordcloud based on word weight from TFIDF
- Results from Word2Vec - Window Size = 3
What do people love?
What do people hate?
How to stock the house?
What activities?
Who do people travel with?
When do they travel?
Window Size Matters!
Credit: https://machinelearningmastery.com/what-are-word-embeddings/
Surprise me!
Dirty Data!
Pop Quiz!
- What does it mean for words to be related?
- Many things - semantic, syntactic, similar, related, etc
- What is an embedding and why do I care?
- Way to convert text data into numbers so we can use computers to do the work
- Is an embedding really worth 1000 words?
- It can be worth 10000 words! Based on how big the entire data set is
- What are some gotchas in dealing with text data?
- Preprocessing, window size, and other hyperparameters in using Word2Vec or FastText
- What approaches can I use to find insights in my data?
- Semantic Indexing, similar words, clustering, word2vec, fasttext
Thank you!
Questions?
Appendix
Embedding Pros Cons
Bag-of-Words Simple, fast Sparse, high dimension
Does not capture position in text
Does not capture semantics
TFIDF Easy to compute
Easily compare similarity between 2
documents
Dense, high dimension
Does not capture position in text
Does not capture semantics
Topic Space Lower dimension
Captures semantics
Handles synonyms in documents
Used for search/retrieval
Number of topics needs to be defined
Could be slow in high dimension
Topics need to be “hand labeled”
Word2Vec Can leverage pretrained models
Understand relationships between
words
Better for analogies
Inability to handle unseen words
Active research - To go from word vectors to
sentence vectors
fastText Character based
Can deal with unseen words
Can leverage pretrained models
Longer to train than Word2Vec
Active research - To go from word vectors to
sentence vectors
Resources..
- https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html
- https://web.stanford.edu/class/cs124/
- https://datascience.stackexchange.com/questions/11402/preprocessing-text-before-use-rnn/11421
- https://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf
- http://ruder.io/word-embeddings-1/index.html
- https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/
- https://www.ischool.utexas.edu/~ssoy/organizing/l391d2c.htm
- https://en.wikipedia.org/wiki/Information_retrieval
- https://en.wikipedia.org/wiki/As_We_May_Think
- https://en.wikipedia.org/wiki/Latent_semantic_analysis
- https://pdfs.semanticscholar.org/5b5c/a878c534aee3882a038ef9e82f46e102131b.pdf - “A survey of text similarity”
- http://www.jair.org/media/2934/live-2934-4846-jair.pdf
More Resources...
- https://cs224d.stanford.edu/lecture_notes/notes1.pdf
- https://www.quora.com/What-are-the-advantages-and-disadvantages-of-TF-IDF
- https://simonpaarlberg.com/post/latent-semantic-analyses/
- https://www.quora.com/What-are-the-advantages-and-disadvantages-of-Latent-Semantic-Analysis
- http://elliottash.com/wp-content/uploads/2017/07/Text-class-05-word-embeddings-1.pdf
- http://ruder.io/secret-word2vec/index.html#addingcontextvectors
- https://www.linkedin.com/pulse/what-main-difference-between-word2vec-fasttext-federico-cesconi/
- https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/
Highlight of Gensim Functions
- Gensim:
- parsing.preprocessing
- models.tfidfmodel
- models.lsimodel
- models.word2vec
- models.fastText
- models.keyedvectors
- Descriptions at: https://radimrehurek.com/gensim/apiref.html
- Tutorials at: https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
Highlight of Other Python Functions
- Sklearn:
- TfidfVectorizer
- KMeans Clustering
- Incredible documentation overall : http://scikit-learn.org/stable/index.html
- Wordcloud:
- https://github.com/amueller/word_cloud/tree/master/wordcloud
Open Source Tool for Text Search
- Really Fast!!
- Built on Lucene - Apache Project - almost 20 years old and still evolving
- Lucene - set the standard for search and indexing
Example Code - Data Preprocessing
Example Code - TFIDF and Topic Embedding (LSA)
Semantic
Syntactic
Semantic vs Syntactic
Credit: https://arxiv.org/pdf/1301.3781v3.pdf
Fun Reviews - Amazon

More Related Content

PDF
Query Understanding
Eoin Hurrell, PhD
 
PPTX
Using topic modelling frameworks for NLP and semantic search
Dawn Anderson MSc DigM
 
PDF
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
PPTX
Natural Language Processing for Irish
Teresa Lynn
 
PPTX
The Semantic Web #4 - RDF (1)
Myungjin Lee
 
PDF
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
Lucidworks
 
PPTX
Semantic
NIT Durgapur
 
KEY
Semantic Web: A web that is not the Web
Bruce Esrig
 
Query Understanding
Eoin Hurrell, PhD
 
Using topic modelling frameworks for NLP and semantic search
Dawn Anderson MSc DigM
 
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
Natural Language Processing for Irish
Teresa Lynn
 
The Semantic Web #4 - RDF (1)
Myungjin Lee
 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
Lucidworks
 
Semantic
NIT Durgapur
 
Semantic Web: A web that is not the Web
Bruce Esrig
 

Similar to Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business (20)

PPTX
Enriching the semantic web tutorial session 1
Tobias Wunner
 
PPT
Big Data and Natural Language Processing
Michel Bruley
 
PPTX
A Simple Introduction to Word Embeddings
Bhaskar Mitra
 
PDF
Information Retrieval and Map-Reduce Implementations
Jason J Pulikkottil
 
PDF
Data science and artificial intelligence
ssuser774037
 
PPTX
Pycon ke word vectors
Osebe Sammi
 
ODP
Text-mining and Automation
benosteen
 
PPTX
DATA641 Lecture 3 - Word meaning.pptx
DrPraveenPawar
 
PPT
Oss swot
Bill Ott
 
PDF
WTF is Semantic Web?
milesw
 
PPTX
Quant viz
Tony Hirst
 
PDF
The Original Hypertext Preprocessor
Drew McLellan
 
PDF
Working with text data
Katerina Vylomova
 
PDF
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
Data Science Milan
 
PDF
DBpedia as Gaeilge Chapter
Bianca Pereira
 
PPTX
Why Search? (starring Elasticsearch)
Doug Turnbull
 
PPTX
From Natural Language Processing to Artificial Intelligence
Jonathan Mugan
 
PDF
Semantic engagement handouts
STIinnsbruck
 
ODP
The search engine index
CJ Jenkins
 
Enriching the semantic web tutorial session 1
Tobias Wunner
 
Big Data and Natural Language Processing
Michel Bruley
 
A Simple Introduction to Word Embeddings
Bhaskar Mitra
 
Information Retrieval and Map-Reduce Implementations
Jason J Pulikkottil
 
Data science and artificial intelligence
ssuser774037
 
Pycon ke word vectors
Osebe Sammi
 
Text-mining and Automation
benosteen
 
DATA641 Lecture 3 - Word meaning.pptx
DrPraveenPawar
 
Oss swot
Bill Ott
 
WTF is Semantic Web?
milesw
 
Quant viz
Tony Hirst
 
The Original Hypertext Preprocessor
Drew McLellan
 
Working with text data
Katerina Vylomova
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
Data Science Milan
 
DBpedia as Gaeilge Chapter
Bianca Pereira
 
Why Search? (starring Elasticsearch)
Doug Turnbull
 
From Natural Language Processing to Artificial Intelligence
Jonathan Mugan
 
Semantic engagement handouts
STIinnsbruck
 
The search engine index
CJ Jenkins
 
Ad

More from Rehgan Avon (9)

PDF
Ezgi Karaesmen - Data Cleaning and Manipulation with R
Rehgan Avon
 
PDF
Dr. Karen Amstutz - Digitizing Health: How Analytics are Disrupting Healthca...
Rehgan Avon
 
PDF
Amanda Cinnamon - Treat Your Code Like the Valuable Software It Is
Rehgan Avon
 
PDF
Cheryl Wiebe - Advanced Analytics in the Industrial World
Rehgan Avon
 
PDF
Wei Xu - Innovative Applications of AI Panel
Rehgan Avon
 
PPTX
Helen Patton - Governing Big Data: Security, Privacy & Data Management
Rehgan Avon
 
PPT
Dr. Lara Sucheston-Campbell - Building a working farm: Planning and planting ...
Rehgan Avon
 
PDF
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Rehgan Avon
 
PDF
BDAA_Newsletter
Rehgan Avon
 
Ezgi Karaesmen - Data Cleaning and Manipulation with R
Rehgan Avon
 
Dr. Karen Amstutz - Digitizing Health: How Analytics are Disrupting Healthca...
Rehgan Avon
 
Amanda Cinnamon - Treat Your Code Like the Valuable Software It Is
Rehgan Avon
 
Cheryl Wiebe - Advanced Analytics in the Industrial World
Rehgan Avon
 
Wei Xu - Innovative Applications of AI Panel
Rehgan Avon
 
Helen Patton - Governing Big Data: Security, Privacy & Data Management
Rehgan Avon
 
Dr. Lara Sucheston-Campbell - Building a working farm: Planning and planting ...
Rehgan Avon
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Rehgan Avon
 
BDAA_Newsletter
Rehgan Avon
 
Ad

Recently uploaded (20)

PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Presentation on animal welfare a good topic
kidscream385
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 

Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embeddings in Natural Language Processing for your Business

  • 1. An Embedding is Worth 1000 Words Start using Word Embeddings in your Business! Jaya Zenchenko March 15, 2018 Women in Analytics
  • 2. About Me - BA Math - MS Math Education - MS Applied Math - Research Scientist - DoD: Images, Video, Graphs - Data Scientist - NLP, Unsupervised Learning - Data Science Manager - Images, Text - Founder of “Women in Data Science - Austin” Meetup @datanerd_jaya [email protected]
  • 4. Pop Quiz! - What is an embedding and why do I care? - Is an embedding really worth 1000 words? - What does it mean for words to be related? - What are some gotchas in dealing with text data? - What approaches can I use to get insights from my text data?
  • 7. Word Embeddings - Coined in 2003 - Also called “Word Vectors” - Text being converted into numbers - Not just for words! Sentences, documents, etc. - Not magic!
  • 8. Why am I here? - Word Embeddings are cool! - No magic - Highlight built in functionality of “gensim” and “scikit-learn” python package to quickly tackle your next text based project
  • 9. What Can We Do? - Get insights - Search/Retrieval - Clustering (grouping) - Identify Topics (i.e. themes) - Apply techniques to other data sets (images, click through, etc)!!
  • 11. Vector/Embedding Vector or Embedding 2-dimensional vector = [2, 3] 3-dimensional vector = [2, 3, 1] N-dimensional vector = [2, 3, 1, 5, …, nth value] Credit: https://maths-wiki.wikispaces.com/Vectors
  • 12. Clustering Grouping points that are close - what is “close”? Credit: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html
  • 13. Text Based Similarity Credit: https://alexn.org/blog/2012/01/16/cosine-similarity-euclidean-distance.html
  • 14. Topics Themes found in our data. I.e. “restaurants”, “amenities”, “activities” Credit: https://nlpforhackers.io/topic-modeling/
  • 15. Brief History of Word Embeddings - 17th century - philosophers such as Leibniz and Descartes put forward proposal for codes to relate words in different languages - 1920s patent filed by Emanuel Goldberg - “Statistical Machine that searched for documents stored on film” Credit: http://museen-dresden.de/index.php?lang=de&node=termine&resartium=events&tempus=week&locus=technischesammlungen&event=2680 Resource: https://en.wikipedia.org/wiki/History_of_natural_language_processing
  • 16. Brief History of Word Embeddings - 1945 - Vannevar Bush - Inspired by Goldberg - Desire for collective “memory” machine to make knowledge accessible Credit: https://en.wikipedia.org/wiki/Vannevar_Bush
  • 17. Themes In History - NLP - Machine Translation - Information Retrieval - Language Understanding
  • 18. “You shall know a word by the company it keeps — J. R. Firth (1957)”
  • 19. Distributional hypothesis “You shall know a word by the company it keeps — J. R. Firth (1957)” Words that occur in similar contexts tend to have similar meanings (Harris, 1954)
  • 20. Similar Meaning - Words that occur in similar contexts tend to have similar meanings Type of “Similarity” Definition Examples Semantic Relatedness Any relation between words Car, Road Bee, Honey Semantic Similarity Words used in the same way Car, Auto Doctor, Nurse Many more in Computational Linguistics! Resource: https://arxiv.org/pdf/1003.1141.pdf
  • 21. Context - Words that occur in similar contexts tend to have similar meanings - What is context? - A word? - A sentence? - A whole document? - A group (or window) or words? - ...
  • 22. Similar - Words that occur in similar contexts tend to have similar meanings - What does ‘similar context’ mean mathematically? - Count words that appear together in context? - Should we count words within the context]? - Count how far apart they are? - Weight them? - … - Thinking about “context”, “similar”, “meaning” in so many ways leads to evolution of different word embeddings
  • 23. - Words that occur in similar contexts tend to have similar meanings - What is a word?? - Words when printed are letters surrounded by white space and/or end of sentence punctuations (, . ? !) - Words are combined to form sentences that follow language rules
  • 24. - What is a sentence?? EASY! Separate the text by ‘.’ , ‘!’, ‘?’ Dr. Ford did not ask Col. Mustard the name of Mr. Smith’s dog.
  • 25. Data Preprocessing - Clean and Tokenize - Very important - your results may change drastically - Tokenize - to split text into “tokens” - Required for gensim - To keep or not to keep: - Numbers - Punctuation - Stop words (Common words - no universal list) - Sparse words - HTML tags - ... - Other languages may tokenize differently!
  • 26. - “i made her duck” Credit: https://emojipedia.org/duck/ https://design.tutsplus.com/tutorials/how-to-animate-a-character-throwing-a-ball--cms-26207
  • 27. Named Entity Extraction - Annotate Example - “i love my apple” Credit: https://www.dabur.com/realfruitpower/fruit-juices/apple-fruit-history https://www.apple.com/shop/product/MK0C2AM/A/apple-pencil-for-ipad-pro
  • 28. Data Preprocessing - Annotate - Annotate - Part of speech (Noun, Verb, etc) - Named entity recognition (Organization, Money, Time, Locations, etc.) - Choose to append, keep or ignore More at https://spacy.io/usage/linguistic-features#section-named-entities
  • 29. Data Preprocessing - Reduce Words - Stemming - Reduce word to “word stem” - Crude heuristics to chop off word endings - Many approaches - Porter Stemming in Gensim - Fast! - Running -> run - Lemmatize - Properly reduce the word based on part of speech annotation - Available in gensim - Slow - Better -> good - Both differ based on language! Resource: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
  • 30. Preprocessing Example - Remove stop words and stem Credit: https://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759
  • 31. Embeddings - Word - Context (sentence, document, etc)
  • 32. One-Hot-Encoding - Convert words to vector - available in scikit-learn quick 1 0 0 0 0 brown 0 1 0 0 0 dog 0 0 1 0 0 jump 0 0 0 1 0 lazy 0 0 0 0 1
  • 33. - Convert document to vector - 1 if the word exists, 0 otherwise - Available in scikit-learn Boolean Embedding Credit: https://www.slideshare.net/mgrcar/text-and-text-stream-mining-tutorial-15137759 1 1 1 1 1
  • 34. - “Term Frequency” - count the frequency of the word in a document - “Count Vectorizer” in scikit-learn Bag of Words
  • 35. Document Term Matrix - Term frequency for multiple documents - “Count Vectorizer” in scikit-learn Credit: http://ryanheuser.org/word-vectors-2/
  • 36. Weighting - Why do we want to weight words? - “The”, “and”, … - Term Frequency Inverse Document Frequency (TFIDF) - Reduce the weight of very common words that appear in many of the documents - Applies to Document-Term Matrix
  • 38. Word Co-Occurrence Matrix Word-Word Matrix for a given Context Window (# words to consider on each side).
  • 39. Word Co-Occurrence Matrix Context Window Size = 1 Credit: https://towardsdatascience.com/word-to-vectors-natural-language-processing-b253dd0b0817
  • 40. Weighting - Why do we want to weight word pairs? - “New” York” vs “in” “the” - Pointwise Mutual Information (PMI) - Higher weight for mutually common words that are infrequent
  • 41. Pointwise Mutual Information (PMI) Credit: https://en.wikipedia.org/wiki/Pointwise_mutual_information
  • 42. Topic Embedding - Latent Spaces One approach: Latent Semantic Analysis (LSA) - Singular Value Decomposition (SVD) on the Document-Term Matrix or TFIDF Weighted Matrix Credit: https://tex.stackexchange.com/questions/258811/diagram-for-svd topics topicstopics
  • 43. Topic Embeddings - LSA/LSI - Latent Semantic Analysis or Indexing - Used for Search and Retrieval - Can only capture linear relationships - Use Non-Negative Matrix Factorization for “understandable” topics - LDA (Latent Dirichlet Allocation) - Can capture non-linear relationships - Guided LDA (Semi-Supervised LDA) - Seed the topics with a few words!
  • 44. Prediction Based Embeddings - Frequency Based -> Prediction Based - Neural architecture to train - Predict words based on other words (or characters!)
  • 45. Neural Word Embedding - Word2Vec vec(“king”) - vec(“man”) + vec(“woman”) = vec(“queen”) Trained on Google News Data - 3 million words and phrases. Credit: https://www.slideshare.net/dnldimitri/student-lecture-for-master-course-on-d-dimitrihermans
  • 46. Popular Neural Word Embeddings by Mikolov - Word2Vec (2013) - Better at semantic similarity - brother : sister - fastText - Better at syntactic similarity due to character n-grams - great : greater - Similar architecture for both - one trained on words, the other on character n-grams Resources: https://en.wikipedia.org/wiki/Word2vec
  • 47. Pretrained Embeddings - Quickly leverage pretrained embeddings trained on different data sets (google news, wikipedia, etc) - Many available in baseline gensim package - To gain deeper domain specific insights - train your own model ! - Additional models available to download - different languages, etc https://github.com/Hironsan/awesome-embedding-models
  • 48. Popular Neural Word Embedding - GloVe (2014) - by Pennington, Socher, Manning - Closest to a variant of word co-occurrence matrix - Pre-trained model available in gensim - Comparison of Word2Vec and GloVe by Radim Řehůřek's (gensim author) Resource: https://nlp.stanford.edu/projects/glove/
  • 49. Breaking News! - Omar Levy proves Word2Vec is basically SVD on Pointwise Mutual Information (PMI) weighted Co-Occurrence word matrix! - Levy NIPS 2014 - “Neural Word Embeddings as Implicit Matrix Factorization” - Chris Moody @ StichFix - Oct 2017 - “Stop Using Word2Vec”
  • 51. Why Word Embeddings? - Too much text data! - I want to know Credit: https://blog.beeminder.com/allthethings/
  • 52. Data & Packages Credit: https://disneyworld.disney.go.com/entertainment/magic-kingdom/move-it-shake-it-dance-play-it https://pragmaticarchitect.files.wordpress.com/2013/06/mabi87.png
  • 53. Example of Data Preprocessing
  • 55. Code to Create Embeddings
  • 56. Models Created - Word2Vec - window size = 3, window size = 10 - fastText - window size = 3, window size = 10
  • 57. fastText FTW * “most_similar” presents ordered words, similarity score
  • 59. Where Can I Go?
  • 60. Find me some grub!
  • 61. Clustering - Clustered the word embeddings using KMeans clustering - Showing the word size in wordcloud based on word weight from TFIDF - Results from Word2Vec - Window Size = 3
  • 62. What do people love?
  • 63. What do people hate?
  • 64. How to stock the house?
  • 66. Who do people travel with?
  • 67. When do they travel?
  • 68. Window Size Matters! Credit: https://machinelearningmastery.com/what-are-word-embeddings/
  • 70. Pop Quiz! - What does it mean for words to be related? - Many things - semantic, syntactic, similar, related, etc - What is an embedding and why do I care? - Way to convert text data into numbers so we can use computers to do the work - Is an embedding really worth 1000 words? - It can be worth 10000 words! Based on how big the entire data set is - What are some gotchas in dealing with text data? - Preprocessing, window size, and other hyperparameters in using Word2Vec or FastText - What approaches can I use to find insights in my data? - Semantic Indexing, similar words, clustering, word2vec, fasttext
  • 73. Embedding Pros Cons Bag-of-Words Simple, fast Sparse, high dimension Does not capture position in text Does not capture semantics TFIDF Easy to compute Easily compare similarity between 2 documents Dense, high dimension Does not capture position in text Does not capture semantics Topic Space Lower dimension Captures semantics Handles synonyms in documents Used for search/retrieval Number of topics needs to be defined Could be slow in high dimension Topics need to be “hand labeled” Word2Vec Can leverage pretrained models Understand relationships between words Better for analogies Inability to handle unseen words Active research - To go from word vectors to sentence vectors fastText Character based Can deal with unseen words Can leverage pretrained models Longer to train than Word2Vec Active research - To go from word vectors to sentence vectors
  • 74. Resources.. - https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html - https://web.stanford.edu/class/cs124/ - https://datascience.stackexchange.com/questions/11402/preprocessing-text-before-use-rnn/11421 - https://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf - http://ruder.io/word-embeddings-1/index.html - https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/ - https://www.ischool.utexas.edu/~ssoy/organizing/l391d2c.htm - https://en.wikipedia.org/wiki/Information_retrieval - https://en.wikipedia.org/wiki/As_We_May_Think - https://en.wikipedia.org/wiki/Latent_semantic_analysis - https://pdfs.semanticscholar.org/5b5c/a878c534aee3882a038ef9e82f46e102131b.pdf - “A survey of text similarity” - http://www.jair.org/media/2934/live-2934-4846-jair.pdf
  • 75. More Resources... - https://cs224d.stanford.edu/lecture_notes/notes1.pdf - https://www.quora.com/What-are-the-advantages-and-disadvantages-of-TF-IDF - https://simonpaarlberg.com/post/latent-semantic-analyses/ - https://www.quora.com/What-are-the-advantages-and-disadvantages-of-Latent-Semantic-Analysis - http://elliottash.com/wp-content/uploads/2017/07/Text-class-05-word-embeddings-1.pdf - http://ruder.io/secret-word2vec/index.html#addingcontextvectors - https://www.linkedin.com/pulse/what-main-difference-between-word2vec-fasttext-federico-cesconi/ - https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/
  • 76. Highlight of Gensim Functions - Gensim: - parsing.preprocessing - models.tfidfmodel - models.lsimodel - models.word2vec - models.fastText - models.keyedvectors - Descriptions at: https://radimrehurek.com/gensim/apiref.html - Tutorials at: https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
  • 77. Highlight of Other Python Functions - Sklearn: - TfidfVectorizer - KMeans Clustering - Incredible documentation overall : http://scikit-learn.org/stable/index.html - Wordcloud: - https://github.com/amueller/word_cloud/tree/master/wordcloud
  • 78. Open Source Tool for Text Search - Really Fast!! - Built on Lucene - Apache Project - almost 20 years old and still evolving - Lucene - set the standard for search and indexing
  • 79. Example Code - Data Preprocessing
  • 80. Example Code - TFIDF and Topic Embedding (LSA)
  • 81. Semantic Syntactic Semantic vs Syntactic Credit: https://arxiv.org/pdf/1301.3781v3.pdf
  • 82. Fun Reviews - Amazon