SlideShare a Scribd company logo
[course site]
Day 2 Lecture 4
Word Embeddings
Word2Vec
Antonio Bonafonte
2
Christopher Olah
Visualizing Representations
3
Representation of categorical features
Newsgroup task:
Input: 1000 words (from a fixed vocabulary V, size |V|)
Variable which can take a limited number of possible values
E.g.: gender, blood types, countries, …, letters, words,
phonemes
Phonetics transcription (CMU DICT):
Input: letters: (a b c…. z ‘ - .) (30 symbols)
4
One-hot (one-of-n) encoding
Example: letters. |V| = 30
‘a’: xT
= [1,0,0, ..., 0]
‘b’: xT
= [0,1,0, ..., 0]
‘c’: xT
= [0,0,1, ..., 0]
.
.
.
‘.’: xT
= [0,0,0, ..., 1]
5
One-hot (one-of-n) encoding
Example: words.
cat: xT
= [1,0,0, ..., 0]
dog: xT
= [0,1,0, ..., 0]
.
.
mamaguy: xT
= [0,0,0, …,0,1,0,...,0]
.
.
.
Number of words, |V| ?
B2: 5K
C2: 18K
LVSR: 50-100K
Wikipedia (1.6B): 400K
Crawl data (42B): 2M
6
One-hot encoding of words: limitations
● Large dimensionality
● Sparse representation (mostly zeros)
● Blind representation
○ Only operators: ‘!=’ and ‘==’
7
Word embeddings
● Represent words using vectors of dimension d
(~100 - 500)
● Meaningful (semantic, syntactic) distances
● Dominant research topic in last years in NLP
conferences (EMNLP)
● Good embeddings are useful for many other tasks
8
GloVe (Stanford)
9
GloVe (Stanford)
10
Word analogy:
a is to b as c is to ….
Find d such as wd
is closest to wa
- wb
+ wc
● Athens is to Greece as Berlin to ….
● Dance is to dancing as fly to ….
Word similarity:
Closest word to ...
Evaluation of the Representation
11
How to define word representation?
You shall know a word by the company it keeps.
Firth, J. R. 1957
Some relevant examples:
Latent semantic analysis (LSA):
● Define co-ocurrence matrix of words wi
in documents wj
● Apply SVD to reduce dimensionality
GloVe (Global Vectors):
● Start with co-ocurrences of word wi
and wj
in context of wi
● Fit log-bilinear regression model of the embeddings
Using NN to define embeddings
One-hot encoding + fully connected
→ embedding (projection) layer
Example:
Language Model (predict next word given previous words)
produces word embeddings (Bengio 2003)
12
From tensorflow word2vec tutorial: https://www.tensorflow.org/tutorials/word2vec/
13
Toy example: predict next word
Corpus:
the dog saw a cat
the dog chased a cat
the cat climbed a tree
|V| = 8
One-hot encoding:
a: [1,0,0,0,0,0,0,0]
cat: [0,1,0,0,0,0,0,0]
chased: [0,0,1,0,0,0,0,0]
climbed: [0,0,0,1,0,0,0,0]
dog: [0,0,0,0,1,0,0,0]
saw: [0,0,0,0,0,1,0,0]
the: [0,0,0,0,0,0,1,0]
tree: [0,0,0,0,0,0,0,1]
14
Toy example: predict next word
Architecture:
Input Layer: h1
(x)= WI
· x
Hidden Layer: h2
(x)= g(WH
· h1
(x))
Output Layer: z(x) = o(WO
· h2
(x))
Training sample:
cat → climbed
15
x = y = zeros(8,1)
x(2) = y(4) = 1
WI = rand(3,8) - 0.5;
WO = rand(8,3) - 0.5;
WH = rand(3,3);
h1 = WI * x
a2 = WH * h1 h2 = tanh(h2)
a3 = WO * h2
z3 = exp(a3) z3 = h3/sum(z3)
16
Input Layer: h1
(x)= WI
· x (projection layer)
Note that non-lineal activation function in projection layer is irrelevant
17
Hidden Layer: hH
(x)= g(WH
· h1
(x))
Softmax Layer: z(x) = o(WO
· h2
(x))
18
19
Computational complexity
Example:
Num training data: 1B
Vocabulary size: 100K
Context: 3 previous words
Embeddings: 100
Hidden Layers: 300 units
Projection Layer: - (copy row)
Hidden Layer: 300 · 300 products, 300 tanh(.)
Softmax Layer: 100 · 100K products, 100K exp(.)
Total: 90K + 10M !!
The softmax is the network's main bottleneck.
We can get embeddings implicitly from any task that
involves words.
However, good generic embeddings are good for other
tasks which may have much less data (transfer learning).
Sometimes, the embeddings can be fine-tuned to the final
task.
Word embeddings: requirements
20
How to get good embedings:
● Very large lexicon
● Huge amount of learning data
● Unsupervised (or trivial labels)
● Computational efficient
Word embeddings: requirements
21
Architecture specific for produccing embeddings.
It is learnt with huge amount of data.
Simplify the architecture: remove hidden layer.
Simplify the cost (softmax)
Two variants:
● CBOW (continuos bag of words)
● Skip-gram
Word2Vec [Mikolov 2013]
22
CBOW: Continuous Bag of Words
the cat climbed a tree
Given context:
a, cat, the, tree
Estimate prob. of
climbed
23
Skip-gram
the cat climbed a tree
Given word:
climbed
Estimate prob. of context words:
a, cat, the, tree
(It selects randomly the context length,
till max of 10 left + 10 right)
24
Reduce cost: subsampling
Most frequent words (is, the) can appear hundred of millions
of times.
Less important:
Paris ~ France OK
Paris ~ the ?
→ Each input word, wi
, is discarded with probability
25
Simplify cost: negative sampling
Softmax is required to get probabilities. But our goal here is
just to get good embeddings.
Cost function: maximize s(Wo
j
· WI
i
)
but add term negative for words which do not appear in the
context (randomly selected).
26
27
Word2Vec is not deep, but used in many tasks using deep
learning
There are other approaches: this is a very popular toolkit,
with trained embeddings, but there are others (GloVe).
Why does it works?
See paper from GloVe [Pennington et al]
Discussion
28
Word2Vec
● Mikolov, Tomas; et al. "Efficient Estimation of Word
Representations in Vector Space
● Linguistic Regularities in Continuous Space Word
Representations
● Tensorflow tutorial
https://www.tensorflow.org/tutorials/word2vec/
GloVe: http://nlp.stanford.edu/projects/glove/ (& paper)
Blog Sebastian Ruder
sebastianruder.com/word-embeddings-1/
References
29

More Related Content

What's hot (20)

PDF
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Universitat Politècnica de Catalunya
 
PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Lecture4 xing
Tianlu Wang
 
PDF
Dual Learning for Machine Translation (NIPS 2016)
Toru Fujino
 
PDF
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Iclr2016 vaeまとめ
Deep Learning JP
 
PDF
(DL輪読)Matching Networks for One Shot Learning
Masahiro Suzuki
 
PDF
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Universitat Politècnica de Catalunya
 
PDF
(DL hacks輪読) Variational Inference with Rényi Divergence
Masahiro Suzuki
 
PPTX
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Eun Ji Lee
 
PDF
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PPTX
Anomaly detection using deep one class classifier
홍배 김
 
PDF
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Altoros
 
PDF
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Universitat Politècnica de Catalunya
 
PPTX
The world of loss function
홍배 김
 
PDF
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Output Units and Cost Function in FNN
Lin JiaMing
 
PDF
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
Masahiro Suzuki
 
PDF
Epsrcws08 campbell isvm_01
Cheng Feng
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Universitat Politècnica de Catalunya
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Lecture4 xing
Tianlu Wang
 
Dual Learning for Machine Translation (NIPS 2016)
Toru Fujino
 
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Iclr2016 vaeまとめ
Deep Learning JP
 
(DL輪読)Matching Networks for One Shot Learning
Masahiro Suzuki
 
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Universitat Politècnica de Catalunya
 
(DL hacks輪読) Variational Inference with Rényi Divergence
Masahiro Suzuki
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Eun Ji Lee
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Anomaly detection using deep one class classifier
홍배 김
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Altoros
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Universitat Politècnica de Catalunya
 
The world of loss function
홍배 김
 
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Output Units and Cost Function in FNN
Lin JiaMing
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
Masahiro Suzuki
 
Epsrcws08 campbell isvm_01
Cheng Feng
 

Similar to Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017) (20)

PPTX
word vector embeddings in natural languag processing
ReetShinde
 
PPTX
Word embedding
ShivaniChoudhary74
 
PPTX
wordembedding.pptx
JOBANPREETSINGH62
 
PDF
Word2Vec
hyunyoung Lee
 
PPTX
Word_Embeddings.pptx
GowrySailaja
 
PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
Deep Learning Italia
 
PPTX
Deep Learning Bangalore meet up
Satyam Saxena
 
PPTX
DLBLR talk
Anuj Gupta
 
PPTX
Natural language processing unit - 2 ppt
Hshhdvrjdnkddb
 
PDF
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 
PPTX
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
sofia pillai
 
PDF
Deep learning for nlp
Viet-Trung TRAN
 
PDF
Continuous bag of words cbow word2vec word embedding work .pdf
devangmittal4
 
PDF
Effect of word embedding vector dimensionality on sentiment analysis through ...
IAESIJAI
 
PPTX
Tomáš Mikolov - Distributed Representations for NLP
Machine Learning Prague
 
PPTX
A Panorama of Natural Language Processing
Ted Xiao
 
PPTX
Lecture1.pptx
jonathanG19
 
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
PPTX
Text Classification Using Machine Learning.pptx
shabb1
 
word vector embeddings in natural languag processing
ReetShinde
 
Word embedding
ShivaniChoudhary74
 
wordembedding.pptx
JOBANPREETSINGH62
 
Word2Vec
hyunyoung Lee
 
Word_Embeddings.pptx
GowrySailaja
 
Embedding for fun fumarola Meetup Milano DLI luglio
Deep Learning Italia
 
Deep Learning Bangalore meet up
Satyam Saxena
 
DLBLR talk
Anuj Gupta
 
Natural language processing unit - 2 ppt
Hshhdvrjdnkddb
 
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
sofia pillai
 
Deep learning for nlp
Viet-Trung TRAN
 
Continuous bag of words cbow word2vec word embedding work .pdf
devangmittal4
 
Effect of word embedding vector dimensionality on sentiment analysis through ...
IAESIJAI
 
Tomáš Mikolov - Distributed Representations for NLP
Machine Learning Prague
 
A Panorama of Natural Language Processing
Ted Xiao
 
Lecture1.pptx
jonathanG19
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
Text Classification Using Machine Learning.pptx
shabb1
 
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
PDF
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
PDF
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
PDF
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
PDF
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Ad

Recently uploaded (20)

PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 

Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)

  • 1. [course site] Day 2 Lecture 4 Word Embeddings Word2Vec Antonio Bonafonte
  • 3. 3 Representation of categorical features Newsgroup task: Input: 1000 words (from a fixed vocabulary V, size |V|) Variable which can take a limited number of possible values E.g.: gender, blood types, countries, …, letters, words, phonemes Phonetics transcription (CMU DICT): Input: letters: (a b c…. z ‘ - .) (30 symbols)
  • 4. 4 One-hot (one-of-n) encoding Example: letters. |V| = 30 ‘a’: xT = [1,0,0, ..., 0] ‘b’: xT = [0,1,0, ..., 0] ‘c’: xT = [0,0,1, ..., 0] . . . ‘.’: xT = [0,0,0, ..., 1]
  • 5. 5 One-hot (one-of-n) encoding Example: words. cat: xT = [1,0,0, ..., 0] dog: xT = [0,1,0, ..., 0] . . mamaguy: xT = [0,0,0, …,0,1,0,...,0] . . . Number of words, |V| ? B2: 5K C2: 18K LVSR: 50-100K Wikipedia (1.6B): 400K Crawl data (42B): 2M
  • 6. 6 One-hot encoding of words: limitations ● Large dimensionality ● Sparse representation (mostly zeros) ● Blind representation ○ Only operators: ‘!=’ and ‘==’
  • 7. 7 Word embeddings ● Represent words using vectors of dimension d (~100 - 500) ● Meaningful (semantic, syntactic) distances ● Dominant research topic in last years in NLP conferences (EMNLP) ● Good embeddings are useful for many other tasks
  • 10. 10 Word analogy: a is to b as c is to …. Find d such as wd is closest to wa - wb + wc ● Athens is to Greece as Berlin to …. ● Dance is to dancing as fly to …. Word similarity: Closest word to ... Evaluation of the Representation
  • 11. 11 How to define word representation? You shall know a word by the company it keeps. Firth, J. R. 1957 Some relevant examples: Latent semantic analysis (LSA): ● Define co-ocurrence matrix of words wi in documents wj ● Apply SVD to reduce dimensionality GloVe (Global Vectors): ● Start with co-ocurrences of word wi and wj in context of wi ● Fit log-bilinear regression model of the embeddings
  • 12. Using NN to define embeddings One-hot encoding + fully connected → embedding (projection) layer Example: Language Model (predict next word given previous words) produces word embeddings (Bengio 2003) 12
  • 13. From tensorflow word2vec tutorial: https://www.tensorflow.org/tutorials/word2vec/ 13
  • 14. Toy example: predict next word Corpus: the dog saw a cat the dog chased a cat the cat climbed a tree |V| = 8 One-hot encoding: a: [1,0,0,0,0,0,0,0] cat: [0,1,0,0,0,0,0,0] chased: [0,0,1,0,0,0,0,0] climbed: [0,0,0,1,0,0,0,0] dog: [0,0,0,0,1,0,0,0] saw: [0,0,0,0,0,1,0,0] the: [0,0,0,0,0,0,1,0] tree: [0,0,0,0,0,0,0,1] 14
  • 15. Toy example: predict next word Architecture: Input Layer: h1 (x)= WI · x Hidden Layer: h2 (x)= g(WH · h1 (x)) Output Layer: z(x) = o(WO · h2 (x)) Training sample: cat → climbed 15
  • 16. x = y = zeros(8,1) x(2) = y(4) = 1 WI = rand(3,8) - 0.5; WO = rand(8,3) - 0.5; WH = rand(3,3); h1 = WI * x a2 = WH * h1 h2 = tanh(h2) a3 = WO * h2 z3 = exp(a3) z3 = h3/sum(z3) 16
  • 17. Input Layer: h1 (x)= WI · x (projection layer) Note that non-lineal activation function in projection layer is irrelevant 17
  • 18. Hidden Layer: hH (x)= g(WH · h1 (x)) Softmax Layer: z(x) = o(WO · h2 (x)) 18
  • 19. 19 Computational complexity Example: Num training data: 1B Vocabulary size: 100K Context: 3 previous words Embeddings: 100 Hidden Layers: 300 units Projection Layer: - (copy row) Hidden Layer: 300 · 300 products, 300 tanh(.) Softmax Layer: 100 · 100K products, 100K exp(.) Total: 90K + 10M !! The softmax is the network's main bottleneck.
  • 20. We can get embeddings implicitly from any task that involves words. However, good generic embeddings are good for other tasks which may have much less data (transfer learning). Sometimes, the embeddings can be fine-tuned to the final task. Word embeddings: requirements 20
  • 21. How to get good embedings: ● Very large lexicon ● Huge amount of learning data ● Unsupervised (or trivial labels) ● Computational efficient Word embeddings: requirements 21
  • 22. Architecture specific for produccing embeddings. It is learnt with huge amount of data. Simplify the architecture: remove hidden layer. Simplify the cost (softmax) Two variants: ● CBOW (continuos bag of words) ● Skip-gram Word2Vec [Mikolov 2013] 22
  • 23. CBOW: Continuous Bag of Words the cat climbed a tree Given context: a, cat, the, tree Estimate prob. of climbed 23
  • 24. Skip-gram the cat climbed a tree Given word: climbed Estimate prob. of context words: a, cat, the, tree (It selects randomly the context length, till max of 10 left + 10 right) 24
  • 25. Reduce cost: subsampling Most frequent words (is, the) can appear hundred of millions of times. Less important: Paris ~ France OK Paris ~ the ? → Each input word, wi , is discarded with probability 25
  • 26. Simplify cost: negative sampling Softmax is required to get probabilities. But our goal here is just to get good embeddings. Cost function: maximize s(Wo j · WI i ) but add term negative for words which do not appear in the context (randomly selected). 26
  • 27. 27
  • 28. Word2Vec is not deep, but used in many tasks using deep learning There are other approaches: this is a very popular toolkit, with trained embeddings, but there are others (GloVe). Why does it works? See paper from GloVe [Pennington et al] Discussion 28
  • 29. Word2Vec ● Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space ● Linguistic Regularities in Continuous Space Word Representations ● Tensorflow tutorial https://www.tensorflow.org/tutorials/word2vec/ GloVe: http://nlp.stanford.edu/projects/glove/ (& paper) Blog Sebastian Ruder sebastianruder.com/word-embeddings-1/ References 29