Deep learning - Chatbot

Chatbot
Sequence to Sequence Learning
29 Mar 2017
Presented By:
Jin Zhang
Yang Zhou
Fred Qin
Liam Bui
Overview
Network
Architecture
Loss Function
Improvement
Techniques

Overview
Network
Architecture
Loss Function
Improvement
Techniques
Chatbot Concept
Deep Learning for Chatbot: http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/

Overview
Network
Architecture
Loss Function
Improvement
Techniques
LSTM for Language Model
• Language Model
Predict next word given the previous words
• RNN
Unable to learn long term dependency, not suitable for language model
• LSTM
3 sigmoid gates to control info flow
Understanding LSTM Networks: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Overview
Network
Architecture
Loss Function
Improvement
Techniques
• First step: which previous information to throw
away from the cell state
LSTM for Language Model
• Second step: what new information to be
stored in the cell state
- A sigmoid layer decides which values to
update
- A tanh layer creates new candidate values
C~t that could be added to the state
- Combine these two to create an update to
the state
• Third step: filter Ct and output only what we
want to output
Understanding LSTM Networks: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Seq2Seq model comprises of two language models:
• Encoder: a language model to encode input sequence into a fixed length vector (thought vector)
• Decoder: another language model to look at both thought vector and previous output to generate next words
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Sequence To Sequence Model
Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473

Overview
Network
Architecture
Loss Function
Improvement
Techniques
Which“Crane”?
I like crane because …

Overview
Network
Architecture
Loss Function
Improvement
Techniques
Sequence Model with Neural Network: https://indico.io/blog/sequence-modeling-neuralnets-part1/

Overview
Network
Architecture
Loss Function
Improvement
Techniques
Generating a word is a multi-class classification task over all possible words, i.e. vocabulary.
W* = argmaxW P(W|Previous words)
Example :
I always order pizza with cheese and ……
mushrooms 0.15
pepperoni 0.12
anchovies 0.01
….
rice 0.0001
and 1e-100
Loss Function

Cross Entropy Loss:
Cross-Entropy:
Cross-Entropy for a sentence w1, w2, …, wn:
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Evaluating Language Model: https://courses.engr.illinois.edu/cs498jh/Slides/Lecture04.pdf
Perplexity:
In practice, a variant called perplexity is usually used as metric to evaluate language models.

• Cross entropy can be seen as a measure of uncertainty
• Perplexity can be seen as “number of choices”
Overview
Network
Architecture
Loss Function
Improvement
Techniques
Cross entropy loss vs Perplexity:
• Entropy: ~2.58
• Perplexity: 6 choices
• Which statement do you prefer?
- The die has 6 faces
- The die has 2.58 entropy
• We can see perplexity as the average choices each time. The higher it is, the more “choices” of words
you have, then the more uncertain the language model is.
• Example: 6 faced balanced die. Each face is numbered from 1 to 6 so we have

Overview
Network
Architecture
Loss Function
Improvement
Techniques
Problem:
- The last state of the encoder contains mostly
information from the last elements of the
encoder sequence
- Inverse Input Sequence helps in some cases
How are you ?
I am fine .
Attention Mechanism:
- Allow each stage in decoder to look at any
encoder stages
- Decoder understand the input sentence more
and look at suitable positions to generate words

Overview
Network
Architecture
Loss Function
Improvement
Techniques
Problem:
- The last state of the encoder contains mostly
information from the last elements of the
encoder sequence
- Inverse Input Sequence helps in some cases
Attention Mechanism:
- Allow each stage in decoder to look at any
encoder stages
- Decoder understand the input sentence more
and look at suitable positions to generate words
Seq2Seq Seq2Seq with
attention
Sentence Length - 30 13.93 21.50
Sentence Length - 50 17.82 28.45
BLEU score on English-French Translation corpus

Overview
Network
Architecture
Loss Function
Improvement
Techniques
Problem:
- Maximizing conditional probabilities at each stage
might not lead to maximum full-joint probability.
- Storing all possible generated sentences are not
feasible due to resource limitation.
Possible output 2: Never been better
Possible output 1: I am fine
Beam Search:
- At each stage in decoder, store best M possible
outputs
Sequence to Sequence Learning: https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
Conditional Probability: 0.6 0.4 1
Conditional Probability: 0.4 0.9 1
Full-joint
Probability:
0.24
0.36
Possible Output 1:
Possible Output 2:
Possible Output M:
How are you ?
I am fine .
…

Overview
Network
Architecture
Loss Function
Improvement
Techniques
Problem:
- Maximizing conditional probabilities at each stage
might not lead to maximum full-joint probability.
- Storing all possible generated sentences are not
feasible due to memory limitation.
Beam Search:
- At each stage in decoder, store best M possible
outputs
Sequence to Sequence Learning: https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
Seq2Seq with
beam-size = 1
Seq2Seq with
beam size = 12
28.45 30.59
BLEU score on English-French Translation corpus.
Max sentence length 50

Overview
Network
Architecture
Loss Function
Improvement
Techniques
…

Cross Entropy Loss:
Cross-Entropy:
Cross-Entropy for a sentence w1, w2, …, wn:
Overview
Network
Architecture
Loss Function
Improvement
Techniques
= −𝑙𝑜𝑔2 𝑚(𝑥∗
)
= − 𝑙𝑜𝑔2 𝑚(𝑤1
∗
, … , 𝑤 𝑛
∗
)
= − [𝑙𝑜𝑔2 𝑚 𝑤 𝑛
∗
|𝑤1
∗
, … , 𝑤 𝑛−1
∗
+ 𝑙𝑜𝑔2 𝑚 𝑤 𝑛−1
∗
|𝑤1
∗
, … , 𝑤 𝑛−2
∗
+ … + 𝑙𝑜𝑔2 𝑚 𝑤1
∗
]
sum of log-probability in decoding steps

Overview
Network
Architecture
Loss Function
Improvement
Techniques
1. Reinforcement Learning:
Longer sentence is usually more interesting. So, we can use sentence length as rewards to further train the model:
• Action: Word choice
• State: Current generated sentence
• Reward: Sentence Length
2. Adversarial Training:
Make generated sentences look real using Adversarial training:
• Generative Model: generate sentences based on inputs
• Discriminant Model: tries to tell if a sentence is true response or generated response
• Objective: train generative model to “fool“ discriminant model
Adversarial Learning for Neural Dialogue Generation: https://arxiv.org/abs/1701.06547

Deep learning - Chatbot

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Deep learning - Chatbot (20)

Recently uploaded (20)

Deep learning - Chatbot

Editor's Notes