Convolutional Neural Networks and Natural Language Processing

Convolutional Neural Networks
and Natural Language Processing
Thomas Delteil – github.com/thomasdelteil – linkedin.com/in/thomasdelteil
Applied Scientist @ AWS Deep Engine

Goals
§ Explain what convolutions are
§ Show how to handle textual data
§ Analyze a reference neural network
architecture for text classification
§ Demonstrate how to train and deploy a CNN for
Natural Language Processing
Learn Data Science, Vancouver – Deep Learning and NLP - CNNs and NLP - Thomas Delteil - github.com/thomasdelteil - linkedin.com/in/thomasdelteil

Convolutions
And where to find them

2012 - ImageNet Classification with Deep Convolutional Neural Networks
ImageNet classification with Deep Convolutional Neural Networks, Alex Krizhevsky, Ilya Sutskever, Geoffrey E.
Hinton, Advances in Neural Information Processing Systems, 2012
AlexNet architecture

ImageNet competition
Classify images among 1000 classes:
AlexNet Top-5 error-rate, 25% => 16%!

Actual photo of the reaction from the computer vision community*
*might just be a stock photo

I told you
so!

What made Convolutional Neural Networks viable?

GPUs!
- Nvidia V100, float16 Ops:
~ 120 TFLOPS, 5000+ cuda cores
- #1 Super computer 2005 ~135 TFLOPS
Source: Mathworks

Sea/Land segmentation via satellite images
DeepUNet: A Deep Fully Convolutional Network for Pixel-level Sea-Land Segmentation, Ruirui Li et al, 2017

Automatic Galaxy classication
Deep Galaxy: Classification of Galaxies based on Deep Convolutional Neural Networks , Nour Eldeen M. Khalifa, 2017

Medical Imaging, MRI, X-ray, surgical cameras
Review of MRI-based Brain Tumor Image Segmentation Using Deep Learning Methods, Ali Isn et al. 2016

What is a convolution ?
It is the cross-channel sum of the element-wise
multiplication of a convolutional filter (kernel/mask)
computed over a sliding window on an input tensor
given a certain stride and padding, plus a bias term.
The result is called a feature map.
2 2 1
3 1 -1
4 3 2
1 -1
-1 0
Input matrix (3x3)
no padding
1 channel
Kernel (2x2)
Stride 1
Bias = 2
Feature map (2x2)
-1 2
0 1
1*2 –1*2 –1*3 + 0*1 + 2 = – 1
1*2 –1*2 –1*1 + 0*-1 + 2. = 2
1*3 –1*1 –1*4 + 0*3 + 2 = 0
1*1 – (-1)*1 –1*3 + 0*2 + 2 = 1

What is a convolution ? Padding
Source: Machine Learning guru - Neural Networks CNN

What is a convolution ? Stride = 2

What is a convolution ? Multi Channel
1 convolutional filter
(3)x(3x3)

What is a convolution ? Multi Channel
source: Convolutional Neural Networks on the iphone with vggnet
N: Number of input channels
W:Width of the kernel
H: Height of the kernel
M: Number of output channels
Kernel size = ! ∗ # ∗ $
#Params = % ∗ ! ∗ # ∗ $ + %
256 convolutions of kernel (3,3) on 256 input channels
256*256*3*3 = ~0.5M

Easily parallelizable
Convolution computations are:
- Independent (across filters and within
filter)
- Simple (multiplication and sums)

Why does it work?
Sharpening filter
Laplacian filter
Sobel x-axis filter

Why does it work?
- Detect patterns at larger and larger scale by stacking convolution
layers on top of each others to grow the receptive field
- Applicable to spatially correlated data
Source: AlexNet first 96 (55x55) filters learned represented in RGB
space (3 input channels)

Growing receptive field
Source: ML Review, A guide to receptive field arithmetic
Deeper in the
network

Visualize convolutions
http://scs.ryerson.ca/~aharley/vis/conv/flat.html

Visualize convolutions
Source: Neural Network 3D Simulation
(warning flashing lights)

State of the art networks are getting deeper and more complex
Source: Inception v3
input

Learn Data Science – Deep Learning and NLP - CNNs and NLP - Thomas Delteil - github.com/thomasdelteil - linkedin.com/in/thomasdelteil
High number of parameters => Requires a lot of data to train

Advanced type of convolutions
Source: An introduction to different types of convolutions
Transposed Convolutions
(deconvolution)
EnhanceNet
Dilated Convolutions
WaveNet
Depth-wise separable
Convolutions
MobileNet

On to Natural Language
Processing

NLP
Machine
translation
OCR
Q&A
Sentiment
Analysis
Speech
Recognition
TTS
Topic
Modelling
Information
Retrieval
Natural
Language
Understanding
Document
Classification
NLP Domains

8.4PB
of information per second
as of 2020
source: business2comunity, 2016
70%
of companies
use customer feedback
Source: business2comunity, 2016
£1.3Tvalue of company
data
source: IDC, 2014
10%
of organizations expect to
commercialise their data by 2020
source: Gartner, 2016
NLP Industry Facts
Source: Ticary, What is natural language processing Learn Data Science, Vancouver – Deep Learning and NLP - CNNs and NLP - Thomas Delteil - github.com/thomasdelteil - linkedin.com/in/thomasdelteil

Convolutions and Natural Language Processing

Data Representation
?
source: Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn,and Dong Yu,. Classiﬁcation Convolutional Neural Networks for
Speech Recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014

Encoding Data word-level
- Word-level embedding (word2vec). Word -> N-dimensional vector
Source: Convolutional Neural Networks for Sentence Classification,Yoon Kim, 2014
N
time
different
embeddings

V A N C O U V E R N L P …
_ 0 0 0 0 0 0 0 0 0 1 0 0 0
- 0 0 0 0 0 0 0 0 0 0 0 0 0
. 0 0 0 0 0 0 0 0 0 0 0 0 0
A 0 1 0 0 0 0 0 0 0 0 0 0 0
B 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 1 0 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 1 0 0 0 0 0
F 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 0 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 0 0 0 0 1 0
M 0 0 0 0 0 0 0 0 0 0 0 0 0
N 0 0 1 0 0 0 0 0 0 0 1 0 0
O 0 0 0 0 1 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0 0 1
Q 0 0 0 0 0 0 0 0 0 0 0 0 0
R 0 0 0 0 0 0 0 0 1 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0
U 0 0 0 0 0 1 0 0 0 0 0 0 0
V 1 0 0 0 0 0 1 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0
X 0 0 0 0 0 0 0 0 0 0 0 0 0
Y 0 0 0 0 0 0 0 0 0 0 0 0 0
Z 0 0 0 0 0 0 0 0 0 0 0 0 0
Encoding Data – Character-level
- One-hot encoding
- Alphabet
- Sparse representation
- Character embedding

Text classification, N categories

Neural
Network
- Fiction: 0%
- Biography: 6%
…
- Play: 80%
…
- Documentation: 0%

source: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. NIPS 2015
Visualization with Netro
Deep Neural Network: Crepe Model
Visualization with Netron
Intuition: convolutions act similarly as n-grams

V A N C O U V E R … 1013
_ 0 0 0 0 0 0 0 0 0 1
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
- 0 0 0 0 0 0 0 0 0 0 …
. 0 0 0 0 0 0 0 0 0 0 …
A 0 1 0 0 0 0 0 0 0 0 …
B 0 0 0 0 0 0 0 0 0 0 …
C 0 0 0 1 0 0 0 0 0 0 …
D 0 0 0 0 0 0 0 0 0 0 …
E 0 0 0 0 0 0 0 1 0 0 …
F 0 0 0 0 0 0 0 0 0 0 …
G 0 0 0 0 0 0 0 0 0 0 …
H 0 0 0 0 0 0 0 0 0 0 …
I 0 0 0 0 0 0 0 0 0 0 …
J 0 0 0 0 0 0 0 0 0 0 …
K 0 0 0 0 0 0 0 0 0 0 …
L 0 0 0 0 0 0 0 0 0 0 …
M 0 0 0 0 0 0 0 0 0 0 …
N 0 0 1 0 0 0 0 0 0 0 …
O 0 0 0 0 1 0 0 0 0 0 …
P 0 0 0 0 0 0 0 0 0 0 …
Q 0 0 0 0 0 0 0 0 0 0 …
R 0 0 0 0 0 0 0 0 1 0 …
S 0 0 0 0 0 0 0 0 0 0 …
T 0 0 0 0 0 0 0 0 0 0 …
U 0 0 0 0 0 1 0 0 0 0 …
V 1 0 0 0 0 0 1 0 0 0 …
W 0 0 0 0 0 0 0 0 0 0 …
X 0 0 0 0 0 0 0 0 0 0 …
Y 0 0 0 0 0 0 0 0 0 0 …
Z 0 0 0 0 0 0 0 0 0 0 …
0 1 2 3 4 … … … … … … … … 1007
0 6.4 1.1 3.2 0.3 -0.4 … … … … … … … … …
1 -2.1 0.2 -3.4 … … … … … … … … … … …
… … … … … … … … … … … … … … …
… … … … … … … … … … … … … … …
… … … … … … … … … … … … … … …
254 … … … … … … … … … … … … … …
255 1.2 3.4 -1 1.2 3.2 … … … … … … … … …
x 256
69x1014x1 = ~70k
1x1008x256 = ~256k
x 1008
Temporal Convolution (256 69*7/1)

1x1008x256 = ~256k
1x1008x256 = ~ 256k
Activation Function: Rectified Linear Unit (ReLU)
! " = $
", " ≥ 0
0, " < 0
0 1 2 3 4 5 … 1007
0 6.4 1.1 3.2 0.3 -0.4 0.2 … …
… … … … … … … … …
255 1.2 3.4 -1 1.2 3.2 2.8 … …
0 1 2 3 4 5 … 1007
0 6.4 1.1 3.2 0.3 0 0.2 … …
… … … … … … … … …
255 1.2 3.4 0 1.2 3.2 2.8 … …

0 1 2 3 4 5 … 1007
0 6.4 1.1 3.2 0.3 0 0.2 … …
… … … … … … … … …
255 1.2 3.4 0 1.2 3.2 2.8 … …
0 1 … 335
0 6.4 0.3 … …
… … … … …
255 3.4 3.2 … …
1x1008x256 = ~256k
1x336x256 = ~86k
x 336
x 256
Down-sampling: Max-Pooling (256 1*3/3)
source : Stanford's CS231n

Fast forward…
1x336x256 = ~86k <- after 1 convolution layer (69*7/1) and 1 max pooling (3x1/3)
1x330x256 = ~85k <- after 1 convolution layer (1*7/1)
1x110x256 = ~28k <- 1 max-pooling (1*3/3)
3x102x256 = ~26k <- 4 convolutions layers (1*3/1)
1x34x256 = ~9k <- 1 max-pooling (1*3/3)

0 1 2 3 4 5 6 7 8 … 33
0 6.4 0.1 … … … … … … … … …
1 2.1 24.9 … … … … … … … … …
… … … … … … … … … … … …
255 … … … … … … … … … … 9.9
0
0 6.4
1 0.1
… …
34 2.1
35 24.9
… …
… …
… …
8703 9.9 8704x1x1 = ~9k
1x34x256 = ~9k
x 256
Flattening Layer

0
0 6.4
1 0.1
… …
8703 9.9
8704x1x1 = ~9k
0
1
k
1023
x 1024
1024x1x1 = ~1k
!" # = %
&'(
)*(+
,"& ∗ .& + 0"
0
0 8.7
1 -2.1
… …
1023 32.1
Fully Connected / Dense layer (1024)

0
0 8.7
1 0
… …
… …
… …
… …
… …
… …
… …
1023 32.1
DROP OUT
1024x1x1 = ~1k
0
1
k
1023
x 1024
1024x1x1 = ~1k
!" # = %
&'(
)*(+
,"& ∗ .& + 0"
0
0 9.2
1 5.3
… …
1023 0.1
ignored
Dropout (p=0.5) + Fully Connected Layer (1024)

0
0 6.4
1 0.1
… …
… …
… …
… …
… …
… …
… …
1023 9.9
1024x1x1 = ~1k
0
…
N-1
x N
Nx1x1 = N
0
0 2.7
1 0.1
… …
… …
N-1 12.5
ignored
Softmax
0
0 0.1
1 0.01
… …
… …
N-1 0.8
Nx1x1 = N
!"#$%&' ( ) =
+,-
∑/01
234 +
,/
Output: Dropout + Dense + Softmax for N categories

Neural
Network
- Fiction: 0%
- Biography: 6%
…
- Play: 6%
…
- Documentation: 80%

How to train the network? Backward propagation!

Backward propagation – Efficient Gradient Descent
- Fiction: 0%
- Biography: 6% 0%
…
- Play: 6% 100%
…
- Documentation: 80% 0%
- Fiction: 0%
- Biography: 6%
…
- Play: 6%
…
- Documentation: 80%
Update the weights of the convolutional masks and fully
connected units so that the error will be minimized next time
Neural
Network
!"# = !"# − &.
()
(*+,

Learning Rate ! : How much to update the weights for every batch of documents?
Training Parameters: Learning Rate
Source:Towards data Science: Gradient descent in a nutshell

Training parameters: Batch Size
Batch size: How many examples to learn from in one step?

Training parameters: Number of epochs
Number of epochs: How many times should we feed the network the entire training set?

Jupyter notebook demo – Crepe in Apache MXNet/Gluon
https://github.com/ThomasDelteil/CNN_NLP_MXNet

Results
Traditional approaches
Word-level CNN
Character-level CNN

For images
For text
Humans to rephrase the examples
Synonyms
Similar semantic meaning
Data Augmentation

Data Augmentation
The quick brown fox jumps over the lazy dog

Data Augmentation
fast
swift
speedy
idle
indolent
slothful
hound
pup
mutt
leaps
springs
bounds
hops
hazel
brunette
chestnut

Data Augmentation
fast
swift
speedy
idle
indolent
slothful
hound
pup
mutt
leaps
springs
bounds
hops
The swift brunette fox leaps over the slothful pup
hazel
brunette
chestnut

You need a large dataset

Live Demo – Classification of product category for Amazon Reviews
https://thomasdelteil.github.io/CNN_NLP_MXNet/

- Develop model using a Jupyter notebook
- Train model on GPU instance
- Package model behind web API in a Docker container, e.g using MXNet Model Server
- Upload container to container registry
- Deploy container to an elastic container service
- Enjoy quick and linear scaling
- Put the API behind a load balancer with SSL termination
- Enjoy J
Workflow and Operationalization
Elastic
Container
Service
GPU instance Container
Registry
Auto-scaling Load
Balancer
Container
HTTPS request
“Loved this
book”
HTTPS response
{
“prediction” : {
“book”: 0.99
}
}

Advanced use-cases for
Convolutions and NLP

CNN + LSTM: Spatially and Temporally Deep Neural Networks
- CNN for feature extraction
- LSTM for temporal representation
Applications:
- Video (CNN for frames, LSTM to
combine them temporally)
- Text tasks
- Audio (Language detection)
Source: Combining CNN and RNN for spoken language detection

Advanced use-case: Speech Generation WaveNet
Source: DeepMind Wavenet generative model raw audio

WaveNet: Dilated Causal Convolution
Source: DeepMind Wavenet generative model raw audio

Summary
§ Learned about convolutions
§ Applied them to textual data
§ Studied the crepe architecture from
Zhang et al. in details
§ Learned about advanced use cases and
operationalization

Thank you!
Connect here
github.com/thomasdelteil
linkedin.com/in/thomasdelteil
tdelteil@amazon.com
Photos credits: https://pexels.com and https://unsplash.com/

Convolutional Neural Networks and Natural Language Processing

More Related Content

What's hot (20)

Similar to Convolutional Neural Networks and Natural Language Processing (20)

Recently uploaded (20)

Convolutional Neural Networks and Natural Language Processing