Natural Language Processing
Info 159/259
Lecture 4: Text classification 2 (Jan 27, 2022)
David Bamman, UC Berkeley
Binary logistic regression
( = | , )=
+ exp =
output space Y={ , }
Multiclass logistic regression
exp( )
( = | = ; )=
Y exp( )
output space Y = { ,..., }
Features
• As a discriminative classifier, logistic features
regression doesn’t assume features are
independent.
contains like
• Its power partly comes in the ability to create
richly expressive features without the burden of has word that shows up in
independence. positive sentiment dictionary
• We can represent text through features that are review begins with “I like”
not just the identities of individual words, but
any feature that is scoped over the entirety of at least 5 mentions of positive
affectual verbs (like, love,
the input. etc.)
Logistic regression
• We want to find the value of β that leads to the highest value of the
conditional log likelihood:
( )= log ( | , )
=
5
L2 regularization
( )= log ( | , )
= =
• We can do this by changing the function we’re trying to optimize by adding a penalty for having
values of β that are high
• This is equivalent to saying that each β element is drawn from a Normal distribution centered on 0.
• η controls how much of a penalty to pay for coefficients that are far from 0 (optimize on
development data)
L1 regularization
( )= log ( | , ) | |
= =
• L1 regularization encourages coefficients to be exactly 0.
• η again controls how much of a penalty to pay for coefficients that
are far from 0 (optimize on development data)
.
https://www.forbes.com/sites/kevinmurnane/2016/04/01/what-is-
deep-learning-and-how-is-it-useful
History of NLP
• Foundational insights, 1940s/1950s
• Two camps (symbolic/stochastic), 1957-1970
• Four paradigms (stochastic, logic-based, NLU, discourse modeling), 1970-1983
• Empiricism and FSM (1983-1993)
• Field comes together (1994-1999)
• Machine learning (2000–today)
• Neural networks (~2014–today) J&M 2008, ch 1
“Word embedding” in NLP papers
0.7
0.525
0.35
0.175
0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Data from ACL papers in the ACL Anthology (https://www.aclweb.org/anthology/)
Neural networks in NLP
• Language modeling [Mikolov et al. 2010]
• Text classification [Kim 2014; Iyyer et al. 2015]
• Syntactic parsing [Chen and Manning 2014, Dyer et al. 2015, Andor et al. 2016]
• CCG super tagging [Lewis and Steedman 2014]
• Machine translation [Cho et al. 2014, Sustkever et al. 2014]
• Dialogue agents [Sordoni et al. 2015, Vinyals and Lee 2015, Ji et al. 2016]
• (for overview, see Goldberg 2017, 1.3.1)
Neural networks
• Discrete, high-dimensional representation of inputs (one-hot vectors) ->
low-dimensional “distributed” representations.
• Static representations -> contextual representations, where
representations of words are sensitive to local context.
• Non-linear interactions of input features
• Multiple layers to capture hierarchical structure
Neural network libraries
Logistic regression
x β
1 not 1 -0.5
P( ŷ = 1) =
1 + exp (− ∑i=1 xi βi)
F
bad 1 -1.7
movie 0 0.3
SGD
Calculate the derivative of some loss function with respect to parameters we
can change, update accordingly to make predictions on training data a little
less wrong next time.
15
Logistic regression
x β
x1 β1
1 not 1 -0.5
P( ŷ = 1) =
1 + exp (− ∑i=1 xi βi)
F β2
x2 y
bad 1 -1.7
x3 β3
movie 0 0.3
Feedforward neural network
• Input and output are mediated by at least one hidden layer.
x1
h1
y
x2
h2
x3
*For simplicity, we’re leaving out the bias term,
but assume most layers have them as well.
W V
x1 W1,1
W1,2 h1 V1
W2,1
y
x2 W2,2
V2
W3,1 h2
W3,2
x3
Input “Hidden” Output
Layer
W V
x1
h1
y
x2
h2
x3
x W V y
not 1 -0.5 1.3 4.1 1
bad 1 0.4 0.08 -0.9
movie 0 1.7 3.1
W V
x1
h1
y
x2
h2
x3
the hidden nodes are
= , completely determined by the
input and weights
=
W V
x1
h1
y
x2
h2
x3
= ,
=
Activation functions
( )=
+ exp( )
1.00
0.75
0.50
y
0.25
0.00
-10 -5 0 5 10
x
Logistic regression
1
P( ŷ = 1) =
1 + exp (− ∑i=1 xi βi)
x1 β1 F
β2
x2 y
F
(∑ )
P( ŷ = 1) = σ xi βi
x3 β3 i=1
We can think about logistic regression as a
neural network with no hidden layers
Activation functions
exp( ) exp( )
tanh( ) =
exp( ) + exp( )
1.0
0.5
0.0
y
-0.5
-1.0
-10 -5 0 5 10
x
Activation functions
ReLU(z) = max(0, z)
10.0
7.5
5.0
y
2.5
0.0
-10 -5 0 5 10
x
function
derivative
Goldberg 46
• ReLU and tanh are both used extensively in modern systems.
• Sigmoid is useful for final layer to scale output between 0 and 1, but
is not often used in intermediate layers.
W V
x1
h1
y
x2
h2
x3
= ,
= ˆ= [ + ]
= ,
=
W V
x1
h1
y
x2
h2
x3
ˆ= , + ,
we can express y as a function only of the input x and the weights W and V
ˆ= , + ,
This is hairy, but differentiable
Backpropagation: Given training samples of <x,y> pairs, we can use
stochastic gradient descent to find the values of W and V that minimize
the loss.
W V
x1
h1
y
x2
h2
x3
Neural networks are a series of
functions chained together ( ) ( ) ( ( ) )
The loss is another function
chained on top log ( ( ( ) ))
Chain rule
Let’s take the likelihood for a
single training example with
log ( ( ( ) )) label y =1; we want this value to
be as high as possible
log ( ( ( ) )) ( ( ) ) ( )
=
( ( ) ) ( )
log ( ( )) ( )
=
( )
Chain rule
log ( ( )) ( )
=
( )
= ( ) ( ( ))
( )
=( ( ))
=( ˆ)
Neural networks
• Tremendous flexibility on design choices (exchange feature
engineering for model engineering)
• Articulate model structure and use the chain rule to derive parameter
updates.
Neural network structures
x1
h1
y 1
x2
h2
x3
Output one real value; sigmoid function
for output gives single probability
between 0 and 1
Neural network structures
y 0
x1
h1
y 1
x2
h2
x3
y 0
Multiclass: output 3 values, only one = 1 in training data;
softmax function for output gives probability between 0 and
1 for each class (all class probabilities sum to 1); classes
compete with each other.
Neural network structures
y 1 glad
nsubj_walks x1
h1
y 1 happy
dobj_hits x2
h2
y 0 sad
nsubj_says x3
output 3 values, several = 1 in training data; sigmoid
function for each output gives probability of presence of
that label; classes do not compete with each other since
multiple can be present together.
Regularization
• Increasing the number of parameters = increasing the possibility for
overfitting to training data
Regularization
• L2 regularization: penalize W and V for being too large
• Dropout: when training on a <x,y> pair, randomly remove some node and
weights.
• Early stopping: Stop backpropagation before the training error is too
small.
Deeper networks
W1 W2 V
x1
h1
h2
x2
h2 y
h2
x3
h2
x3
Densely
W
x1 connected layer
x2
x3
h1 x
h2
x4
W
x5 h2
x6
h
x7
h = (xW )
Convolutional networks
• With convolution networks, the same operation is (i.e., the same set of
parameters) is applied to different regions of the input
2D Convolution
0 0 0 0 0
0 1 1 1 0
0 1 1 1 0
0 1 1 1 0
0 0 0 0 0
blurring
1D Convolution
convolution K 1/3 1/3 1/3
4
1.5
0 1 3 -1 4 2 0
x -1
3
1⅓ 1 2 1⅔ 2
1.5
0
Convolutional
I x1 networks
hated x2
h1 h1=f(I, hated, it)
x Input
it x3
h2 h2=f(hated, it, I) W1
I x4
W2 Parameters
h3 h3=f(it, I, really)
really x5 W3
x6 h4 h4=f(I, really, hated)
hated
h5 h5=f(really, hated, it)
it x7 h1 = (x1 W1 + x2 W2 + x3 W3 )
Indicator vector vocab item
a
indicator
aa 0
aal 0
aalii 0
• Every token is a V-dimensional vector
(size of the vocab) with a single 1 aam 0
identifying the word
aardvark 1
aardwolf 0
aba 0
x W
x1 3.1
1 -2.7
x1 W1 1.4
-0.7
x2 -1.4
9.2
h1 -3.1
x3 x2 W2 -2.7
1 1.4
0.1
x4 1 0.3
-0.4
x3 W3 -2.4
-4.7
x5 5.7
x6
h1 = (x1 W1 + x2 W2 + x3 W3 )
x7
x W
x1 3.1
1 -2.7
x1 W1 1.4
-0.7
x2 -1.4
9.2
h1 -3.1
x3 x2 W2 -2.7
1 1.4
0.1
x4 1 0.3
-0.4
x3 W3 -2.4
-4.7
x5 5.7
For indicator vectors, we’re just adding
x6 these numbers together
x7 h1 = (W1,xid
1
+ W2,xid
2
+ W3,xid
3
)
(Where xnid specifies the location of the 1 in
the vector — i.e., the vocabulary id)
x W
x1 0.4 3.1
0.8 -2.7
x1 1.2 W1 1.4
-1.3 -0.7
x2 0.4 -1.4
0.2 9.2
h1 -5.3 -3.1
x3 x2 -1.2 W2 -2.7
5.3 1.4
0.4 0.1
x4 2.6 0.3
2.7 -0.4
x3 -3.2 W3 -2.4
6.2 -4.7
x5 1.9 5.7
x6
For dense input vectors (e.g.,
embeddings), full dot product
x7
h1 = (x1 W1 + x2 W2 + x3 W3 )
7
3
7
Pooling
1
1
9 • Down-samples a layer by selecting a
single point from some set
• Max-pooling selects the largest value
0
5
5
3
7
3
Global pooling
1
• Down-samples a layer by selecting a
9
single point from some set
2
9
1
• Max-pooling over time (global max
pooling) selects the largest value over
an entire sequence
0
Very common for NLP problems.
5
•
3
Convolutional
x1 networks
x2 1
x3 10
x4 2 10
x5 -1
x6 5 This defines one filter.
x7
convolution max pooling
We can specify multiple filters; each
filter is a separate set of parameters to
x1 be learned
x2
Wa Wb Wc Wd
x3
1
x1
x4
x2
1
x5
1
x3
x6
x7
h1 = (x W ) R4
Convolutional networks
• With max pooling, we select a single number for each filter over all tokens
• (e.g., with 100 filters, the output of max pooling stage = 100-dimensional
vector)
• If we specify multiple filters, we can also scope each filter over different
window sizes
Zhang and Wallace 2016, “A Sensitivity Analysis of
(and Practitioners’ Guide to) Convolutional Neural
Networks for Sentence Classification”
CNN as important ngram
detector
unique types
Higher-order ngrams are much more
informative than just unigrams (e.g., “i
don’t like this movie” [“I”, “don’t”, “like”, unigrams 50921
“this”, “movie”])
bigrams 451,220
We can think about a CNN as
providing a mechanism for detecting trigrams 910,694
important (sequential) ngrams without
having the burden of creating them as
4-grams 1,074,921
unique features
Unique ngrams (1-4) in Cornell movie review dataset