0% found this document useful (0 votes)

19 views55 pages

4 Classification 2

The document discusses logistic regression and neural networks for text classification. It covers binary and multiclass logistic regression, regularization techniques like L2 and L1 regularization, and feedforward neural networks with activation functions like sigmoid, tanh, and ReLU. Backpropagation is introduced as a way to optimize neural networks using gradient descent.

Uploaded by

1767824623

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views55 pages

4 Classification 2

Uploaded by

1767824623

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Natural Language Processing

Info 159/259
Lecture 4: Text classification 2 (Jan 27, 2022)

David Bamman, UC Berkeley

Binary logistic regression

( = | , )=
+ exp =

output space Y={ , }

Multiclass logistic regression

exp( )
( = | = ; )=
Y exp( )

output space Y = { ,..., }

Features
• As a discriminative classifier, logistic features
regression doesn’t assume features are
independent.
contains like
• Its power partly comes in the ability to create
richly expressive features without the burden of has word that shows up in
independence. positive sentiment dictionary

• We can represent text through features that are review begins with “I like”
not just the identities of individual words, but
any feature that is scoped over the entirety of at least 5 mentions of positive
affectual verbs (like, love,
the input. etc.)
Logistic regression
• We want to find the value of β that leads to the highest value of the
conditional log likelihood:

( )= log ( | , )
=

5
L2 regularization
( )= log ( | , )
= =

• We can do this by changing the function we’re trying to optimize by adding a penalty for having
values of β that are high

• This is equivalent to saying that each β element is drawn from a Normal distribution centered on 0.

• η controls how much of a penalty to pay for coefficients that are far from 0 (optimize on
development data)
L1 regularization
( )= log ( | , ) | |
= =

• L1 regularization encourages coefficients to be exactly 0.

• η again controls how much of a penalty to pay for coefficients that

are far from 0 (optimize on development data)
.

https://www.forbes.com/sites/kevinmurnane/2016/04/01/what-is-
deep-learning-and-how-is-it-useful
History of NLP
• Foundational insights, 1940s/1950s

• Two camps (symbolic/stochastic), 1957-1970

• Four paradigms (stochastic, logic-based, NLU, discourse modeling), 1970-1983

• Empiricism and FSM (1983-1993)

• Field comes together (1994-1999)

• Machine learning (2000–today)

• Neural networks (~2014–today) J&M 2008, ch 1

“Word embedding” in NLP papers

0.7

0.525

0.35

0.175

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

Data from ACL papers in the ACL Anthology (https://www.aclweb.org/anthology/)

Neural networks in NLP
• Language modeling [Mikolov et al. 2010]

• Text classification [Kim 2014; Iyyer et al. 2015]

• Syntactic parsing [Chen and Manning 2014, Dyer et al. 2015, Andor et al. 2016]

• CCG super tagging [Lewis and Steedman 2014]

• Machine translation [Cho et al. 2014, Sustkever et al. 2014]

• Dialogue agents [Sordoni et al. 2015, Vinyals and Lee 2015, Ji et al. 2016]

• (for overview, see Goldberg 2017, 1.3.1)

Neural networks
• Discrete, high-dimensional representation of inputs (one-hot vectors) ->
low-dimensional “distributed” representations.

• Static representations -> contextual representations, where

representations of words are sensitive to local context.

• Non-linear interactions of input features

• Multiple layers to capture hierarchical structure

Neural network libraries
Logistic regression
x β

1 not 1 -0.5
P( ŷ = 1) =
1 + exp (− ∑i=1 xi βi)
F

bad 1 -1.7

movie 0 0.3
SGD

Calculate the derivative of some loss function with respect to parameters we

can change, update accordingly to make predictions on training data a little
less wrong next time.

15
Logistic regression
x β

x1 β1

1 not 1 -0.5
P( ŷ = 1) =
1 + exp (− ∑i=1 xi βi)
F β2
x2 y

bad 1 -1.7

x3 β3

movie 0 0.3
Feedforward neural network

• Input and output are mediated by at least one hidden layer.

x1
h1

y
x2
h2

x3
*For simplicity, we’re leaving out the bias term,
but assume most layers have them as well.

W V

x1 W1,1

W1,2 h1 V1

W2,1
y
x2 W2,2
V2
W3,1 h2

W3,2
x3

Input “Hidden” Output

Layer
W V

x1
h1

y
x2
h2

x W V y

not 1 -0.5 1.3 4.1 1

bad 1 0.4 0.08 -0.9

movie 0 1.7 3.1

W V

x1
h1

y
x2
h2

the hidden nodes are

= , completely determined by the
input and weights
=
W V

x1
h1

y
x2
h2

= ,
=
Activation functions
( )=
+ exp( )
1.00

0.75

0.50
y

0.25

0.00

-10 -5 0 5 10
x
Logistic regression
1
P( ŷ = 1) =
1 + exp (− ∑i=1 xi βi)
x1 β1 F

β2
x2 y
F

(∑ )
P( ŷ = 1) = σ xi βi
x3 β3 i=1

We can think about logistic regression as a

neural network with no hidden layers
Activation functions
exp( ) exp( )
tanh( ) =
exp( ) + exp( )

1.0

0.5

0.0
y

-0.5

-1.0

-10 -5 0 5 10
x
Activation functions
ReLU(z) = max(0, z)

10.0

7.5

5.0
y

2.5

0.0

-10 -5 0 5 10
x
function

derivative

Goldberg 46

• ReLU and tanh are both used extensively in modern systems.

• Sigmoid is useful for final layer to scale output between 0 and 1, but
is not often used in intermediate layers.
W V

x1
h1

y
x2
h2

= ,
= ˆ= [ + ]

= ,
=
W V

x1
h1

y
x2
h2

ˆ= , + ,

we can express y as a function only of the input x and the weights W and V
ˆ= , + ,

This is hairy, but differentiable

Backpropagation: Given training samples of <x,y> pairs, we can use

stochastic gradient descent to find the values of W and V that minimize
the loss.
W V

x1
h1

y
x2
h2

Neural networks are a series of

functions chained together ( ) ( ) ( ( ) )

The loss is another function

chained on top log ( ( ( ) ))
Chain rule
Let’s take the likelihood for a
single training example with
log ( ( ( ) )) label y =1; we want this value to
be as high as possible

log ( ( ( ) )) ( ( ) ) ( )
=
( ( ) ) ( )

log ( ( )) ( )
=
( )
Chain rule

log ( ( )) ( )
=
( )

= ( ) ( ( ))
( )

=( ( ))
=( ˆ)
Neural networks

• Tremendous flexibility on design choices (exchange feature

engineering for model engineering)

• Articulate model structure and use the chain rule to derive parameter
updates.
Neural network structures

x1
h1

y 1
x2
h2

Output one real value; sigmoid function

for output gives single probability
between 0 and 1
Neural network structures
y 0
x1
h1

y 1
x2
h2

x3
y 0

Multiclass: output 3 values, only one = 1 in training data;

softmax function for output gives probability between 0 and
1 for each class (all class probabilities sum to 1); classes
compete with each other.
Neural network structures
y 1 glad
nsubj_walks x1
h1

y 1 happy
dobj_hits x2
h2

y 0 sad
nsubj_says x3

output 3 values, several = 1 in training data; sigmoid

function for each output gives probability of presence of
that label; classes do not compete with each other since
multiple can be present together.
Regularization

• Increasing the number of parameters = increasing the possibility for

overfitting to training data
Regularization

• L2 regularization: penalize W and V for being too large

• Dropout: when training on a <x,y> pair, randomly remove some node and
weights.

• Early stopping: Stop backpropagation before the training error is too

small.
Deeper networks
W1 W2 V

x1
h1

h2
x2
h2 y

h2
x3
h2

x3
Densely
W
x1 connected layer
x2

x3
h1 x

h2
x4
W
x5 h2

x6
h
x7

h = (xW )
Convolutional networks

• With convolution networks, the same operation is (i.e., the same set of
parameters) is applied to different regions of the input
2D Convolution

0 0 0 0 0
0 1 1 1 0
0 1 1 1 0
0 1 1 1 0
0 0 0 0 0

blurring
1D Convolution

convolution K 1/3 1/3 1/3

4
1.5
0 1 3 -1 4 2 0
x -1

3
1⅓ 1 2 1⅔ 2
1.5
0
Convolutional
I x1 networks
hated x2

h1 h1=f(I, hated, it)

x Input
it x3

h2 h2=f(hated, it, I) W1
I x4
W2 Parameters

h3 h3=f(it, I, really)
really x5 W3

x6 h4 h4=f(I, really, hated)

hated

h5 h5=f(really, hated, it)

it x7 h1 = (x1 W1 + x2 W2 + x3 W3 )
Indicator vector vocab item

a
indicator

aa 0

aal 0

aalii 0
• Every token is a V-dimensional vector
(size of the vocab) with a single 1 aam 0

identifying the word

aardvark 1

aardwolf 0

aba 0
x W

x1 3.1

1 -2.7

x1 W1 1.4

-0.7
x2 -1.4

9.2

h1 -3.1
x3 x2 W2 -2.7

1 1.4

0.1

x4 1 0.3
-0.4
x3 W3 -2.4
-4.7
x5 5.7

h1 = (x1 W1 + x2 W2 + x3 W3 )
x7
x W

x1 3.1

1 -2.7

x1 W1 1.4

-0.7
x2 -1.4

9.2

h1 -3.1
x3 x2 W2 -2.7

1 1.4

0.1

x4 1 0.3
-0.4
x3 W3 -2.4
-4.7
x5 5.7

For indicator vectors, we’re just adding

x6 these numbers together

x7 h1 = (W1,xid
1
+ W2,xid
2
+ W3,xid
3
)
(Where xnid specifies the location of the 1 in
the vector — i.e., the vocabulary id)
x W

x1 0.4 3.1

0.8 -2.7

x1 1.2 W1 1.4

-1.3 -0.7
x2 0.4 -1.4

0.2 9.2

h1 -5.3 -3.1
x3 x2 -1.2 W2 -2.7

5.3 1.4

0.4 0.1

x4 2.6 0.3
2.7 -0.4
x3 -3.2 W3 -2.4
6.2 -4.7
x5 1.9 5.7

x6
For dense input vectors (e.g.,
embeddings), full dot product
x7
h1 = (x1 W1 + x2 W2 + x3 W3 )
7

3
7
Pooling
1

1
9 • Down-samples a layer by selecting a
single point from some set

• Max-pooling selects the largest value

5
5
3
7

3
Global pooling
1

• Down-samples a layer by selecting a

9
single point from some set
2
9
1

• Max-pooling over time (global max

pooling) selects the largest value over
an entire sequence
0

Very common for NLP problems.

5
•
3
Convolutional
x1 networks
x2 1

x3 10

x4 2 10

x5 -1

x6 5 This defines one filter.

convolution max pooling

We can specify multiple filters; each
filter is a separate set of parameters to
x1 be learned

Wa Wb Wc Wd
x3
1
x1

x2
1
x5
1

x3
x6

x7
h1 = (x W ) R4
Convolutional networks

• With max pooling, we select a single number for each filter over all tokens

• (e.g., with 100 filters, the output of max pooling stage = 100-dimensional
vector)

• If we specify multiple filters, we can also scope each filter over different
window sizes
Zhang and Wallace 2016, “A Sensitivity Analysis of
(and Practitioners’ Guide to) Convolutional Neural
Networks for Sentence Classification”
CNN as important ngram
detector
unique types
Higher-order ngrams are much more
informative than just unigrams (e.g., “i
don’t like this movie” [“I”, “don’t”, “like”, unigrams 50921

“this”, “movie”])
bigrams 451,220
We can think about a CNN as
providing a mechanism for detecting trigrams 910,694
important (sequential) ngrams without
having the burden of creating them as
4-grams 1,074,921
unique features
Unique ngrams (1-4) in Cornell movie review dataset

4 Classification 3
No ratings yet
4 Classification 3
59 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
12 NN & Backpropagation
No ratings yet
12 NN & Backpropagation
37 pages
06 NeuralNetworks 2024
No ratings yet
06 NeuralNetworks 2024
82 pages
CS 4650/7650: Natural Language Processing: Neural Text Classification
No ratings yet
CS 4650/7650: Natural Language Processing: Neural Text Classification
85 pages
L2 - UCLxDeepMind DL2020
No ratings yet
L2 - UCLxDeepMind DL2020
104 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
ANN Unit-2
No ratings yet
ANN Unit-2
48 pages
NN 25 Aug
No ratings yet
NN 25 Aug
96 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
Deep Learning
100% (2)
Deep Learning
49 pages
7 NN Apr 28 2021
No ratings yet
7 NN Apr 28 2021
81 pages
XCS224N Module4 Slides
No ratings yet
XCS224N Module4 Slides
91 pages
Slide 7 - Neural Networks
No ratings yet
Slide 7 - Neural Networks
64 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
Week 14 (NN)
No ratings yet
Week 14 (NN)
49 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
DL Concepts 1 Overview
No ratings yet
DL Concepts 1 Overview
80 pages
465-Lecture 1 (Deep Learning)
No ratings yet
465-Lecture 1 (Deep Learning)
47 pages
Lecture 22
No ratings yet
Lecture 22
27 pages
Neural Nets
No ratings yet
Neural Nets
33 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
NN Notes
No ratings yet
NN Notes
39 pages
01 - Introduction To Deep Learning
No ratings yet
01 - Introduction To Deep Learning
56 pages
Lec 5 - RNN
No ratings yet
Lec 5 - RNN
61 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
CM20315 01 Intro
No ratings yet
CM20315 01 Intro
62 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
Deep Learning For Natural Language GDG Bloomington 1690248059
No ratings yet
Deep Learning For Natural Language GDG Bloomington 1690248059
41 pages
Artificial Neural Networks: Introduction To Computational Neuroscience
No ratings yet
Artificial Neural Networks: Introduction To Computational Neuroscience
42 pages
Tikas FYP
No ratings yet
Tikas FYP
37 pages
AN2DL 02 2324 Perceptron 2 FeedForward
No ratings yet
AN2DL 02 2324 Perceptron 2 FeedForward
55 pages
Lecture 0.4 - Neural Networks
No ratings yet
Lecture 0.4 - Neural Networks
51 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
82 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
Lecture14 - ML (FF, Autoenc, Dense Networks)
No ratings yet
Lecture14 - ML (FF, Autoenc, Dense Networks)
28 pages
AI Perspective (Post-Web) : Robotics
No ratings yet
AI Perspective (Post-Web) : Robotics
84 pages
Unit Iv DM
No ratings yet
Unit Iv DM
58 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
NLP NN Language Modeling Week5
No ratings yet
NLP NN Language Modeling Week5
33 pages
26 Deep Learning Annotated
No ratings yet
26 Deep Learning Annotated
53 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Lecture15 Neural Nets
No ratings yet
Lecture15 Neural Nets
70 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
Notes ML 02 Slides RNN ANN
No ratings yet
Notes ML 02 Slides RNN ANN
105 pages
Detection of Parkinson's Disease Using Machine Learning
75% (4)
Detection of Parkinson's Disease Using Machine Learning
91 pages
Barreto 2004
No ratings yet
Barreto 2004
16 pages
Design and Analysis of Algorithms تﺎﻴﻣزراﻮﺨﻟا ﻞﻴﻠﺤﺗو ﻢﻴﻤﺼﺗ: · November 2016
No ratings yet
Design and Analysis of Algorithms تﺎﻴﻣزراﻮﺨﻟا ﻞﻴﻠﺤﺗو ﻢﻴﻤﺼﺗ: · November 2016
32 pages
Chemometrics in Analytical Chemistry
No ratings yet
Chemometrics in Analytical Chemistry
596 pages
MODEL - MCQ
No ratings yet
MODEL - MCQ
22 pages
Important DS UNIT 4
No ratings yet
Important DS UNIT 4
2 pages
Cit335 2024 - 1
No ratings yet
Cit335 2024 - 1
2 pages
Homework Sol
No ratings yet
Homework Sol
7 pages
Advanced Algorithms Guide
No ratings yet
Advanced Algorithms Guide
34 pages
Logistic Regression & ANN Assignment
No ratings yet
Logistic Regression & ANN Assignment
1 page
4 Educ 115 Detailed Lesson Plan
96% (26)
4 Educ 115 Detailed Lesson Plan
29 pages
Final Report Daa Case Study 1
No ratings yet
Final Report Daa Case Study 1
19 pages
Civil Eng. Numerical Methods Guide
50% (4)
Civil Eng. Numerical Methods Guide
256 pages
Ratio and Difference of and Norms and Sparse Representation With Coherent Dictionaries
No ratings yet
Ratio and Difference of and Norms and Sparse Representation With Coherent Dictionaries
14 pages
4.8 Numerical Integration
No ratings yet
4.8 Numerical Integration
11 pages
ML Model Set 2
No ratings yet
ML Model Set 2
2 pages
Theory
100% (1)
Theory
5 pages
Fourier Transform Properties
No ratings yet
Fourier Transform Properties
6 pages
2CSDE64 Information Theory and Coding
No ratings yet
2CSDE64 Information Theory and Coding
2 pages
2D Poisson's Equation Solutions
No ratings yet
2D Poisson's Equation Solutions
16 pages
Course Outline Numerical Analysis
No ratings yet
Course Outline Numerical Analysis
2 pages
Digital Signal Processing by S Salivahanan PDF
56% (9)
Digital Signal Processing by S Salivahanan PDF
655 pages
Ec8203 Neural Network & Fuzzy Systems (End - Mo19)
No ratings yet
Ec8203 Neural Network & Fuzzy Systems (End - Mo19)
1 page
Algorithms - Discrete Mathematics Questions and Answers - Sanfoundry
No ratings yet
Algorithms - Discrete Mathematics Questions and Answers - Sanfoundry
7 pages
Interpolation: Dr. Sukanta Deb
No ratings yet
Interpolation: Dr. Sukanta Deb
32 pages
PHD Progress Report PPT 20191222-c
No ratings yet
PHD Progress Report PPT 20191222-c
36 pages
Daa Combined
No ratings yet
Daa Combined
407 pages
Download
No ratings yet
Download
14 pages
Computer Vision: Interest Points
No ratings yet
Computer Vision: Interest Points
76 pages
Sudoku Solving with Evolutionary Algorithm
No ratings yet
Sudoku Solving with Evolutionary Algorithm
11 pages

4 Classification 2

Uploaded by

4 Classification 2

Uploaded by

Natural Language Processing

David Bamman, UC Berkeley

output space Y={ , }

output space Y = { ,..., }

• L1 regularization encourages coefficients to be exactly 0.

• η again controls how much of a penalty to pay for coefficients that

• Two camps (symbolic/stochastic), 1957-1970

• Four paradigms (stochastic, logic-based, NLU, discourse modeling), 1970-1983

• Empiricism and FSM (1983-1993)

• Field comes together (1994-1999)

• Machine learning (2000–today)

• Neural networks (~2014–today) J&M 2008, ch 1

Data from ACL papers in the ACL Anthology (https://www.aclweb.org/anthology/)

• Text classification [Kim 2014; Iyyer et al. 2015]

• CCG super tagging [Lewis and Steedman 2014]

• Machine translation [Cho et al. 2014, Sustkever et al. 2014]

• (for overview, see Goldberg 2017, 1.3.1)

• Static representations -> contextual representations, where

• Non-linear interactions of input features

• Multiple layers to capture hierarchical structure

Calculate the derivative of some loss function with respect to parameters we

• Input and output are mediated by at least one hidden layer.

Input “Hidden” Output

not 1 -0.5 1.3 4.1 1

bad 1 0.4 0.08 -0.9

movie 0 1.7 3.1

the hidden nodes are

We can think about logistic regression as a

• ReLU and tanh are both used extensively in modern systems.

This is hairy, but differentiable

Backpropagation: Given training samples of <x,y> pairs, we can use

Neural networks are a series of

The loss is another function

• Tremendous flexibility on design choices (exchange feature

Output one real value; sigmoid function

Multiclass: output 3 values, only one = 1 in training data;

output 3 values, several = 1 in training data; sigmoid

• Increasing the number of parameters = increasing the possibility for

• L2 regularization: penalize W and V for being too large

• Early stopping: Stop backpropagation before the training error is too

convolution K 1/3 1/3 1/3

h1 h1=f(I, hated, it)

x6 h4 h4=f(I, really, hated)

h5 h5=f(really, hated, it)

identifying the word

For indicator vectors, we’re just adding

• Max-pooling selects the largest value

• Down-samples a layer by selecting a

• Max-pooling over time (global max

Very common for NLP problems.

x6 5 This defines one filter.

convolution max pooling

You might also like