N20181126

Points
• A neural discriminative constituency parser - F1 93.55

• Chart parser/decoder

• Encoder-decoder style dcp - the architecture
• Structure meaning of multi-headed self-attention for cp

• 8-layer, 8-head transformer + BiLSTM decoder

• Analysis by input ablation: word, POS and position

• position or content (POS⟺morph, ElMo/CharConcat)

• Metric of tree structure accuracy: ParsEval

Constituency Parsing
Grammar structure CKY-algorithm ChomskyCFG Transition-based
Chart parser
NLP tutorial 10: (11↑)
• Probability as score
• Bottom-up combine
(bracketing per se)
• Beam search
Godfather
Transformer .
word+POS+position
Decomposition,
3 4 5 6 7
A BiLSTM for  
fence points

Incrementally build up
W0 W1 W2 W3 W4<bos> <eos>
CKY
⇊fence points⇊

Incrementally build up
Score for a bracket: (decoder)

How to deal with non-phrase?

• CKY: little probability (PCFG)

• Chen (me): <nil> tag / vector

• This research: s(i, j, ∅) = 0
i, j are fence points;

l is a label
↕ train with ∅ or <nil>

Position Embedding
Encoder: linguistic Information
Word Embedding
POS
Embedding
Input Zdmodel
T
Component-wise add
zt = wt + mt + pt
Since then, zt is sent to the
Transformer and dmodel keeps
throughout the encoder.

zt
xt
yt
xt
xt

qt = WT
Qxt
kt = WT
K xt
vt = WT
V xt
p(i → j)
¯vt
qi
ki
vi vj
kj
qj
p(i → j)
¯vi
xi
“gather information from up to 8 remote locations”

Decoder again
Wi Wj…
Run a BiRNN once
Run a FFN several times
“92.67 F1 on Penn Treebank WSJ dev set”
We must be the 2018 champion! と⼼心が叫びそうだ
T*(T+1)/2 times Δ

Analysis by Input Ablation
zt = wt + mt + pt
Word, POS and position embeddings are
added, but also overlapped:
qt = WT
Qzt
kt = WT
K zt
vt = WT
V zt
p(i → j)
¯vt
qt = WT
Q pt
kt = WT
K pt
vt = WT
V zt
Layer-wise disabled
“it seems strange that content-based attention
beneﬁts our model to such a small degree.”

Decomposition on i/w
zt = wt + mt + pt
zt = [wt + mt; pt]
F1 92.60
F1 92.67
1. Decompose input
2. Decompose attention
q ⋅ k
q = q(c)
+ q(p)
k = k(c)
+ k(p)
k ⋅ q = (q(c)
+ q(p)
) ⋅ (k(c)
+ k(p)
)
k(c)
⋅ q(p)
+ k(p)
⋅ q(c)
All mix-up:
An example of cross-terms:

“the word the always attends to the 5th
position in the sentence”
xt = [x(c)
; x(p)
]
c = Wx = [c(c)
; c(p)
] = [W(c)
x(c)
; W(p)
x(p)
]
F1 93.15 (+0.5)
all on dev set

Analysis by Constrains
“When we began to investigate how the model makes use
of long-distance attention, we found that there are
particular attention heads at some layers in our model
that almost always attend to the start token.”
RECALL: There are 8 heads in
each of the transformer layer.

“This suggests that the start token
is being used as the location for
some sentence-wide pooling/
processing, or perhaps as a
dummy target location when a
head fails to ﬁnd the particular
phenomenon that it’s learned to
search for.”

In short, it is a dustbin for
redundant .attention
WinA WinA + some spec
←Train with window
and then test on dev

8 layers :)

5 Lexical ModelsPOS tags from Stanford parserzt = [wt + mt; pt]
-4 layers at ELMo
pneumonoultramicrosco
picsilicovolcanoconiosis
>> Longtu’s

N20181126

More Related Content

What's hot (20)

Similar to N20181126 (20)

Recently uploaded (20)

N20181126