SlideShare a Scribd company logo
Introducer: Z.Chen
Points
• A neural discriminative constituency parser - F1 93.55

• Chart parser/decoder

• Encoder-decoder style dcp - the architecture
• Structure meaning of multi-headed self-attention for cp

• 8-layer, 8-head transformer + BiLSTM decoder

• Analysis by input ablation: word, POS and position

• position or content (POS⟺morph, ElMo/CharConcat)

• Metric of tree structure accuracy: ParsEval
Constituency Parsing
Grammar structure CKY-algorithm ChomskyCFG Transition-based
Chart parser
NLP tutorial 10: (11↑)
• Probability as score
• Bottom-up combine
(bracketing per se)
• Beam search
Godfather
Transformer .
word+POS+position
Decomposition,
3 4 5 6 7
A BiLSTM for 

fence points
Incrementally build up
W0 W1 W2 W3 W4<bos> <eos>
CKY
⇊fence points⇊
Incrementally build up
Score for a bracket: (decoder)

How to deal with non-phrase?

• CKY: little probability (PCFG)

• Chen (me): <nil> tag / vector

• This research: s(i, j, ∅) = 0
i, j are fence points;

l is a label
↕ train with ∅ or <nil>
Position Embedding
Encoder: linguistic Information
Word Embedding
POS
Embedding
Input Zdmodel
T
Component-wise add
zt = wt + mt + pt
Since then, zt is sent to the
Transformer and dmodel keeps
throughout the encoder.
Encoder: linguistic Information
zt
xt
yt
xt
xt
Encoder: linguistic Information
qt = WT
Qxt
kt = WT
K xt
vt = WT
V xt
p(i → j)
¯vt
qi
ki
vi vj
kj
qj
p(i → j)
¯vi
xi
“gather information from up to 8 remote locations”
Decoder again
Wi Wj…
Run a BiRNN once
Run a FFN several times
“92.67 F1 on Penn Treebank WSJ dev set”
We must be the 2018 champion! と⼼心が叫びそうだ
T*(T+1)/2 times Δ
Analysis by Input Ablation
zt = wt + mt + pt
Word, POS and position embeddings are
added, but also overlapped:
qt = WT
Qzt
kt = WT
K zt
vt = WT
V zt
p(i → j)
¯vt
qt = WT
Q pt
kt = WT
K pt
vt = WT
V zt
Layer-wise disabled
“it seems strange that content-based attention
benefits our model to such a small degree.”
Decomposition on i/w
zt = wt + mt + pt
zt = [wt + mt; pt]
F1 92.60
F1 92.67
1. Decompose input
2. Decompose attention
q ⋅ k
q = q(c)
+ q(p)
k = k(c)
+ k(p)
k ⋅ q = (q(c)
+ q(p)
) ⋅ (k(c)
+ k(p)
)
k(c)
⋅ q(p)
+ k(p)
⋅ q(c)
All mix-up:
An example of cross-terms:

“the word the always attends to the 5th
position in the sentence”
xt = [x(c)
; x(p)
]
c = Wx = [c(c)
; c(p)
] = [W(c)
x(c)
; W(p)
x(p)
]
F1 93.15 (+0.5)
all on dev set
Analysis by Constrains
“When we began to investigate how the model makes use
of long-distance attention, we found that there are
particular attention heads at some layers in our model
that almost always attend to the start token.”
RECALL: There are 8 heads in
each of the transformer layer.

“This suggests that the start token
is being used as the location for
some sentence-wide pooling/
processing, or perhaps as a
dummy target location when a
head fails to find the particular
phenomenon that it’s learned to
search for.”

In short, it is a dustbin for
redundant .attention
WinA WinA + some spec
←Train with window
and then test on dev

8 layers :)
5 Lexical ModelsPOS tags from Stanford parserzt = [wt + mt; pt]
-4 layers at ELMo
pneumonoultramicrosco
picsilicovolcanoconiosis
>> Longtu’s
Finale

More Related Content

PPTX
Asymptotic Notation
mohanrathod18
 
PDF
A Note on TopicRNN
Tomonari Masada
 
PDF
Gate level minimization (2nd update)
Aravir Rose
 
DOC
pradeepbishtLecture13 div conq
Pradeep Bisht
 
PPTX
Cyclic code systematic
Nihal Gupta
 
PDF
Problem set3 | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
Vivekananda Samiti
 
PPTX
Asymptotic notation
sajinis3
 
DOCX
Cs6503 theory of computation november december 2016
appasami
 
Asymptotic Notation
mohanrathod18
 
A Note on TopicRNN
Tomonari Masada
 
Gate level minimization (2nd update)
Aravir Rose
 
pradeepbishtLecture13 div conq
Pradeep Bisht
 
Cyclic code systematic
Nihal Gupta
 
Problem set3 | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
Vivekananda Samiti
 
Asymptotic notation
sajinis3
 
Cs6503 theory of computation november december 2016
appasami
 

What's hot (20)

DOCX
CS2303 Theory of computation April may 2015
appasami
 
PPTX
9. ES6 | Let And Const | TypeScript | JavaScript
pcnmtutorials
 
PDF
[Question Paper] Fundamentals of Digital Computing (Revised Course) [May / 2016]
Mumbai B.Sc.IT Study
 
PDF
Cheat Sheets for Hard Problems
Neeldhara Misra
 
PDF
Faster Practical Block Compression for Rank/Select Dictionaries
Rakuten Group, Inc.
 
PPTX
python gil
rfyiamcool
 
PPTX
Chapter 22 Finite Field
Tony Cervera Jr.
 
PDF
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
The Statistical and Applied Mathematical Sciences Institute
 
PDF
algo1
guest140e61
 
PPTX
1.6 all notes
Lorie Blickhan
 
PDF
Matlab integration
pramodkumar1804
 
PPTX
SPIRE2013-tabei20131009
Yasuo Tabei
 
PDF
A One-Pass Triclustering Approach: Is There any Room for Big Data?
Dmitrii Ignatov
 
PDF
2015 CMS Winter Meeting Poster
Chelsea Battell
 
PPTX
CPM2013-tabei201306
Yasuo Tabei
 
PDF
Multipipes
Eric Van Hensbergen
 
PDF
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Rakuten Group, Inc.
 
PPTX
Weekends with Competitive Programming
NiharikaSingh839269
 
PDF
Automatic variational inference with latent categorical variables
Tomasz Kusmierczyk
 
PDF
Lesson 5 Nov 3
ingroy
 
CS2303 Theory of computation April may 2015
appasami
 
9. ES6 | Let And Const | TypeScript | JavaScript
pcnmtutorials
 
[Question Paper] Fundamentals of Digital Computing (Revised Course) [May / 2016]
Mumbai B.Sc.IT Study
 
Cheat Sheets for Hard Problems
Neeldhara Misra
 
Faster Practical Block Compression for Rank/Select Dictionaries
Rakuten Group, Inc.
 
python gil
rfyiamcool
 
Chapter 22 Finite Field
Tony Cervera Jr.
 
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
The Statistical and Applied Mathematical Sciences Institute
 
1.6 all notes
Lorie Blickhan
 
Matlab integration
pramodkumar1804
 
SPIRE2013-tabei20131009
Yasuo Tabei
 
A One-Pass Triclustering Approach: Is There any Room for Big Data?
Dmitrii Ignatov
 
2015 CMS Winter Meeting Poster
Chelsea Battell
 
CPM2013-tabei201306
Yasuo Tabei
 
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Rakuten Group, Inc.
 
Weekends with Competitive Programming
NiharikaSingh839269
 
Automatic variational inference with latent categorical variables
Tomasz Kusmierczyk
 
Lesson 5 Nov 3
ingroy
 
Ad

Similar to N20181126 (20)

PDF
Relaxation methods for the matrix exponential on large networks
David Gleich
 
PDF
Sep logic slide
rainoftime
 
PPTX
Branch and bounding : Data structures
Kàŕtheek Jåvvàjí
 
PDF
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
Alex Pruden
 
PDF
Variational Bayes: A Gentle Introduction
Flavio Morelli
 
PDF
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
PPT
derivative basic to strategies presentation
ssuserbec54e
 
PDF
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
Cemal Ardil
 
PDF
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Peng Cheng
 
PDF
Elliptic Curve Cryptography
Kelly Bresnahan
 
PPTX
Paris data-geeks-2013-03-28
Ted Dunning
 
PPT
presentation related to artificial intelligence.ppt
Divya Somashekar
 
PPT
presentation on artificial intelligence autosaved
Divya Somashekar
 
PPTX
Tutorial on Object Detection (Faster R-CNN)
Hwa Pyung Kim
 
PDF
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Seokhwan Kim
 
PPTX
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
PPT
derivative.ppt
Spyder20
 
PPT
derivative.ppt
bahbib22
 
PPT
Contrastive Divergence Learning
penny 梁斌
 
PPT
chapter9.ppt
Praveen Kumar
 
Relaxation methods for the matrix exponential on large networks
David Gleich
 
Sep logic slide
rainoftime
 
Branch and bounding : Data structures
Kàŕtheek Jåvvàjí
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
Alex Pruden
 
Variational Bayes: A Gentle Introduction
Flavio Morelli
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
derivative basic to strategies presentation
ssuserbec54e
 
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
Cemal Ardil
 
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021
Peng Cheng
 
Elliptic Curve Cryptography
Kelly Bresnahan
 
Paris data-geeks-2013-03-28
Ted Dunning
 
presentation related to artificial intelligence.ppt
Divya Somashekar
 
presentation on artificial intelligence autosaved
Divya Somashekar
 
Tutorial on Object Detection (Faster R-CNN)
Hwa Pyung Kim
 
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Seokhwan Kim
 
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
derivative.ppt
Spyder20
 
derivative.ppt
bahbib22
 
Contrastive Divergence Learning
penny 梁斌
 
chapter9.ppt
Praveen Kumar
 
Ad

Recently uploaded (20)

PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PPTX
How to Apply for a Job From Odoo 18 Website
Celine George
 
PDF
RA 12028_ARAL_Orientation_Day-2-Sessions_v2.pdf
Seven De Los Reyes
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PDF
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PPTX
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
How to Apply for a Job From Odoo 18 Website
Celine George
 
RA 12028_ARAL_Orientation_Day-2-Sessions_v2.pdf
Seven De Los Reyes
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
INTESTINALPARASITES OR WORM INFESTATIONS.pptx
PRADEEP ABOTHU
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
Basics and rules of probability with real-life uses
ravatkaran694
 

N20181126

  • 2. Points • A neural discriminative constituency parser - F1 93.55 • Chart parser/decoder • Encoder-decoder style dcp - the architecture • Structure meaning of multi-headed self-attention for cp • 8-layer, 8-head transformer + BiLSTM decoder • Analysis by input ablation: word, POS and position • position or content (POS⟺morph, ElMo/CharConcat) • Metric of tree structure accuracy: ParsEval
  • 3. Constituency Parsing Grammar structure CKY-algorithm ChomskyCFG Transition-based Chart parser NLP tutorial 10: (11↑) • Probability as score • Bottom-up combine (bracketing per se) • Beam search Godfather Transformer . word+POS+position Decomposition, 3 4 5 6 7 A BiLSTM for 
 fence points
  • 4. Incrementally build up W0 W1 W2 W3 W4<bos> <eos> CKY ⇊fence points⇊
  • 5. Incrementally build up Score for a bracket: (decoder) How to deal with non-phrase? • CKY: little probability (PCFG) • Chen (me): <nil> tag / vector • This research: s(i, j, ∅) = 0 i, j are fence points; l is a label ↕ train with ∅ or <nil>
  • 6. Position Embedding Encoder: linguistic Information Word Embedding POS Embedding Input Zdmodel T Component-wise add zt = wt + mt + pt Since then, zt is sent to the Transformer and dmodel keeps throughout the encoder.
  • 8. Encoder: linguistic Information qt = WT Qxt kt = WT K xt vt = WT V xt p(i → j) ¯vt qi ki vi vj kj qj p(i → j) ¯vi xi “gather information from up to 8 remote locations”
  • 9. Decoder again Wi Wj… Run a BiRNN once Run a FFN several times “92.67 F1 on Penn Treebank WSJ dev set” We must be the 2018 champion! と⼼心が叫びそうだ T*(T+1)/2 times Δ
  • 10. Analysis by Input Ablation zt = wt + mt + pt Word, POS and position embeddings are added, but also overlapped: qt = WT Qzt kt = WT K zt vt = WT V zt p(i → j) ¯vt qt = WT Q pt kt = WT K pt vt = WT V zt Layer-wise disabled “it seems strange that content-based attention benefits our model to such a small degree.”
  • 11. Decomposition on i/w zt = wt + mt + pt zt = [wt + mt; pt] F1 92.60 F1 92.67 1. Decompose input 2. Decompose attention q ⋅ k q = q(c) + q(p) k = k(c) + k(p) k ⋅ q = (q(c) + q(p) ) ⋅ (k(c) + k(p) ) k(c) ⋅ q(p) + k(p) ⋅ q(c) All mix-up: An example of cross-terms: “the word the always attends to the 5th position in the sentence” xt = [x(c) ; x(p) ] c = Wx = [c(c) ; c(p) ] = [W(c) x(c) ; W(p) x(p) ] F1 93.15 (+0.5) all on dev set
  • 12. Analysis by Constrains “When we began to investigate how the model makes use of long-distance attention, we found that there are particular attention heads at some layers in our model that almost always attend to the start token.” RECALL: There are 8 heads in each of the transformer layer. “This suggests that the start token is being used as the location for some sentence-wide pooling/ processing, or perhaps as a dummy target location when a head fails to find the particular phenomenon that it’s learned to search for.” In short, it is a dustbin for redundant .attention WinA WinA + some spec ←Train with window and then test on dev 8 layers :)
  • 13. 5 Lexical ModelsPOS tags from Stanford parserzt = [wt + mt; pt] -4 layers at ELMo pneumonoultramicrosco picsilicovolcanoconiosis >> Longtu’s