An Evolution of Deep Learning Models for AI2 Reasoning Challenge

An Evolution of
Deep Learning Models
for AI2 Reasoning Challenge
Traian Rebedea
traian.rebedea@cs.pub.ro
Associate Professor, University Politehnica of Bucharest
Co-founder & Chief Data Scientist, RoboSelf
** work with George-Sebastian Pirtoaca and Stefan Ruseti

About me
• Academic profile
• PhD in Natural Language Processing (NLP) applied in Tehnology Enhanced Learning - 2013
• Generating feedback to learners engaged in multi-party computer supported collaborative conversations
• Research projects involving NLP, information extraction and machine learning
• Conversational agents, question-answering, natural language interfaces to databases, opinion mining,
information extraction from public data about companies and persons
• Industrial profile
• Co-founded Roboself in 2019, a technological startup developing virtual personal assistants
• Innovation grant for startups - EU funded Open Data Incubator in Europe (Wholi)
• Two research projects in collaboration with companies (Bitdefender, Autonomous Systems)
• Community
• Co-founder of Bucharest Deep Learning meetup
• Co-organizer of Eastern European Machine Learning (EEML) summer school 2019
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
2

Outline
• Introduction to Question Answering (QA)
• AI2 Reasoning Challenge (ARC)
• Strong Baselines for ARC
• Two-Stage Inference Model
• Attentive Ranker (BERT)
• Attentive Ranker (Multi)
• QA Going Further
• Conclusions
6th Mar 2020
Challenge
3

Introduction to Question Answering (QA)
• QA is one of the most studied topics in Natural Language Processing and
Information Retrieval
• Several flavours
• Factoid / Non-factoid
• Closed / Open
• Using other types of data
• VisualQA
• MovieQA
• Multimodal QA
• E.g. RecipeQA
• Knowledge-base QA
• E.g. QALD (QA over Linked Data)
• Reading Comprehension vs QA? Reasoning Challenge? Sentence Selection?
6th Mar 2020
Challenge
4

Factoid vs Non-factoid
vs.
6th Mar 2020
Challenge
5

Factoid vs Non-factoid
vs.
6th Mar 2020
Challenge
6

Stanford Question Answering Dataset
(SQuAD)
• Closed reading comprehension dataset
• Some questions are factoid
• Others are simple non-factoid
• Articles from Wikipedia
• Several crowdsourced questions and spans
from the article containing the answer
• SQuAD 2.0: added more complex questions,
added negative examples
• https://rajpurkar.github.io/SQuAD-
explorer/
6th Mar 2020
Challenge
7

Stanford Question Answering Dataset
(SQuAD)
6th Mar 2020
Challenge
8

HotpotQA
• More complex QA dataset
• Factoid questions requiring multi-hops
• Articles from Wikipedia
• Two versions
• Open (all Wikipedia)
• Closed (added several distractors)
• Two tasks
• Finding the correct answer
• Providing supporting facts
• Questions split into easy/medium/hard
• https://hotpotqa.github.io/
6th Mar 2020
Challenge
9

HotpotQA
6th Mar 2020
Challenge
10

AI2 Reasoning Challenge (ARC)
• “Think you have Solved Question Answering?
Try ARC, the AI2 Reasoning Challenge”
• Grade-school science questions (authored for human tests)
• Multiple choice, most of them with 4 candidate answers
• Open QA, mixed factoid and non-factoid
• Largest public-domain set of this kind (7,787 questions)
• Challenge Set (2590 questions): questions answered incorrectly by an IR (Information
Retrieval) ranker and a word co-occurrence algorithm (PMI)
• Easy Set (5197 questions): rest of them
6th Mar 2020
Challenge
11

AI2 Reasoning Challenge (ARC)
• ARC is a refinement of previous science
reasoning challenge datasets proposed
by AI2
• Challenge dataset requires various types
of reasoning
• Some of them are multi-hop
6th Mar 2020
Challenge
12

Strong Baselines for ARC
• Challenge dataset was very difficult to
solve not only by the co-occurrence
baselines (IR, PMI), but also by state of the
art deep learning models from 2018
• BiDAF and Decomposable Attention are deep
learning models
• TableIPL is simbolic using integer linear
programming, DGEM is a mix of deep learning
and statistical/rules (OpenIE)
• Most models with very good performance of
Easy set have poor results on Challenge set
• No models significantly better than random
guess baseline
6th Mar 2020
Challenge
13

Two-Stage Inference Model
• Premise: Complex questions require models that should be able to
(partially) understand the context of the question and to perform
some kind of inference to determine the correct answer
• Two-stage model that combines an information retrieval (IR) engine
with several deep learning architectures (called solvers)
6th Mar 2020
Challenge
14

Two-Stage Inference Model – Stage 1
• Extract relevant contexts for each
(question, candidate answer) pair
using an IR engine
• Use Lucene for indexing and searching
English Wikipedia, science books
collected from CK-12, and ARC Corpus
• Term-based weighting for Lucene
using a semantic essentialness score
computed by a simple NN trained on
semantic and syntactic word features
(2.2k questions manually annotated
with term essentialness)
6th Mar 2020
Challenge
15

Two-Stage Inference Model – Stage 2
• Construct several (more complex) models to predict if an answer is
correct based on additional information inferred from the contexts
• Called solvers
• Several deep learning models fed with a (question, answer, context)
triplet and trained to predict the likelihood that the answer is correct
given the question and the current context
• Models pretrained on different NLP tasks and fine-tuned for multiple-
choice QA
• Ensemble model with a simple voting NN that computes the final
score
6th Mar 2020
Challenge
16

Two-Stage Inference Model - Solvers
• First solver computes a more efficient semantic
similarity using word embeddings and RNNs
• Adapted the Bidirectional Attention Flow (BiDAF)
architecture proposed for SQuAD to process (Q, A, C)
triplets
• Pre-trained on SQuAD v1.1, after transforming it into a
dataset suitable for multiple-choice QA by generating
wrong candidate answers
• Second solver employs neural models for natural
language inference (NLI)
• Reframe (Q, A, C) triples as NLI: Transform the pair
(Q, A) into an affirmative sentence that forms the
hypothesis. The context from the IR engine will act as
the premise.
• BiDAF architecture to perform NLI by modifying the
output layer to a 3-way softmax layer: entailment,
neutral, or contradiction
• Pre-trained on three large NLI datasets: SNLI, MultiNLI,
and SciTail
6th Mar 2020
Challenge
17

Two-Stage Inference Model - Results
• The only model in early 2019 that obtained good performance for both
Challenge and Easy datasets
• 2nd place for Easy; 8th place for Challenge (but with no BERT and no symbolic)
• Possible improvements
• Using a better knowledge base to find candidate contexts
• Adding additional solvers (more powerful, e.g. BERT based)
6th Mar 2020
Challenge
18

Attentive Ranker (BERT)
Improve previous model
1. Introduce a self-attention based neural network, called Attentive
Ranker, that latently learns to rank documents (answering questions
by L2R) by their importance related to a given question, whilst
optimizing the objective of predicting the correct answer (L2R by
answering questions)
2. Adding several candidate contexts for each candidate answer
3. Use BERT to combine (Q, A) and all candidate contexts
6th Mar 2020
Challenge
19

Attentive Ranker: Answering Questions by L2R
• The Attentive Ranker latently learns to rank
supporting documents (contexts) for each
candidate answer at a semantic level
• Semantically rank the first N retrieved
documents vs. sort them by a lexical metric
(e.g. TF-IDF, BM25) => improves question
answering
• Computing if a document is relevant given a
(question, candidate answer) pair uses a set
of weak discriminators:
• Document Relevance Discriminator (DRD,
trained on modified SQuAD)
• Answer Verifier Discriminator (AVD, trained on
RACE)
• TF-IDF Discriminator
6th Mar 2020
Challenge
20

Attentive Ranker: L2R by Answering Questions
• The Attentive Ranker is trained to
predict the correct answer to a
question, given a set of top documents
supporting each candidate answer, in a
bootstrapping fashion
• In the forward pass, the model first
computes the document importance
scores, which are further used to predict
the correct answer.
• During backpropagation, the ranking
parameters are also optimized, latently
improving the L2R quality.
• In the next iteration, a better L2R
performance leads to more accurate
question answering.
6th Mar 2020
Challenge
21

Attentive Ranker – Results
• The proposed model achieved 1st place for both Easy and Challenge datasets, at
the moment it was proposed
• Later, it was surpassed by BERT pretrained on larger datasets related to science
texts
• And by more powerful transformers, e.g. ALBERT
• Replacing TF-IDF/doc2vec sorted documents with our Attentive Ranker highly
improves the accuracy of various downstream decision models (e.g. BERT)
6th Mar 2020
Challenge
22

Attentive Ranker – Results
• Combining several weak discriminators improves accuracy
• Using multiple candidate documents is better (~20 for Easy, ~50 for Challenge)
6th Mar 2020
Challenge
23

Attentive Ranker (Multi)
• Add more powerful transformer-based discriminators
• XLNet, RoBERTa, ALBERT
• Their decisions are correlated, but only moderately
6th Mar 2020
Challenge
24

6th Mar 2020
Challenge
25

6th Mar 2020
Challenge
26

QA Going Further
• https://leaderboard.allenai.org/arc/submissions/public
6th Mar 2020
Challenge
27

QA Going Further
6th Mar 2020
Challenge
28

QA Going Further
• Finetune transformers on larger texts similar to the QA dataset?
• E.g. science; maybe simpler, but not very easy
• Adding more QA pairs in the dataset?
• Difficult, takes time and human annotators
• Humans are able to learn without looking at any QA pairs, only by reading texts
• Adversarial traning?
• This seems to be the current next technological advancement for NLP
• E.g. FreeLB - https://arxiv.org/abs/1909.11764 (improves results on several applied
NLP tasks, e.g. QA, NLI, semantic similarity); accepted with maximum scores ar ICLR
2020
• Previously, FreeAT obtained very good results for other QA tasks
• New ideas??? 
6th Mar 2020
Challenge
29

QA Going Further
6th Mar 2020
Challenge
30

Conclusions
• Question Answering comes in various flavors
• Deep learning models for text representation (esp. RNNs, transformers) have improved results for
all datasets / tasks
• Achieving human-level performance is still far for most tasks
• For some simpler datasets (e.g. SQuAD), there is a claim of surpassing human performance
• For more complex datasets (e.g. ARC, MultihopQA) that require (some) reasoning, top solutions are still (far)
below human performance
• For small datasets, performance is quite poor
• Open QA is also particulary hard because we still rely on an IR engine to get supporting
documents (candidate contexts)
• Improve this component by adding new terms to the question (maybe use Reinforcement learning for this?)
• Interesting results from adversarial training for NLP
• More on QA progress: http://nlpprogress.com/english/question_answering.html
6th Mar 2020
Challenge
31

Thank you!
traian.rebedea@cs.pub.ro
_____
_____
6th Mar 2020
Challenge
32

An Evolution of Deep Learning Models for AI2 Reasoning Challenge

More Related Content

Similar to An Evolution of Deep Learning Models for AI2 Reasoning Challenge

More from Traian Rebedea

Recently uploaded

An Evolution of Deep Learning Models for AI2 Reasoning Challenge