An Evolution of
Deep Learning Models
for AI2 Reasoning Challenge
Traian Rebedea
traian.rebedea@cs.pub.ro
Associate Professor, University Politehnica of Bucharest
Co-founder & Chief Data Scientist, RoboSelf
** work with George-Sebastian Pirtoaca and Stefan Ruseti
About me
• Academic profile
• PhD in Natural Language Processing (NLP) applied in Tehnology Enhanced Learning - 2013
• Generating feedback to learners engaged in multi-party computer supported collaborative conversations
• Research projects involving NLP, information extraction and machine learning
• Conversational agents, question-answering, natural language interfaces to databases, opinion mining,
information extraction from public data about companies and persons
• Industrial profile
• Co-founded Roboself in 2019, a technological startup developing virtual personal assistants
• Innovation grant for startups - EU funded Open Data Incubator in Europe (Wholi)
• Two research projects in collaboration with companies (Bitdefender, Autonomous Systems)
• Community
• Co-founder of Bucharest Deep Learning meetup
• Co-organizer of Eastern European Machine Learning (EEML) summer school 2019
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
2
Outline
• Introduction to Question Answering (QA)
• AI2 Reasoning Challenge (ARC)
• Strong Baselines for ARC
• Two-Stage Inference Model
• Attentive Ranker (BERT)
• Attentive Ranker (Multi)
• QA Going Further
• Conclusions
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
3
Introduction to Question Answering (QA)
• QA is one of the most studied topics in Natural Language Processing and
Information Retrieval
• Several flavours
• Factoid / Non-factoid
• Closed / Open
• Using other types of data
• VisualQA
• MovieQA
• Multimodal QA
• E.g. RecipeQA
• Knowledge-base QA
• E.g. QALD (QA over Linked Data)
• Reading Comprehension vs QA? Reasoning Challenge? Sentence Selection?
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
4
Factoid vs Non-factoid
vs.
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
5
Factoid vs Non-factoid
vs.
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
6
Stanford Question Answering Dataset
(SQuAD)
• Closed reading comprehension dataset
• Some questions are factoid
• Others are simple non-factoid
• Articles from Wikipedia
• Several crowdsourced questions and spans
from the article containing the answer
• SQuAD 2.0: added more complex questions,
added negative examples
• https://rajpurkar.github.io/SQuAD-
explorer/
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
7
Stanford Question Answering Dataset
(SQuAD)
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
8
HotpotQA
• More complex QA dataset
• Factoid questions requiring multi-hops
• Articles from Wikipedia
• Two versions
• Open (all Wikipedia)
• Closed (added several distractors)
• Two tasks
• Finding the correct answer
• Providing supporting facts
• Questions split into easy/medium/hard
• https://hotpotqa.github.io/
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
9
HotpotQA
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
10
AI2 Reasoning Challenge (ARC)
• “Think you have Solved Question Answering?
Try ARC, the AI2 Reasoning Challenge”
• Grade-school science questions (authored for human tests)
• Multiple choice, most of them with 4 candidate answers
• Open QA, mixed factoid and non-factoid
• Largest public-domain set of this kind (7,787 questions)
• Challenge Set (2590 questions): questions answered incorrectly by an IR (Information
Retrieval) ranker and a word co-occurrence algorithm (PMI)
• Easy Set (5197 questions): rest of them
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
11
AI2 Reasoning Challenge (ARC)
• ARC is a refinement of previous science
reasoning challenge datasets proposed
by AI2
• Challenge dataset requires various types
of reasoning
• Some of them are multi-hop
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
12
Strong Baselines for ARC
• Challenge dataset was very difficult to
solve not only by the co-occurrence
baselines (IR, PMI), but also by state of the
art deep learning models from 2018
• BiDAF and Decomposable Attention are deep
learning models
• TableIPL is simbolic using integer linear
programming, DGEM is a mix of deep learning
and statistical/rules (OpenIE)
• Most models with very good performance of
Easy set have poor results on Challenge set
• No models significantly better than random
guess baseline
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
13
Two-Stage Inference Model
• Premise: Complex questions require models that should be able to
(partially) understand the context of the question and to perform
some kind of inference to determine the correct answer
• Two-stage model that combines an information retrieval (IR) engine
with several deep learning architectures (called solvers)
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
14
Two-Stage Inference Model – Stage 1
• Extract relevant contexts for each
(question, candidate answer) pair
using an IR engine
• Use Lucene for indexing and searching
English Wikipedia, science books
collected from CK-12, and ARC Corpus
• Term-based weighting for Lucene
using a semantic essentialness score
computed by a simple NN trained on
semantic and syntactic word features
(2.2k questions manually annotated
with term essentialness)
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
15
Two-Stage Inference Model – Stage 2
• Construct several (more complex) models to predict if an answer is
correct based on additional information inferred from the contexts
• Called solvers
• Several deep learning models fed with a (question, answer, context)
triplet and trained to predict the likelihood that the answer is correct
given the question and the current context
• Models pretrained on different NLP tasks and fine-tuned for multiple-
choice QA
• Ensemble model with a simple voting NN that computes the final
score
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
16
Two-Stage Inference Model - Solvers
• First solver computes a more efficient semantic
similarity using word embeddings and RNNs
• Adapted the Bidirectional Attention Flow (BiDAF)
architecture proposed for SQuAD to process (Q, A, C)
triplets
• Pre-trained on SQuAD v1.1, after transforming it into a
dataset suitable for multiple-choice QA by generating
wrong candidate answers
• Second solver employs neural models for natural
language inference (NLI)
• Reframe (Q, A, C) triples as NLI: Transform the pair
(Q, A) into an affirmative sentence that forms the
hypothesis. The context from the IR engine will act as
the premise.
• BiDAF architecture to perform NLI by modifying the
output layer to a 3-way softmax layer: entailment,
neutral, or contradiction
• Pre-trained on three large NLI datasets: SNLI, MultiNLI,
and SciTail
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
17
Two-Stage Inference Model - Results
• The only model in early 2019 that obtained good performance for both
Challenge and Easy datasets
• 2nd place for Easy; 8th place for Challenge (but with no BERT and no symbolic)
• Possible improvements
• Using a better knowledge base to find candidate contexts
• Adding additional solvers (more powerful, e.g. BERT based)
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
18
Attentive Ranker (BERT)
Improve previous model
1. Introduce a self-attention based neural network, called Attentive
Ranker, that latently learns to rank documents (answering questions
by L2R) by their importance related to a given question, whilst
optimizing the objective of predicting the correct answer (L2R by
answering questions)
2. Adding several candidate contexts for each candidate answer
3. Use BERT to combine (Q, A) and all candidate contexts
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
19
Attentive Ranker: Answering Questions by L2R
• The Attentive Ranker latently learns to rank
supporting documents (contexts) for each
candidate answer at a semantic level
• Semantically rank the first N retrieved
documents vs. sort them by a lexical metric
(e.g. TF-IDF, BM25) => improves question
answering
• Computing if a document is relevant given a
(question, candidate answer) pair uses a set
of weak discriminators:
• Document Relevance Discriminator (DRD,
trained on modified SQuAD)
• Answer Verifier Discriminator (AVD, trained on
RACE)
• TF-IDF Discriminator
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
20
Attentive Ranker: L2R by Answering Questions
• The Attentive Ranker is trained to
predict the correct answer to a
question, given a set of top documents
supporting each candidate answer, in a
bootstrapping fashion
• In the forward pass, the model first
computes the document importance
scores, which are further used to predict
the correct answer.
• During backpropagation, the ranking
parameters are also optimized, latently
improving the L2R quality.
• In the next iteration, a better L2R
performance leads to more accurate
question answering.
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
21
Attentive Ranker – Results
• The proposed model achieved 1st place for both Easy and Challenge datasets, at
the moment it was proposed
• Later, it was surpassed by BERT pretrained on larger datasets related to science
texts
• And by more powerful transformers, e.g. ALBERT
• Replacing TF-IDF/doc2vec sorted documents with our Attentive Ranker highly
improves the accuracy of various downstream decision models (e.g. BERT)
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
22
Attentive Ranker – Results
• Combining several weak discriminators improves accuracy
• Using multiple candidate documents is better (~20 for Easy, ~50 for Challenge)
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
23
Attentive Ranker (Multi)
• Add more powerful transformer-based discriminators
• XLNet, RoBERTa, ALBERT
• Their decisions are correlated, but only moderately
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
24
Attentive Ranker (Multi)
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
25
Attentive Ranker (Multi)
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
26
QA Going Further
• https://leaderboard.allenai.org/arc/submissions/public
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
27
QA Going Further
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
28
QA Going Further
• Finetune transformers on larger texts similar to the QA dataset?
• E.g. science; maybe simpler, but not very easy
• Adding more QA pairs in the dataset?
• Difficult, takes time and human annotators
• Humans are able to learn without looking at any QA pairs, only by reading texts
• Adversarial traning?
• This seems to be the current next technological advancement for NLP
• E.g. FreeLB - https://arxiv.org/abs/1909.11764 (improves results on several applied
NLP tasks, e.g. QA, NLI, semantic similarity); accepted with maximum scores ar ICLR
2020
• Previously, FreeAT obtained very good results for other QA tasks
• New ideas??? 
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
29
QA Going Further
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
30
Conclusions
• Question Answering comes in various flavors
• Deep learning models for text representation (esp. RNNs, transformers) have improved results for
all datasets / tasks
• Achieving human-level performance is still far for most tasks
• For some simpler datasets (e.g. SQuAD), there is a claim of surpassing human performance
• For more complex datasets (e.g. ARC, MultihopQA) that require (some) reasoning, top solutions are still (far)
below human performance
• For small datasets, performance is quite poor
• Open QA is also particulary hard because we still rely on an IR engine to get supporting
documents (candidate contexts)
• Improve this component by adding new terms to the question (maybe use Reinforcement learning for this?)
• Interesting results from adversarial training for NLP
• More on QA progress: http://nlpprogress.com/english/question_answering.html
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
31
Thank you!
traian.rebedea@cs.pub.ro
_____
_____
6th Mar 2020
An Evolution of Deep Learning Models for AI2 Reasoning
Challenge
32

An Evolution of Deep Learning Models for AI2 Reasoning Challenge

  • 1.
    An Evolution of DeepLearning Models for AI2 Reasoning Challenge Traian Rebedea [email protected] Associate Professor, University Politehnica of Bucharest Co-founder & Chief Data Scientist, RoboSelf ** work with George-Sebastian Pirtoaca and Stefan Ruseti
  • 2.
    About me • Academicprofile • PhD in Natural Language Processing (NLP) applied in Tehnology Enhanced Learning - 2013 • Generating feedback to learners engaged in multi-party computer supported collaborative conversations • Research projects involving NLP, information extraction and machine learning • Conversational agents, question-answering, natural language interfaces to databases, opinion mining, information extraction from public data about companies and persons • Industrial profile • Co-founded Roboself in 2019, a technological startup developing virtual personal assistants • Innovation grant for startups - EU funded Open Data Incubator in Europe (Wholi) • Two research projects in collaboration with companies (Bitdefender, Autonomous Systems) • Community • Co-founder of Bucharest Deep Learning meetup • Co-organizer of Eastern European Machine Learning (EEML) summer school 2019 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 2
  • 3.
    Outline • Introduction toQuestion Answering (QA) • AI2 Reasoning Challenge (ARC) • Strong Baselines for ARC • Two-Stage Inference Model • Attentive Ranker (BERT) • Attentive Ranker (Multi) • QA Going Further • Conclusions 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 3
  • 4.
    Introduction to QuestionAnswering (QA) • QA is one of the most studied topics in Natural Language Processing and Information Retrieval • Several flavours • Factoid / Non-factoid • Closed / Open • Using other types of data • VisualQA • MovieQA • Multimodal QA • E.g. RecipeQA • Knowledge-base QA • E.g. QALD (QA over Linked Data) • Reading Comprehension vs QA? Reasoning Challenge? Sentence Selection? 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 4
  • 5.
    Factoid vs Non-factoid vs. 6thMar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 5
  • 6.
    Factoid vs Non-factoid vs. 6thMar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 6
  • 7.
    Stanford Question AnsweringDataset (SQuAD) • Closed reading comprehension dataset • Some questions are factoid • Others are simple non-factoid • Articles from Wikipedia • Several crowdsourced questions and spans from the article containing the answer • SQuAD 2.0: added more complex questions, added negative examples • https://rajpurkar.github.io/SQuAD- explorer/ 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 7
  • 8.
    Stanford Question AnsweringDataset (SQuAD) 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 8
  • 9.
    HotpotQA • More complexQA dataset • Factoid questions requiring multi-hops • Articles from Wikipedia • Two versions • Open (all Wikipedia) • Closed (added several distractors) • Two tasks • Finding the correct answer • Providing supporting facts • Questions split into easy/medium/hard • https://hotpotqa.github.io/ 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 9
  • 10.
    HotpotQA 6th Mar 2020 AnEvolution of Deep Learning Models for AI2 Reasoning Challenge 10
  • 11.
    AI2 Reasoning Challenge(ARC) • “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge” • Grade-school science questions (authored for human tests) • Multiple choice, most of them with 4 candidate answers • Open QA, mixed factoid and non-factoid • Largest public-domain set of this kind (7,787 questions) • Challenge Set (2590 questions): questions answered incorrectly by an IR (Information Retrieval) ranker and a word co-occurrence algorithm (PMI) • Easy Set (5197 questions): rest of them 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 11
  • 12.
    AI2 Reasoning Challenge(ARC) • ARC is a refinement of previous science reasoning challenge datasets proposed by AI2 • Challenge dataset requires various types of reasoning • Some of them are multi-hop 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 12
  • 13.
    Strong Baselines forARC • Challenge dataset was very difficult to solve not only by the co-occurrence baselines (IR, PMI), but also by state of the art deep learning models from 2018 • BiDAF and Decomposable Attention are deep learning models • TableIPL is simbolic using integer linear programming, DGEM is a mix of deep learning and statistical/rules (OpenIE) • Most models with very good performance of Easy set have poor results on Challenge set • No models significantly better than random guess baseline 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 13
  • 14.
    Two-Stage Inference Model •Premise: Complex questions require models that should be able to (partially) understand the context of the question and to perform some kind of inference to determine the correct answer • Two-stage model that combines an information retrieval (IR) engine with several deep learning architectures (called solvers) 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 14
  • 15.
    Two-Stage Inference Model– Stage 1 • Extract relevant contexts for each (question, candidate answer) pair using an IR engine • Use Lucene for indexing and searching English Wikipedia, science books collected from CK-12, and ARC Corpus • Term-based weighting for Lucene using a semantic essentialness score computed by a simple NN trained on semantic and syntactic word features (2.2k questions manually annotated with term essentialness) 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 15
  • 16.
    Two-Stage Inference Model– Stage 2 • Construct several (more complex) models to predict if an answer is correct based on additional information inferred from the contexts • Called solvers • Several deep learning models fed with a (question, answer, context) triplet and trained to predict the likelihood that the answer is correct given the question and the current context • Models pretrained on different NLP tasks and fine-tuned for multiple- choice QA • Ensemble model with a simple voting NN that computes the final score 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 16
  • 17.
    Two-Stage Inference Model- Solvers • First solver computes a more efficient semantic similarity using word embeddings and RNNs • Adapted the Bidirectional Attention Flow (BiDAF) architecture proposed for SQuAD to process (Q, A, C) triplets • Pre-trained on SQuAD v1.1, after transforming it into a dataset suitable for multiple-choice QA by generating wrong candidate answers • Second solver employs neural models for natural language inference (NLI) • Reframe (Q, A, C) triples as NLI: Transform the pair (Q, A) into an affirmative sentence that forms the hypothesis. The context from the IR engine will act as the premise. • BiDAF architecture to perform NLI by modifying the output layer to a 3-way softmax layer: entailment, neutral, or contradiction • Pre-trained on three large NLI datasets: SNLI, MultiNLI, and SciTail 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 17
  • 18.
    Two-Stage Inference Model- Results • The only model in early 2019 that obtained good performance for both Challenge and Easy datasets • 2nd place for Easy; 8th place for Challenge (but with no BERT and no symbolic) • Possible improvements • Using a better knowledge base to find candidate contexts • Adding additional solvers (more powerful, e.g. BERT based) 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 18
  • 19.
    Attentive Ranker (BERT) Improveprevious model 1. Introduce a self-attention based neural network, called Attentive Ranker, that latently learns to rank documents (answering questions by L2R) by their importance related to a given question, whilst optimizing the objective of predicting the correct answer (L2R by answering questions) 2. Adding several candidate contexts for each candidate answer 3. Use BERT to combine (Q, A) and all candidate contexts 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 19
  • 20.
    Attentive Ranker: AnsweringQuestions by L2R • The Attentive Ranker latently learns to rank supporting documents (contexts) for each candidate answer at a semantic level • Semantically rank the first N retrieved documents vs. sort them by a lexical metric (e.g. TF-IDF, BM25) => improves question answering • Computing if a document is relevant given a (question, candidate answer) pair uses a set of weak discriminators: • Document Relevance Discriminator (DRD, trained on modified SQuAD) • Answer Verifier Discriminator (AVD, trained on RACE) • TF-IDF Discriminator 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 20
  • 21.
    Attentive Ranker: L2Rby Answering Questions • The Attentive Ranker is trained to predict the correct answer to a question, given a set of top documents supporting each candidate answer, in a bootstrapping fashion • In the forward pass, the model first computes the document importance scores, which are further used to predict the correct answer. • During backpropagation, the ranking parameters are also optimized, latently improving the L2R quality. • In the next iteration, a better L2R performance leads to more accurate question answering. 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 21
  • 22.
    Attentive Ranker –Results • The proposed model achieved 1st place for both Easy and Challenge datasets, at the moment it was proposed • Later, it was surpassed by BERT pretrained on larger datasets related to science texts • And by more powerful transformers, e.g. ALBERT • Replacing TF-IDF/doc2vec sorted documents with our Attentive Ranker highly improves the accuracy of various downstream decision models (e.g. BERT) 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 22
  • 23.
    Attentive Ranker –Results • Combining several weak discriminators improves accuracy • Using multiple candidate documents is better (~20 for Easy, ~50 for Challenge) 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 23
  • 24.
    Attentive Ranker (Multi) •Add more powerful transformer-based discriminators • XLNet, RoBERTa, ALBERT • Their decisions are correlated, but only moderately 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 24
  • 25.
    Attentive Ranker (Multi) 6thMar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 25
  • 26.
    Attentive Ranker (Multi) 6thMar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 26
  • 27.
    QA Going Further •https://leaderboard.allenai.org/arc/submissions/public 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 27
  • 28.
    QA Going Further 6thMar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 28
  • 29.
    QA Going Further •Finetune transformers on larger texts similar to the QA dataset? • E.g. science; maybe simpler, but not very easy • Adding more QA pairs in the dataset? • Difficult, takes time and human annotators • Humans are able to learn without looking at any QA pairs, only by reading texts • Adversarial traning? • This seems to be the current next technological advancement for NLP • E.g. FreeLB - https://arxiv.org/abs/1909.11764 (improves results on several applied NLP tasks, e.g. QA, NLI, semantic similarity); accepted with maximum scores ar ICLR 2020 • Previously, FreeAT obtained very good results for other QA tasks • New ideas???  6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 29
  • 30.
    QA Going Further 6thMar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 30
  • 31.
    Conclusions • Question Answeringcomes in various flavors • Deep learning models for text representation (esp. RNNs, transformers) have improved results for all datasets / tasks • Achieving human-level performance is still far for most tasks • For some simpler datasets (e.g. SQuAD), there is a claim of surpassing human performance • For more complex datasets (e.g. ARC, MultihopQA) that require (some) reasoning, top solutions are still (far) below human performance • For small datasets, performance is quite poor • Open QA is also particulary hard because we still rely on an IR engine to get supporting documents (candidate contexts) • Improve this component by adding new terms to the question (maybe use Reinforcement learning for this?) • Interesting results from adversarial training for NLP • More on QA progress: http://nlpprogress.com/english/question_answering.html 6th Mar 2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 31
  • 32.
    Thank you! [email protected] _____ _____ 6th Mar2020 An Evolution of Deep Learning Models for AI2 Reasoning Challenge 32