SlideShare a Scribd company logo
Strong Baselines for Neural
Semi-supervised Learning
under Domain Shift
Sebastian Ruder Barbara Plank
‣ State-of-the-art domain adaptation approaches
‣ leverage task-specific features
‣ evaluate on proprietary datasets or on a single
benchmark
‣ Only compare against weak baselines
‣ Almost none evaluate against approaches from the
extensive semi-supervised learning (SSL) literature
2
Learning under Domain Shift
‣ How do classics in SSL compare to recent advances?
‣ Can we combine the best of both worlds?
‣ How well do these approaches work on out-of-distribution
data?
3
Revisiting Semi-Supervised Learning
Classics in a Neural World
• Self-training
• (Co-training)
• Tri-training
• Tri-training with disagreement
Bootstrapping algorithms
1. Train model on labeled data.
2. Use confident predictions on unlabeled data
as training examples. Repeat.
5
Self-training
- Error amplification
‣ Calibration
‣ Output probabilities in neural networks are poorly
calibrated.
‣ Throttling (Abney, 2007), i.e. selecting the top n highest
confidence unlabeled examples works best.
‣ Online learning
‣ Training until convergence on labeled data and then on
unlabeled data works best.
6
Self-training variants
7
1. Train three models on bootstrapped samples.
2. Use predictions on unlabeled data for third if two agree.
y = 1
x
y = 1
1
Tri-training
Tri-training
8
Tri-training
Tri-training
1. Train three models on bootstrapped samples.
2. Use predictions on unlabeled data for third if two agree.
3. Final prediction: majority voting
Tri-training
y = 1y = 1 y = 0
1
x
Tri-training

with disagreement
9
Tri-training with
disagreement
1. Train three models on bootstrapped samples.
2. Use predictions on unlabeled data for third if two agree
and prediction differs.
y = 1
x
y = 1
1
y = 0
- 3 independent models
‣ Sampling unlabeled data
‣ Producing predictions for all unlabeled examples is
expensive
‣ Sample number of unlabeled examples
‣ Confidence thresholding
‣ Not effective for classic approaches, but essential for
our method
10
Tri-training hyper-parameters
11
y = 1
x
y = 1
1
Multi-task tri-training
1. Train one model with 3 objective functions.
2. Use predictions on unlabeled data for third if two agree.
Multi-task

Tri-training
3. Restrict final layers to 

use different 

representations.
4. Train third objective 

function only on 

pseudo labeled to 

bridge domain shift.
12
BiLSTM
w2
char
BiLSTM
BiLSTM
w1
char
BiLSTM
BiLSTM
w3
char
BiLSTM
m1 m2 m3 m1 m2 m3 m1 m2 m3
orthogonality constraint (Bousmalis et al., 2016)
Multi-task

Tri-training
Lorth = ∥W⊤
m1
Wm2
∥2
F
L(θ) = −
∑
i
∑
1,..,n
log Pmi
(y| ⃗h ) + γLorthLoss:
(Plank et al., 2016)
13
Data & Tasks
Two tasks: Domains:
Sentiment analysis on Amazon reviews dataset (Blitzer et al, 2006)
POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012)
Sentiment Analysis Results
Accuracy
75
76.75
78.5
80.25
82
Avg over 4 target domains
VFAE* DANN* Asym* Source only
Self-training Tri-training Tri-training-Disagr. MT-Tri
* result from Saito et al., (2017)
14
‣ Multi-task tri-training slightly outperforms tri-training, but
has higher variance.
15
POS Tagging Results
Trained on 10% labeled data (WSJ)
Accuracy
88.7
88.975
89.25
89.525
89.8
Avg over 5 target domains
Source (+embeds) Self-training Tri-training
Tri-training-Disagr. MT-Tri
‣ Tri-training with disagreement works best with little data.
16
POS Tagging Results
* result from Schnabel & Schütze (2014)
Trained on full labeled data (WSJ)
Accuracy
89
89.75
90.5
91.25
92
Avg over 5 target domains
TnT Stanford* Source (+embeds)
Tri-training Tri-training-Disagr. MT-Tri
‣ Tri-training works best in the full data setting.
17
POS Tagging Analysis
Accuracy on out-of-vocabulary (OOV) tokens
AccuracyonOOVtokens
50
57.5
65
72.5
80
%OOVtokens
0
2.75
5.5
8.25
11
Answers Emails Newsgroups Reviews Weblogs
OOV tokens Src Tri MT-Tri
‣ Classic tri-training works best on OOV tokens.
‣ MT-Tri does worse than source-only baseline on OOV.
18
POS accuracy per binned log frequency
Accuracydeltavs.src-onlybaseline
-0.005
0
0.005
0.009
0.014
0.018
Binned frequency
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
MT-Tri Tri
‣ Tri-training works best on low-frequency tokens (leftmost
bins).
POS Tagging Analysis
19
POS Tagging Analysis
Accuracy on unknown word-tag (UWT) tokens
AccuracyonUWTtokens
8
12.5
17
21.5
26
%UWTtokens
0
1
2
3
4
Answers Emails Newsgroups Reviews Weblogs
UWT rate Src Tri MT-Tri FLORS*
‣ No bootstrapping method works well on unknown word-
tag combinations.
‣ Less lexicalized FLORS approach is superior.
very difficult cases
* result from Schnabel
& Schütze (2014)
‣ Classic tri-training works best: outperforms recent
state-of-the-art methods for sentiment analysis.
‣ We address the drawback of tri-training (space &
time complexity) via the proposed MT-Tri model
‣ MT-Tri works best on sentiment, but not for POS.
‣ Importance of:
‣ Comparing neural methods to classics (strong
baselines)
‣ Evaluation on multiple tasks & domains
20
Takeaways
Tri-training

More Related Content

What's hot (19)

PPTX
Deep Neural Methods for Retrieval
Bhaskar Mitra
 
PDF
Topics Modeling
Svitlana volkova
 
PDF
Basic review on topic modeling
Hiroyuki Kuromiya
 
PPTX
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Bhaskar Mitra
 
PPTX
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
ODP
Topic Modeling
Karol Grzegorczyk
 
PDF
Topic model an introduction
Yueshen Xu
 
PPTX
5 Lessons Learned from Designing Neural Models for Information Retrieval
Bhaskar Mitra
 
PPTX
Topic modeling using big data analytics
Farheen Nilofer
 
PPT
Topic Models - LDA and Correlated Topic Models
Claudia Wagner
 
PPTX
Document ranking using qprp with concept of multi dimensional subspace
Prakash Dubey
 
PDF
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
PDF
Latent Dirichlet Allocation
Marco Righini
 
PPTX
Deep Learning for Search
Bhaskar Mitra
 
PPTX
(Hierarchical) topic modeling
Yueshen Xu
 
PPTX
The Duet model
Bhaskar Mitra
 
PDF
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Universitat Politècnica de Catalunya
 
PDF
Lifelong Topic Modelling presentation
Daniele Di Mitri
 
PPTX
Topic model, LDA and all that
Zhibo Xiao
 
Deep Neural Methods for Retrieval
Bhaskar Mitra
 
Topics Modeling
Svitlana volkova
 
Basic review on topic modeling
Hiroyuki Kuromiya
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Bhaskar Mitra
 
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
Topic Modeling
Karol Grzegorczyk
 
Topic model an introduction
Yueshen Xu
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
Bhaskar Mitra
 
Topic modeling using big data analytics
Farheen Nilofer
 
Topic Models - LDA and Correlated Topic Models
Claudia Wagner
 
Document ranking using qprp with concept of multi dimensional subspace
Prakash Dubey
 
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
Latent Dirichlet Allocation
Marco Righini
 
Deep Learning for Search
Bhaskar Mitra
 
(Hierarchical) topic modeling
Yueshen Xu
 
The Duet model
Bhaskar Mitra
 
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Universitat Politècnica de Catalunya
 
Lifelong Topic Modelling presentation
Daniele Di Mitri
 
Topic model, LDA and all that
Zhibo Xiao
 

Similar to Strong Baselines for Neural Semi-supervised Learning under Domain Shift (20)

PDF
Heuristic design of experiments w meta gradient search
Greg Makowski
 
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
PPTX
Model Selection Techniques
Swati .
 
PPT
deepnet-lourentzou.ppt
yang947066
 
PPT
Deep learning is a subset of machine learning and AI
leradiophysicien1
 
PPT
Overview of Deep Learning and its advantage
aqib296675
 
PPT
Introduction to Deep Learning presentation
johanericka2
 
PDF
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Eugene Yan Ziyou
 
PDF
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
PDF
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
PPTX
Unit V -Multiple Learners.pptx for artificial intelligence
y2fn5mbzdb
 
PPTX
Unit V -Multiple Learners in artificial intelligence and machine learning
y2fn5mbzdb
 
DOCX
Types of Machine Learnig Algorithms(CART, ID3)
Fatimakhan325
 
PDF
deep CNN vs conventional ML
Chao Han [email protected]
 
PDF
Kaggle presentation
HJ van Veen
 
PDF
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Tech Triveni
 
PPTX
in5490-classification (1).pptx
MonicaTimber
 
PPTX
Cubesat challenge considerations deep dive
clintonbeye
 
PPTX
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
MahmoudAbuGhali
 
PDF
Validation Is (Not) Easy
Dmytro Panchenko
 
Heuristic design of experiments w meta gradient search
Greg Makowski
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Model Selection Techniques
Swati .
 
deepnet-lourentzou.ppt
yang947066
 
Deep learning is a subset of machine learning and AI
leradiophysicien1
 
Overview of Deep Learning and its advantage
aqib296675
 
Introduction to Deep Learning presentation
johanericka2
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Eugene Yan Ziyou
 
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
Unit V -Multiple Learners.pptx for artificial intelligence
y2fn5mbzdb
 
Unit V -Multiple Learners in artificial intelligence and machine learning
y2fn5mbzdb
 
Types of Machine Learnig Algorithms(CART, ID3)
Fatimakhan325
 
deep CNN vs conventional ML
Chao Han [email protected]
 
Kaggle presentation
HJ van Veen
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Tech Triveni
 
in5490-classification (1).pptx
MonicaTimber
 
Cubesat challenge considerations deep dive
clintonbeye
 
17_00-Dima-Panchenko-cnn-tips-and-tricks.pptx
MahmoudAbuGhali
 
Validation Is (Not) Easy
Dmytro Panchenko
 
Ad

More from Sebastian Ruder (17)

PDF
Frontiers of Natural Language Processing
Sebastian Ruder
 
PDF
On the Limitations of Unsupervised Bilingual Dictionary Induction
Sebastian Ruder
 
PDF
Successes and Frontiers of Deep Learning
Sebastian Ruder
 
PDF
Optimization for Deep Learning
Sebastian Ruder
 
PPTX
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Sebastian Ruder
 
PDF
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Sebastian Ruder
 
PDF
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Sebastian Ruder
 
PDF
Transfer Learning for Natural Language Processing
Sebastian Ruder
 
PDF
Making sense of word senses: An introduction to word-sense disambiguation and...
Sebastian Ruder
 
PDF
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Sebastian Ruder
 
PDF
NIPS 2016 Highlights - Sebastian Ruder
Sebastian Ruder
 
PDF
Multi-modal Neural Machine Translation - Iacer Calixto
Sebastian Ruder
 
PDF
Funded PhD/MSc. Opportunities at AYLIEN
Sebastian Ruder
 
PDF
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Sebastian Ruder
 
PDF
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Sebastian Ruder
 
PDF
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Sebastian Ruder
 
PDF
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Sebastian Ruder
 
Frontiers of Natural Language Processing
Sebastian Ruder
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
Sebastian Ruder
 
Successes and Frontiers of Deep Learning
Sebastian Ruder
 
Optimization for Deep Learning
Sebastian Ruder
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Sebastian Ruder
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Sebastian Ruder
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Sebastian Ruder
 
Transfer Learning for Natural Language Processing
Sebastian Ruder
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Sebastian Ruder
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Sebastian Ruder
 
NIPS 2016 Highlights - Sebastian Ruder
Sebastian Ruder
 
Multi-modal Neural Machine Translation - Iacer Calixto
Sebastian Ruder
 
Funded PhD/MSc. Opportunities at AYLIEN
Sebastian Ruder
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Sebastian Ruder
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Sebastian Ruder
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Sebastian Ruder
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Sebastian Ruder
 
Ad

Recently uploaded (20)

PPTX
DNA_structure_2025_Curso de Ácidos Nucleicos
Cinvestav
 
PPTX
Quality control test for plastic & metal.pptx
shrutipandit17
 
PDF
Systems Biology: Integrating Engineering with Biological Research (www.kiu.a...
publication11
 
PPTX
METABOLIC_SYNDROME Dr Shadab- kgmu lucknow pptx
ShadabAlam169087
 
PPTX
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
PPTX
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
PPTX
Preparation of Experimental Animals.pptx
muralinath2
 
PPTX
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
PPTX
The Toxic Effects of Aflatoxin B1 and Aflatoxin M1 on Kidney through Regulati...
OttokomaBonny
 
PDF
Pulsar Sparking: What if mountains on the surface?
Sérgio Sacani
 
PDF
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
PPTX
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
PPTX
Laboratory design and safe microbiological practices
Akanksha Divkar
 
PDF
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
PDF
Challenges of Transpiling Smalltalk to JavaScript
ESUG
 
PPTX
Cell Structure and Organelles Slides PPT
JesusNeyra8
 
PPTX
mirna_2025_clase_genética_cinvestav_Dralvarez
Cinvestav
 
PPTX
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
PPTX
Hydrocarbons Pollution. OIL pollutionpptx
AkCreation33
 
DNA_structure_2025_Curso de Ácidos Nucleicos
Cinvestav
 
Quality control test for plastic & metal.pptx
shrutipandit17
 
Systems Biology: Integrating Engineering with Biological Research (www.kiu.a...
publication11
 
METABOLIC_SYNDROME Dr Shadab- kgmu lucknow pptx
ShadabAlam169087
 
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
Preparation of Experimental Animals.pptx
muralinath2
 
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
The Toxic Effects of Aflatoxin B1 and Aflatoxin M1 on Kidney through Regulati...
OttokomaBonny
 
Pulsar Sparking: What if mountains on the surface?
Sérgio Sacani
 
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
Laboratory design and safe microbiological practices
Akanksha Divkar
 
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
Challenges of Transpiling Smalltalk to JavaScript
ESUG
 
Cell Structure and Organelles Slides PPT
JesusNeyra8
 
mirna_2025_clase_genética_cinvestav_Dralvarez
Cinvestav
 
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
Hydrocarbons Pollution. OIL pollutionpptx
AkCreation33
 

Strong Baselines for Neural Semi-supervised Learning under Domain Shift

  • 1. Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara Plank
  • 2. ‣ State-of-the-art domain adaptation approaches ‣ leverage task-specific features ‣ evaluate on proprietary datasets or on a single benchmark ‣ Only compare against weak baselines ‣ Almost none evaluate against approaches from the extensive semi-supervised learning (SSL) literature 2 Learning under Domain Shift
  • 3. ‣ How do classics in SSL compare to recent advances? ‣ Can we combine the best of both worlds? ‣ How well do these approaches work on out-of-distribution data? 3 Revisiting Semi-Supervised Learning Classics in a Neural World
  • 4. • Self-training • (Co-training) • Tri-training • Tri-training with disagreement Bootstrapping algorithms
  • 5. 1. Train model on labeled data. 2. Use confident predictions on unlabeled data as training examples. Repeat. 5 Self-training - Error amplification
  • 6. ‣ Calibration ‣ Output probabilities in neural networks are poorly calibrated. ‣ Throttling (Abney, 2007), i.e. selecting the top n highest confidence unlabeled examples works best. ‣ Online learning ‣ Training until convergence on labeled data and then on unlabeled data works best. 6 Self-training variants
  • 7. 7 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree. y = 1 x y = 1 1 Tri-training Tri-training
  • 8. 8 Tri-training Tri-training 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree. 3. Final prediction: majority voting Tri-training y = 1y = 1 y = 0 1 x
  • 9. Tri-training
 with disagreement 9 Tri-training with disagreement 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree and prediction differs. y = 1 x y = 1 1 y = 0 - 3 independent models
  • 10. ‣ Sampling unlabeled data ‣ Producing predictions for all unlabeled examples is expensive ‣ Sample number of unlabeled examples ‣ Confidence thresholding ‣ Not effective for classic approaches, but essential for our method 10 Tri-training hyper-parameters
  • 11. 11 y = 1 x y = 1 1 Multi-task tri-training 1. Train one model with 3 objective functions. 2. Use predictions on unlabeled data for third if two agree. Multi-task
 Tri-training 3. Restrict final layers to 
 use different 
 representations. 4. Train third objective 
 function only on 
 pseudo labeled to 
 bridge domain shift.
  • 12. 12 BiLSTM w2 char BiLSTM BiLSTM w1 char BiLSTM BiLSTM w3 char BiLSTM m1 m2 m3 m1 m2 m3 m1 m2 m3 orthogonality constraint (Bousmalis et al., 2016) Multi-task
 Tri-training Lorth = ∥W⊤ m1 Wm2 ∥2 F L(θ) = − ∑ i ∑ 1,..,n log Pmi (y| ⃗h ) + γLorthLoss: (Plank et al., 2016)
  • 13. 13 Data & Tasks Two tasks: Domains: Sentiment analysis on Amazon reviews dataset (Blitzer et al, 2006) POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012)
  • 14. Sentiment Analysis Results Accuracy 75 76.75 78.5 80.25 82 Avg over 4 target domains VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri * result from Saito et al., (2017) 14 ‣ Multi-task tri-training slightly outperforms tri-training, but has higher variance.
  • 15. 15 POS Tagging Results Trained on 10% labeled data (WSJ) Accuracy 88.7 88.975 89.25 89.525 89.8 Avg over 5 target domains Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri ‣ Tri-training with disagreement works best with little data.
  • 16. 16 POS Tagging Results * result from Schnabel & Schütze (2014) Trained on full labeled data (WSJ) Accuracy 89 89.75 90.5 91.25 92 Avg over 5 target domains TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri ‣ Tri-training works best in the full data setting.
  • 17. 17 POS Tagging Analysis Accuracy on out-of-vocabulary (OOV) tokens AccuracyonOOVtokens 50 57.5 65 72.5 80 %OOVtokens 0 2.75 5.5 8.25 11 Answers Emails Newsgroups Reviews Weblogs OOV tokens Src Tri MT-Tri ‣ Classic tri-training works best on OOV tokens. ‣ MT-Tri does worse than source-only baseline on OOV.
  • 18. 18 POS accuracy per binned log frequency Accuracydeltavs.src-onlybaseline -0.005 0 0.005 0.009 0.014 0.018 Binned frequency 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 MT-Tri Tri ‣ Tri-training works best on low-frequency tokens (leftmost bins). POS Tagging Analysis
  • 19. 19 POS Tagging Analysis Accuracy on unknown word-tag (UWT) tokens AccuracyonUWTtokens 8 12.5 17 21.5 26 %UWTtokens 0 1 2 3 4 Answers Emails Newsgroups Reviews Weblogs UWT rate Src Tri MT-Tri FLORS* ‣ No bootstrapping method works well on unknown word- tag combinations. ‣ Less lexicalized FLORS approach is superior. very difficult cases * result from Schnabel & Schütze (2014)
  • 20. ‣ Classic tri-training works best: outperforms recent state-of-the-art methods for sentiment analysis. ‣ We address the drawback of tri-training (space & time complexity) via the proposed MT-Tri model ‣ MT-Tri works best on sentiment, but not for POS. ‣ Importance of: ‣ Comparing neural methods to classics (strong baselines) ‣ Evaluation on multiple tasks & domains 20 Takeaways Tri-training