Strong Baselines for Neural Semi-supervised Learning under Domain Shift

Strong Baselines for Neural
Semi-supervised Learning
under Domain Shift
Sebastian Ruder Barbara Plank

‣ State-of-the-art domain adaptation approaches
‣ leverage task-speciﬁc features
‣ evaluate on proprietary datasets or on a single
benchmark
‣ Only compare against weak baselines
‣ Almost none evaluate against approaches from the
extensive semi-supervised learning (SSL) literature
2
Learning under Domain Shift

‣ How do classics in SSL compare to recent advances?
‣ Can we combine the best of both worlds?
‣ How well do these approaches work on out-of-distribution
data?
3
Revisiting Semi-Supervised Learning
Classics in a Neural World

• Self-training
• (Co-training)
• Tri-training
• Tri-training with disagreement
Bootstrapping algorithms

1. Train model on labeled data.
2. Use conﬁdent predictions on unlabeled data
as training examples. Repeat.
5
Self-training
- Error amplification

‣ Calibration
‣ Output probabilities in neural networks are poorly
calibrated.
‣ Throttling (Abney, 2007), i.e. selecting the top n highest
conﬁdence unlabeled examples works best.
‣ Online learning
‣ Training until convergence on labeled data and then on
unlabeled data works best.
6
Self-training variants

7
1. Train three models on bootstrapped samples.
2. Use predictions on unlabeled data for third if two agree.
y = 1
x
y = 1
1
Tri-training
Tri-training

8
Tri-training
Tri-training
3. Final prediction: majority voting
Tri-training
y = 1y = 1 y = 0
1
x

Tri-training 
with disagreement
9
Tri-training with
disagreement
2. Use predictions on unlabeled data for third if two agree
and prediction differs.
y = 1
x
y = 1
1
y = 0
- 3 independent models

‣ Sampling unlabeled data
‣ Producing predictions for all unlabeled examples is
expensive
‣ Sample number of unlabeled examples
‣ Conﬁdence thresholding
‣ Not effective for classic approaches, but essential for
our method
10
Tri-training hyper-parameters

11
y = 1
x
y = 1
1
Multi-task tri-training
1. Train one model with 3 objective functions.
Multi-task 
Tri-training
3. Restrict ﬁnal layers to  
use different  
representations.
4. Train third objective  
function only on  
pseudo labeled to  
bridge domain shift.

12
BiLSTM
w2
char
BiLSTM
BiLSTM
w1
char
BiLSTM
BiLSTM
w3
char
BiLSTM
m1 m2 m3 m1 m2 m3 m1 m2 m3
orthogonality constraint (Bousmalis et al., 2016)
Multi-task 
Tri-training
Lorth = ∥W⊤
m1
Wm2
∥2
F
L(θ) = −
∑
i
∑
1,..,n
log Pmi
(y| ⃗h ) + γLorthLoss:
(Plank et al., 2016)

13
Data & Tasks
Two tasks: Domains:
Sentiment analysis on Amazon reviews dataset (Blitzer et al, 2006)
POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012)

Sentiment Analysis Results
Accuracy
75
76.75
78.5
80.25
82
Avg over 4 target domains
VFAE* DANN* Asym* Source only
Self-training Tri-training Tri-training-Disagr. MT-Tri
* result from Saito et al., (2017)
14
‣ Multi-task tri-training slightly outperforms tri-training, but
has higher variance.

15
POS Tagging Results
Trained on 10% labeled data (WSJ)
Accuracy
88.7
88.975
89.25
89.525
89.8
Source (+embeds) Self-training Tri-training
Tri-training-Disagr. MT-Tri
‣ Tri-training with disagreement works best with little data.

16
POS Tagging Results
* result from Schnabel & Schütze (2014)
Trained on full labeled data (WSJ)
Accuracy
89
89.75
90.5
91.25
92
TnT Stanford* Source (+embeds)
Tri-training Tri-training-Disagr. MT-Tri
‣ Tri-training works best in the full data setting.

17
POS Tagging Analysis
Accuracy on out-of-vocabulary (OOV) tokens
AccuracyonOOVtokens
50
57.5
65
72.5
80
%OOVtokens
0
2.75
5.5
8.25
11
Answers Emails Newsgroups Reviews Weblogs
OOV tokens Src Tri MT-Tri
‣ Classic tri-training works best on OOV tokens.
‣ MT-Tri does worse than source-only baseline on OOV.

18
POS accuracy per binned log frequency
Accuracydeltavs.src-onlybaseline
-0.005
0
0.005
0.009
0.014
0.018
Binned frequency
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
MT-Tri Tri
‣ Tri-training works best on low-frequency tokens (leftmost
bins).

19
Accuracy on unknown word-tag (UWT) tokens
AccuracyonUWTtokens
8
12.5
17
21.5
26
%UWTtokens
0
1
2
3
4
Answers Emails Newsgroups Reviews Weblogs
UWT rate Src Tri MT-Tri FLORS*
‣ No bootstrapping method works well on unknown word-
tag combinations.
‣ Less lexicalized FLORS approach is superior.
very difﬁcult cases
* result from Schnabel
& Schütze (2014)

‣ Classic tri-training works best: outperforms recent
state-of-the-art methods for sentiment analysis.
‣ We address the drawback of tri-training (space &
time complexity) via the proposed MT-Tri model
‣ MT-Tri works best on sentiment, but not for POS.
‣ Importance of:
‣ Comparing neural methods to classics (strong
baselines)
‣ Evaluation on multiple tasks & domains
20
Takeaways
Tri-training

Strong Baselines for Neural Semi-supervised Learning under Domain Shift

More Related Content

What's hot (19)

Similar to Strong Baselines for Neural Semi-supervised Learning under Domain Shift (20)

More from Sebastian Ruder (17)

Recently uploaded (20)

Strong Baselines for Neural Semi-supervised Learning under Domain Shift