SlideShare a Scribd company logo
Neural Mask Generator:
Learning to Generate Adaptive Word
Maskings for Language Model Adaptation
Minki Kang1*, Moonsu Han1*, and Sung Ju Hwang1,2
KAIST1, Daejeon, South Korea
AITRICS2, Seoul, South Korea
1
Background
The recent success of neural language model is based on the scheme of
pre-train once, and fine-tune everywhere.
[Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
Background
Recent Language Models (LM) are pre-trained on large and heterogeneous
dataset.
General Dataset
(e.g. Wikipedia)
Specific-Domain
Dataset
Further
Pre-training
[Beltagy et al. 19] SciBERT: A Pretrained Language Model for Scientific Text, EMNLP 2019.
[Lee et al. 20] BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 2020.
[Gururangan et al. 20] Don’t stop Pre-training: Adapt Language Models to Domains and Tasks, ACL 2020.
Some works propose further pre-training for LM adaptation.
Background
Masked Language Models (MLMs) objective has shown to be effective
for language model pre-training.
A myocardial infarction,
also known as a [MASK]
attack, occurs when blo
od flow decreases.
A myocardial infarction,
also known as a heart
attack, occurs when bl
ood flow decreases.
[Original] [Model Input] [Model Output]
heart
[Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
Motivation
Will it be effective to further train the pre-trained language model on a
domain-specific corpus using randomly generated masks?
A myocardial infarction,
also known as a heart
attack, occurs when bl
ood flow decreases.
Language
Model
A myocardial infarction,
also known as a heart
attack, occurs when bl
ood flow decreases.
TrivialImportant
Motivation
Although several heuristic masking policies have been proposed, none
of them is clearly superior over others.
A myo ##car ##dial in ##farc ##tion occurs when blood flow ....Original:
A [MASK] [MASK] [MASK] in ##farc ##tion occurs when blood flow ...Whole-word:
Span: A myo ##car ##dial in ##farc [MASK] [MASK] [MASK] blood flow ...
A myo [MASK] ##dial [MASK] ##farc ##tion occurs when [MASK] flow ...Random:
In this work, we propose to generate the masks adaptively for the
given domain, by learning the optimal masking policy.
[Joshi et al. 20] SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL 2020.
[Sun et al. 19] Enhanced Representation through Knowledge Integration, arXiv 2019.
Motivation
Our objective is to find the task-dependent masking policy via a
learnable mask generator.
Problem Formulation
Masked Language Model
Unannotated
Text corpus
[MASK]
Masked
Text corpus
Language Model
Parameters [MASK]
Original
Context
Masked
Context
Problem Formulation
Masked Language Model
A myo [MASK] ##dial [MASK]
##farc ##tion occurs when
[MASK] flow ...
Masked Context
A myo ##car ##dial in ##farc
##tion occurs when blood flo
w ...
Original Context
𝑤! = A
𝑤" = myo
𝑤# = ##car
𝑤$ = ##dial
𝑤% = in
𝑤& = ##farc
𝑤' = ##tion
…
Words (Tokens)
𝑧( = $
1,
0,
𝑖𝑓 𝑖-𝑡ℎ word is masked
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Problem Formulation
Bi-level formulation: Masking
{3, 5, 10}
List of word indices
to be masked
Probability of each word
being masked
…
##car in
𝒊 = 𝟏 𝒊 = 𝑵
Arbitrary Function
parameterized by 𝜆
Problem Formulation
Bi-level formulation: Further Pre-Training (Inner Loop)
Further Pre-trained
Language Model
parameterized by 𝜆
Problem Formulation
Bi-level formulation: Fine-tuning on the task (Inner Loop)
Downstream task
Solver model
Loss function of
Supervised Learning
Training Dataset
Problem Formulation
Bi-level formulation: Outer-level objective (Outer Loop)
Problem Formulation
Reinforcement learning formulation
Probability of each word
being masked
…
##car in
𝒊 = 𝟏 𝒊 = 𝑵
A myo ##car ##dial in ##farc
##tion occurs when blood flo
w ...
Input Context
𝑅 = −
= Accuracy on the test set.
Policy
Actions
Reward
Problem Formulation
Reinforcement learning formulation
The probability of
masking T tokens
Transition
Probability
The cat is cute .
The [MASK] is cute .
The [MASK] is [MASK] .
t=1
t=2
t=3
Example (MDP)
The cat is cute .
The [MASK] is [MASK] .
Example (Approximation)
Neural Mask Generator
Neural Mask Generator
Training objective
1. Advantageous Actor-Critic
2. Off-Policy learning with Prioritized Experience Replay
3. Importance Sampling
Neural Mask Generator
Training objective
Sampled Replays Entropy
Regularization
Neural Mask Generator
Some practical problems remain for reinforcement learning.
1. Using the full size of dataset in the inner loop is not feasible.
2. The test dataset is unobservable during training step.
Sample
Neural Mask Generator
The NMG model encounters different sub-task at every new episode.
Episode
1
Episode
2
Same across episodesDifferent across episodes
≠
Comparable?
Accuracy: 0.35
≠
Pre-trained
Language Model
(BERT)
=
Accuracy: 0.6
Neural Mask Generator
We introduce the random policy as an opponent policy.
Accuracy: 0.6
Accuracy: 0.35
Accuracy: 0.54
Accuracy: 0.4
Episode
1
Episode
2
≠≠
Neural Mask Generator
We add another neural policy to induce the Self-Play.
Accuracy: 0.6 Accuracy: 0.54
Neural Policy
(Player)
+ Neural Policy
(Opponent)
Accuracy: 0.62
Random Policy
(Opponent)
a = {𝟏, 5, 𝟕} a-. = {1, 5, 9} a! = {4, 5, 7}
Positive Reward Negative Reward
Neural Mask Generator
In each episode, the language model for each policy is initialized.
Episode
2
Episode
1
Initialized
Initialized
“Omit other policies
for brevity.”Further
Pre-training
Fine-tuning Evaluation
Neural Mask Generator
Continual adaptation - Instead, load the LM from former episode.
Episode
2
Episode
1
Initialized
Load
“Omit other policies
for brevity.”Further
Pre-training
Fine-tuning Evaluation
Experiments
1) Question Answering
• SQuAD v1.1
• emrQA
• NewsQA
2) Text Classification
• IMDb
• ChemProt
Datasets
1) Question Answering
• BERT
• DistilBERT
2) Text Classification
• BERT
Language Models
Experiments
• No Pre-training
• Random Masking (Devlin et al. 19)
• Whole-Random Masking (Devlin et al. 19)
• Span-Random Masking (Joshi et al. 20)
• Entity-Random Masking (Sun et al. 19)
• Punctuation-Random Masking
Baselines
[Joshi et al. 20] SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL 2020.
[Sun et al. 19] Enhanced Representation through Knowledge Integration, arXiv 2019.
[Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
Results
[Text Classification Results] [Ablation Results]
Results
Analysis
[Example from NewsQA]
[Top6 Part-Of-Speech Tag of Masked Words on NewsQA]
Conclusion
• We proposed Neural Mask Generator (NMG), which learns the adaptive
masking policy to adapt the language model to a new domain.
• We formulate the problem of learning the optimal masking policy as a bi-level
meta-learning framework, with reinforcement learning for optimization.
• Experimental results on multiple NLU tasks show that NMG generates
adaptive word masking for a given domain, which yields better or at least
comparable performance over the best-working heuristic masking policy.
Code is available at https://github.com/Nardien/NMG
Thank you

More Related Content

What's hot (20)

PDF
Difference between star schema and snowflake schema
Umar Ali
 
PPTX
Data models in NoSQL
Dr-Dipali Meher
 
PDF
Graphql
Niv Ben David
 
PPTX
Introduction to MongoDB.pptx
Surya937648
 
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
 
PPTX
The Basics of MongoDB
valuebound
 
PDF
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
PPTX
Spring Boot and REST API
07.pallav
 
PPTX
Docker 101 - Nov 2016
Docker, Inc.
 
PDF
Data warehouse con azure synapse analytics
Eduardo Castro
 
PPTX
Voldemort
fasiha ikram
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Swagger / Quick Start Guide
Andrii Gakhov
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Introduction to Docker Compose
Ajeet Singh Raina
 
PPTX
CockroachDB
andrei moga
 
PDF
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...
Claus Ibsen
 
PPTX
Introduction to NoSQL Databases
Derek Stainer
 
PDF
Introduction to HBase
Avkash Chauhan
 
PPTX
Introduction to sqoop
Uday Vakalapudi
 
Difference between star schema and snowflake schema
Umar Ali
 
Data models in NoSQL
Dr-Dipali Meher
 
Graphql
Niv Ben David
 
Introduction to MongoDB.pptx
Surya937648
 
HADOOP TECHNOLOGY ppt
sravya raju
 
The Basics of MongoDB
valuebound
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
Spring Boot and REST API
07.pallav
 
Docker 101 - Nov 2016
Docker, Inc.
 
Data warehouse con azure synapse analytics
Eduardo Castro
 
Voldemort
fasiha ikram
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Swagger / Quick Start Guide
Andrii Gakhov
 
Introduction to Apache Spark
Rahul Jain
 
Introduction to Docker Compose
Ajeet Singh Raina
 
CockroachDB
andrei moga
 
Red Hat Nordics 2020 - Apache Camel 3 the next generation of enterprise integ...
Claus Ibsen
 
Introduction to NoSQL Databases
Derek Stainer
 
Introduction to HBase
Avkash Chauhan
 
Introduction to sqoop
Uday Vakalapudi
 

Similar to Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Language Model Adaptation (20)

PPTX
Research paper presentation for a project .pptx
MaryamAziz47
 
PDF
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
Hiroki Shimanaka
 
PDF
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Kyuri Kim
 
PDF
cs229_lecture_slides_selfsupervision_final.pdf
taoyikai
 
PDF
Deep network notes.pdf
Ramya Nellutla
 
PDF
The NLP Muppets revolution!
Fabio Petroni, PhD
 
PDF
Open vocabulary problem
JaeHo Jang
 
PDF
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
Edge AI and Vision Alliance
 
PDF
Deep Learning & NLP: Graphs to the Rescue!
Roelof Pieters
 
PDF
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
MLAI2
 
PPTX
[Paper review] BERT
JEE HYUN PARK
 
PDF
Devoxx traitement automatique du langage sur du texte en 2019
Alexis Agahi
 
PDF
CHUNKER BASED SENTIMENT ANALYSIS AND TENSE CLASSIFICATION FOR NEPALI TEXT
SethDarren1
 
PDF
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
kevig
 
PDF
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
kevig
 
PDF
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
MLAI2
 
PPTX
Trustworthy Generative AI_ ICML'23 Tutorial.pptx
sylvioneto11
 
PDF
Plug play language_models
Mohammad Moslem Uddin
 
PDF
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
ijnlc
 
PDF
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
kevig
 
Research paper presentation for a project .pptx
MaryamAziz47
 
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
Hiroki Shimanaka
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Kyuri Kim
 
cs229_lecture_slides_selfsupervision_final.pdf
taoyikai
 
Deep network notes.pdf
Ramya Nellutla
 
The NLP Muppets revolution!
Fabio Petroni, PhD
 
Open vocabulary problem
JaeHo Jang
 
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
Edge AI and Vision Alliance
 
Deep Learning & NLP: Graphs to the Rescue!
Roelof Pieters
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
MLAI2
 
[Paper review] BERT
JEE HYUN PARK
 
Devoxx traitement automatique du langage sur du texte en 2019
Alexis Agahi
 
CHUNKER BASED SENTIMENT ANALYSIS AND TENSE CLASSIFICATION FOR NEPALI TEXT
SethDarren1
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
kevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
kevig
 
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
MLAI2
 
Trustworthy Generative AI_ ICML'23 Tutorial.pptx
sylvioneto11
 
Plug play language_models
Mohammad Moslem Uddin
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
ijnlc
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
kevig
 
Ad

More from MLAI2 (20)

PDF
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
MLAI2
 
PDF
Online Hyperparameter Meta-Learning with Hypergradient Distillation
MLAI2
 
PDF
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
PDF
Representational Continuity for Unsupervised Continual Learning
MLAI2
 
PDF
Skill-Based Meta-Reinforcement Learning
MLAI2
 
PDF
Edge Representation Learning with Hypergraphs
MLAI2
 
PDF
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
MLAI2
 
PDF
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
MLAI2
 
PDF
Task Adaptive Neural Network Search with Meta-Contrastive Learning
MLAI2
 
PDF
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
MLAI2
 
PDF
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
MLAI2
 
PDF
Accurate Learning of Graph Representations with Graph Multiset Pooling
MLAI2
 
PDF
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
MLAI2
 
PDF
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
MLAI2
 
PDF
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MLAI2
 
PDF
Adversarial Self-Supervised Contrastive Learning
MLAI2
 
PDF
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
MLAI2
 
PDF
Cost-effective Interactive Attention Learning with Neural Attention Process
MLAI2
 
PDF
Adversarial Neural Pruning with Latent Vulnerability Suppression
MLAI2
 
PDF
Generating Diverse and Consistent QA pairs from Contexts with Information-Max...
MLAI2
 
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
MLAI2
 
Online Hyperparameter Meta-Learning with Hypergradient Distillation
MLAI2
 
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
Representational Continuity for Unsupervised Continual Learning
MLAI2
 
Skill-Based Meta-Reinforcement Learning
MLAI2
 
Edge Representation Learning with Hypergraphs
MLAI2
 
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
MLAI2
 
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
MLAI2
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
MLAI2
 
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
MLAI2
 
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
MLAI2
 
Accurate Learning of Graph Representations with Graph Multiset Pooling
MLAI2
 
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
MLAI2
 
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
MLAI2
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MLAI2
 
Adversarial Self-Supervised Contrastive Learning
MLAI2
 
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
MLAI2
 
Cost-effective Interactive Attention Learning with Neural Attention Process
MLAI2
 
Adversarial Neural Pruning with Latent Vulnerability Suppression
MLAI2
 
Generating Diverse and Consistent QA pairs from Contexts with Information-Max...
MLAI2
 
Ad

Recently uploaded (20)

PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
July Patch Tuesday
Ivanti
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
July Patch Tuesday
Ivanti
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 

Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Language Model Adaptation

  • 1. Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation Minki Kang1*, Moonsu Han1*, and Sung Ju Hwang1,2 KAIST1, Daejeon, South Korea AITRICS2, Seoul, South Korea 1
  • 2. Background The recent success of neural language model is based on the scheme of pre-train once, and fine-tune everywhere. [Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
  • 3. Background Recent Language Models (LM) are pre-trained on large and heterogeneous dataset. General Dataset (e.g. Wikipedia) Specific-Domain Dataset Further Pre-training [Beltagy et al. 19] SciBERT: A Pretrained Language Model for Scientific Text, EMNLP 2019. [Lee et al. 20] BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 2020. [Gururangan et al. 20] Don’t stop Pre-training: Adapt Language Models to Domains and Tasks, ACL 2020. Some works propose further pre-training for LM adaptation.
  • 4. Background Masked Language Models (MLMs) objective has shown to be effective for language model pre-training. A myocardial infarction, also known as a [MASK] attack, occurs when blo od flow decreases. A myocardial infarction, also known as a heart attack, occurs when bl ood flow decreases. [Original] [Model Input] [Model Output] heart [Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
  • 5. Motivation Will it be effective to further train the pre-trained language model on a domain-specific corpus using randomly generated masks? A myocardial infarction, also known as a heart attack, occurs when bl ood flow decreases. Language Model A myocardial infarction, also known as a heart attack, occurs when bl ood flow decreases. TrivialImportant
  • 6. Motivation Although several heuristic masking policies have been proposed, none of them is clearly superior over others. A myo ##car ##dial in ##farc ##tion occurs when blood flow ....Original: A [MASK] [MASK] [MASK] in ##farc ##tion occurs when blood flow ...Whole-word: Span: A myo ##car ##dial in ##farc [MASK] [MASK] [MASK] blood flow ... A myo [MASK] ##dial [MASK] ##farc ##tion occurs when [MASK] flow ...Random: In this work, we propose to generate the masks adaptively for the given domain, by learning the optimal masking policy. [Joshi et al. 20] SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL 2020. [Sun et al. 19] Enhanced Representation through Knowledge Integration, arXiv 2019.
  • 7. Motivation Our objective is to find the task-dependent masking policy via a learnable mask generator.
  • 8. Problem Formulation Masked Language Model Unannotated Text corpus [MASK] Masked Text corpus Language Model Parameters [MASK] Original Context Masked Context
  • 9. Problem Formulation Masked Language Model A myo [MASK] ##dial [MASK] ##farc ##tion occurs when [MASK] flow ... Masked Context A myo ##car ##dial in ##farc ##tion occurs when blood flo w ... Original Context 𝑤! = A 𝑤" = myo 𝑤# = ##car 𝑤$ = ##dial 𝑤% = in 𝑤& = ##farc 𝑤' = ##tion … Words (Tokens) 𝑧( = $ 1, 0, 𝑖𝑓 𝑖-𝑡ℎ word is masked 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 10. Problem Formulation Bi-level formulation: Masking {3, 5, 10} List of word indices to be masked Probability of each word being masked … ##car in 𝒊 = 𝟏 𝒊 = 𝑵 Arbitrary Function parameterized by 𝜆
  • 11. Problem Formulation Bi-level formulation: Further Pre-Training (Inner Loop) Further Pre-trained Language Model parameterized by 𝜆
  • 12. Problem Formulation Bi-level formulation: Fine-tuning on the task (Inner Loop) Downstream task Solver model Loss function of Supervised Learning Training Dataset
  • 13. Problem Formulation Bi-level formulation: Outer-level objective (Outer Loop)
  • 14. Problem Formulation Reinforcement learning formulation Probability of each word being masked … ##car in 𝒊 = 𝟏 𝒊 = 𝑵 A myo ##car ##dial in ##farc ##tion occurs when blood flo w ... Input Context 𝑅 = − = Accuracy on the test set. Policy Actions Reward
  • 15. Problem Formulation Reinforcement learning formulation The probability of masking T tokens Transition Probability The cat is cute . The [MASK] is cute . The [MASK] is [MASK] . t=1 t=2 t=3 Example (MDP) The cat is cute . The [MASK] is [MASK] . Example (Approximation)
  • 17. Neural Mask Generator Training objective 1. Advantageous Actor-Critic 2. Off-Policy learning with Prioritized Experience Replay 3. Importance Sampling
  • 18. Neural Mask Generator Training objective Sampled Replays Entropy Regularization
  • 19. Neural Mask Generator Some practical problems remain for reinforcement learning. 1. Using the full size of dataset in the inner loop is not feasible. 2. The test dataset is unobservable during training step. Sample
  • 20. Neural Mask Generator The NMG model encounters different sub-task at every new episode. Episode 1 Episode 2 Same across episodesDifferent across episodes ≠ Comparable? Accuracy: 0.35 ≠ Pre-trained Language Model (BERT) = Accuracy: 0.6
  • 21. Neural Mask Generator We introduce the random policy as an opponent policy. Accuracy: 0.6 Accuracy: 0.35 Accuracy: 0.54 Accuracy: 0.4 Episode 1 Episode 2 ≠≠
  • 22. Neural Mask Generator We add another neural policy to induce the Self-Play. Accuracy: 0.6 Accuracy: 0.54 Neural Policy (Player) + Neural Policy (Opponent) Accuracy: 0.62 Random Policy (Opponent) a = {𝟏, 5, 𝟕} a-. = {1, 5, 9} a! = {4, 5, 7} Positive Reward Negative Reward
  • 23. Neural Mask Generator In each episode, the language model for each policy is initialized. Episode 2 Episode 1 Initialized Initialized “Omit other policies for brevity.”Further Pre-training Fine-tuning Evaluation
  • 24. Neural Mask Generator Continual adaptation - Instead, load the LM from former episode. Episode 2 Episode 1 Initialized Load “Omit other policies for brevity.”Further Pre-training Fine-tuning Evaluation
  • 25. Experiments 1) Question Answering • SQuAD v1.1 • emrQA • NewsQA 2) Text Classification • IMDb • ChemProt Datasets 1) Question Answering • BERT • DistilBERT 2) Text Classification • BERT Language Models
  • 26. Experiments • No Pre-training • Random Masking (Devlin et al. 19) • Whole-Random Masking (Devlin et al. 19) • Span-Random Masking (Joshi et al. 20) • Entity-Random Masking (Sun et al. 19) • Punctuation-Random Masking Baselines [Joshi et al. 20] SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL 2020. [Sun et al. 19] Enhanced Representation through Knowledge Integration, arXiv 2019. [Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
  • 28. [Text Classification Results] [Ablation Results] Results
  • 29. Analysis [Example from NewsQA] [Top6 Part-Of-Speech Tag of Masked Words on NewsQA]
  • 30. Conclusion • We proposed Neural Mask Generator (NMG), which learns the adaptive masking policy to adapt the language model to a new domain. • We formulate the problem of learning the optimal masking policy as a bi-level meta-learning framework, with reinforcement learning for optimization. • Experimental results on multiple NLU tasks show that NMG generates adaptive word masking for a given domain, which yields better or at least comparable performance over the best-working heuristic masking policy. Code is available at https://github.com/Nardien/NMG