NLP：自然语言理解ALBERT.pdf资源-CSDN下载

版权申诉

144 浏览量 2022-04-21 14:48:58 上传评论收藏 409KB PDF 举报

A lite BERT for self-supervised learning of language representations提出两种参数裁剪的方法来降低 Bert 的内存消耗并增加训练速度。关于内存已有的一些解决方案：模型并行、更聪明的内存管理。ALBERT 结合了两种技术同时解决了内存和训练时长的问题：分解 Embedding 的参数跨层参数共享还有个增益是可以充当正则化的形式，从而稳定训练并有助于泛化。自然语言处理（NLP）是人工智能领域的一个重要分支，它涉及到如何使计算机理解和生成人类语言。近年来，预训练模型在NLP中取得了显著进展，尤其是BERT（Bidirectional Encoder Representations from Transformers）的出现，它通过自监督学习提升了下游任务的表现。然而，随着模型规模的扩大，GPU或TPU的内存限制以及训练时间的增加成为挑战。为了解决这些问题，"ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations"论文提出了两种参数减少技术，降低了BERT的内存消耗，加快了训练速度。 ALBERT的主要创新点包括： 1. 分解嵌入参数（Factorized Embedding Parameters）：BERT的嵌入层通常包含词汇表大小和隐藏维度的乘积，这会占用大量内存。ALBERT将嵌入层分解为两个较小的矩阵，减少了参数数量，但保留了信息表示的完整性。 2. 跨层参数共享（Cross-layer Parameter Sharing）：BERT每个层都有独立的参数，而ALBERT则在所有层之间共享参数，这进一步减小了模型大小，同时通过参数复用提高了训练效率。此外，ALBERT引入了一种自我监督损失，该损失专注于建模句子间的连贯性。这种新损失函数有助于处理涉及多句输入的任务，对下游任务的性能有显著提升。论文展示了这些方法使ALBERT模型在GLUE（General Language Understanding Evaluation）、RACE（ReAding Comprehension from Exam）和SQuAD（Stanford Question Answering Dataset）等基准测试中，即使参数量少于BERT大型模型，也能达到新的状态-of-the-art结果。 ALBERT的优势在于能够在保持高性能的同时，减小模型复杂性，使得大规模预训练模型的训练变得更加可行。这对于资源有限的环境或者需要快速训练的场景尤其有利。通过开源代码和预训练模型，研究者和开发者可以进一步探索和应用这些技术，推动NLP领域的发展。 ALBERT通过参数减少技术和自我监督的句子连贯性建模，成功地解决了大模型的训练难题，提升了预训练模型的效率和泛化能力，为NLP任务提供了更轻量级且高效的解决方案。这不仅有助于解决实际应用中的资源限制问题，也为未来的研究提供了新的思路。

资源推荐

资源详情

资源评论

Published as a conference paper at ICLR 2020

ALBERT: A LITE BERT FOR SELF-SUPERVISED

LEARNING OF LANGUAGE REPRESENTATIONS

Zhenzhong Lan

Mingda Chen

2∗

Sebastian Goodman

Kevin Gimpel

Piyush Sharma

Radu Soricut

Google Research

Toyota Technological Institute at Chicago

{lanzhzh, seabass, piyushsharma, rsoricut}@google.com

{mchen, kgimpel}@ttic.edu

ABSTRACT

Increasing model size when pretraining natural language representations often re-

sults in improved performance on downstream tasks. However, at some point fur-

ther model increases become harder due to GPU/TPU memory limitations and

longer training times. To address these problems, we present two parameter-

reduction techniques to lower memory consumption and increase the training

speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows

that our proposed methods lead to models that scale much better compared to

the original BERT. We also use a self-supervised loss that focuses on modeling

inter-sentence coherence, and show it consistently helps downstream tasks with

multi-sentence inputs. As a result, our best model establishes new state-of-the-art

results on the GLUE, RACE, and SQuAD benchmarks while having fewer param-

eters compared to BERT-large. The code and the pretrained models are available

at https://github.com/google-research/ALBERT.

1 INTRODUCTION

Full network pre-training (Dai & Le, 2015; Radford et al., 2018; Devlin et al., 2019; Howard &

Ruder, 2018) has led to a series of breakthroughs in language representation learning. Many non-

trivial NLP tasks, including those that have limited training data, have greatly beneﬁted from these

pre-trained models. One of the most compelling signs of these breakthroughs is the evolution of ma-

chine performance on a reading comprehension task designed for middle and high-school English

exams in China, the RACE test (Lai et al., 2017): the paper that originally describes the task and for-

mulates the modeling challenge reports then state-of-the-art machine accuracy at 44.1%; the latest

published result reports their model performance at 83.2% (Liu et al., 2019); the work we present

here pushes it even higher to 89.4%, a stunning 45.3% improvement that is mainly attributable to

our current ability to build high-performance pretrained language representations.

Evidence from these improvements reveals that a large network is of crucial importance for achiev-

ing state-of-the-art performance (Devlin et al., 2019; Radford et al., 2019). It has become common

practice to pre-train large models and distill them down to smaller ones (Sun et al., 2019; Turc et al.,

2019) for real applications. Given the importance of model size, we ask: Is having better NLP

models as easy as having larger models?

An obstacle to answering this question is the memory limitations of available hardware. Given that

current state-of-the-art models often have hundreds of millions or even billions of parameters, it is

easy to hit these limitations as we try to scale our models. Training speed can also be signiﬁcantly

hampered in distributed training, as the communication overhead is directly proportional to the

number of parameters in the model.

Existing solutions to the aforementioned problems include model parallelization (Shazeer et al.,

2018; Shoeybi et al., 2019) and clever memory management (Chen et al., 2016; Gomez et al., 2017).

∗

Work done as an intern at Google Research, driving data processing and downstream task evaluations.

arXiv:1909.11942v6 [cs.CL] 9 Feb 2020

Published as a conference paper at ICLR 2020

These solutions address the memory limitation problem, but not the communication overhead. In

this paper, we address all of the aforementioned problems, by designing A Lite BERT (ALBERT)

architecture that has signiﬁcantly fewer parameters than a traditional BERT architecture.

ALBERT incorporates two parameter reduction techniques that lift the major obstacles in scaling

pre-trained models. The ﬁrst one is a factorized embedding parameterization. By decomposing

the large vocabulary embedding matrix into two small matrices, we separate the size of the hidden

layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden

size without signiﬁcantly increasing the parameter size of the vocabulary embeddings. The second

technique is cross-layer parameter sharing. This technique prevents the parameter from growing

with the depth of the network. Both techniques signiﬁcantly reduce the number of parameters for

BERT without seriously hurting performance, thus improving parameter-efﬁciency. An ALBERT

conﬁguration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster.

The parameter reduction techniques also act as a form of regularization that stabilizes the training

and helps with generalization.

To further improve the performance of ALBERT, we also introduce a self-supervised loss for

sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed

to address the ineffectiveness (Yang et al., 2019; Liu et al., 2019) of the next sentence prediction

(NSP) loss proposed in the original BERT.

As a result of these design decisions, we are able to scale up to much larger ALBERT conﬁgurations

that still have fewer parameters than BERT-large but achieve signiﬁcantly better performance. We

establish new state-of-the-art results on the well-known GLUE, SQuAD, and RACE benchmarks

for natural language understanding. Speciﬁcally, we push the RACE accuracy to 89.4%, the GLUE

benchmark to 89.4, and the F1 score of SQuAD 2.0 to 92.2.

2 RELATED WORK

2.1 SCALING UP REPRESENTATION LEARNING FOR NATURAL LANGUAGE

Learning representations of natural language has been shown to be useful for a wide range of NLP

tasks and has been widely adopted (Mikolov et al., 2013; Le & Mikolov, 2014; Dai & Le, 2015; Pe-

ters et al., 2018; Devlin et al., 2019; Radford et al., 2018; 2019). One of the most signiﬁcant changes

in the last two years is the shift from pre-training word embeddings, whether standard (Mikolov

et al., 2013; Pennington et al., 2014) or contextualized (McCann et al., 2017; Peters et al., 2018),

to full-network pre-training followed by task-speciﬁc ﬁne-tuning (Dai & Le, 2015; Radford et al.,

2018; Devlin et al., 2019). In this line of work, it is often shown that larger model size improves

performance. For example, Devlin et al. (2019) show that across three selected natural language

understanding tasks, using larger hidden size, more hidden layers, and more attention heads always

leads to better performance. However, they stop at a hidden size of 1024, presumably because of the

model size and computation cost problems.

It is difﬁcult to experiment with large models due to computational constraints, especially in terms

of GPU/TPU memory limitations. Given that current state-of-the-art models often have hundreds of

millions or even billions of parameters, we can easily hit memory limits. To address this issue, Chen

et al. (2016) propose a method called gradient checkpointing to reduce the memory requirement to be

sublinear at the cost of an extra forward pass. Gomez et al. (2017) propose a way to reconstruct each

layer’s activations from the next layer so that they do not need to store the intermediate activations.

Both methods reduce the memory consumption at the cost of speed. Raffel et al. (2019) proposed

to use model parallelization to train a giant model. In contrast, our parameter-reduction techniques

reduce memory consumption and increase training speed.

2.2 CROSS-LAYER PARAMETER SHARING

The idea of sharing parameters across layers has been previously explored with the Transformer

architecture (Vaswani et al., 2017), but this prior work has focused on training for standard encoder-

decoder tasks rather than the pretraining/ﬁnetuning setting. Different from our observations, De-

hghani et al. (2018) show that networks with cross-layer parameter sharing (Universal Transformer,

UT) get better performance on language modeling and subject-verb agreement than the standard

Published as a conference paper at ICLR 2020

transformer. Very recently, Bai et al. (2019) propose a Deep Equilibrium Model (DQE) for trans-

former networks and show that DQE can reach an equilibrium point for which the input embedding

and the output embedding of a certain layer stay the same. Our observations show that our em-

beddings are oscillating rather than converging. Hao et al. (2019) combine a parameter-sharing

transformer with the standard one, which further increases the number of parameters of the standard

transformer.

2.3 SENTENCE ORDERING OBJECTIVES

ALBERT uses a pretraining loss based on predicting the ordering of two consecutive segments

of text. Several researchers have experimented with pretraining objectives that similarly relate to

discourse coherence. Coherence and cohesion in discourse have been widely studied and many

phenomena have been identiﬁed that connect neighboring text segments (Hobbs, 1979; Halliday &

Hasan, 1976; Grosz et al., 1995). Most objectives found effective in practice are quite simple. Skip-

thought (Kiros et al., 2015) and FastSent (Hill et al., 2016) sentence embeddings are learned by using

an encoding of a sentence to predict words in neighboring sentences. Other objectives for sentence

embedding learning include predicting future sentences rather than only neighbors (Gan et al., 2017)

and predicting explicit discourse markers (Jernite et al., 2017; Nie et al., 2019). Our loss is most

similar to the sentence ordering objective of Jernite et al. (2017), where sentence embeddings are

learned in order to determine the ordering of two consecutive sentences. Unlike most of the above

work, however, our loss is deﬁned on textual segments rather than sentences. BERT (Devlin et al.,

2019) uses a loss based on predicting whether the second segment in a pair has been swapped

with a segment from another document. We compare to this loss in our experiments and ﬁnd that

sentence ordering is a more challenging pretraining task and more useful for certain downstream

tasks. Concurrently to our work, Wang et al. (2019) also try to predict the order of two consecutive

segments of text, but they combine it with the original next sentence prediction in a three-way

classiﬁcation task rather than empirically comparing the two.

3 THE ELEMENTS OF ALBERT

In this section, we present the design decisions for ALBERT and provide quantiﬁed comparisons

against corresponding conﬁgurations of the original BERT architecture (Devlin et al., 2019).

3.1 MODEL ARCHITECTURE CHOICES

The backbone of the ALBERT architecture is similar to BERT in that it uses a transformer en-

coder (Vaswani et al., 2017) with GELU nonlinearities (Hendrycks & Gimpel, 2016). We follow the

BERT notation conventions and denote the vocabulary embedding size as E, the number of encoder

layers as L, and the hidden size as H. Following Devlin et al. (2019), we set the feed-forward/ﬁlter

size to be 4H and the number of attention heads to be H/64.

There are three main contributions that ALBERT makes over the design choices of BERT.

Factorized embedding parameterization. In BERT, as well as subsequent modeling improve-

ments such as XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019), the WordPiece embedding

size E is tied with the hidden layer size H, i.e., E ≡ H. This decision appears suboptimal for both

modeling and practical reasons, as follows.

From a modeling perspective, WordPiece embeddings are meant to learn context-independent repre-

sentations, whereas hidden-layer embeddings are meant to learn context-dependent representations.

As experiments with context length indicate (Liu et al., 2019), the power of BERT-like represen-

tations comes from the use of context to provide the signal for learning such context-dependent

representations. As such, untying the WordPiece embedding size E from the hidden layer size H

allows us to make a more efﬁcient usage of the total model parameters as informed by modeling

needs, which dictate that H  E.

From a practical perspective, natural language processing usually require the vocabulary size V to

be large.

If E ≡ H, then increasing H increases the size of the embedding matrix, which has size

Similar to BERT, all the experiments in this paper use a vocabulary size V of 30,000.

剩余16页未读，继续阅读

评论收藏

内容反馈

版权申诉

方案互联

粉丝: 19

NLP：自然语言理解ALBERT.pdf

自然语言处理(NLP)基础理解

NLP理解层次[参考].pdf

复旦《自然语言处理导论》Introduction-To-NLP.pdf

自然语言处理-基于预训练模型的方法-笔记

NLP自然语言处理10篇论文.zip

AI基础：一文看懂BERT.pdf

自然语言处理-基于预训练模型的方法 笔记

斯坦福CS224n_自然语言处理与深度学习 笔记

自然语言处理(NLP)在AIOps中的应用.pdf

兜哥带你NLP入门（自然语言处理入门）.pdf

NLP汉语自然语言处理原理与实践.pdf 有目录

NLP自然语言处理（二）——中文分词篇.pdf

python自然语言处理（NLP）入门.pdf

ALBERT_presentation.pdf

自然语言处理概述-哈工大

BERT中文翻译PDF版.pdf

基于ALBERT的中文命名实体识别方法.pdf

NLP技术分享 自然语言处理技术课程 共184页.pdf

小样本NLP自然语言处理的元学习Meta-learning for Few-shot NLP.pdf

NLP课程 北理工自然语言处理课程 NLP基础知识课程 第3章 词性标注 共54页.pdf

NLP课程 北理工自然语言处理课程 NLP基础知识课程 第5-2章 句法结构分析2 共56页.pdf

NLP课程 北理工自然语言处理课程 NLP基础知识课程 第4章 语言模型 共70页.pdf

语言模型也会“地域黑”？实验表明ALBERT最能黑，BART最友善.pdf

融合预训练语言模型的成语完形填空算法.pdf

Next-AI10.pdf

Mysql For Python (2010).pdf

Transformers for Natural Language Processing Buil(2).pdf

2023 年 “华为杯” 第二十届中国研究生数学建模竞赛一等奖 总结和复盘

write函数调用时socket缓冲区满了怎么办？

最新资源

自然语言处理-基于预训练模型的方法笔记

斯坦福CS224n_自然语言处理与深度学习笔记

NLP技术分享自然语言处理技术课程共184页.pdf

NLP课程北理工自然语言处理课程 NLP基础知识课程第3章词性标注共54页.pdf

NLP课程北理工自然语言处理课程 NLP基础知识课程第5-2章句法结构分析2 共56页.pdf

NLP课程北理工自然语言处理课程 NLP基础知识课程第4章语言模型共70页.pdf

2023 年 “华为杯” 第二十届中国研究生数学建模竞赛一等奖总结和复盘