
Published as a conference paper at ICLR 2020
These solutions address the memory limitation problem, but not the communication overhead. In
this paper, we address all of the aforementioned problems, by designing A Lite BERT (ALBERT)
architecture that has significantly fewer parameters than a traditional BERT architecture.
ALBERT incorporates two parameter reduction techniques that lift the major obstacles in scaling
pre-trained models. The first one is a factorized embedding parameterization. By decomposing
the large vocabulary embedding matrix into two small matrices, we separate the size of the hidden
layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden
size without significantly increasing the parameter size of the vocabulary embeddings. The second
technique is cross-layer parameter sharing. This technique prevents the parameter from growing
with the depth of the network. Both techniques significantly reduce the number of parameters for
BERT without seriously hurting performance, thus improving parameter-efficiency. An ALBERT
configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster.
The parameter reduction techniques also act as a form of regularization that stabilizes the training
and helps with generalization.
To further improve the performance of ALBERT, we also introduce a self-supervised loss for
sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed
to address the ineffectiveness (Yang et al., 2019; Liu et al., 2019) of the next sentence prediction
(NSP) loss proposed in the original BERT.
As a result of these design decisions, we are able to scale up to much larger ALBERT configurations
that still have fewer parameters than BERT-large but achieve significantly better performance. We
establish new state-of-the-art results on the well-known GLUE, SQuAD, and RACE benchmarks
for natural language understanding. Specifically, we push the RACE accuracy to 89.4%, the GLUE
benchmark to 89.4, and the F1 score of SQuAD 2.0 to 92.2.
2 RELATED WORK
2.1 SCALING UP REPRESENTATION LEARNING FOR NATURAL LANGUAGE
Learning representations of natural language has been shown to be useful for a wide range of NLP
tasks and has been widely adopted (Mikolov et al., 2013; Le & Mikolov, 2014; Dai & Le, 2015; Pe-
ters et al., 2018; Devlin et al., 2019; Radford et al., 2018; 2019). One of the most significant changes
in the last two years is the shift from pre-training word embeddings, whether standard (Mikolov
et al., 2013; Pennington et al., 2014) or contextualized (McCann et al., 2017; Peters et al., 2018),
to full-network pre-training followed by task-specific fine-tuning (Dai & Le, 2015; Radford et al.,
2018; Devlin et al., 2019). In this line of work, it is often shown that larger model size improves
performance. For example, Devlin et al. (2019) show that across three selected natural language
understanding tasks, using larger hidden size, more hidden layers, and more attention heads always
leads to better performance. However, they stop at a hidden size of 1024, presumably because of the
model size and computation cost problems.
It is difficult to experiment with large models due to computational constraints, especially in terms
of GPU/TPU memory limitations. Given that current state-of-the-art models often have hundreds of
millions or even billions of parameters, we can easily hit memory limits. To address this issue, Chen
et al. (2016) propose a method called gradient checkpointing to reduce the memory requirement to be
sublinear at the cost of an extra forward pass. Gomez et al. (2017) propose a way to reconstruct each
layer’s activations from the next layer so that they do not need to store the intermediate activations.
Both methods reduce the memory consumption at the cost of speed. Raffel et al. (2019) proposed
to use model parallelization to train a giant model. In contrast, our parameter-reduction techniques
reduce memory consumption and increase training speed.
2.2 CROSS-LAYER PARAMETER SHARING
The idea of sharing parameters across layers has been previously explored with the Transformer
architecture (Vaswani et al., 2017), but this prior work has focused on training for standard encoder-
decoder tasks rather than the pretraining/finetuning setting. Different from our observations, De-
hghani et al. (2018) show that networks with cross-layer parameter sharing (Universal Transformer,
UT) get better performance on language modeling and subject-verb agreement than the standard
2