Scaling your training pipeline for larger models
To train larger models, we need to employ techniques such as gradient accumulation and mixed precision training.
To train very large language models that might not fit on a single GPU, the following code introduces a special LargeScaleLLMTrainer
. It uses two main tricks to handle this:
First, gradient accumulation allows us to simulate having access to a larger GPU. Instead of updating the model's parameters after every small batch of data, we process several small batches, accumulating their gradients along the way. Only after a predefined number of batches do we perform an actual update to the model's parameters. This technique enables the model to learn as if it had seen a much larger batch of data, without requiring the memory capacity of an extremely large GPU.
Second, it employs mixed precision training, a technique where the computer performs many calculations using smaller, lower-precision numbers (which require...