Understanding the basics
Quantization refers to reducing the precision of the weights and activations of a model, typically from 32-bit floating point (FP32) to lower precision formats such as 16-bit (FP16) or even 8-bit integers (INT8). The goal is to decrease memory usage, speed up computation, and make the model more deployable on hardware with limited computational capacity. While quantization can lead to performance degradation, carefully tuned quantization schemes usually result in only minor losses in accuracy, especially for LLMs with robust architectures.
There are two primary quantization methods: dynamic quantization and static quantization.
- Dynamic quantization: Calculates quantization parameters on the fly during inference based on the actual input values. This adapts better to varying data distributions but introduces some computational overhead compared to static approaches.
In the following example, we use
torch.quantization.quantize_dynamic
to dynamically...