Structured versus unstructured pruning
When pruning LLMs, you can either prune weights individually (unstructured pruning) or remove entire structures, such as filters, channels, or attention heads (structured pruning):
- Unstructured pruning: This involves removing individual weights based on magnitude or other criteria. It provides more granularity but can result in sparse matrices, which are harder to optimize on standard hardware, as demonstrated in the
prune.l1_unstructured
function described earlier. - Structured pruning: Entire sections of the model, such as neurons, channels, or layers, are pruned. This approach is easier to implement on modern hardware and often leads to better speedups in inference time, even though it may have a larger immediate effect on model performance.
Structured pruning in LLMs can be implemented using PyTorch’s built-in utilities, as shown in the following code. Here, we apply L2-norm structured pruning to remove 30% of neurons...