Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum

Deep Dive: Parameter-Efficient Model Adaptation
with LoRA and Spectrum
Companion video: https://youtu.be/CTncBjRgktk
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.

A typical model adaptation workflow
Pretrained
model
Domain-
adapted
model
Instruction-
tuned model
Aligned
model
📄📄📄
Unlabeled
domain dataset
Continuous
pre-training
(CPT)
Instruction
fine-tuning
(IFT) Alignment
📄📄📄
Unlabeled domain dataset + Q&A dataset
📄📄📄
Preference dataset
Instruction
pre-training
📄📄📄
Q&A dataset
« Language Models are Few-Shot Learners » https://arxiv.org/abs/2005.14165 (05/2020)
« Finetuned Language Models Are Zero-Shot Learners » https://arxiv.org/abs/2109.01652 (09/2021)
« Efficient Continual Pre-training for Building Domain Specific Large Language Models » https://arxiv.org/abs/2311.08545 (11/2023)
« Instruction Pre-Training: Language Models are Supervised Multitask Learners » https://arxiv.org/abs/2406.14491v1 (06/2024)
« How Do Large Language Models Acquire Factual Knowledge During Pretraining? » https://arxiv.org/abs/2406.11813v1 (06/2024)

Challenges of model adaptation
• Building a "great" model involves time-consuming and compute-intensive steps
• Building datasets is hard work
• Continuous Pre-Training requires a large corpus, at least billions of tokens
• Instruction Fine-Tuning and Alignment requires high-quality, diverse Q&A pairs
• Training models: accuracy or cost-efficiency?
• Full fine-tuning (FFT): update all model parameters in original precision (say, BF16)
• Compute-heavy and expensive… assuming you can get the required amount of data and compute
• Parameter Efficient Fine Tuning (PEFT), e.g. LoRA or QLoRA
• Learn only a much smaller number of model parameters, with optional quantization
• Much more memory efficient, enabling smaller GPUs and shorter training times
• Very effective for Instruction Fine-Tuning (IFT) and alignment
• Significant accuracy degradation for CPT
• Can we get accuracy and cost-efficiency? Yes: Spectrum

Singular Value Decomposition
import numpy as np
C = np.random.rand(1024, 1024)
U, Sigma, Vt = np.linalg.svd(C)
• SVD is a general matrix factorization technique.
• U: basis vectors for column space
• VT: basis vectors for row space
• Σ: diagonal matrix of singular values
MIT Linear Algebra course: https://www.youtube.com/watch?v=TX_vooSnhm8
C
1024x1024
= * *
U
1024x1024
VT
1024x1024
Left singular
vectors
Right singular
vectors
Singular values
(in descending
order)
𝚺
0
0
1024x1024

Low-rank approximation with SVD
U, Sigma, Vt = np.linalg.svd(C)
U_k = U[:, :top_k]
Sigma_k = np.diag(Sigma[:top_k])
Vt_k = Vt[:top_k, :]
C_k = np.dot(U_k, np.dot(Sigma_k, Vt_k))
diff = np.linalg.norm(C-C_k, 'fro')
C_k
1024x1024
≈ * *
U_k
1024 x k
Vt_k
k x 1024
Top k
vectors
Top k
vectors
Top k
values
𝚺
_k
0
0
k x k
Reconstruction
of C
• We can approximate C by keeping only
the k largest singular values
• k is called the rank
• Fewer parameters are required
• C : 2^20 parameters
• U_k, Sigma_k, Vt_k : 2*(2^10*k) + k^2
• If k=8 : 16448 parameters (1.56%)
• Frobenius norm: the difference between C and C_k

Low Rank Adaptation (LoRA)
https://arxiv.org/abs/2106.09685 (06/2021)
W’ = +
W0
Base model
(frozen)
Weight updates
(learned)
ΔW
Fine-tuned
model
ΔW = *
B A
n x r
n x m r x m
• LoRA hypothesis: fine-tuning updates can be
learned with two low-rank matrices
"For a pre-trained weight matrix W0 ∈ R n×m , we
constrain its update by representing the latter with
a low-rank decomposition W0 + ∆W = W0 + BA,
where B ∈ R n×r , A ∈ R r×m ,
and the rank r << min(n, m).
During training, W0 is frozen and does not receive
gradient updates, while A and B contain trainable
parameters".
• We only learn r×(n+m) parameters, a MUCH
smaller number than W0's n×m
(r is typically 4 to 16)
• LoRA is usually applied to attention layers, not to
all layers.
• Very easy to run with Hugging Face PEFT
https://github.com/huggingface/peft

Challenges with LoRA
• Choosing which layers to apply LoRA to can be non-trivial.
• Finding the optimal hyperparameters (particularly the rank) can be a time-consuming process
• The rank determines the number of significant directions used to adapt the pre-trained model.
• If the rank is too low, we risk losing information (under-fitting), i.e. catastrophic forgetting.
• If the rank is too high, we risk introducing noise (over-fitting), i.e. insignificant directions
• Unfortunately, the same rank may not work well for all layers.
• LoRA may not be effective when adapting a model to a new domain that is very different from the pre-
training data (continuous pre-training)
https://blog.arcee.ai/why-methods-like-qlora-fall-short-in-domain-knowledge-injection-2/
• « LoRA vs. Full Fine-Tuning: An Illusion of Equivalence » https://arxiv.org/abs/2410.21228 (10/2024)
« LoRA and full fine-tuning produce structurally different parameter updates, characterized by the existence of intruder
dimensions (…) These are singular vectors, with large associated singular values, that are approximately orthogonal to the
singular vectors in a pre-trained weight matrix (…) LoRA fine-tuned models with intruder dimensions forget more of the pre-
training distribution and exhibit less robust continual learning compared to full fine-tuning. »

Spectrum
https://github.com/cognitivecomputations/spectrum
• Hypothesis: not all layers contribute equally to the output.
• Some layers have a higher signal-to-noise ratio (SNR) than others.
• Spectrum runs full fine-tuning on these, and leaves the other
layers untouched.
• For a given layer, SNR is the ratio of the sum of large singular
values to the sum of small singular values (based on threshold ε)
• For large matrices, the distribution of singular values forms a
continuous distribution, such as the Marchenko-Pastur
distribution which is bounded by λ- and λ+
• λ- and λ+ depend on the matrix size and the stddev of its singular values.
• Any value within these bounds is considered random.
• Any value larger than λ+ is likely to be significant.
• λ+ is the ε threshold.
1. Run SVD on all model layers
2. For each layer:
• Compute ε
• Compute SNR
3. Keep only the top SNR layers
(typically 25%)
4. Output a configuration files
unfreezing the top SNR layers
5. Run full-fine tuning on the top SNR
layers

Using Spectrum
https://github.com/cognitivecomputations/spectrum
python spectrum.py --model-name <insert local or HF repo here>
--top-percent <top % of SNR ratios to target>
unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# input_layernorm layers
- model.layers.0.input_layernorm
# lm_head layers
# mlp.down_proj layers
- model.layers.21.mlp.down_proj
. . .
"model.layers.10.self_attn.o_proj": {
"snr": 0.25031203031539917,
"type": "self_attn.o_proj"
},
"snr": 0.2547757625579834,
},
"snr": 0.2616233825683594,
},
"snr": 0.2736438810825348,
},
. . .
SNR for each model layer Top % layers to train
Insert into your training code or
your Axolotl configuration file

GPU RAM usage and training time

Arcee SuperNova Lite (Llama-3.1-8b)
Time
@ bs=1
GPU RAM
@ bs=1
MMLU
acc
GSM8K
strict-match
Hellaswag
acc_norm
Full-fine tuning - OOM - - -
LoRA (r=32) 48 mn 41.5 GB 0.6587 0.6520 0.7840
QLoRA (4-bit, r=32) 42 mn 30.6 GB 0.6619 0.6785 0.7860
Spectrum-25 32 mn (-31%) 37.5 GB (+22%) 0.6870 (+3.8%) 0.7597 (+12%) 0.8027 (+2.1%)
Spectrum-50 43 mn 41.5 GB 0.6844 0.7445 0.7999
Single-GPU training (L40S on AWS, 48 GB RAM), 1 epoch
Model: https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite
Dataset: https://huggingface.co/datasets/tatsu-lab/alpaca (52K rows)
accelerate launch -m axolotl.cli.train examples/supernova-lite/<config>.yml
lm_eval --model hf --model_args pretrained=<model> --tasks mmlu,gsm8k,hellaswag --batch_size 8

Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum

More Related Content

Similar to Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum (20)

More from Julien SIMON (20)

Recently uploaded (20)

Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum