SlideShare a Scribd company logo
6
Most read
7
Most read
13
Most read
Deep Dive: Parameter-Efficient Model Adaptation
with LoRA and Spectrum
Companion video: https://youtu.be/CTncBjRgktk
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
A typical model adaptation workflow
Pretrained
model
Domain-
adapted
model
Instruction-
tuned model
Aligned
model
📄📄📄
Unlabeled
domain dataset
Continuous
pre-training
(CPT)
Instruction
fine-tuning
(IFT) Alignment
📄📄📄
Unlabeled domain dataset + Q&A dataset
📄📄📄
Preference dataset
Instruction
pre-training
📄📄📄
Q&A dataset
« Language Models are Few-Shot Learners » https://arxiv.org/abs/2005.14165 (05/2020)
« Finetuned Language Models Are Zero-Shot Learners » https://arxiv.org/abs/2109.01652 (09/2021)
« Efficient Continual Pre-training for Building Domain Specific Large Language Models » https://arxiv.org/abs/2311.08545 (11/2023)
« Instruction Pre-Training: Language Models are Supervised Multitask Learners » https://arxiv.org/abs/2406.14491v1 (06/2024)
« How Do Large Language Models Acquire Factual Knowledge During Pretraining? » https://arxiv.org/abs/2406.11813v1 (06/2024)
Challenges of model adaptation
• Building a "great" model involves time-consuming and compute-intensive steps
• Building datasets is hard work
• Continuous Pre-Training requires a large corpus, at least billions of tokens
• Instruction Fine-Tuning and Alignment requires high-quality, diverse Q&A pairs
• Training models: accuracy or cost-efficiency?
• Full fine-tuning (FFT): update all model parameters in original precision (say, BF16)
• Compute-heavy and expensive… assuming you can get the required amount of data and compute
• Parameter Efficient Fine Tuning (PEFT), e.g. LoRA or QLoRA
• Learn only a much smaller number of model parameters, with optional quantization
• Much more memory efficient, enabling smaller GPUs and shorter training times
• Very effective for Instruction Fine-Tuning (IFT) and alignment
• Significant accuracy degradation for CPT
• Can we get accuracy and cost-efficiency? Yes: Spectrum
Singular Value Decomposition
import numpy as np
C = np.random.rand(1024, 1024)
U, Sigma, Vt = np.linalg.svd(C)
• SVD is a general matrix factorization technique.
• U: basis vectors for column space
• VT: basis vectors for row space
• Σ: diagonal matrix of singular values
MIT Linear Algebra course: https://www.youtube.com/watch?v=TX_vooSnhm8
C
1024x1024
= * *
U
1024x1024
VT
1024x1024
Left singular
vectors
Right singular
vectors
Singular values
(in descending
order)
𝚺
0
0
1024x1024
Low-rank approximation with SVD
U, Sigma, Vt = np.linalg.svd(C)
U_k = U[:, :top_k]
Sigma_k = np.diag(Sigma[:top_k])
Vt_k = Vt[:top_k, :]
C_k = np.dot(U_k, np.dot(Sigma_k, Vt_k))
diff = np.linalg.norm(C-C_k, 'fro')
C_k
1024x1024
≈ * *
U_k
1024 x k
Vt_k
k x 1024
Top k
vectors
Top k
vectors
Top k
values
𝚺
_k
0
0
k x k
Reconstruction
of C
• We can approximate C by keeping only
the k largest singular values
• k is called the rank
• Fewer parameters are required
• C : 2^20 parameters
• U_k, Sigma_k, Vt_k : 2*(2^10*k) + k^2
• If k=8 : 16448 parameters (1.56%)
• Frobenius norm: the difference between C and C_k
Low Rank Adaptation (LoRA)
https://arxiv.org/abs/2106.09685 (06/2021)
W’ = +
W0
Base model
(frozen)
Weight updates
(learned)
ΔW
Fine-tuned
model
ΔW = *
B A
n x r
n x m r x m
• LoRA hypothesis: fine-tuning updates can be
learned with two low-rank matrices
"For a pre-trained weight matrix W0 ∈ R n×m , we
constrain its update by representing the latter with
a low-rank decomposition W0 + ∆W = W0 + BA,
where B ∈ R n×r , A ∈ R r×m ,
and the rank r << min(n, m).
During training, W0 is frozen and does not receive
gradient updates, while A and B contain trainable
parameters".
• We only learn r×(n+m) parameters, a MUCH
smaller number than W0's n×m
(r is typically 4 to 16)
• LoRA is usually applied to attention layers, not to
all layers.
• Very easy to run with Hugging Face PEFT
https://github.com/huggingface/peft
Challenges with LoRA
• Choosing which layers to apply LoRA to can be non-trivial.
• Finding the optimal hyperparameters (particularly the rank) can be a time-consuming process
• The rank determines the number of significant directions used to adapt the pre-trained model.
• If the rank is too low, we risk losing information (under-fitting), i.e. catastrophic forgetting.
• If the rank is too high, we risk introducing noise (over-fitting), i.e. insignificant directions
• Unfortunately, the same rank may not work well for all layers.
• LoRA may not be effective when adapting a model to a new domain that is very different from the pre-
training data (continuous pre-training)
https://blog.arcee.ai/why-methods-like-qlora-fall-short-in-domain-knowledge-injection-2/
• « LoRA vs. Full Fine-Tuning: An Illusion of Equivalence » https://arxiv.org/abs/2410.21228 (10/2024)
« LoRA and full fine-tuning produce structurally different parameter updates, characterized by the existence of intruder
dimensions (…) These are singular vectors, with large associated singular values, that are approximately orthogonal to the
singular vectors in a pre-trained weight matrix (…) LoRA fine-tuned models with intruder dimensions forget more of the pre-
training distribution and exhibit less robust continual learning compared to full fine-tuning. »
Spectrum
https://arxiv.org/abs/2406.06623 (06/2024)
https://github.com/cognitivecomputations/spectrum
• Hypothesis: not all layers contribute equally to the output.
• Some layers have a higher signal-to-noise ratio (SNR) than others.
• Spectrum runs full fine-tuning on these, and leaves the other
layers untouched.
• For a given layer, SNR is the ratio of the sum of large singular
values to the sum of small singular values (based on threshold ε)
• For large matrices, the distribution of singular values forms a
continuous distribution, such as the Marchenko-Pastur
distribution which is bounded by λ- and λ+
• λ- and λ+ depend on the matrix size and the stddev of its singular values.
• Any value within these bounds is considered random.
• Any value larger than λ+ is likely to be significant.
• λ+ is the ε threshold.
1. Run SVD on all model layers
2. For each layer:
• Compute ε
• Compute SNR
3. Keep only the top SNR layers
(typically 25%)
4. Output a configuration files
unfreezing the top SNR layers
5. Run full-fine tuning on the top SNR
layers
Using Spectrum
https://arxiv.org/abs/2406.06623 (06/2024)
https://github.com/cognitivecomputations/spectrum
python spectrum.py --model-name <insert local or HF repo here>
--top-percent <top % of SNR ratios to target>
unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# input_layernorm layers
- model.layers.0.input_layernorm
- model.layers.1.input_layernorm
- model.layers.2.input_layernorm
- model.layers.3.input_layernorm
- model.layers.4.input_layernorm
- model.layers.5.input_layernorm
# lm_head layers
# mlp.down_proj layers
- model.layers.21.mlp.down_proj
- model.layers.20.mlp.down_proj
- model.layers.22.mlp.down_proj
- model.layers.19.mlp.down_proj
- model.layers.23.mlp.down_proj
- model.layers.24.mlp.down_proj
. . .
"model.layers.10.self_attn.o_proj": {
"snr": 0.25031203031539917,
"type": "self_attn.o_proj"
},
"model.layers.11.self_attn.o_proj": {
"snr": 0.2547757625579834,
"type": "self_attn.o_proj"
},
"model.layers.12.self_attn.o_proj": {
"snr": 0.2616233825683594,
"type": "self_attn.o_proj"
},
"model.layers.13.self_attn.o_proj": {
"snr": 0.2736438810825348,
"type": "self_attn.o_proj"
},
. . .
SNR for each model layer Top % layers to train
Insert into your training code or
your Axolotl configuration file
Model quality: Mistral-7b
Model quality: Llama-3-8b
GPU RAM usage and training time
Arcee SuperNova Lite (Llama-3.1-8b)
Time
@ bs=1
GPU RAM
@ bs=1
MMLU
acc
GSM8K
strict-match
Hellaswag
acc_norm
Full-fine tuning - OOM - - -
LoRA (r=32) 48 mn 41.5 GB 0.6587 0.6520 0.7840
QLoRA (4-bit, r=32) 42 mn 30.6 GB 0.6619 0.6785 0.7860
Spectrum-25 32 mn (-31%) 37.5 GB (+22%) 0.6870 (+3.8%) 0.7597 (+12%) 0.8027 (+2.1%)
Spectrum-50 43 mn 41.5 GB 0.6844 0.7445 0.7999
Single-GPU training (L40S on AWS, 48 GB RAM), 1 epoch
Model: https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite
Dataset: https://huggingface.co/datasets/tatsu-lab/alpaca (52K rows)
accelerate launch -m axolotl.cli.train examples/supernova-lite/<config>.yml
lm_eval --model hf --model_args pretrained=<model> --tasks mmlu,gsm8k,hellaswag --batch_size 8

More Related Content

Similar to Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum (20)

PDF
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
MLconf
 
PDF
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
maXbox starter65 machinelearning3
Max Kleiner
 
PPTX
SPECFORMER: SPECTRAL GRAPH NEURAL NETWORKS MEET TRANSFORMERS.pptx
ssuser2624f71
 
PDF
MLSEV. Logistic Regression, Deepnets, and Time Series
BigML, Inc
 
PDF
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
PPT
The Structured Prediction – An Overview.ppt
TanXiaoyang1
 
PPT
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Cheatsheet supervised-learning
Steve Nouri
 
PDF
Project - Deep Locality Sensitive Hashing
Gabriele Angeletti
 
PPTX
Everything you need to know about AutoML
Arpitha Gurumurthy
 
PDF
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PDF
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Databricks
 
PDF
Google TensorFlow Tutorial
台灣資料科學年會
 
PDF
llm lecture 3 stanford blah blah blah blah
saud140081
 
PPTX
1. Introduction to deep learning.pptx
Kv Sagar
 
PPTX
Practical deep learning for computer vision
Eran Shlomo
 
PPTX
Introduction to Deep Learning and Tensorflow
Oswald Campesato
 
PPTX
Support vector machines
manaswinimysore
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
MLconf
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Universitat Politècnica de Catalunya
 
maXbox starter65 machinelearning3
Max Kleiner
 
SPECFORMER: SPECTRAL GRAPH NEURAL NETWORKS MEET TRANSFORMERS.pptx
ssuser2624f71
 
MLSEV. Logistic Regression, Deepnets, and Time Series
BigML, Inc
 
MLHEP 2015: Introductory Lecture #2
arogozhnikov
 
The Structured Prediction – An Overview.ppt
TanXiaoyang1
 
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Cheatsheet supervised-learning
Steve Nouri
 
Project - Deep Locality Sensitive Hashing
Gabriele Angeletti
 
Everything you need to know about AutoML
Arpitha Gurumurthy
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Databricks
 
Google TensorFlow Tutorial
台灣資料科學年會
 
llm lecture 3 stanford blah blah blah blah
saud140081
 
1. Introduction to deep learning.pptx
Kv Sagar
 
Practical deep learning for computer vision
Eran Shlomo
 
Introduction to Deep Learning and Tensorflow
Oswald Campesato
 
Support vector machines
manaswinimysore
 

More from Julien SIMON (20)

PDF
deep_dive_multihead_latent_attention.pdf
Julien SIMON
 
PDF
Deep Dive: Model Distillation with DistillKit
Julien SIMON
 
PDF
Building High-Quality Domain-Specific Models with Mergekit
Julien SIMON
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Optimizing LLM Inference
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Quantizing LLMs
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Model Merging
Julien SIMON
 
PDF
An introduction to computer vision with Hugging Face
Julien SIMON
 
PDF
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
PDF
Building NLP applications with Transformers
Julien SIMON
 
PPTX
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
PDF
Starting your AI/ML project right (May 2020)
Julien SIMON
 
PPTX
Scale Machine Learning from zero to millions of users (April 2020)
Julien SIMON
 
PPTX
An Introduction to Generative Adversarial Networks (April 2020)
Julien SIMON
 
PPTX
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
Julien SIMON
 
PDF
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Julien SIMON
 
PDF
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
Julien SIMON
 
PDF
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 
deep_dive_multihead_latent_attention.pdf
Julien SIMON
 
Deep Dive: Model Distillation with DistillKit
Julien SIMON
 
Building High-Quality Domain-Specific Models with Mergekit
Julien SIMON
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien SIMON
 
Julien Simon - Deep Dive - Optimizing LLM Inference
Julien SIMON
 
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien SIMON
 
Julien Simon - Deep Dive - Quantizing LLMs
Julien SIMON
 
Julien Simon - Deep Dive - Model Merging
Julien SIMON
 
An introduction to computer vision with Hugging Face
Julien SIMON
 
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
Building NLP applications with Transformers
Julien SIMON
 
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
Starting your AI/ML project right (May 2020)
Julien SIMON
 
Scale Machine Learning from zero to millions of users (April 2020)
Julien SIMON
 
An Introduction to Generative Adversarial Networks (April 2020)
Julien SIMON
 
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
Julien SIMON
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Julien SIMON
 
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
Julien SIMON
 
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 
Ad

Recently uploaded (20)

PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Ad

Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum

  • 1. Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum Companion video: https://youtu.be/CTncBjRgktk Julien Simon https://www.linkedin.com/in/juliensimon https://www.youtube.com/juliensimonfr The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
  • 2. A typical model adaptation workflow Pretrained model Domain- adapted model Instruction- tuned model Aligned model 📄📄📄 Unlabeled domain dataset Continuous pre-training (CPT) Instruction fine-tuning (IFT) Alignment 📄📄📄 Unlabeled domain dataset + Q&A dataset 📄📄📄 Preference dataset Instruction pre-training 📄📄📄 Q&A dataset « Language Models are Few-Shot Learners » https://arxiv.org/abs/2005.14165 (05/2020) « Finetuned Language Models Are Zero-Shot Learners » https://arxiv.org/abs/2109.01652 (09/2021) « Efficient Continual Pre-training for Building Domain Specific Large Language Models » https://arxiv.org/abs/2311.08545 (11/2023) « Instruction Pre-Training: Language Models are Supervised Multitask Learners » https://arxiv.org/abs/2406.14491v1 (06/2024) « How Do Large Language Models Acquire Factual Knowledge During Pretraining? » https://arxiv.org/abs/2406.11813v1 (06/2024)
  • 3. Challenges of model adaptation • Building a "great" model involves time-consuming and compute-intensive steps • Building datasets is hard work • Continuous Pre-Training requires a large corpus, at least billions of tokens • Instruction Fine-Tuning and Alignment requires high-quality, diverse Q&A pairs • Training models: accuracy or cost-efficiency? • Full fine-tuning (FFT): update all model parameters in original precision (say, BF16) • Compute-heavy and expensive… assuming you can get the required amount of data and compute • Parameter Efficient Fine Tuning (PEFT), e.g. LoRA or QLoRA • Learn only a much smaller number of model parameters, with optional quantization • Much more memory efficient, enabling smaller GPUs and shorter training times • Very effective for Instruction Fine-Tuning (IFT) and alignment • Significant accuracy degradation for CPT • Can we get accuracy and cost-efficiency? Yes: Spectrum
  • 4. Singular Value Decomposition import numpy as np C = np.random.rand(1024, 1024) U, Sigma, Vt = np.linalg.svd(C) • SVD is a general matrix factorization technique. • U: basis vectors for column space • VT: basis vectors for row space • Σ: diagonal matrix of singular values MIT Linear Algebra course: https://www.youtube.com/watch?v=TX_vooSnhm8 C 1024x1024 = * * U 1024x1024 VT 1024x1024 Left singular vectors Right singular vectors Singular values (in descending order) 𝚺 0 0 1024x1024
  • 5. Low-rank approximation with SVD U, Sigma, Vt = np.linalg.svd(C) U_k = U[:, :top_k] Sigma_k = np.diag(Sigma[:top_k]) Vt_k = Vt[:top_k, :] C_k = np.dot(U_k, np.dot(Sigma_k, Vt_k)) diff = np.linalg.norm(C-C_k, 'fro') C_k 1024x1024 ≈ * * U_k 1024 x k Vt_k k x 1024 Top k vectors Top k vectors Top k values 𝚺 _k 0 0 k x k Reconstruction of C • We can approximate C by keeping only the k largest singular values • k is called the rank • Fewer parameters are required • C : 2^20 parameters • U_k, Sigma_k, Vt_k : 2*(2^10*k) + k^2 • If k=8 : 16448 parameters (1.56%) • Frobenius norm: the difference between C and C_k
  • 6. Low Rank Adaptation (LoRA) https://arxiv.org/abs/2106.09685 (06/2021) W’ = + W0 Base model (frozen) Weight updates (learned) ΔW Fine-tuned model ΔW = * B A n x r n x m r x m • LoRA hypothesis: fine-tuning updates can be learned with two low-rank matrices "For a pre-trained weight matrix W0 ∈ R n×m , we constrain its update by representing the latter with a low-rank decomposition W0 + ∆W = W0 + BA, where B ∈ R n×r , A ∈ R r×m , and the rank r << min(n, m). During training, W0 is frozen and does not receive gradient updates, while A and B contain trainable parameters". • We only learn r×(n+m) parameters, a MUCH smaller number than W0's n×m (r is typically 4 to 16) • LoRA is usually applied to attention layers, not to all layers. • Very easy to run with Hugging Face PEFT https://github.com/huggingface/peft
  • 7. Challenges with LoRA • Choosing which layers to apply LoRA to can be non-trivial. • Finding the optimal hyperparameters (particularly the rank) can be a time-consuming process • The rank determines the number of significant directions used to adapt the pre-trained model. • If the rank is too low, we risk losing information (under-fitting), i.e. catastrophic forgetting. • If the rank is too high, we risk introducing noise (over-fitting), i.e. insignificant directions • Unfortunately, the same rank may not work well for all layers. • LoRA may not be effective when adapting a model to a new domain that is very different from the pre- training data (continuous pre-training) https://blog.arcee.ai/why-methods-like-qlora-fall-short-in-domain-knowledge-injection-2/ • « LoRA vs. Full Fine-Tuning: An Illusion of Equivalence » https://arxiv.org/abs/2410.21228 (10/2024) « LoRA and full fine-tuning produce structurally different parameter updates, characterized by the existence of intruder dimensions (…) These are singular vectors, with large associated singular values, that are approximately orthogonal to the singular vectors in a pre-trained weight matrix (…) LoRA fine-tuned models with intruder dimensions forget more of the pre- training distribution and exhibit less robust continual learning compared to full fine-tuning. »
  • 8. Spectrum https://arxiv.org/abs/2406.06623 (06/2024) https://github.com/cognitivecomputations/spectrum • Hypothesis: not all layers contribute equally to the output. • Some layers have a higher signal-to-noise ratio (SNR) than others. • Spectrum runs full fine-tuning on these, and leaves the other layers untouched. • For a given layer, SNR is the ratio of the sum of large singular values to the sum of small singular values (based on threshold ε) • For large matrices, the distribution of singular values forms a continuous distribution, such as the Marchenko-Pastur distribution which is bounded by λ- and λ+ • λ- and λ+ depend on the matrix size and the stddev of its singular values. • Any value within these bounds is considered random. • Any value larger than λ+ is likely to be significant. • λ+ is the ε threshold. 1. Run SVD on all model layers 2. For each layer: • Compute ε • Compute SNR 3. Keep only the top SNR layers (typically 25%) 4. Output a configuration files unfreezing the top SNR layers 5. Run full-fine tuning on the top SNR layers
  • 9. Using Spectrum https://arxiv.org/abs/2406.06623 (06/2024) https://github.com/cognitivecomputations/spectrum python spectrum.py --model-name <insert local or HF repo here> --top-percent <top % of SNR ratios to target> unfrozen_parameters: - ^lm_head.weight$ - ^model.embed_tokens.weight$ # input_layernorm layers - model.layers.0.input_layernorm - model.layers.1.input_layernorm - model.layers.2.input_layernorm - model.layers.3.input_layernorm - model.layers.4.input_layernorm - model.layers.5.input_layernorm # lm_head layers # mlp.down_proj layers - model.layers.21.mlp.down_proj - model.layers.20.mlp.down_proj - model.layers.22.mlp.down_proj - model.layers.19.mlp.down_proj - model.layers.23.mlp.down_proj - model.layers.24.mlp.down_proj . . . "model.layers.10.self_attn.o_proj": { "snr": 0.25031203031539917, "type": "self_attn.o_proj" }, "model.layers.11.self_attn.o_proj": { "snr": 0.2547757625579834, "type": "self_attn.o_proj" }, "model.layers.12.self_attn.o_proj": { "snr": 0.2616233825683594, "type": "self_attn.o_proj" }, "model.layers.13.self_attn.o_proj": { "snr": 0.2736438810825348, "type": "self_attn.o_proj" }, . . . SNR for each model layer Top % layers to train Insert into your training code or your Axolotl configuration file
  • 12. GPU RAM usage and training time
  • 13. Arcee SuperNova Lite (Llama-3.1-8b) Time @ bs=1 GPU RAM @ bs=1 MMLU acc GSM8K strict-match Hellaswag acc_norm Full-fine tuning - OOM - - - LoRA (r=32) 48 mn 41.5 GB 0.6587 0.6520 0.7840 QLoRA (4-bit, r=32) 42 mn 30.6 GB 0.6619 0.6785 0.7860 Spectrum-25 32 mn (-31%) 37.5 GB (+22%) 0.6870 (+3.8%) 0.7597 (+12%) 0.8027 (+2.1%) Spectrum-50 43 mn 41.5 GB 0.6844 0.7445 0.7999 Single-GPU training (L40S on AWS, 48 GB RAM), 1 epoch Model: https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite Dataset: https://huggingface.co/datasets/tatsu-lab/alpaca (52K rows) accelerate launch -m axolotl.cli.train examples/supernova-lite/<config>.yml lm_eval --model hf --model_args pretrained=<model> --tasks mmlu,gsm8k,hellaswag --batch_size 8