SlideShare a Scribd company logo
6
Most read
11
Most read
15
Most read
Tailoring Small Language Models
for Enterprise Use Cases
Julien Simon, Chief Evangelist
julien@arcee.ai
linkedin.com/in/juliensimon
youtube.com/juliensimonfr
Tailoring Small Language Models for Enterprise Use Cases
Why customers prefer Small Language Models (SLM)
• Accessibility: anyone can use the models, regardless of budget or affiliation
• Transparency: customers have full visibility on model weights
• Privacy: customers don't have to send their data to black box APIs
• IP protection: customers train models on their data, and own them
• Freedom of choice: customers are not locked in. They can switch models anytime
• IT flexibility: customers can train and deploy models anywhere they like, using any technology
• Cost optimization: customers find can the cost/performance sweet spot for each project
• Model quality: a small tailored model will always outperform a generic large model
A typical model adaptation workflow
Pretrained
model
Domain-
adapted
model
Instruction-
tuned model
Aligned
model
📄📄📄
Unlabeled
domain dataset
Continuous
pre-training
(CPT)
Instruction
fine-tuning
(IFT) Alignment
📄📄📄
Unlabeled domain dataset + Q&A dataset
📄📄📄
Preference dataset
Instruction
pre-training
📄📄📄
Q&A dataset
« Language Models are Few-Shot Learners » https://arxiv.org/abs/2005.14165 (05/2020)
« Finetuned Language Models Are Zero-Shot Learners » https://arxiv.org/abs/2109.01652 (09/2021)
« Efficient Continual Pre-training for Building Domain Specific Large Language Models » https://arxiv.org/abs/2311.08545 (11/2023)
« Instruction Pre-Training: Language Models are Supervised Multitask Learners » https://arxiv.org/abs/2406.14491v1 (06/2024)
« How Do Large Language Models Acquire Factual Knowledge During Pretraining? » https://arxiv.org/abs/2406.11813v1 (06/2024)
Continuous pre-training (CPT)
• (Continuous) pre-training involves training the model on a large corpus, often billions of tokens
• Option 1 - Full fine-tuning (FFT): train the full model in original precision (say, BF16)
• Compute-heavy and expensive
• Option 2 - Use Parameter Efficient Fine Tuning (PEFT), e.g. LoRA or QLoRA
• https://arxiv.org/abs/2305.14314 (05/2023)
• Large memory savings, enabling smaller GPUs and larger batch sizes
• Very effective for Instruction Fine-Tuning (IFT) and alignment
• Significant accuracy degradation for CPT
https://blog.arcee.ai/why-methods-like-qlora-fall-short-in-domain-knowledge-injection-2/
• Option 3 - Train only the most contributing layers in original precision
• Spectrum: https://arxiv.org/abs/2406.06623 (06/2024) + https://blog.arcee.ai/optimizing-llm-training-with-spectrum/
• https://github.com/cognitivecomputations/spectrum
• Spectrum-25 outperforms QLoRa on memory usage, training speed, and accuracy
• Spectrum-50 accuracy is on par or better (!) than FFT, and within 10% of QLoRa savings
Fine-tuning
• Low Rank Adaptation (LoRA) https://arxiv.org/abs/2106.09685
• Hypothesis: updates can be learned with two much smaller matrices
• LoRA reduces the number of trainable parameters by 1,000x or more, with minimal loss of accuracy
• At inference time, learned parameters are simply added to the original parameters : no extra latency
• QLoRA: LoRA for quantized models https://arxiv.org/abs/2305.14314
• Quantize a pre-trained model to 4-bit and fine-tune it with LoRA
• "QLoRA reduces the average memory requirements of fine-tuning a 65B parameter model
from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance
compared to a 16- bit fully fine-tuned baseline".
• The quality (diversity and complexity) of your Q&A dataset is important
• https://github.com/arcee-ai/EvolKit : a toolkit to enhance Q&A fine-tuning datasets
• Dataset generated with EvolKit: https://huggingface.co/datasets/arcee-ai/EvolKit-20k
"LoRA Land: 310 Fine-tuned LLMs that rival GPT-4"
https://arxiv.org/abs/2405.00732 (04/2024)
• 10 base models
• 31 tasks in 5 categories
• Classic NLP
• Coding
• Knowledge
• Reasoning
• Math
• Consistent prompting
• Completion
• Zero or single-shot
• Fine-tuning
• 4-bit QLoRA
• A single A10 GPU (!)
• No hyperparameter tuning
301/310 models surpass their base model counterpart.
The best
fi
ne-tuned LLM outperforms the best base model from +8.3 to +67.5 points, +25.0 points on average.
All
fi
ne-tuned models perform better than GPT-3.5.
224/310
fi
ne-tuned LLMs surpass the benchmark set by GPT-4.
All 7B
fi
ne-tuned models perform better than GPT-4, except for gemma-7b and gemma-7b-it.
Reinforcement Learning with Human Feedback (RLHF)
https://huyenchip.com/2023/05/02/rlhf.html
Reward-based RLHF is challenging
• Scalability: building a large human workforce is difficult and time-consuming
• Ethics: RLHF often involves underpaid outsourced workers
• Bias and quality: human feedback can be biased or inconsistent
• Complexity: RLHF requires many steps and datasets
• Cost: RLHF is very compute-intensive
Washington Post
Time
Daily Mail
Reward-free RLHF: Direct Preference Optimization (DPO)
https://arxiv.org/abs/2305.18290 (05/2023)
• DPO eliminates the need for a reward model
• The final model is trained on a statistical estimation of preference data
https://huggingface.co/datasets/
arcee-ai/general-dpo-datasets
Model Merging
https://arxiv.org/abs/2403.13257 (03/2024) + https://github.com/arcee-ai/mergekit
• Building a "great" model is challenging.
• Multiple training and fine-tuning steps are time-
consuming and compute-intensive
• Instead, can we build a model by merging several
models that already have the properties we need?
• Combine multiple task-specific models into a single
multitask model without any additional training
• Not an ensembling technique: there's only one
model at the end
• Merging only requires lightweight CPU compute
• Fast process, no extra cost for training and
inference, no extra inference latency
models:
- model: mistralai/Mistral-7B-Instruct-v0.2
parameters:
density: 0.5
weight: 0.5
- model: BioMistral/BioMistral-7B
parameters:
density: 0.5
weight: 0.5
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
normalize: false
int8_mask: true
dtype: float16
A modern model adaptation workflow
Pretrained
model
Domain-
adapted
model
Instruction-
tuned model
Aligned
model
Alignment
Merging
instead of
fine-tuning
Instruction-
tuned model
Merging
instead of
training
Domain-
adapted
model
Merging
instead of
aligning
Aligned
model
Merging steps can be combined, e.g., merge with a domain-adapted and aligned model
📄📄📄
Unlabeled
domain dataset
📄📄📄
Preference dataset
📄📄📄
Q&A dataset
Continuous
pre-training
(CPT)
Instruction
fine-tuning
(IFT)
Spectrum DPO
LoRA
EvolKit
Arcee Cloud
https://app.arcee.ai + https://docs.arcee.ai
Arcee SuperNova 70B (September 10th)
https://blog.arcee.ai/meet-arcee-supernova-our-flagship-70b-model-alternative-to-openai/
https://blog.arcee.ai/arcee-supernova-training-pipeline-and-model-composition/
A distilled version of Llama-3.1-405B,
merged with two other in-house
Llama-3.1-70B models
Best 70B model available today
Outperforms Llama-3.1-405B, Claude-3.5
and GPT-4o on IFEval
https://arxiv.org/abs/2311.07911
Chat with SuperNova (web)
Available on the AWS Marketplace
Llama-3.1-SuperNova-Lite 8B (September 10th)
https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite
A distilled version of Llama-3.1-405B
Best 8B model available today
#1 on the Hugging Face Open LLM
leaderboard
Chat with Llama SuperNova Lite
(ollama, Q5_K_S)
SuperNova Lite on Inferentia2
SuperNova Lite on Graviton4
Summing things up
No model rules them all : find the most appropriate one for each use case
Small, tailored open models are the way to go
New training and fine-tuning techniques are changing the model adaptation game
Visit arcee.ai to learn how you can build yours with Arcee Cloud (SaaS) or Arcee Enterprise (VPC deployment)
https://arcee.ai/blog
https://huggingface.co/arcee-ai
https://github.com/arcee-ai/aws-samples
https://youtube.com/c/juliensimonfr
Julien Simon, Chief Evangelist, Arcee AI
julien@arcee.ai

More Related Content

What's hot (20)

PDF
From decision trees to random forests
Viet-Trung TRAN
 
PPTX
Explainable Machine Learning (Explainable ML)
Hayim Makabee
 
PDF
Large Language Models Bootcamp
Data Science Dojo
 
PDF
Introduction to Generative Adversarial Networks (GANs)
Appsilon Data Science
 
PPTX
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Sri Ambati
 
PDF
Boosting Algorithms Omar Odibat
omarodibat
 
PDF
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
Numenta
 
PDF
Scaling Instruction-Finetuned Language Models
taeseon ryu
 
PPTX
How to fine-tune and develop your own large language model.pptx
Knoldus Inc.
 
PPTX
LDM_ImageSythesis.pptx
AkankshaRawat53
 
PDF
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Edureka!
 
PDF
Use Case Patterns for LLM Applications (1).pdf
M Waleed Kadous
 
PDF
Tutorial on Deep Generative Models
MLReview
 
PDF
Generative-AI-in-enterprise-20230615.pdf
Liming Zhu
 
PDF
Machine Learning for Dummies (without mathematics)
Andrews Cordolino Sobral
 
PPTX
Decision Tree - C4.5&CART
Xueping Peng
 
PPTX
Ensemble methods in machine learning
SANTHOSH RAJA M G
 
PPTX
Gradient Boosted trees
Nihar Ranjan
 
PDF
Back Propagation Neural Network In AI PowerPoint Presentation Slide Templates...
SlideTeam
 
PDF
Basic Generative Adversarial Networks
Dong Heon Cho
 
From decision trees to random forests
Viet-Trung TRAN
 
Explainable Machine Learning (Explainable ML)
Hayim Makabee
 
Large Language Models Bootcamp
Data Science Dojo
 
Introduction to Generative Adversarial Networks (GANs)
Appsilon Data Science
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Sri Ambati
 
Boosting Algorithms Omar Odibat
omarodibat
 
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
Numenta
 
Scaling Instruction-Finetuned Language Models
taeseon ryu
 
How to fine-tune and develop your own large language model.pptx
Knoldus Inc.
 
LDM_ImageSythesis.pptx
AkankshaRawat53
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Edureka!
 
Use Case Patterns for LLM Applications (1).pdf
M Waleed Kadous
 
Tutorial on Deep Generative Models
MLReview
 
Generative-AI-in-enterprise-20230615.pdf
Liming Zhu
 
Machine Learning for Dummies (without mathematics)
Andrews Cordolino Sobral
 
Decision Tree - C4.5&CART
Xueping Peng
 
Ensemble methods in machine learning
SANTHOSH RAJA M G
 
Gradient Boosted trees
Nihar Ranjan
 
Back Propagation Neural Network In AI PowerPoint Presentation Slide Templates...
SlideTeam
 
Basic Generative Adversarial Networks
Dong Heon Cho
 

Similar to Tailoring Small Language Models for Enterprise Use Cases (20)

PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PPTX
Introduction to LLM Post-Training - MIT 6.S191 2025
Maxime Labonne
 
PDF
Customizing LLMs
Jim Steele
 
PDF
Building High-Quality Domain-Specific Models with Mergekit
Julien SIMON
 
PDF
LLM Cheatsheet and it's brief introduction
DarkKnight437486
 
PPTX
Belgian NLP meetup Dec 2023
HyperledgerIndiaChap
 
PPTX
Local Applications of Large Language Models based on RAG.pptx
lwz614595250
 
PDF
solulab.com-A Complete LLM Technique Comparison.pdf
celinedion89121
 
PDF
solulab.com-A Complete LLM Technique Comparison (2).pdf
celinedion89121
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Databricks
 
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
cniclsh1
 
PDF
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
cniclsh1
 
PDF
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Databricks
 
PDF
Reproducible AI Using PyTorch and MLflow
Databricks
 
PDF
Enterprise Trends for Gen AI - Berkeley LLM AI Agents MOOC
VincentLui15
 
PDF
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Julien SIMON
 
PPTX
Fine-tuning Large Language Models by Dmitry Balabka
DevClub_lv
 
PDF
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
PDF
Toward unified framework and symbolic decision making - Berkeley LLM AI Agent...
VincentLui15
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Introduction to LLM Post-Training - MIT 6.S191 2025
Maxime Labonne
 
Customizing LLMs
Jim Steele
 
Building High-Quality Domain-Specific Models with Mergekit
Julien SIMON
 
LLM Cheatsheet and it's brief introduction
DarkKnight437486
 
Belgian NLP meetup Dec 2023
HyperledgerIndiaChap
 
Local Applications of Large Language Models based on RAG.pptx
lwz614595250
 
solulab.com-A Complete LLM Technique Comparison.pdf
celinedion89121
 
solulab.com-A Complete LLM Technique Comparison (2).pdf
celinedion89121
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Databricks
 
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS - Parameter Effi...
cniclsh1
 
W4L2_11-667: LARGE LANGUAGE MODELS: METHODS AND APPLICATIONS PETM Parameter E...
cniclsh1
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Databricks
 
Reproducible AI Using PyTorch and MLflow
Databricks
 
Enterprise Trends for Gen AI - Berkeley LLM AI Agents MOOC
VincentLui15
 
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Julien SIMON
 
Fine-tuning Large Language Models by Dmitry Balabka
DevClub_lv
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
Toward unified framework and symbolic decision making - Berkeley LLM AI Agent...
VincentLui15
 
Ad

More from Julien SIMON (20)

PDF
deep_dive_multihead_latent_attention.pdf
Julien SIMON
 
PDF
Deep Dive: Model Distillation with DistillKit
Julien SIMON
 
PDF
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Optimizing LLM Inference
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Quantizing LLMs
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Model Merging
Julien SIMON
 
PDF
An introduction to computer vision with Hugging Face
Julien SIMON
 
PDF
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
PDF
Building NLP applications with Transformers
Julien SIMON
 
PPTX
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
PDF
Starting your AI/ML project right (May 2020)
Julien SIMON
 
PPTX
Scale Machine Learning from zero to millions of users (April 2020)
Julien SIMON
 
PPTX
An Introduction to Generative Adversarial Networks (April 2020)
Julien SIMON
 
PPTX
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
Julien SIMON
 
PDF
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Julien SIMON
 
PDF
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
Julien SIMON
 
PDF
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 
PDF
Building smart applications with AWS AI services (October 2019)
Julien SIMON
 
PPTX
Build, train and deploy ML models with SageMaker (October 2019)
Julien SIMON
 
deep_dive_multihead_latent_attention.pdf
Julien SIMON
 
Deep Dive: Model Distillation with DistillKit
Julien SIMON
 
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien SIMON
 
Julien Simon - Deep Dive - Optimizing LLM Inference
Julien SIMON
 
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien SIMON
 
Julien Simon - Deep Dive - Quantizing LLMs
Julien SIMON
 
Julien Simon - Deep Dive - Model Merging
Julien SIMON
 
An introduction to computer vision with Hugging Face
Julien SIMON
 
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
Building NLP applications with Transformers
Julien SIMON
 
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
Starting your AI/ML project right (May 2020)
Julien SIMON
 
Scale Machine Learning from zero to millions of users (April 2020)
Julien SIMON
 
An Introduction to Generative Adversarial Networks (April 2020)
Julien SIMON
 
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
Julien SIMON
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
Julien SIMON
 
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
Julien SIMON
 
A pragmatic introduction to natural language processing models (October 2019)
Julien SIMON
 
Building smart applications with AWS AI services (October 2019)
Julien SIMON
 
Build, train and deploy ML models with SageMaker (October 2019)
Julien SIMON
 
Ad

Recently uploaded (20)

PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Learn Computer Forensics, Second Edition
AnuraShantha7
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
Q2 Leading a Tableau User Group - Onboarding
lward7
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Learn Computer Forensics, Second Edition
AnuraShantha7
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
July Patch Tuesday
Ivanti
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Q2 Leading a Tableau User Group - Onboarding
lward7
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 

Tailoring Small Language Models for Enterprise Use Cases

  • 1. Tailoring Small Language Models for Enterprise Use Cases Julien Simon, Chief Evangelist [email protected] linkedin.com/in/juliensimon youtube.com/juliensimonfr
  • 3. Why customers prefer Small Language Models (SLM) • Accessibility: anyone can use the models, regardless of budget or affiliation • Transparency: customers have full visibility on model weights • Privacy: customers don't have to send their data to black box APIs • IP protection: customers train models on their data, and own them • Freedom of choice: customers are not locked in. They can switch models anytime • IT flexibility: customers can train and deploy models anywhere they like, using any technology • Cost optimization: customers find can the cost/performance sweet spot for each project • Model quality: a small tailored model will always outperform a generic large model
  • 4. A typical model adaptation workflow Pretrained model Domain- adapted model Instruction- tuned model Aligned model 📄📄📄 Unlabeled domain dataset Continuous pre-training (CPT) Instruction fine-tuning (IFT) Alignment 📄📄📄 Unlabeled domain dataset + Q&A dataset 📄📄📄 Preference dataset Instruction pre-training 📄📄📄 Q&A dataset « Language Models are Few-Shot Learners » https://arxiv.org/abs/2005.14165 (05/2020) « Finetuned Language Models Are Zero-Shot Learners » https://arxiv.org/abs/2109.01652 (09/2021) « Efficient Continual Pre-training for Building Domain Specific Large Language Models » https://arxiv.org/abs/2311.08545 (11/2023) « Instruction Pre-Training: Language Models are Supervised Multitask Learners » https://arxiv.org/abs/2406.14491v1 (06/2024) « How Do Large Language Models Acquire Factual Knowledge During Pretraining? » https://arxiv.org/abs/2406.11813v1 (06/2024)
  • 5. Continuous pre-training (CPT) • (Continuous) pre-training involves training the model on a large corpus, often billions of tokens • Option 1 - Full fine-tuning (FFT): train the full model in original precision (say, BF16) • Compute-heavy and expensive • Option 2 - Use Parameter Efficient Fine Tuning (PEFT), e.g. LoRA or QLoRA • https://arxiv.org/abs/2305.14314 (05/2023) • Large memory savings, enabling smaller GPUs and larger batch sizes • Very effective for Instruction Fine-Tuning (IFT) and alignment • Significant accuracy degradation for CPT https://blog.arcee.ai/why-methods-like-qlora-fall-short-in-domain-knowledge-injection-2/ • Option 3 - Train only the most contributing layers in original precision • Spectrum: https://arxiv.org/abs/2406.06623 (06/2024) + https://blog.arcee.ai/optimizing-llm-training-with-spectrum/ • https://github.com/cognitivecomputations/spectrum • Spectrum-25 outperforms QLoRa on memory usage, training speed, and accuracy • Spectrum-50 accuracy is on par or better (!) than FFT, and within 10% of QLoRa savings
  • 6. Fine-tuning • Low Rank Adaptation (LoRA) https://arxiv.org/abs/2106.09685 • Hypothesis: updates can be learned with two much smaller matrices • LoRA reduces the number of trainable parameters by 1,000x or more, with minimal loss of accuracy • At inference time, learned parameters are simply added to the original parameters : no extra latency • QLoRA: LoRA for quantized models https://arxiv.org/abs/2305.14314 • Quantize a pre-trained model to 4-bit and fine-tune it with LoRA • "QLoRA reduces the average memory requirements of fine-tuning a 65B parameter model from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance compared to a 16- bit fully fine-tuned baseline". • The quality (diversity and complexity) of your Q&A dataset is important • https://github.com/arcee-ai/EvolKit : a toolkit to enhance Q&A fine-tuning datasets • Dataset generated with EvolKit: https://huggingface.co/datasets/arcee-ai/EvolKit-20k
  • 7. "LoRA Land: 310 Fine-tuned LLMs that rival GPT-4" https://arxiv.org/abs/2405.00732 (04/2024) • 10 base models • 31 tasks in 5 categories • Classic NLP • Coding • Knowledge • Reasoning • Math • Consistent prompting • Completion • Zero or single-shot • Fine-tuning • 4-bit QLoRA • A single A10 GPU (!) • No hyperparameter tuning 301/310 models surpass their base model counterpart. The best fi ne-tuned LLM outperforms the best base model from +8.3 to +67.5 points, +25.0 points on average. All fi ne-tuned models perform better than GPT-3.5. 224/310 fi ne-tuned LLMs surpass the benchmark set by GPT-4. All 7B fi ne-tuned models perform better than GPT-4, except for gemma-7b and gemma-7b-it.
  • 8. Reinforcement Learning with Human Feedback (RLHF) https://huyenchip.com/2023/05/02/rlhf.html
  • 9. Reward-based RLHF is challenging • Scalability: building a large human workforce is difficult and time-consuming • Ethics: RLHF often involves underpaid outsourced workers • Bias and quality: human feedback can be biased or inconsistent • Complexity: RLHF requires many steps and datasets • Cost: RLHF is very compute-intensive Washington Post Time Daily Mail
  • 10. Reward-free RLHF: Direct Preference Optimization (DPO) https://arxiv.org/abs/2305.18290 (05/2023) • DPO eliminates the need for a reward model • The final model is trained on a statistical estimation of preference data https://huggingface.co/datasets/ arcee-ai/general-dpo-datasets
  • 11. Model Merging https://arxiv.org/abs/2403.13257 (03/2024) + https://github.com/arcee-ai/mergekit • Building a "great" model is challenging. • Multiple training and fine-tuning steps are time- consuming and compute-intensive • Instead, can we build a model by merging several models that already have the properties we need? • Combine multiple task-specific models into a single multitask model without any additional training • Not an ensembling technique: there's only one model at the end • Merging only requires lightweight CPU compute • Fast process, no extra cost for training and inference, no extra inference latency models: - model: mistralai/Mistral-7B-Instruct-v0.2 parameters: density: 0.5 weight: 0.5 - model: BioMistral/BioMistral-7B parameters: density: 0.5 weight: 0.5 merge_method: ties base_model: mistralai/Mistral-7B-v0.1 parameters: normalize: false int8_mask: true dtype: float16
  • 12. A modern model adaptation workflow Pretrained model Domain- adapted model Instruction- tuned model Aligned model Alignment Merging instead of fine-tuning Instruction- tuned model Merging instead of training Domain- adapted model Merging instead of aligning Aligned model Merging steps can be combined, e.g., merge with a domain-adapted and aligned model 📄📄📄 Unlabeled domain dataset 📄📄📄 Preference dataset 📄📄📄 Q&A dataset Continuous pre-training (CPT) Instruction fine-tuning (IFT) Spectrum DPO LoRA EvolKit
  • 13. Arcee Cloud https://app.arcee.ai + https://docs.arcee.ai
  • 14. Arcee SuperNova 70B (September 10th) https://blog.arcee.ai/meet-arcee-supernova-our-flagship-70b-model-alternative-to-openai/ https://blog.arcee.ai/arcee-supernova-training-pipeline-and-model-composition/ A distilled version of Llama-3.1-405B, merged with two other in-house Llama-3.1-70B models Best 70B model available today Outperforms Llama-3.1-405B, Claude-3.5 and GPT-4o on IFEval https://arxiv.org/abs/2311.07911 Chat with SuperNova (web) Available on the AWS Marketplace
  • 15. Llama-3.1-SuperNova-Lite 8B (September 10th) https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite A distilled version of Llama-3.1-405B Best 8B model available today #1 on the Hugging Face Open LLM leaderboard Chat with Llama SuperNova Lite (ollama, Q5_K_S) SuperNova Lite on Inferentia2 SuperNova Lite on Graviton4
  • 16. Summing things up No model rules them all : find the most appropriate one for each use case Small, tailored open models are the way to go New training and fine-tuning techniques are changing the model adaptation game Visit arcee.ai to learn how you can build yours with Arcee Cloud (SaaS) or Arcee Enterprise (VPC deployment) https://arcee.ai/blog https://huggingface.co/arcee-ai https://github.com/arcee-ai/aws-samples https://youtube.com/c/juliensimonfr Julien Simon, Chief Evangelist, Arcee AI [email protected]