SlideShare a Scribd company logo
Optimizing Large Language Models with vLLM
and Related Tools
By - Tamanna
NextGen_Outlier 1
Introduction to LLM Optimization
Large Language Models (LLMs) like LLaMA and Mistral power AI applications but face
challenges:
High memory usage (e.g., LLaMA-13B needs 26GB in FP16).
Slow inference speeds for real-time tasks.
High computational costs.
vLLM: Open-source library for efficient LLM inference and serving.
Developed at UC Berkeley, 40,000+ GitHub stars.
Up to 24x higher throughput than Hugging Face Transformers.
NextGen_Outlier 2
What is vLLM?
vLLM (Virtual Large Language Model) optimizes LLM inference:
Reduces memory waste with PagedAttention.
Boosts throughput with continuous batching.
Supports quantization (e.g., FP8, INT8).
Compatible with Hugging Face models and OpenAI-style APIs.
Ideal for chatbots, code assistants, and text generation.
NextGen_Outlier 3
Core Features of vLLM
PagedAttention: Divides KV cache into blocks, reducing memory waste to <4%.
Example: Shares KV blocks for similar prompts (e.g., "What is the capital of...").
Continuous Batching: Dynamically processes requests, minimizing latency.
Quantization: Supports FP8, INT8, AWQ, GPTQ, bitsandbytes.
Example: LLaMA-13B from 26GB (FP16) to 13GB (INT8).
Optimized CUDA Kernels: Uses FlashAttention for speed.
Tensor Parallelism & Speculative Decoding: Scales across GPUs, predicts tokens faster.
NextGen_Outlier 4
vLLM Workflow
Diagram: [Placeholder for vLLM workflow diagram]
Workflow:
User sends prompt to vLLM API server.
AsyncLLM processes requests asynchronously.
EngineCore schedules with PagedAttention and continuous batching.
Quantized model runs on optimized CUDA kernels.
Output returned to user.
Note: Create diagram using tools like Mermaid or TikZ.
NextGen_Outlier 5
Using vLLM: Example (1/2)
Install: pip install vllm
Offline Inference:
from vllm import LLM, SamplingParams
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
llm = LLM(model=model_name, quantization="int8")
prompts = ["The future of AI is", "Write a short poem about the moon"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}n")
NextGen_Outlier 6
Using vLLM: Example (2/2)
Online Serving:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --quantization int8 --port 8000
Docker Deployment:
docker run --runtime nvidia --gpus all -p 8000:8000 vllm/vllm-openai:latest
NextGen_Outlier 7
Other Tools Like vLLM
TensorRT-LLM (NVIDIA):
Optimized for NVIDIA GPUs, FP8 quantization.
High throughput (120–130 tok/s for LLaMA-7B).
Complex setup (model conversion).
DeepSpeed (Microsoft):
Mixed precision (FP16, BF16), ZeRO for distributed setups.
Ideal for large-scale training/inference.
OpenLLM (BentoML):
Focus on quantization (GPTQ, bitsandbytes) and fine-tuning.
Suited for memory-constrained environments.
TGI (Hugging Face):
Continuous batching, FlashAttention.
2.2x–2.5x less throughput than vLLM.
NextGen_Outlier 8
Additional Tools for LLM Optimization
ONNX Runtime (Microsoft):
Cross-platform (CPUs, GPUs, TPUs), INT8/FP16 quantization.
Use case: Edge devices, mixed hardware.
Llama.cpp:
CPU-optimized, 4-bit/5-bit quantization (Q4_K_M).
Example: Run Mistral-7B on a laptop (4.5GB).
ExLlamaV2:
4-bit GPTQ on NVIDIA GPUs, ~100 tok/s for LLaMA-7B.
Lightweight, single-GPU focus.
Aphrodite-Engine:
vLLM fork with speculative decoding.
Experimental, marginal speed gains.
NextGen_Outlier 9
Quantization Techniques
PTQ (Post-Training Quantization): Fast, slight accuracy loss.
Example: INT8 in vLLM.
QAT (Quantization-Aware Training): High accuracy, resource-intensive.
AWQ: Activation-aware, fast on NVIDIA GPUs.
GPTQ: 4-bit/8-bit quantization, memory-efficient.
Bitsandbytes: 8-bit with hardware acceleration.
Q4_K_M (Llama.cpp): 4-bit for CPU inference.
Example: ./main -m mistral-7b-q4_k_m.gguf -p "Tell me a story"
NextGen_Outlier 10
Additional Optimization Techniques
Model Distillation:
Trains smaller "student" model to mimic larger "teacher."
Use case: Edge deployment.
Example: Distill LLaMA-13B to 3B.
LoRA (Low-Rank Adaptation):
Fine-tunes small parameter subset.
Example in vLLM:
from vllm import LLM
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", lora_path="lora-adapter")
NextGen_Outlier 11
Comparison of Tools
Feature vLLM
TensorRT-
LLM
DeepSpeed Llama.cpp
Focus
High-throughput
serving
GPU
inference
Training/Inference
CPU/Edge
inference
Optimization
PagedAttention,
Batching
FP8, Kernels
ZeRO, Mixed
Precision
Q4/Q5
Quantization
Throughput 24x vs. HF 120–130 tok/s High (distributed) 10–20 tok/s
Quantization
FP8, INT8, AWQ,
GPTQ
FP8, INT8 FP16, BF16, INT8 Q4, Q5, Q8
Hardware NVIDIA, AMD, CPUs NVIDIA GPUs Multi-GPU, CPUs CPUs, Edge
Use Case Chatbots, APIs
NVIDIA
setups
Large-scale setups Edge devices
NextGen_Outlier 12
Practical Example: Chatbot Deployment
GPU with vLLM (Mistral-7B, INT8):
vllm serve mistralai/Mistral-7B-v0.1 --quantization int8 --port 8000
Memory: 7GB (vs. 14GB FP16).
Output: "Why did the scarecrow become a motivational speaker? ..."
CPU with Llama.cpp (Mistral-7B, Q4_K_M):
./main -m mistral-7b-q4_k_m.gguf -p "Tell me a joke!" -n 100
Memory: ~4.5GB, runs on 16GB RAM laptop.
NextGen_Outlier 13
Conclusion
vLLM: Top choice for high-throughput LLM serving with PagedAttention and quantization.
Alternatives: TensorRT-LLM (NVIDIA GPUs), DeepSpeed (distributed), OpenLLM (fine-tuning),
TGI (Hugging Face), ONNX Runtime (cross-platform), Llama.cpp (CPU/edge), ExLlamaV2 (4-bit
GPTQ), Aphrodite-Engine (experimental).
Techniques: Quantization (PTQ, AWQ, Q4_K_M), distillation, LoRA.
Deploy LLMs efficiently on GPUs, CPUs, or edge devices.
Resources: vLLM documentation, Llama.cpp GitHub, Runpod.
NextGen_Outlier 14
Thank you!!
NextGen_Outlier 15
NextGen_Outlier 16

More Related Content

Similar to Optimizing Large Language Models with vLLM and Related Tools.pdf (20)

PPTX
Multi layered perceptron (mlp)
Handson System
 
PDF
MLeap: Deploy Spark ML Pipelines to Production API Servers
DataWorks Summit
 
PPTX
Simple Scalar Simulator of ACD Familiariation Labratory Manual
zelalem2022
 
PDF
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
 
PDF
Mkl mic lab_0
Hung le Minh
 
PDF
Deep learning for FinTech
geetachauhan
 
PDF
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
PDF
Automatic and Interpretable Machine Learning with H2O and LIME
Jo-fai Chow
 
PPT
Hs java open_party
Open Party
 
PPTX
Shifu plugin-trainer and pmml-adapter
Lisa Hua
 
PPTX
Weka + Clj-ml.pptx
Kiran Karkera
 
PPT
D3ML Session
Falk Hartmann
 
PDF
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
MichaelOLeary82
 
PDF
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Intel® Software
 
PDF
cnsm2011_slide
rerngvit yanggratoke
 
PPTX
autoTVM
Yi-Wen Hung
 
PDF
Managing the Machine Learning Lifecycle with MLOps
Fatih Baltacı
 
PPT
An Architecture for an XML-Template Engine enabling Safe Authoring
Falk Hartmann
 
PDF
EKON27_MLP_1_sign.pdf
breitschbreitsch
 
ODP
PoC: Using a Group Communication System to improve MySQL Replication HA
Ulf Wendel
 
Multi layered perceptron (mlp)
Handson System
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
DataWorks Summit
 
Simple Scalar Simulator of ACD Familiariation Labratory Manual
zelalem2022
 
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
 
Mkl mic lab_0
Hung le Minh
 
Deep learning for FinTech
geetachauhan
 
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
Automatic and Interpretable Machine Learning with H2O and LIME
Jo-fai Chow
 
Hs java open_party
Open Party
 
Shifu plugin-trainer and pmml-adapter
Lisa Hua
 
Weka + Clj-ml.pptx
Kiran Karkera
 
D3ML Session
Falk Hartmann
 
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
MichaelOLeary82
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Intel® Software
 
cnsm2011_slide
rerngvit yanggratoke
 
autoTVM
Yi-Wen Hung
 
Managing the Machine Learning Lifecycle with MLOps
Fatih Baltacı
 
An Architecture for an XML-Template Engine enabling Safe Authoring
Falk Hartmann
 
EKON27_MLP_1_sign.pdf
breitschbreitsch
 
PoC: Using a Group Communication System to improve MySQL Replication HA
Ulf Wendel
 

More from Tamanna36 (7)

PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
Building Powerful Agentic AI with Google ADK, MCP, RAG, and Ollama.pptx
Tamanna36
 
PDF
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
PPTX
Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...
Tamanna36
 
PPTX
Understanding LLM Temperature: A comprehensive Guide
Tamanna36
 
PDF
Knowledge based System
Tamanna36
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Building Powerful Agentic AI with Google ADK, MCP, RAG, and Ollama.pptx
Tamanna36
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...
Tamanna36
 
Understanding LLM Temperature: A comprehensive Guide
Tamanna36
 
Knowledge based System
Tamanna36
 
Ad

Recently uploaded (20)

PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
BinarySearchTree in datastructures in detail
kichokuttu
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Ad

Optimizing Large Language Models with vLLM and Related Tools.pdf

  • 1. Optimizing Large Language Models with vLLM and Related Tools By - Tamanna NextGen_Outlier 1
  • 2. Introduction to LLM Optimization Large Language Models (LLMs) like LLaMA and Mistral power AI applications but face challenges: High memory usage (e.g., LLaMA-13B needs 26GB in FP16). Slow inference speeds for real-time tasks. High computational costs. vLLM: Open-source library for efficient LLM inference and serving. Developed at UC Berkeley, 40,000+ GitHub stars. Up to 24x higher throughput than Hugging Face Transformers. NextGen_Outlier 2
  • 3. What is vLLM? vLLM (Virtual Large Language Model) optimizes LLM inference: Reduces memory waste with PagedAttention. Boosts throughput with continuous batching. Supports quantization (e.g., FP8, INT8). Compatible with Hugging Face models and OpenAI-style APIs. Ideal for chatbots, code assistants, and text generation. NextGen_Outlier 3
  • 4. Core Features of vLLM PagedAttention: Divides KV cache into blocks, reducing memory waste to <4%. Example: Shares KV blocks for similar prompts (e.g., "What is the capital of..."). Continuous Batching: Dynamically processes requests, minimizing latency. Quantization: Supports FP8, INT8, AWQ, GPTQ, bitsandbytes. Example: LLaMA-13B from 26GB (FP16) to 13GB (INT8). Optimized CUDA Kernels: Uses FlashAttention for speed. Tensor Parallelism & Speculative Decoding: Scales across GPUs, predicts tokens faster. NextGen_Outlier 4
  • 5. vLLM Workflow Diagram: [Placeholder for vLLM workflow diagram] Workflow: User sends prompt to vLLM API server. AsyncLLM processes requests asynchronously. EngineCore schedules with PagedAttention and continuous batching. Quantized model runs on optimized CUDA kernels. Output returned to user. Note: Create diagram using tools like Mermaid or TikZ. NextGen_Outlier 5
  • 6. Using vLLM: Example (1/2) Install: pip install vllm Offline Inference: from vllm import LLM, SamplingParams model_name = "meta-llama/Meta-Llama-3-8B-Instruct" llm = LLM(model=model_name, quantization="int8") prompts = ["The future of AI is", "Write a short poem about the moon"] sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50) outputs = llm.generate(prompts, sampling_params) for output in outputs: print(f"Prompt: {output.prompt}") print(f"Output: {output.outputs[0].text}n") NextGen_Outlier 6
  • 7. Using vLLM: Example (2/2) Online Serving: vllm serve meta-llama/Meta-Llama-3-8B-Instruct --quantization int8 --port 8000 Docker Deployment: docker run --runtime nvidia --gpus all -p 8000:8000 vllm/vllm-openai:latest NextGen_Outlier 7
  • 8. Other Tools Like vLLM TensorRT-LLM (NVIDIA): Optimized for NVIDIA GPUs, FP8 quantization. High throughput (120–130 tok/s for LLaMA-7B). Complex setup (model conversion). DeepSpeed (Microsoft): Mixed precision (FP16, BF16), ZeRO for distributed setups. Ideal for large-scale training/inference. OpenLLM (BentoML): Focus on quantization (GPTQ, bitsandbytes) and fine-tuning. Suited for memory-constrained environments. TGI (Hugging Face): Continuous batching, FlashAttention. 2.2x–2.5x less throughput than vLLM. NextGen_Outlier 8
  • 9. Additional Tools for LLM Optimization ONNX Runtime (Microsoft): Cross-platform (CPUs, GPUs, TPUs), INT8/FP16 quantization. Use case: Edge devices, mixed hardware. Llama.cpp: CPU-optimized, 4-bit/5-bit quantization (Q4_K_M). Example: Run Mistral-7B on a laptop (4.5GB). ExLlamaV2: 4-bit GPTQ on NVIDIA GPUs, ~100 tok/s for LLaMA-7B. Lightweight, single-GPU focus. Aphrodite-Engine: vLLM fork with speculative decoding. Experimental, marginal speed gains. NextGen_Outlier 9
  • 10. Quantization Techniques PTQ (Post-Training Quantization): Fast, slight accuracy loss. Example: INT8 in vLLM. QAT (Quantization-Aware Training): High accuracy, resource-intensive. AWQ: Activation-aware, fast on NVIDIA GPUs. GPTQ: 4-bit/8-bit quantization, memory-efficient. Bitsandbytes: 8-bit with hardware acceleration. Q4_K_M (Llama.cpp): 4-bit for CPU inference. Example: ./main -m mistral-7b-q4_k_m.gguf -p "Tell me a story" NextGen_Outlier 10
  • 11. Additional Optimization Techniques Model Distillation: Trains smaller "student" model to mimic larger "teacher." Use case: Edge deployment. Example: Distill LLaMA-13B to 3B. LoRA (Low-Rank Adaptation): Fine-tunes small parameter subset. Example in vLLM: from vllm import LLM llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", lora_path="lora-adapter") NextGen_Outlier 11
  • 12. Comparison of Tools Feature vLLM TensorRT- LLM DeepSpeed Llama.cpp Focus High-throughput serving GPU inference Training/Inference CPU/Edge inference Optimization PagedAttention, Batching FP8, Kernels ZeRO, Mixed Precision Q4/Q5 Quantization Throughput 24x vs. HF 120–130 tok/s High (distributed) 10–20 tok/s Quantization FP8, INT8, AWQ, GPTQ FP8, INT8 FP16, BF16, INT8 Q4, Q5, Q8 Hardware NVIDIA, AMD, CPUs NVIDIA GPUs Multi-GPU, CPUs CPUs, Edge Use Case Chatbots, APIs NVIDIA setups Large-scale setups Edge devices NextGen_Outlier 12
  • 13. Practical Example: Chatbot Deployment GPU with vLLM (Mistral-7B, INT8): vllm serve mistralai/Mistral-7B-v0.1 --quantization int8 --port 8000 Memory: 7GB (vs. 14GB FP16). Output: "Why did the scarecrow become a motivational speaker? ..." CPU with Llama.cpp (Mistral-7B, Q4_K_M): ./main -m mistral-7b-q4_k_m.gguf -p "Tell me a joke!" -n 100 Memory: ~4.5GB, runs on 16GB RAM laptop. NextGen_Outlier 13
  • 14. Conclusion vLLM: Top choice for high-throughput LLM serving with PagedAttention and quantization. Alternatives: TensorRT-LLM (NVIDIA GPUs), DeepSpeed (distributed), OpenLLM (fine-tuning), TGI (Hugging Face), ONNX Runtime (cross-platform), Llama.cpp (CPU/edge), ExLlamaV2 (4-bit GPTQ), Aphrodite-Engine (experimental). Techniques: Quantization (PTQ, AWQ, Q4_K_M), distillation, LoRA. Deploy LLMs efficiently on GPUs, CPUs, or edge devices. Resources: vLLM documentation, Llama.cpp GitHub, Runpod. NextGen_Outlier 14