Skip to main content

VLLM

LiteLLM supports all models on VLLM.

PropertyDetails
DescriptionvLLM is a fast and easy-to-use library for LLM inference and serving. Docs
Provider Route on LiteLLMhosted_vllm/ (for OpenAI compatible server), vllm/ (for vLLM sdk usage)
Provider DocvLLM โ†—
Supported Endpoints/chat/completions, /embeddings, /completions, /rerank

Quick Start

Usage - litellm.completion (calling OpenAI compatible endpoint)โ€‹

vLLM Provides an OpenAI compatible endpoints - here's how to call it with LiteLLM

In order to use litellm to call a hosted vllm server add the following to your completion call

  • model="hosted_vllm/<your-vllm-model-name>"
  • api_base = "your-hosted-vllm-server"
import litellm 

response = litellm.completion(
model="hosted_vllm/facebook/opt-125m", # pass the vllm model name
messages=messages,
api_base="https://hosted-vllm-api.co",
temperature=0.2,
max_tokens=80)

print(response)

Usage - LiteLLM Proxy Server (calling OpenAI compatible endpoint)โ€‹

Here's how to call an OpenAI-Compatible Endpoint with the LiteLLM Proxy Server

  1. Modify the config.yaml
model_list:
- model_name: my-model
litellm_params:
model: hosted_vllm/facebook/opt-125m # add hosted_vllm/ prefix to route as OpenAI provider
api_base: https://hosted-vllm-api.co # add api base for OpenAI compatible provider
  1. Start the proxy
$ litellm --config /path/to/config.yaml
  1. Send Request to LiteLLM Proxy Server
import openai
client = openai.OpenAI(
api_key="sk-1234", # pass litellm proxy key, if you're using virtual keys
base_url="http://0.0.0.0:4000" # litellm-proxy-base url
)

response = client.chat.completions.create(
model="my-model",
messages = [
{
"role": "user",
"content": "what llm are you"
}
],
)

print(response)

Embeddingsโ€‹

from litellm import embedding   
import os

os.environ["HOSTED_VLLM_API_BASE"] = "http://localhost:8000"


embedding = embedding(model="hosted_vllm/facebook/opt-125m", input=["Hello world"])

print(embedding)

Rerankโ€‹

from litellm import rerank
import os

os.environ["HOSTED_VLLM_API_BASE"] = "http://localhost:8000"
os.environ["HOSTED_VLLM_API_KEY"] = "" # [optional], if your VLLM server requires an API key

query = "What is the capital of the United States?"
documents = [
"Carson City is the capital city of the American state of Nevada.",
"The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean. Its capital is Saipan.",
"Washington, D.C. is the capital of the United States.",
"Capital punishment has existed in the United States since before it was a country.",
]

response = rerank(
model="hosted_vllm/your-rerank-model",
query=query,
documents=documents,
top_n=3,
)
print(response)

Async Usageโ€‹

from litellm import arerank
import os, asyncio

os.environ["HOSTED_VLLM_API_BASE"] = "http://localhost:8000"
os.environ["HOSTED_VLLM_API_KEY"] = "" # [optional], if your VLLM server requires an API key

async def test_async_rerank():
query = "What is the capital of the United States?"
documents = [
"Carson City is the capital city of the American state of Nevada.",
"The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean. Its capital is Saipan.",
"Washington, D.C. is the capital of the United States.",
"Capital punishment has existed in the United States since before it was a country.",
]

response = await arerank(
model="hosted_vllm/your-rerank-model",
query=query,
documents=documents,
top_n=3,
)
print(response)

asyncio.run(test_async_rerank())

Send Video URL to VLLMโ€‹

Example Implementation from VLLM here

Use this to send a video url to VLLM + Gemini in the same format, using OpenAI's files message type.

There are two ways to send a video url to VLLM:

  1. Pass the video url directly
{"type": "file", "file": {"file_id": video_url}},
  1. Pass the video data as base64
{"type": "file", "file": {"file_data": f"data:video/mp4;base64,{video_data_base64}"}}
from litellm import completion

messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize the following video"
},
{
"type": "file",
"file": {
"file_id": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}
}
]
}
]

# call vllm
os.environ["HOSTED_VLLM_API_BASE"] = "https://hosted-vllm-api.co"
os.environ["HOSTED_VLLM_API_KEY"] = "" # [optional], if your VLLM server requires an API key
response = completion(
model="hosted_vllm/qwen", # pass the vllm model name
messages=messages,
)

# call gemini
os.environ["GEMINI_API_KEY"] = "your-gemini-api-key"
response = completion(
model="gemini/gemini-1.5-flash", # pass the gemini model name
messages=messages,
)

print(response)

(Deprecated) for vllm pip packageโ€‹

Using - litellm.completionโ€‹

pip install litellm vllm
import litellm 

response = litellm.completion(
model="vllm/facebook/opt-125m", # add a vllm prefix so litellm knows the custom_llm_provider==vllm
messages=messages,
temperature=0.2,
max_tokens=80)

print(response)

Batch Completionโ€‹

from litellm import batch_completion

model_name = "facebook/opt-125m"
provider = "vllm"
messages = [[{"role": "user", "content": "Hey, how's it going"}] for _ in range(5)]

response_list = batch_completion(
model=model_name,
custom_llm_provider=provider, # can easily switch to huggingface, replicate, together ai, sagemaker, etc.
messages=messages,
temperature=0.2,
max_tokens=80,
)
print(response_list)

Prompt Templatesโ€‹

For models with special prompt templates (e.g. Llama2), we format the prompt to fit their template.

What if we don't support a model you need? You can also specify you're own custom prompt formatting, in case we don't have your model covered yet.

Does this mean you have to specify a prompt for all models? No. By default we'll concatenate your message content to make a prompt (expected format for Bloom, T-5, Llama-2 base models, etc.)

Default Prompt Template

def default_pt(messages):
return " ".join(message["content"] for message in messages)

Code for how prompt templates work in LiteLLM

Models we already have Prompt Templates forโ€‹

Model NameWorks for ModelsFunction Call
meta-llama/Llama-2-7b-chatAll meta-llama llama2 chat modelscompletion(model='vllm/meta-llama/Llama-2-7b', messages=messages, api_base="your_api_endpoint")
tiiuae/falcon-7b-instructAll falcon instruct modelscompletion(model='vllm/tiiuae/falcon-7b-instruct', messages=messages, api_base="your_api_endpoint")
mosaicml/mpt-7b-chatAll mpt chat modelscompletion(model='vllm/mosaicml/mpt-7b-chat', messages=messages, api_base="your_api_endpoint")
codellama/CodeLlama-34b-Instruct-hfAll codellama instruct modelscompletion(model='vllm/codellama/CodeLlama-34b-Instruct-hf', messages=messages, api_base="your_api_endpoint")
WizardLM/WizardCoder-Python-34B-V1.0All wizardcoder modelscompletion(model='vllm/WizardLM/WizardCoder-Python-34B-V1.0', messages=messages, api_base="your_api_endpoint")
Phind/Phind-CodeLlama-34B-v2All phind-codellama modelscompletion(model='vllm/Phind/Phind-CodeLlama-34B-v2', messages=messages, api_base="your_api_endpoint")

Custom prompt templatesโ€‹

# Create your own custom prompt template works 
litellm.register_prompt_template(
model="togethercomputer/LLaMA-2-7B-32K",
roles={
"system": {
"pre_message": "[INST] <<SYS>>\n",
"post_message": "\n<</SYS>>\n [/INST]\n"
},
"user": {
"pre_message": "[INST] ",
"post_message": " [/INST]\n"
},
"assistant": {
"pre_message": "\n",
"post_message": "\n",
}
} # tell LiteLLM how you want to map the openai messages to this model
)

def test_vllm_custom_model():
model = "vllm/togethercomputer/LLaMA-2-7B-32K"
response = completion(model=model, messages=messages)
print(response['choices'][0]['message']['content'])
return response

test_vllm_custom_model()

Implementation Code