vllm docker容器部署大语言模型

什么是 VLLM?

VLLM(Very Large Language Model Inference)是一个 高性能、优化显存管理 的大模型推理引擎。它的目标是 最大化推理吞吐量,并降低显存消耗,让大语言模型(LLMs)在 单卡或多 GPU 服务器 上运行得更高效。

VLLM 的核心优势:

  • 高吞吐量:支持批量推理,减少 token 生成延迟,高效 KV 
  • 缓存管理:优化 GPU 显存,支持 更长的上下文
  • 多 GPU 支持:Tensor Parallel 加速推
  • OpenAI API 兼容:可以作为 本地 API 服务器 运行

核心特点
高性能推理 :采用先进的技术如 PagedAttention 和 Continuous Batching,实现了极高的吞吐量和内存效率。

兼容性强 :支持多种主流的大模型架构,如 Hugging Face 的模型,并且与 OpenAI 的 API 兼容,方便与其他应用集成。

灵活易用 :提供简洁明了的接口,支持多种解码算法,包括并行采样、波束搜索等,并且可以流式输出结果。

硬件支持广泛 :支持 NVIDIA GPU 和 AMD GPU,能够充分利用 GPU 的计算能力来加速推理过程。

优化功能丰富 :支持量化技术,如 GPTQ、AWQ 等,以及优化的 CUDA 内核,进一步提升性能。

系统要求

vLLM 包含预编译的 C++ 和 CUDA (12.6) 二进制文件,需满足以下条件:

  • 操作系统:Linux
  • Python 版本:3.9 ~ 3.12
  • GPU:计算能力 7.0 或更高(如 V100、T4、RTX20xx、A10、A100、L4、H100 等)

注:计算能力(Compute capability)定义了每个 NVIDIA GPU 架构的硬件特性和支持的指令。计算能力决定你是否可以使用某些 CUDA 或 Tensor 核心功能(如 Unified Memory、Tensor Core、动态并行等),并不直接代表 GPU 的计算性能。

环境准备

在正式部署 VLLM 之前,我们需要先确保机器环境可用,包括 显卡驱动、CUDA、Docker 等核心组件。

安装 NVIDIA GPU Driver,NVIDIA Container Toolkit,以及配置 NVIDIA Container Runtime

curl http://127.0.0.1:11435/v1/chat/completions -H "Content-Type: application/json"  -d '{ "model": "qwen3-1.7b", "messages": [{"role": "system","content": "you are a helpful assistant"},{"role": "user","content": "你好"}],"max_new_tokens": 4096,"temperature": 0.4,"stream": false}'

vllm serve /media/p/zdp/dyna/Qwen3-1.7B --port 11435 --host 0.0.0.0 --served-model-name qwen3-1.7b --max_model_len 3000 --enable-reasoning --reasoning-parser  deepseek_r1  --gpu_memory_utilization 0.4

vllm serve /home/zxw/Documents/nllb/Qwen3-0.6B --port 11439 --host 0.0.0.0 --served-model-name qwen3-0.6b --max_model_len 3000 --enable-reasoning --reasoning-parser  deepseek_r1  --gpu_memory_utilization 0.6


docker exec -e CUDA_VISIBLE_DEVICES=1 -w /workspace/llm_weights/ -itd vllm_infer nohup vllm serve /workspace/llm_weights/QwQ-32B-AWQ --quantization awq --port 11435 --host 0.0.0.0 --served-model-name qwq --max_model_len 8000 --enable-reasoning --reasoning-parser  deepseek_r1  --gpu_memory_utilization 0.95 --dtype float16>> VL-3B_log.txt &
docker pull vllm/vllm-openai:v0.9.0.1


如果被墙阻挡了,请先设置huggingface的镜像再下载:
# 临时生效
export HF_ENDPOINT=https://hf-mirror.com
# 永久生效
echo export HF_ENDPOINT=https://hf-mirror.com >> ~/.bashrc


docker run --runtime nvidia --gpus all \
     -v /mnt/llm_deploy/:/home/llm_deploy \
     -p 9000:8000 \
     --ipc=host \
     -d \
     --name vllm_deepseek_qwen32b \
     vllm/vllm-openai:v0.9.0.1 \
     --model /home/llm_deploy/DeepSeek-R1-Distill-Qwen-32B \
     --tensor-parallel-size 4 \
     --max_model_len 60000


外部端口9000映射到内部的8000端口,因为内部默认使用8000这个端口。

其他命令选项和直接启动vllm服务类似,可酌情添加。



curl http://localhost:9000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/home/llm_deploy/DeepSeek-R1-Distill-Qwen-32B",
        "messages": [
            {"role": "user", "content": "你是谁?"}
        ]
    }'

在 vLLM 框架中,--reasoning-parser 参数是用于控制推理模型输出解析的关键配置项,其核心作用是将模型生成的推理步骤与最终结论分离,并适配 OpenAI API 的响应格式。

参数作用​

  1. ​推理内容提取​
    当模型(如 DeepSeek R1、Qwen3 等)生成包含推理步骤的输出时,--reasoning-parser 会识别模型输出中的特殊标记(如 think ... /think),将推理过程提取到 reasoning_content 字段,而最终结论放入 content 字段。

    ​示例响应结构​​:
{
  "choices": [{
    "delta": {
      "reasoning_content": "步骤1: 分析问题... 步骤2: 计算...",
      "content": "最终答案"
    }
  }]
}

如果你在使用 VLLM 过程中遇到 显存溢出(OOM) 或 推理速度慢 的问题,可以尝试调整:

  1. 降低 --max-num-batched-tokens
  2. 调整 --gpu-memory-utilization(一般设为 0.85~0.95)
  3. 使用 --dtype float16 减少显存占用
  4. 在多 GPU 服务器上增加 --tensor-parallel-size
使用 vLLM 作为 OpenAI 兼容 API 服务器
  1. 启动服务 :运行以下命令启动 vLLM 服务,指定模型路径、端口等参数。

vllm serve /path/to/your/model --port 8000

2.发送请求 :在 Python 脚本中使用requests库向 vLLM 服务发送 HTTP 请求,获取生成结果。

import requests
 
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "your_model_name",
    "messages": [{"role": "user", "content": "Hello, how are you?"}]
}
 
response = requests.post(url, headers=headers, json=data)
print(response.json())

使用命令行简单测试,端口是9000,接口是openai api的接口风格。

curl http://localhost:9000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/home/llm_deploy/DeepSeek-R1-Distill-Qwen-32B",
        "messages": [
            {"role": "user", "content": "你是谁?"}
        ]
    }'
curl http://127.0.0.1:11435/v1/chat/completions -H "Content-Type: application/json"  -d '{ "model": "qwen3-1.7b", "messages": [{"role": "system","content": "you are a helpful assistant"},{"role": "user","content": "你好"}],"max_new_tokens": 4096,"temperature": 0.4,"stream": false}'
大模型输出TPS

衡量大模型部署工具的指标之一为TPS(Token Per Second),即每秒模型输出的token数量。

# -*- coding: utf-8 -*-
  # @place: Pudong, Shanghai
  # @file: gradio_for_throughput.py
  # @time: 2024/1/19 16:05
  import gradio as gr
  import requests
  import time
  
  questions = [
         # Coding questions
         "Implement a Python function to compute the Fibonacci numbers.",
         "Write a Rust function that performs binary exponentiation.",
         "How do I allocate memory in C?",
         "What are the differences between Javascript and Python?",
         "How do I find invalid indices in Postgres?",
         "How can you implement a LRU (Least Recently Used) cache in Python?",
         "What approach would you use to detect and prevent race conditions in a multithreaded application?",
         "Can you explain how a decision tree algorithm works in machine learning?",
         "How would you design a simple key-value store database from scratch?",
         "How do you handle deadlock situations in concurrent programming?",
         "What is the logic behind the A* search algorithm, and where is it used?",
         "How can you design an efficient autocomplete system?",
         "What approach would you take to design a secure session management system in a web application?",
         "How would you handle collision in a hash table?",
         "How can you implement a load balancer for a distributed system?",
         # Literature
         "What is the fable involving a fox and grapes?",
         "Write a story in the style of James Joyce about a trip to the Australian outback in 2083, to see robots in the beautiful desert.",
         "Who does Harry turn into a balloon?",
         "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history.",
         "Describe a day in the life of a secret agent who's also a full-time parent.",
         "Create a story about a detective who can communicate with animals.",
         "What is the most unusual thing about living in a city floating in the clouds?",
         "In a world where dreams are shared, what happens when a nightmare invades a peaceful dream?",
         "Describe the adventure of a lifetime for a group of friends who found a map leading to a parallel universe.",
         "Tell a story about a musician who discovers that their music has magical powers.",
         "In a world where people age backwards, describe the life of a 5-year-old man.",
         "Create a tale about a painter whose artwork comes to life every night.",
         "What happens when a poet's verses start to predict future events?",
         "Imagine a world where books can talk. How does a librarian handle them?",
         "Tell a story about an astronaut who discovered a planet populated by plants.",
         "Describe the journey of a letter traveling through the most sophisticated postal service ever.",
         "Write a tale about a chef whose food can evoke memories from the eater's past.",
         # History
         "What were the major contributing factors to the fall of the Roman Empire?",
         "How did the invention of the printing press revolutionize European society?",
         "What are the effects of quantitative easing?",
         "How did the Greek philosophers influence economic thought in the ancient world?",
         "What were the economic and philosophical factors that led to the fall of the Soviet Union?",
         "How did decolonization in the 20th century change the geopolitical map?",
         "What was the influence of the Khmer Empire on Southeast Asia's history and culture?",
         # Thoughtfulness
         "Describe the city of the future, considering advances in technology, environmental changes, and societal shifts.",
         "In a dystopian future where water is the most valuable commodity, how would society function?",
         "If a scientist discovers immortality, how could this impact society, economy, and the environment?",
         "What could be the potential implications of contact with an advanced alien civilization?",
         # Math
         "What is the product of 9 and 8?",
         "If a train travels 120 kilometers in 2 hours, what is its average speed?",
         "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.",
         "Think through this step by step. Calculate the sum of an arithmetic series with first term 3, last term 35, and total terms 11.",
         "Think through this step by step. What is the area of a triangle with vertices at the points (1,2), (3,-4), and (-2,5)?",
         "Think through this step by step. Solve the following system of linear equations: 3x + 2y = 14, 5x - y = 15.",
         # Facts
         "Who was Emperor Norton I, and what was his significance in San Francisco's history?",
         "What is the Voynich manuscript, and why has it perplexed scholars for centuries?",
         "What was Project A119 and what were its objectives?",
         "What is the 'Dyatlov Pass incident' and why does it remain a mystery?",
         "What is the 'Emu War' that took place in Australia in the 1930s?",
         "What is the 'Phantom Time Hypothesis' proposed by Heribert Illig?",
         "Who was the 'Green Children of Woolpit' as per 12th-century English legend?",
         "What are 'zombie stars' in the context of astronomy?",
         "Who were the 'Dog-Headed Saint' and the 'Lion-Faced Saint' in medieval Christian traditions?",
         "What is the story of the 'Globsters', unidentified organic masses washed up on the shores?",
     ]
 
 
 def chat_completion(question):
     url = "http://localhost:50072/v1/chat/completions"
 
     headers = {'Content-Type': 'application/json'}
 
     json_data = {
         'model': "llama-2-13b-chat-hf",
         'messages': [
             {
                 'role': 'system',
                 'content': 'You are a helpful assistant.'
             },
             {
                 'role': 'user',
                 'content': question
             },
         ],
     }
 
     response = requests.post(url, headers=headers, json=json_data)
     answer = response.json()["choices"][0]["message"]["content"]
     output_tokens = response.json()["usage"]["completion_tokens"]
    return answer, output_tokens


def slowly_reverse(texts, progress=gr.Progress()):
    total_token_cnt = 0
    progress(0, desc="starting...")
    q_list = texts.split('\n')
    s_time = time.time()
    data_list = []
    for q in progress.tqdm(q_list, desc=f"generating..."):
        answer, output_token = chat_completion(q)
        total_token_cnt += output_token
        data_list.append([q, answer[:50], total_token_cnt/(time.time() - s_time)])
        print(f"{total_token_cnt/(time.time() - s_time)} TPS")

    return data_list


demo = gr.Interface(
    fn=slowly_reverse,
    # 自定义输入框
    inputs=gr.Textbox(value='\n'.join(questions), label="questions"),
    # 设置输出组件
    outputs=gr.DataFrame(label='Table', headers=['question', 'answer', 'TPS'], interactive=True, wrap=True)
)

demo.queue().launch(server_name='0.0.0.0', share=True)

vllm 大模型部署 参数说明链接:

vllm 大模型部署 参数说明_vllm启动参数-CSDN博客

### 使用 Docker 部署 VLLM 本地模型 #### 安装依赖项和配置环境 为了确保服务器能够顺利运行基于 DockerVLLM 模型,需确认服务器已经安装了必要的软件包并进行了适当设置。这通常涉及更新操作系统、安装 NVIDIA CUDA 工具链以及验证硬件兼容性。 对于 Ubuntu 系统而言,可以执行如下命令来准备基础环境: ```bash sudo apt-get update && sudo apt-get upgrade -y sudo apt install nvidia-driver-<version> cuda-toolkit -y nvidia-smi ``` 上述操作会安装最新的 Nvidia 显卡驱动程序及对应的 CUDA 开发工具集,并通过 `nvidia-smi` 命令测试 GPU 是否被正确识别[^2]。 #### 获取 vLLM Docker 镜像 接下来是从 Docker Hub 或其他可信源获取预构建好的 vLLM Docker 镜像文件。如果网络条件允许的话可以直接拉取官方镜像;否则可能需要提前下载好离线版本再上传至目标机器。 假设要使用的镜像是由社区维护的一个特定标签,则可以通过下面的方式加载它到本地仓库中: ```bash docker pull registry.hub.docker.com/vllm_project:vllm_version_tag ``` 这里替换 `<registry.hub.docker.com/vllm_project:vllm_version_tag>` 为实际可用的镜像地址及其版本号[^1]。 #### 启动容器实例化服务 完成以上准备工作之后就可以创建一个新的 Docker 容器用于承载 vLLM 应用了。建议指定挂载卷路径以便持久保存数据,同时开放必要端口供外部访问 API 接口。 以下是启动带参数选项的例子: ```bash docker run --gpus all \ -p host_port:container_port \ -v /path/to/local/model:/models \ --name=vllm_container_name \ -d vllm_image_name ``` 此脚本中的 `-p` 参数映射主机上的某个端口号到容器内部的服务监听位置;而 `-v` 则定义了一个双向绑定关系使得宿主机目录 `/path/to/local/model` 可以作为存储空间提供给容器内的应用程序读写使用。 #### 测试部署效果 最后一步就是检验整个流程是否顺利完成——即发送 HTTP 请求调用 RESTful APIs 来查询推理结果或者发起对话请求等互动行为。具体方法取决于所选框架所提供的接口文档说明。 例如,当一切正常运作时,应该可以从浏览器或其他客户端工具向 http://localhost:host_port 发起 GET/POST 方法的数据交换尝试。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值