红帽 AI 推理服务 (vLLM) - 入门篇

原创已于 2025-08-18 09:50:11 修改 · 663 阅读

19 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #llama #ai #语言模型

于 2025-08-14 19:59:11 首次发布

《教程汇总》

RedHat AI Inference Server 和 vLLM

vLLM (Virtual Large Language Model) 是一款专为大语言模型推理加速而设计的框架。它是由加州大学伯克利分校 (UC Berkeley) 的研究团队于 2023 年开源的项目，目前 UC Berkeley 和 RedHat 分别是 vLLM 开源社区的两大主要代码贡献方。
在这里插入图片描述
RedHat AI Inference Server 是 RedHat 针对社区版 vLLM 的企业发行版本。它不但可获得 RedHat 的官方支持和服务，还和 RedHat 的 RHEL AI 以及 OpenShift AI 产品进行了集成。

安装前置

确认 NVIDIA GPU 的环境已经安装好。

$ nvidia-smi
Thu Aug 14 03:32:04 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   35C    P8             11W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

准备 vLLM 运行环境

准备 Python 环境

安装 uv 运行环境。

$ curl -LsSf https://astral.ac.cn/uv/install.sh | sh
$ PATH=$PATH:$HOME/.local/bin

用 uv 创建一套 Python 3.12 的 venv 环境，然后进入该环境。

$ uv venv myenv --python 3.12 --seed
$ source ~/myenv/bin/activate

方法1：基于 RHEL 安装运行

此方法适合安装社区版 vLLM。

先在 venv 环境中安装 vllm，然后安装 gcc（vllm 运行模型需要 C 编译器）。

(myenv) $ uv pip install vllm --torch-backend=auto
(myenv) $ dnf install gcc

查看安装 vllm 版本。

(myenv) $ pip show vllm
Name: vllm
Version: 0.10.0
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email:
License-Expression: Apache-2.0
Location: /root/myenv/lib/python3.12/site-packages
Requires: aiohttp, blake3, cachetools, cbor2, cloudpickle, compressed-tensors, depyf, diskcache, einops, fastapi, filelock, gguf, huggingface-hub, lark, llguidance, lm-format-enforcer, mistral_common, msgspec, ninja, numba, numpy, openai, opencv-python-headless, outlines_core, partial-json-parser, pillow, prometheus-fastapi-instrumentator, prometheus_client, protobuf, psutil, py-cpuinfo, pybase64, pydantic, python-json-logger, pyyaml, pyzmq, ray, regex, requests, scipy, sentencepiece, setuptools, six, tiktoken, tokenizers, torch, torchaudio, torchvision, tqdm, transformers, typing_extensions, watchfiles, xformers, xgrammar
Required-by:

启动 vllm 并运行模型。

(myenv) $ vllm serve Qwen/Qwen2.5-1.5B-Instruct

方法2：基于容器安装运行

此方法适合安装红帽版 RHAIIS 以及社区版 vLLM，本文用的是红帽版 RHAIIS。

(myenv) $ podman login registry.redhat.io

启动容器镜像，运行模型。

(myenv) $ mkdir -p ~/.cache/vllm && chmod g+rwX ~/.cache/vllm
(myenv) $ podman run --rm -it \
--name Llama-32-1B-Instruct-FP8 \
--device nvidia.com/gpu=all \
-e "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-e "HF_HUB_OFFLINE=0" \
-p 8000:8000 \
-v ~/.cache/vllm:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

如果提示 “Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all” 错误，

$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
$ dnf install -y nvidia-container-toolkit
$ nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

访问模型

curl 客户端

查看运行的模型。

(myenv) $ curl -s http://localhost:8000/v1/models | jq
{
  "object": "list",
  "data": [
    {
      "id": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
      "object": "model",
      "created": 1755079964,
      "owned_by": "vllm",
      "root": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
      "parent": null,
      "max_model_len": 131072,
      "permission": [
        {
          "id": "modelperm-bf987f6815494c1c99f809ed6ff83b33",
          "object": "model_permission",
          "created": 1755079964,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

访问模型。

(myenv) $ curl -X POST -H "Content-Type: application/json" -d '{"prompt": "What is the capital of France?", "max_tokens": 50}' http://localhost:8000/v1/completions | jq
{
  "id": "cmpl-5906e41557ef403ead035c0a95cef0d0",
  "object": "text_completion",
  "created": 1755057051,
  "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
  "choices": [
    {
      "index": 0,
      "text": " Paris\nThe capital of France is Paris. Paris is the most populous city in France, known for its rich history, art, fashion, and cuisine. It is also home to the Eiffel Tower, the Louvre Museum, and Notre Dame",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 58,
    "completion_tokens": 50,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

python 客户端

安装 openai 库。

(myenv) $ uv pip install openai

创建 python 客户端代码。

(myenv) $ cat << 'EOF' > api.py
from openai import OpenAI

api_key = "llamastack"

model = "RedHatAI/RedHatAI/Llama-3.2-1B-Instruct-FP8"
base_url = "http://localhost:8000/v1/"

client = OpenAI(
    base_url=base_url,
    api_key=api_key,
)

response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Why is Red Hat AI Inference Server a great fit for RHEL?"}
    ]
)
print(response.choices[0].message.content)
EOF

运行 python 客户端代码。

(myenv) $ python api.py

查看 GPU 运行状态

运行命令，查看 GPU 运行状态和运行任务。

$ nvtop

在这里插入图片描述

参考

https://rhpds.github.io/rhaiis-on-rhel-showroom/modules/module-01.html
https://github.com/rh-aiservices-bu/rhaiis-demo/blob/main/README_NVIDIA_SECTION.md
https://mp.weixin.qq.com/s/uw45zUEFiDsj_VK84N0X9A
https://github.com/rh-aiservices-bu/rhaiis-demo
https://access.redhat.com/solutions/7120927
https://github.com/vllm-project/production-stack/blob/main/README.md