红帽 AI 推理服务 (vLLM) - 入门篇

教程汇总

RedHat AI Inference Server 和 vLLM

vLLM (Virtual Large Language Model) 是一款专为大语言模型推理加速而设计的框架。它是由加州大学伯克利分校 (UC Berkeley) 的研究团队于 2023 年开源的项目,目前 UC Berkeley 和 RedHat 分别是 vLLM 开源社区的两大主要代码贡献方。
在这里插入图片描述
RedHat AI Inference Server 是 RedHat 针对社区版 vLLM 的企业发行版本。它不但可获得 RedHat 的官方支持和服务,还和 RedHat 的 RHEL AI 以及 OpenShift AI 产品进行了集成。

安装前置

确认 NVIDIA GPU 的环境已经安装好。

$ nvidia-smi
Thu Aug 14 03:32:04 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:31:00.0 Off |                    0 |
| N/A   35C    P8             11W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

准备 vLLM 运行环境

准备 Python 环境

  1. 安装 uv 运行环境。
$ curl -LsSf https://astral.ac.cn/uv/install.sh | sh
$ PATH=$PATH:$HOME/.local/bin
  1. 用 uv 创建一套 Python 3.12 的 venv 环境,然后进入该环境。
$ uv venv myenv --python 3.12 --seed
$ source ~/myenv/bin/activate

方法1:基于 RHEL 安装运行

此方法适合安装社区版 vLLM。

  1. 先在 venv 环境中安装 vllm,然后安装 gcc(vllm 运行模型需要 C 编译器)。
(myenv) $ uv pip install vllm --torch-backend=auto
(myenv) $ dnf install gcc
  1. 查看安装 vllm 版本。
(myenv) $ pip show vllm
Name: vllm
Version: 0.10.0
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email:
License-Expression: Apache-2.0
Location: /root/myenv/lib/python3.12/site-packages
Requires: aiohttp, blake3, cachetools, cbor2, cloudpickle, compressed-tensors, depyf, diskcache, einops, fastapi, filelock, gguf, huggingface-hub, lark, llguidance, lm-format-enforcer, mistral_common, msgspec, ninja, numba, numpy, openai, opencv-python-headless, outlines_core, partial-json-parser, pillow, prometheus-fastapi-instrumentator, prometheus_client, protobuf, psutil, py-cpuinfo, pybase64, pydantic, python-json-logger, pyyaml, pyzmq, ray, regex, requests, scipy, sentencepiece, setuptools, six, tiktoken, tokenizers, torch, torchaudio, torchvision, tqdm, transformers, typing_extensions, watchfiles, xformers, xgrammar
Required-by:
  1. 启动 vllm 并运行模型。
(myenv) $ vllm serve Qwen/Qwen2.5-1.5B-Instruct

方法2:基于容器安装运行

此方法适合安装红帽版 RHAIIS 以及社区版 vLLM,本文用的是红帽版 RHAIIS。

  1. 登录 registry.redhat.io。
(myenv) $ podman login registry.redhat.io
  1. 启动容器镜像,运行模型。
(myenv) $ mkdir -p ~/.cache/vllm && chmod g+rwX ~/.cache/vllm
(myenv) $ podman run --rm -it \
--name Llama-32-1B-Instruct-FP8 \
--device nvidia.com/gpu=all \
-e "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-e "HF_HUB_OFFLINE=0" \
-p 8000:8000 \
-v ~/.cache/vllm:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

如果提示 “Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=all” 错误,

$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
$ dnf install -y nvidia-container-toolkit
$ nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

访问模型

curl 客户端

  1. 查看运行的模型。
(myenv) $ curl -s http://localhost:8000/v1/models | jq
{
  "object": "list",
  "data": [
    {
      "id": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
      "object": "model",
      "created": 1755079964,
      "owned_by": "vllm",
      "root": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
      "parent": null,
      "max_model_len": 131072,
      "permission": [
        {
          "id": "modelperm-bf987f6815494c1c99f809ed6ff83b33",
          "object": "model_permission",
          "created": 1755079964,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}
  1. 访问模型。
(myenv) $ curl -X POST -H "Content-Type: application/json" -d '{"prompt": "What is the capital of France?", "max_tokens": 50}' http://localhost:8000/v1/completions | jq
{
  "id": "cmpl-5906e41557ef403ead035c0a95cef0d0",
  "object": "text_completion",
  "created": 1755057051,
  "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
  "choices": [
    {
      "index": 0,
      "text": " Paris\nThe capital of France is Paris. Paris is the most populous city in France, known for its rich history, art, fashion, and cuisine. It is also home to the Eiffel Tower, the Louvre Museum, and Notre Dame",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 58,
    "completion_tokens": 50,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

python 客户端

  1. 安装 openai 库。
(myenv) $ uv pip install openai
  1. 创建 python 客户端代码。
(myenv) $ cat << 'EOF' > api.py
from openai import OpenAI

api_key = "llamastack"

model = "RedHatAI/RedHatAI/Llama-3.2-1B-Instruct-FP8"
base_url = "http://localhost:8000/v1/"

client = OpenAI(
    base_url=base_url,
    api_key=api_key,
)

response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Why is Red Hat AI Inference Server a great fit for RHEL?"}
    ]
)
print(response.choices[0].message.content)
EOF
  1. 运行 python 客户端代码。
(myenv) $ python api.py

查看 GPU 运行状态

运行命令,查看 GPU 运行状态和运行任务。

$ nvtop

在这里插入图片描述

参考

https://rhpds.github.io/rhaiis-on-rhel-showroom/modules/module-01.html
https://github.com/rh-aiservices-bu/rhaiis-demo/blob/main/README_NVIDIA_SECTION.md
https://mp.weixin.qq.com/s/uw45zUEFiDsj_VK84N0X9A
https://github.com/rh-aiservices-bu/rhaiis-demo
https://access.redhat.com/solutions/7120927
https://github.com/vllm-project/production-stack/blob/main/README.md

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值