红帽 AI 推理服务（vLLM）- 运行本地的模型-CSDN博客

根据 vLLM 的运行方式的不同，使用 vLLM 运行已保存在本地的模型有以下两种方法：

在 RHEL 中的 vLLM 运行本地模型

设置环境变量 TRANSFORMERS_OFFLINE=1，不让 vLLM 在线获取本地还未有的模型。

$ export TRANSFORMERS_OFFLINE=1

启动 vLLM 并运行 Qwen/Qwen2.5-1.5B-Instruct 模型。由于该模型还未事先下载到本地，因此运行后会提示错误。

$ vllm serve Qwen/Qwen2.5-1.5B-Instruct

先下载 Qwen/Qwen2.5-1.5B-Instruct 模型到本地，然后查看保存到本地的模型文件。

$ python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen2.5-1.5B-Instruct')"
$ tree .cache/huggingface/hub/models--Qwen--Qwen2.5-1.5B-Instruct
.cache/huggingface/hub/models--Qwen--Qwen2.5-1.5B-Instruct
├── blobs
│   ├── 07bfe0640cb5a0037f9322287fbfc682806cf672
│   ├── 20024bfe7c83998e9aeaf98a0cd6a2ce6306c2f0
│   ├── 443909a61d429dff23010e5bddd28ff530edda00
│   ├── 4783fe10ac3adce15ac8f358ef5462739852c569
│   ├── 6634c8cc3133b3848ec74b9f275acaaa1ea618ab
│   ├── a6344aac8c09253b3b630fb776ae94478aa0275b
│   ├── b3327a17e2ffa52e0fd941a2810b18a9fd0e7d94
│   ├── dd924a11b4c220f385b51ffa522daea7c9f3d850e31b162bb5661df483c6d3ee
│   ├── dfc11073787daf1b0f9c0f1499487ab5f4c93738
│   └── f81ead14ab072d65a07817f83a3ee0e5a1890d10
├── refs
│   └── main
└── snapshots
    └── 989aa7980e4cf806f80c7fef2b1adb7bc71aa306
        ├── config.json -> ../../blobs/f81ead14ab072d65a07817f83a3ee0e5a1890d10
        ├── generation_config.json -> ../../blobs/dfc11073787daf1b0f9c0f1499487ab5f4c93738
        ├── LICENSE -> ../../blobs/6634c8cc3133b3848ec74b9f275acaaa1ea618ab
        ├── merges.txt -> ../../blobs/20024bfe7c83998e9aeaf98a0cd6a2ce6306c2f0
        ├── model.safetensors -> ../../blobs/dd924a11b4c220f385b51ffa522daea7c9f3d850e31b162bb5661df483c6d3ee
        ├── README.md -> ../../blobs/b3327a17e2ffa52e0fd941a2810b18a9fd0e7d94
        ├── tokenizer_config.json -> ../../blobs/07bfe0640cb5a0037f9322287fbfc682806cf672
        ├── tokenizer.json -> ../../blobs/443909a61d429dff23010e5bddd28ff530edda00
        └── vocab.json -> ../../blobs/4783fe10ac3adce15ac8f358ef5462739852c569

4 directories, 20 files

再次启动 vLLM 并运行 Qwen/Qwen2.5-1.5B-Instruct 模型，确认可以正常运行。

$ vllm serve Qwen/Qwen2.5-1.5B-Instruct

在容器中的 vLLM 运行本地模型

手工将 Qwen/Qwen3-1.7B 模型中的文件下载到本地 /root/models/Qwen/Qwen3-1.7B 目录中，然后查看保存模型的目录。

$ tree /root/models/Qwen/Qwen3-1.7B
/root/models/Qwen/Qwen3-1.7B
├── config.json
├── generation_config.json
├── merges.txt
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── tokenizer_config.json
├── tokenizer.json
└── vocab.json

0 directories, 9 files

使用 HF_HUB_OFFLINE=1 的参数运行 vLLM 镜像。并将本地 /root/models/Qwen 目录映射到容器 /root/models/Qwen 目录，然后指定从 /root/models/Qwen/Qwen3-1.7B 目录获取模型相关文件。

$ podman run --rm -it \
	--name Qwen3-1.7B \
    --device nvidia.com/gpu=all \
    -v /root/models/Qwen:/root/models/Qwen:z \
    -p 8000:8000 \
    -e HF_HUB_OFFLINE=1 \
    registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0 \
    --model /root/models/Qwen/Qwen3-1.7B