本地完成 DeepSeek + vLLM + One API 的完整部署

原创已于 2025-02-11 21:35:31 修改 · 1.4k 阅读

14 ·

CC 4.0 BY-SA版权

文章标签：

#deepseek #大模型 #linux #oneapi

于 2025-02-11 01:29:36 首次发布

以下是本地部署 DeepSeek 模型并使用 vLLM + One API 的步骤指南：

1. 环境准备
- **操作系统**: Linux (推荐 Ubuntu 20.04+)
- **硬件要求**:
- GPU: NVIDIA GPU (显存需满足模型要求，如 16GB+ 显存)
- CUDA 11.8+ 和 cuDNN 8+
- **Python**: 3.8+

```bash
# 安装基础依赖
sudo apt update && sudo apt install -y python3-pip git

我是在之前创建的chatglm3环境中的，已安装torch，

conda activate chatglm3

之后

2. 安装 vLLM
vLLM 是高效的推理框架，支持多GPU并行推理。
# 安装 vLLM (推荐源码安装)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e . # 从源码安装

# 或直接通过 pip 安装
pip install vllm
3. 下载 DeepSeek 模型
确保已获得合法的模型访问权限，并下载模型权重：
# 示例：从 HuggingFace 下载模型（需权限）
git lfs install
git clone https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat
4. 启动 vLLM 服务
使用以下命令启动 API 服务：
python -m vllm.entrypoints.openai.api_server \
--model /path/to/deepseek-model \
--tokenizer deepseek-ai/deepseek-llm-7b-chat \
--tensor-parallel-size 1 \
--served-model-name deepseek-chat \
--port 8000

我这边是执行：

python -m vllm.entrypoints.openai.api_server \
  --model /media/cys/c4e58bbe-a73a-4b02-ae9e-2b310ee884fb5/deepseek-llm-7b-chat \
  --tokenizer deepseek-ai/deepseek-llm-7b-chat \
  --tensor-parallel-size 1 \
  --served-model-name deepseek-chat \
  --port 8000

**参数说明**：
- `--model`: 模型本地路径
- `--tokenizer`: 分词器名称或路径
- `--tensor-parallel-size`: GPU 并行数量
- `--served-model-name`: API 中使用的模型名称
- `--port`: 服务端口

---

### 5. 部署 One API
One API 是统一 API 管理工具，支持多模型路由。

#### 5.1 下载 One API
```bash
git clone https://github.com/songquanpeng/one-api.git
cd one-api
```

#### 5.2 配置数据库
修改 `config.yaml`：
```yaml
database:
type: sqlite
path: one-api.db
```

#### 5.3 添加 vLLM 渠道
登录 One API 管理界面（默认账号 `root`/`123456`）：
1. 进入 **渠道** -> **添加渠道**
2. 配置参数：
- **类型**: `OpenAI`
- **名称**: 自定义 (如 `deepseek-vllm`)
- **API 地址**: `http://localhost:8000/v1`
- **模型**: `deepseek-chat` (与 vLLM 的 `--served-model-name` 一致)

#### 5.4 启动 One API
```bash
./one-api --port 3000
```

---

### 6. 测试 API
#### 通过 One API 调用：
```bash
curl http://localhost:3000/v1/chat/completions \
-H "Authorization: Bearer ONE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-chat",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```

#### 直接调用 vLLM：
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-chat",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```

---

### 7. 常见问题排查
1. **CUDA 版本不兼容**：
- 使用 `nvcc --version` 检查 CUDA 版本
- 安装匹配的 vLLM 版本：`pip install vllm==0.3.3+cuda118` (适配 CUDA 11.8)

2. **显存不足**：
- 减少 `--tensor-parallel-size`
- 使用量化模型：`--quantization awq`

3. **模型加载失败**：
- 检查模型路径权限
- 确保模型文件完整

4. **端口冲突**：
- 修改 `--port` 参数

---

### 高级配置
- **批处理优化**：添加 `--max-num-batched-tokens 4096` 提升吞吐量
- **监控**：集成 Prometheus 监控指标
- **鉴权**：在 vLLM 启动时添加 `--api-key YOUR_KEY`

通过以上步骤，您可以在本地完成 DeepSeek + vLLM + One API 的完整部署。

-------------------------------------------------------

从错误日志来看，问题的根源在于你的GPU（Tesla P40）不支持`bfloat16`数据类型。Tesla P40的计算能力为6.1，而`bfloat16`至少需要计算能力8.0的支持。因此，在尝试初始化设备时，程序抛出了一个异常。

### 解决方案

为了克服这个问题，你可以按照错误提示中的建议，将数据类型从`bfloat16`更改为`float16`。这可以通过在命令行中明确设置`dtype`标志来实现。具体步骤如下：

#### 通过CLI参数设置

如果你是通过命令行启动服务的，可以在命令中添加`--dtype=half`（`half`是`float16`的另一种表示方式）。例如：

```bash
your_command --dtype=half
```

确保替换`your_command`为你实际用来启动服务的命令。

#### 在代码中设置

如果你是通过编写脚本来启动服务的，可以在配置模型的地方直接设置`dtype`参数为`torch.float16`或简写为`'half'`。例如：

```python
model_config = {
# 其他配置项...
'dtype': torch.float16, # 或者 'half'
}
```

### 验证修改

完成上述更改后，重新运行你的应用，并检查是否解决了该问题。如果一切正常，应该不会再看到关于`bfloat16`不被支持的错误信息了。

这个改动不会影响模型的功能性，但可能会影响精度和性能。对于大多数应用场景而言，使用`float16`代替`bfloat16`是一个合理的折衷方案，尤其是在硬件不支持`bfloat16`的情况下。

-------------------------------------

The error occurs due to a CUDA version mismatch between the installed PyTorch/vLLM and the CUDA libraries. Here's how to resolve it:

1. **Check CUDA Toolkit Version**:
   - Verify the CUDA version in your Conda environment:
     ```bash
     conda list cudatoolkit
     ```
   - If no CUDA toolkit is installed, install the correct version (e.g., 12.2):
     ```bash
     conda install -c "nvidia/label/cuda-12.2.0" cuda-toolkit
     ```

2. **Ensure NVIDIA Driver Compatibility**:
- Check the driver version with `nvidia-smi`. Ensure it supports the CUDA toolkit version (e.g., CUDA 12.2 requires driver ≥ 535.54.03).

3. **Reinstall PyTorch with Correct CUDA Version**:
   - Uninstall existing PyTorch:
     ```bash
     pip uninstall torch
     ```
   - Install PyTorch compatible with your CUDA version (e.g., CUDA 12.1):
     ```bash
     pip install torch --index-url https://download.pytorch.org/whl/cu121
     ```

4. **Reinstall vLLM**:
   - After fixing PyTorch, reinstall vLLM to ensure compatibility:
     ```bash
     pip install vllm
     ```

5. **Verify Installation**:
   - Run a quick check in Python:
     ```python
     import torch
     print(torch.__version__)
     print(torch.cuda.is_available())
     ```

**Note**: If your system's CUDA toolkit is outdated, consider upgrading it or using a Conda environment with a compatible CUDA toolkit version. Aligning all components (NVIDIA driver, CUDA toolkit, PyTorch, and vLLM) to the same CUDA version resolves the missing symbol error.