译：《Converting a Hugging Face Model to a GGUF Model》转化HuggingFace原生模型为GGUF格式

董厂长

已于 2025-03-18 14:38:13 修改

阅读量1k

点赞数 27

CC 4.0 BY-SA版权

分类专栏： LLM 文章标签：人工智能 LLM 机器学习

于 2025-03-18 14:36:07 首次发布

本文链接：https://blog.csdn.net/dongnihao/article/details/146340980

LLM 专栏收录该内容

19 篇文章

订阅专栏

前提：在部署视觉模型遇到LM studio中无法搜索到deepseek-vl

issue：https://github.com/ggml-org/llama.cpp/discussions/2948

解决方案：https://www.substratus.ai/blog/converting-hf-model-gguf-model/

译文：

博客《Converting a Hugging Face Model to a GGUF Model》的中英文对照版本，采用分栏排版以便阅读：

将 Hugging Face 模型转换为 GGUF 模型

Converting a Hugging Face Model to a GGUF Model

GGUF 是一种专为 LLM 推理优化的轻量级模型格式，由 llama.cpp 生态系统开发。它通过量化技术显著减小模型体积，同时保持较高的推理速度。本文将演示如何将 Hugging Face 原生模型转换为 GGUF 格式。
GGUF is a lightweight model format optimized for LLM inference, developed by the llama.cpp ecosystem. It significantly reduces model size through quantization while maintaining high inference speed. This guide demonstrates how to convert Hugging Face models to GGUF format.

一、准备工具

1. Preparation Tools

安装转换工具

bash
```
pip install llama-cpp-python
```
该库包含转换脚本和 GGUF 格式的推理引擎。
Install the conversion tool

bash
```
pip install llama-cpp-python
```
This library includes conversion scripts and an inference engine for GGUF format.
验证依赖
- Linux/macOS：确保已安装cmake和g++
- Windows：推荐使用 WSL 或预编译二进制包
  Verify dependencies
- Linux/macOS: Ensure cmake and g++ are installed.
- Windows: Use WSL or precompiled binaries.

二、下载 Hugging Face 模型

2. Download Hugging Face Model

以llama-2-7b-chat为例：

python

from huggingface_hub import snapshot_download
model_path = snapshot_download(
    repo_id="meta-llama/Llama-2-7b-chat-hf",
    local_dir="llama-2-7b-chat-hf"
)

确保模型文件（如pytorch_model.bin）存在于本地目录。
Example for llama-2-7b-chat:

python

from huggingface_hub import snapshot_download
model_path = snapshot_download(
    repo_id="meta-llama/Llama-2-7b-chat-hf",
    local_dir="llama-2-7b-chat-hf"
)

Ensure files like pytorch_model.bin exist in the local directory.

三、转换为 GGUF 格式

3. Convert to GGUF Format

运行以下命令：

bash

python -m llama_cpp.convert \
  --model ./llama-2-7b-chat-hf \  # 原生模型路径
  --outfile ./llama-2-7b-chat-q4_K_M.gguf \  # 输出文件名
  --outtype q4_K_M \  # 量化参数
  --use-f16 False \  # 禁用FP16混合精度（可选）
  --vocab-only False \  # 仅转换词表（可选）

Run the command:

bash

python -m llama_cpp.convert \
  --model ./llama-2-7b-chat-hf \  # Path to original model
  --outfile ./llama-2-7b-chat-q4_K_M.gguf \  # Output filename
  --outtype q4_K_M \  # Quantization type
  --use-f16 False \  # Disable FP16 (optional)
  --vocab-only False \  # Convert only vocabulary (optional)

量化参数说明

Quantization Options

参数	描述
`q4_K_M`	4-bit 量化（推荐平衡精度与速度）
`q5_K_M`	5-bit 量化（更高精度但更大体积）
`f16`	不量化（需至少 14GB 显存）
`q8_0`	8-bit 量化（适合旧硬件）

Parameter	Description
`q4_K_M`	4-bit quantization (recommended balance of speed and accuracy)
`q5_K_M`	5-bit quantization (higher accuracy, larger size)
`f16`	No quantization (requires ≥14GB VRAM)
`q8_0`	8-bit quantization (suitable for older hardware)

四、验证 GGUF 模型

4. Verify the GGUF Model

检查文件大小
- 7B 模型：原生 14GB → q4_K_M 约 3.5GB
- 13B 模型：原生 26GB → q4_K_M 约 6.5GB
  Check file size
- 7B model: 14GB (original) → ~3.5GB (q4_K_M)
- 13B model: 26GB (original) → ~6.5GB (q4_K_M)

测试推理

python

from llama_cpp import Llama
llm = Llama(
    model_path="./llama-2-7b-chat-q4_K_M.gguf",
    n_ctx=2048  # 设置上下文长度
)
response = llm("写一个Python函数计算斐波那契数列：")
print(response)

Test inference

python

from llama_cpp import Llama
llm = Llama(
    model_path="./llama-2-7b-chat-q4_K_M.gguf",
    n_ctx=2048  # Context length
)
response = llm("Write a Python function to calculate Fibonacci numbers:")
print(response)

五、高级技巧

5. Advanced Tips

多模态模型转换
视觉模型（如 DeepSeek-VL）需额外参数：

bash

--multi-file True \  # 处理多文件模型
--max-shard-size 4GB \  # 分片大小

Converting multi-modal models
Vision models (e.g., DeepSeek-VL) require extra parameters:

bash

--multi-file True \  # Handle multi-file models
--max-shard-size 4GB \  # Shard size

加速转换过程

bash

--threads 8 \  # 使用8线程并行处理
--batch-size 1024 \  # 调整批量大小

Speed up conversion

bash

--threads 8 \  # Use 8 threads
--batch-size 1024 \  # Adjust batch size

六、常见问题

6. Common Issues

内存不足
- 降低--batch-size参数
- 分块转换（使用--split参数）
  Out of memory
- Reduce --batch-size
- Split conversion (use --split)
模型架构不支持
- 确保模型是 LLaMA、Mistral 等兼容架构
- 更新llama-cpp-python到最新版本
  Unsupported model architecture
- Ensure compatibility with LLaMA, Mistral, etc.
- Update llama-cpp-python to the latest version
量化精度损失
- 尝试更高精度的量化参数（如q5_K_M）
- 使用--perplexity参数评估损失
  Quantization accuracy loss
- Try higher precision (e.g., q5_K_M)
- Use --perplexity to evaluate loss

通过以上步骤，您可以轻松将 Hugging Face 模型转换为 GGUF 格式，实现在低资源设备上的高效推理。完整代码示例可访问GitHub 仓库。
By following these steps, you can efficiently convert Hugging Face models to GGUF format for inference on low-resource devices. Full code examples are available on the GitHub repository.