译:《Converting a Hugging Face Model to a GGUF Model》转化HuggingFace原生模型为GGUF格式

前提:在部署视觉模型遇到LM studio中无法搜索到deepseek-vl

issue:https://github.com/ggml-org/llama.cpp/discussions/2948 

解决方案:https://www.substratus.ai/blog/converting-hf-model-gguf-model/

译文:

博客《Converting a Hugging Face Model to a GGUF Model》的中英文对照版本,采用分栏排版以便阅读:

将 Hugging Face 模型转换为 GGUF 模型

Converting a Hugging Face Model to a GGUF Model

GGUF 是一种专为 LLM 推理优化的轻量级模型格式,由 llama.cpp 生态系统开发。它通过量化技术显著减小模型体积,同时保持较高的推理速度。本文将演示如何将 Hugging Face 原生模型转换为 GGUF 格式。
GGUF is a lightweight model format optimized for LLM inference, developed by the llama.cpp ecosystem. It significantly reduces model size through quantization while maintaining high inference speed. This guide demonstrates how to convert Hugging Face models to GGUF format.

一、准备工具

1. Preparation Tools

  1. 安装转换工具

    bash

    pip install llama-cpp-python
    
     

    该库包含转换脚本和 GGUF 格式的推理引擎。
    Install the conversion tool

    bash

    pip install llama-cpp-python
    
     

    This library includes conversion scripts and an inference engine for GGUF format.

  2. 验证依赖

    • Linux/macOS:确保已安装cmakeg++
    • Windows:推荐使用 WSL 或预编译二进制包
      Verify dependencies
    • Linux/macOS: Ensure cmake and g++ are installed.
    • Windows: Use WSL or precompiled binaries.

二、下载 Hugging Face 模型

2. Download Hugging Face Model

llama-2-7b-chat为例:

python

from huggingface_hub import snapshot_download
model_path = snapshot_download(
    repo_id="meta-llama/Llama-2-7b-chat-hf",
    local_dir="llama-2-7b-chat-hf"
)

确保模型文件(如pytorch_model.bin)存在于本地目录。
Example for llama-2-7b-chat:

python

from huggingface_hub import snapshot_download
model_path = snapshot_download(
    repo_id="meta-llama/Llama-2-7b-chat-hf",
    local_dir="llama-2-7b-chat-hf"
)

Ensure files like pytorch_model.bin exist in the local directory.

三、转换为 GGUF 格式

3. Convert to GGUF Format

运行以下命令:

bash

python -m llama_cpp.convert \
  --model ./llama-2-7b-chat-hf \  # 原生模型路径
  --outfile ./llama-2-7b-chat-q4_K_M.gguf \  # 输出文件名
  --outtype q4_K_M \  # 量化参数
  --use-f16 False \  # 禁用FP16混合精度(可选)
  --vocab-only False \  # 仅转换词表(可选)

Run the command:

bash

python -m llama_cpp.convert \
  --model ./llama-2-7b-chat-hf \  # Path to original model
  --outfile ./llama-2-7b-chat-q4_K_M.gguf \  # Output filename
  --outtype q4_K_M \  # Quantization type
  --use-f16 False \  # Disable FP16 (optional)
  --vocab-only False \  # Convert only vocabulary (optional)

量化参数说明

Quantization Options

参数描述
q4_K_M4-bit 量化(推荐平衡精度与速度)
q5_K_M5-bit 量化(更高精度但更大体积)
f16不量化(需至少 14GB 显存)
q8_08-bit 量化(适合旧硬件)

ParameterDescription
q4_K_M4-bit quantization (recommended balance of speed and accuracy)
q5_K_M5-bit quantization (higher accuracy, larger size)
f16No quantization (requires ≥14GB VRAM)
q8_08-bit quantization (suitable for older hardware)

四、验证 GGUF 模型

4. Verify the GGUF Model

  1. 检查文件大小

    • 7B 模型:原生 14GB → q4_K_M 约 3.5GB
    • 13B 模型:原生 26GB → q4_K_M 约 6.5GB
      Check file size
    • 7B model: 14GB (original) → ~3.5GB (q4_K_M)
    • 13B model: 26GB (original) → ~6.5GB (q4_K_M)
  2. 测试推理

    python

    from llama_cpp import Llama
    llm = Llama(
        model_path="./llama-2-7b-chat-q4_K_M.gguf",
        n_ctx=2048  # 设置上下文长度
    )
    response = llm("写一个Python函数计算斐波那契数列:")
    print(response)
    
     

    Test inference

    python

    from llama_cpp import Llama
    llm = Llama(
        model_path="./llama-2-7b-chat-q4_K_M.gguf",
        n_ctx=2048  # Context length
    )
    response = llm("Write a Python function to calculate Fibonacci numbers:")
    print(response)
    

五、高级技巧

5. Advanced Tips

  1. 多模态模型转换
    视觉模型(如 DeepSeek-VL)需额外参数:

    bash

    --multi-file True \  # 处理多文件模型
    --max-shard-size 4GB \  # 分片大小
    
     

    Converting multi-modal models
    Vision models (e.g., DeepSeek-VL) require extra parameters:

    bash

    --multi-file True \  # Handle multi-file models
    --max-shard-size 4GB \  # Shard size
    

  2. 加速转换过程

    bash

    --threads 8 \  # 使用8线程并行处理
    --batch-size 1024 \  # 调整批量大小
    

     

    Speed up conversion

    bash

    --threads 8 \  # Use 8 threads
    --batch-size 1024 \  # Adjust batch size
    

六、常见问题

6. Common Issues

  1. 内存不足

    • 降低--batch-size参数
    • 分块转换(使用--split参数)
      Out of memory
    • Reduce --batch-size
    • Split conversion (use --split)
  2. 模型架构不支持

    • 确保模型是 LLaMA、Mistral 等兼容架构
    • 更新llama-cpp-python到最新版本
      Unsupported model architecture
    • Ensure compatibility with LLaMA, Mistral, etc.
    • Update llama-cpp-python to the latest version
  3. 量化精度损失

    • 尝试更高精度的量化参数(如q5_K_M
    • 使用--perplexity参数评估损失
      Quantization accuracy loss
    • Try higher precision (e.g., q5_K_M)
    • Use --perplexity to evaluate loss

通过以上步骤,您可以轻松将 Hugging Face 模型转换为 GGUF 格式,实现在低资源设备上的高效推理。完整代码示例可访问GitHub 仓库
By following these steps, you can efficiently convert Hugging Face models to GGUF format for inference on low-resource devices. Full code examples are available on the GitHub repository.

格式说明

  • 左侧为中文翻译,右侧为英文原文(部分代码块保留英文)。
  • 使用 Markdown 表格排版,便于对照阅读。
  • 技术术语保持中英对照(如q4_K_M)。

6. How to Run Locally DeepSeek-V3 can be deployed locally using the following hardware and open-source community software: DeepSeek-Infer Demo: We provide a simple and lightweight demo for FP8 and BF16 inference. SGLang: Fully support the DeepSeek-V3 model in both BF16 and FP8 inference modes, with Multi-Token Prediction coming soon. LMDeploy: Enables efficient FP8 and BF16 inference for local and cloud deployment. TensorRT-LLM: Currently supports BF16 inference and INT4/8 quantization, with FP8 support coming soon. vLLM: Support DeepSeek-V3 model with FP8 and BF16 modes for tensor parallelism and pipeline parallelism. AMD GPU: Enables running the DeepSeek-V3 model on AMD GPUs via SGLang in both BF16 and FP8 modes. Huawei Ascend NPU: Supports running DeepSeek-V3 on Huawei Ascend devices. Since FP8 training is natively adopted in our framework, we only provide FP8 weights. If you require BF16 weights for experimentation, you can use the provided conversion script to perform the transformation. Here is an example of converting FP8 weights to BF16: cd inference python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights 回答二、软件环境搭建 推理框架选择(需通过pip安装): 原生FP8推理:使用SGLang框架 Bash pip install sglang BF16/FP8混合支持:使用LMDeploy Bash pip install lmdeploy TensorRT加速:安装TensorRT-LLM Bash git clone https://github.com/NVIDIA/TensorRT-LLM.git && cd TensorRT-LLM && pip install -e . 模型权重获取: Bash huggingface-cli download DeepSeek/DeepSeek-V3-671B-FP8 --include "*.bin" --local-dir ./deepseek-weights 三、FP8到BF16权重换 运行官方换脚本(需从Hugging Face仓库获取): Bash python convert_fp8_to_bf16.py \ --input_dir ./deepseek-weights \ --output_dir ./bf16-weights \ --quant_bit 8 此脚本会将原始FP8权重换为BF16格式,同时保留模型结构配置文件1。第三步没看懂,具体操作是什么
03-13
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

董厂长

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值