【大模型】基于Unsloth微调Llama-3.1 8b代码详解

酒酿小圆子～

已于 2024-08-06 16:31:08 修改

阅读量5.1k

点赞数 4

CC 4.0 BY-SA版权

分类专栏：大模型文章标签：人工智能语言模型

于 2024-08-06 14:34:58 首次发布

本文链接：https://blog.csdn.net/u012856866/article/details/140955316

文章目录

1、加载模型和分词器
2、LoRA adapter
3、数据准备
4、训练模型
5、模型推理
- 5.1 直接推理
- 5.2 基于 TextStreamer 推理
6、保存/加载 LORA 模型
- 6.1 保存 LoRA Adapter
- 6.2 加载 LoRA Adapter
7、Saving to float16 for VLLM
8、GGUF / llama.cpp Conversion
参考资料

Unsloth是一个开源的大模型训练加速项目，使用OpenAI的Triton对模型的计算过程进行重写，大幅提升模型的训练速度，降低训练中的显存占用。

Unsloth Github项目：https://github.com/unslothai/unsloth
基于Unsloth微调Llama-3.1 8b源代码官方colab地址：https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=2eSvM9zX_2d3
Unsloth的安装方式参考博客：【大模型】Unsloth安装及使用教程

1、加载模型和分词器

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",          # Phi-3 2x faster!d
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

输出如下：

在这里插入图片描述

【代码解读】：
（1）代码中基于 unsloth 的 FastLanguageModel.from_pretrained() 加载了模型和分词器，能够显著提升模型和分词器加载速度。

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

（2）这里，本文也给出传统的基于Hugging Face的 transformers 的模型和分词器加载方式，以此来对比一下：

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = './model/llama-3-8b'   # 模型的本地路径
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

2、LoRA adapter

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)