减少空间占用的生成模型实战与推理资源消耗量化对比之二-CSDN博客

8.4.3 未运行缓存的生成模型推理资源量化展示

在训练结束后，我们首先对未运行缓存的生成模型进行推理展示，即先比对正常输出文本的推理模型，我们可以将生成的序列长度统一设置为48，再查看对应的文本输出，代码如下所示：

import torch
from tqdm import tqdm
import torch
import config
import gpt2_cached

import get_dataset
from torch.utils.data import Dataset, DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

gpt2config = config.GPT2Config()
tokenizer = config.Tokenizer()

model = gpt2_cached.GPT2().to(device)
model.load_state_dict(torch.load("./saver/model.pth"),strict=False)
model.eval()

max_length = gpt2config.max_length

top_k = 5
temperature=0.90
import time
start = time.time()

import time
start_time = time.time()
for _ in range(10):
    input_text = "酒店的位置"
    input_ids = torch.tensor([tokenizer.encode(input_text)]).long().to(device)
    past_length = input_ids.shape[-1]  # 初始输入的长度

    input_ids = input_ids.clone().detach().requires_grad_(False).to(device)
    for token_n in range(max_length):
        with torch.no_grad():
            indices_to_input = input_ids
            next_token_logits,_ = model(indices_to_input)
            next_token_logits = next_token_logits[:, -1]

        probs = torch.nn.functional.softmax(next_token_logits, dim=-1) * temperature

        (values, indices) = torch.topk(probs, k=top_k)
        probs[probs < values[:, -1, None]] = 0
        probs = probs / probs.sum(axis=1, keepdims=True)

        next_indices = torch.multinomial(probs, num_samples=1)

        input_ids = torch.cat([input_ids, next_indices], dim=1)

    input_ids = input_ids[0].cpu().numpy()
    text = tokenizer.decode(input_ids.tolist())
    text= text.split("<|end of sentence|>")[0]
    # print(text)
    allocated_memory = torch.cuda.memory_allocated()
    print(f'当前设备上张量所占用的GPU内存: {allocated_memory} 字节')
end_time = time.time()
print("花费的时间为：", end_time - start_time)

读者可以自行运行代码查看生成的文本内容。下面我们继续查看当升级了文本长度后的推理资源耗费，简单地说，我们可以通过增加文本生成的文本长度，在一个较长的生成长度要求下对结果进行比对。

此时我们设置的文本生成长度为768，代码如下所示：

class GPT2Config:
    hidden_size = 384
    vocab_size = 4000
    num_attention_heads = 6
    assert hidden_size % num_attention_heads == 0, 'hidden_size must be divisible by num_head'
    intermediate_size = hidden_size * 4
    dropout = 0.1

    layer_norm_eps = 1e-12
    n_layers = 6

    is_cause = True
    device = "cuda"

    max_length = 768

运行上面代码，结果如下所示：

当前设备上张量所占用的GPU内存: 74084352 字节
当前设备上张量所占用的GPU内存: 74084352 字节
当前设备上张量所占用的GPU内存: 74084352 字节
花费的时间为： 52.08263564109802

在当前的设备配置下，执行特定任务时张量所占用的GPU内存为74084352字节，且这一数值在连续三次的测试中保持一致。换算后可知，这大约占用了0.7GB的显存。完成这一任务所耗费的时间为52.08263564109802秒。从这个测试中我们可以看到，当文本生成的长度延长至768时，GPU资源的占用稳定在74084352字节，也就是大约0.7GB的显存，整个过程耗时约52秒。

请注意，这个数值可能因读者的电脑硬件配置差异而有所变化，建议读者根据自身情况进行相应设置。

8.4.4 在缓存的生成模型推理资源量化展示

下面我们采用同样的长度在带有缓存的生成模型上演示推理资源的占用，读者可以首先完成短文本的生成并对比生成质量，之后使用长文本检测生成的资源占用。同样地，我们采用768作为文本生成的长度，带有缓存的生成模型如下所示：

import torch
from tqdm import tqdm
import torch
import config
import gpt2_cached

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

gpt2config = config.GPT2Config()
tokenizer = config.Tokenizer()

model = gpt2_cached.GPT2(use_cache=True).to(device)
model.load_state_dict(torch.load("./saver/model.pth"), strict=False)
model.eval()

max_length = gpt2config.max_length

top_k = 5
temperature=0.90
import time
start_time = time.time()
for _ in range(10):
    model.reset_kv_cache()
    input_text = "酒店的位置"
    input_ids = torch.tensor([tokenizer.encode(input_text)]).long().to(device)
    past_length = input_ids.shape[-1]  # 初始输入的长度

    input_ids = input_ids.clone().detach().requires_grad_(False).to(device)
    for token_n in range(max_length):
        with torch.no_grad():
            indices_to_input = input_ids
            next_token_logits,_ = model(indices_to_input)
            next_token_logits = next_token_logits[:, -1]

        probs = torch.nn.functional.softmax(next_token_logits, dim=-1) * temperature

        (values, indices) = torch.topk(probs, k=top_k)
        probs[probs < values[:, -1, None]] = 0
        probs = probs / probs.sum(axis=1, keepdims=True)

        next_indices = torch.multinomial(probs, num_samples=1)

        input_ids = torch.cat([input_ids, next_indices], dim=1)

    input_ids = input_ids[0].cpu().numpy()
    text = tokenizer.decode(input_ids.tolist())
    text= text.split("<|end of sentence|>")[0]
    #print(text)
    allocated_memory = torch.cuda.memory_allocated()
    print(f'当前设备上张量所占用的GPU内存: {allocated_memory} 字节')
end_time = time.time()
print("花费的时间为：", end_time - start_time)

通过执行这个代码，我们可以观察到资源耗费的另一种情况，其打印结果如下：

当前设备上张量所占用的GPU内存: 73377280 字节
当前设备上张量所占用的GPU内存: 73377280 字节
当前设备上张量所占用的GPU内存: 73377280 字节
花费的时间为： 40.05440592765808

在相同的任务下，当前设备上张量所占用的GPU内存为73377280字节，并且这个数值在连续的三次测试中同样保持稳定。这次任务所耗费的时间减少到了40.05440592765808秒。这意味着，在生成相同长度的文本内容时，我们仅用了40秒，相较于之前的52秒，显著缩短了处理时间。

8.4.5 使用细精度修正模型输出

除了使用KV Cache完成模型推理外，我们还可以使用半精度修正模型的输出，即在尽量保证输出结果的前提下，对模型精度进行调整，代码如下所示：

model = gpt2_cached.GPT2(use_cache=True).half().to(device)

可以看到，这里我们仅仅在模型的初始化阶段添加了.half()函数，即可完成模型的半精度设置，而从模型运行结果上来看，可以极大地减少缓存的占用。这一点读者可自行尝试学习。