huggingface 笔记：AutoTokenizer,AutoClass, AutoModel

原创

已于 2025-07-09 18:01:06 修改 · 745 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#笔记

于 2024-05-13 10:12:47 首次发布

AutoClass 是一个快捷方式，它可以自动从模型的名称或路径检索预训练模型的架构。只需要为任务选择适当的 AutoClass 及其关联的预处理类。

1 AutoTokenizer

分词器负责将文本预处理成模型输入的数字数组。控制分词过程的规则有多种，包括如何分割单词以及应在什么层级分割单词
- 有多种分词器算法，但它们的目标都是一样的：根据某些规则将文本切分为较小的词或子词（token），并将它们转换为数字（input ids）
- Transformers 的分词器还会返回 attention mask，用来指示哪些 token 需要被模型关注。
需要用相同的模型名称实例化一个分词器，以确保使用的分词规则是模型预训练时使用的

1.1 使用 AutoTokenizer 加载分词器

from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
encoding
'''
{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
'''

input_ids 是句子中每个令牌对应的索引。
attention_mask 指示是否应该关注一个令牌。
token_type_ids 在有多个序列时，标识一个令牌属于哪个序列。

1.2 分词器接受输入列表

分词器还可以接受输入列表，并对文本进行填充和截断，返回长度统一的批处理

tokenizer(
    ["We are very happy to show you the Transformers library.",
     "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)
#这边的pt表示的是返回pytorch
'''
{'input_ids': tensor([[  101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 58263,
         13299,   119,   102],
        [  101, 11312, 18763, 10855, 11530,   112,   162, 39487, 10197,   119,
           102,     0,     0]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}
'''

1.3 通过decode返回输入

tokenizer.decode(encoding["input_ids"])
#'[CLS] we are very happy to show you the [UNK] transformers library. [SEP]'

1.3.1 decode和batch_decode

共同的参数：

skip_special_tokens（布尔型，默认 False）：
- 是否在解码时跳过特殊 token（如 <pad>、<eos> 等）
clean_up_tokenization_spaces（布尔型，可选）
- 是否清理多余空格。例如把 "Hello , world !" 变为 "Hello, world!"

batch_decode的输入是一组 token ID 的列表

decode的输入是一条 token id 序列（单条语句）

1.4 pad

由于句子长度不总是相同，这可能成为问题，因为模型输入的张量需要具有统一的形状。
填充是一种策略，通过向较短的句子添加特殊的填充令牌来确保张量是矩形的。
将填充参数设置为 True，以将批次中较短的序列填充至与最长序列匹配：

不加padding【长度不一】：

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences)
encoded_input
'''

{'input_ids': [[101, 10502, 11523, 10935, 10981, 61304, 136, 102], [101, 11530, 112, 162, 21506, 10191, 45864, 10935, 10981, 61304, 117, 16999, 10373, 119, 102], [101, 11523, 10935, 29577, 44682, 136, 102]], 
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]], 
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}
'''

加了padding

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences,padding=True)
encoded_input
'''
{'input_ids': [[101, 10502, 11523, 10935, 10981, 61304, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 11530, 112, 162, 21506, 10191, 45864, 10935, 10981, 61304, 117, 16999, 10373, 119, 102], [101, 11523, 10935, 29577, 44682, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
'''

第一句和第三句现在因为它们较短而用 0 填充。

1.4.1 从哪一侧pad

默认在右侧pad 0

如果要是在左侧pad呢

from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name,