目录
改为GPU运行
1.检查是否有可用的GPU,并根据可用性设置设备。
2.使用方法将模型和输入张量移动到GPU。.to(device)
3.将所有相关的张量和模型移至GPU后进行计算。
# 检查是否有可用的GPU device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print("run on ",device) |
# 将模型移动到GPU model.to(device) |
我的电脑没有配cuda, 因此先尝试在colab运行
从文本label生成输入token label
之前样例的训练样本如:
labels: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]]) |
源码注释中对模型训练label的输入格式要求:
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`): Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels - 1]``. |
构造token label
1. Tokenize训练句子
2. Tokenize 句子中所有label 短语
3. 对于每个标记标签,循环标记训练句子,只要找到匹配项,就将其标记为“1”,其余标记为“0”. 当一个句子中有两个label短语,将循环训练句子两次.
例子:
sentences=["HuggingFace is a company based in Paris and New York." ] labels_texts=[ ["HuggingFace","York"]] tokenized_sentence=['ĠHug', 'ging', 'Face', 'Ġis', 'Ġa', 'Ġcompany', 'Ġbased', 'Ġin', 'ĠParis', 'Ġand', 'ĠNew', 'ĠYork', '.'] #labels_cls为 tensor([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]]) |
生成token label的代码
from transformers import AutoTokenizer import torch # 初始化tokenizer tokenizer = AutoTokenizer.from_pretrained("tmp/Longformer-finetuned-norm") sentences = ["HuggingFace is a company based in Paris and New York.","I am a little tiger."] labels_texts = [["HuggingFace", "York"],["tiger"]] def tokenize_and_align_labels(sentences, labels_texts, tokenizer): tokenized_inputs = tokenizer(sentences, add_special_tokens=False, return_tensors="pt", padding=True, truncation=True) all_labels_cls = [] for i, sentence in enumerate(sentences): labels_text = labels_texts[i] tokenized_sentence = tokenizer.convert_ids_to_tokens(tokenized_inputs["input_ids"][i]) labels_cls = [0] * len(tokenized_sentence)
for label_text in labels_text: tokenized_label = tokenizer.tokenize(label_text) label_length = len(tokenized_label)
for j in range(len(tokenized_sentence) - label_length + 1):#判断文本中这段token与label中的token完全一致 if tokenized_sentence[j:j + label_length] == tokenized_label: # print("tokenized_sentence:",tokenized_sentence[j:j + label_length]) # print("tokenized_label:",tokenized_label) labels_cls[j:j + label_length] = [1] * label_length
all_labels_cls.append(labels_cls)
return tokenized_inputs, torch.tensor(all_labels_cls) inputs_id, labels_cls = tokenize_and_align_labels(sentences, labels_texts, tokenizer) |
从pdf中读取的文字会有多余的\n
所有\n替换成空格
paper_text = paper_text.replace('\n', ' ') |
去除参考文献
def remove_references(text): keywords = ["References", "REFERENCES"] for keyword in keywords: index = text.find(keyword) if index != -1: return text[:index].strip() return text |
paper_text = remove_references(paper_text) |
另外有个问题, 我从json给它论文样本时token粒度直接为字母. 对比一下可以发现它外面少套了一层list
descriptions = [df['Dataset Description'].tolist()] |
多样本输出文本
将分类结果转换为文本的功能改为多样本循环版本
只转换predicted_token_class_ids中一个元素的代码是这样的
prediction_string=get_prediction_string(predicted_token_class_ids[0])
现在遍历predicted_token_class_ids中的每一个元素都调用get_prediction_string函数, 并把结果保存在prediction_strings中
使用列表推导式遍历 `predicted_token_class_ids` 中的每个元素,并调用 `get_prediction_string` 函数。 将每个调用结果保存在 `prediction_strings` 列表中。
# 遍历 predicted_token_class_ids 中的每个元素,调用 get_prediction_string 函数 prediction_strings = [get_prediction_string(prediction,predicted_inputs_id) for prediction,predicted_inputs_id in zip(predicted_token_class_ids,inputs["input_ids"])] print("Prediction Strings:", prediction_strings) |
def get_prediction_string(predicted_token_class_id): predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_id] print("predicted_tokens_classes",predicted_tokens_classes) |