Extractive QA参考:
https://juejin.cn/post/7180925054559977533
https://huggingface.co/learn/nlp-course/en/chapter7/7
目录
posing questions about a document and identifying the answers as spans of text in the document itself.
常见范式:做句子的二分类任务(该句是否属于摘要),将预测为“属于”的句子拼起来,组成摘要。
像BERT这样的纯编码器模型往往擅长提取诸如“谁发明了Transformer架构?”之类的事实问题的答案,但是当给出诸如“为什么天空是蓝色的?”之类的开放式问题时,则表现不佳。在这些更具挑战性的情况下,通常使用 T5 和 BART 等编码器-解码器模型以类似文本摘要的方式综合信息.
Bert在SQuAD 数据集训练的模型例子:
https://huggingface.co/huggingface-course/bert-finetuned-squad
squad 评估的输出格式包含两个字段, 每个都是列表:
Answer: {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]} |
评估时, 某些问题有几个可能的答案,并且此脚本会将预测答案与所有可接受的答案进行比较,并取得最高分数。
使用深度学习方法做抽取式摘要的经典论文:
Friendly Topic Assistant for Transformer Based Abstractive Summarization
Friendly Topic Assistant for Transformer Based Abstractive Summarization - ACL Anthology
SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents
Extractive Summarization using Deep Learning
Neural Extractive Summarization with Side Information
Ranking Sentences for Extractive Summarization with Reinforcement Learning
Fine-tune BERT for Extractive Summarization
Extractive Summarization of Long Documents by Combining Global and Local Context
Extractive Summarization as Text Matching
重要模型: Fine-tune BERT for Extractive Summarization(BertSum)
https://arxiv.org/abs/1903.10318
GitHub - nlpyang/BertSum: Code for paper Fine-tune BERT for Extractive Summarization
滑动窗口例子(提取开始结束点任务)
模型内部的注意力滑窗
全局注意力可以通过参数设置
用户可以通过设置张量来定义哪些代币“本地”参与,哪些代币“全局”参与 global_attention_mask 适当地在运行时。所有 Longformer 模型都采用以下逻辑 global_attention_mask:
0: the token attends “locally”,
1: the token attends “globally”.
Longformer 自注意力结合了局部(滑动窗口)和全局注意力
文档没有提到更多滑窗相关内容, 可能这种模型的创新就在于attention window, 而不是整个段落滑窗. 因此还是需要另外编写.
输入时候的滑动窗口例子
(参考上文提到的huggingface文档)
从数据集的一个样本创建多个训练特征来处理长上下文,并在它们之间有一个滑动窗口。
为了使用当前示例了解其工作原理,我们可以将长度限制为 100 个,并使用 50 个标记的滑动窗口。提醒一下,我们使用:
max_length 设置最大长度(此处为 100)
truncation="only_second" 当问题及其上下文过长时,截断上下文(位于第二个位置)
stride 设置两个连续块之间的重叠令牌数(此处为 50)
return_overflowing_tokens=True 为了让tokenizer知道我们想要溢出的token
inputs = tokenizer( question, context, max_length=100, truncation="only_second", stride=50, return_overflowing_tokens=True, ) for ids in inputs["input_ids"]: print(tokenizer.decode(ids)) |
'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basi [SEP]' '[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin [SEP]' '[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 [SEP]' '[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP]. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]' |
针对只有一半答案的情况, 数据集为我们提供了上下文中答案的开始字符,通过添加答案的长度,我们可以在上下文中找到结束字符。为了将它们映射到令牌索引,我们需要使用我们研究的偏移映射 第 6 章.我们可以让我们的标记器通过传递来返回这些 return_offsets_mapping=True:
inputs = tokenizer( question, context, max_length=100, truncation="only_second", stride=50, return_overflowing_tokens=True, return_offsets_mapping=True, ) inputs.keys() |
这样inputs里面就会包含偏移值
dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping']) |
得到列表inputs["overflow_to_sample_mapping"]
它代表滑窗来自第几个样本
如[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3]代表样本0~2每个样本分别有4个滑窗,样本3有7个滑窗
answers = raw_datasets["train"][2:6]["answers"] start_positions = [] end_positions = [] for i, offset in enumerate(inputs["offset_mapping"]): sample_idx = inputs["overflow_to_sample_mapping"][i] answer = answers[sample_idx] start_char = answer["answer_start"][0] end_char = answer["answer_start"][0] + len(answer["text"][0]) sequence_ids = inputs.sequence_ids(i) # Find the start and end of the context idx = 0 while sequence_ids[idx] != 1: idx += 1 context_start = idx while sequence_ids[idx] == 1: idx += 1 context_end = idx - 1 # If the answer is not fully inside the context, label is (0, 0) if offset[context_start][0] > start_char or offset[context_end][1] < end_char: start_positions.append(0) end_positions.append(0) else: # Otherwise it's the start and end token positions idx = context_start while idx <= context_end and offset[idx][0] <= start_char: idx += 1 start_positions.append(idx - 1) idx = context_end whi |