自然语言处理用的二分类微调数据SST，可以参考huggingface来具体操作训练

共11个文件

txt：8个

tsv：3个

文本分类

NLP

预训练模型

5星 · 超过95%的资源需积分: 16 179 浏览量 2023-02-14 10:47:55 上传评论收藏 7.09MB ZIP 举报

自然语言处理（NLP）是计算机科学领域的一个重要分支，主要关注如何使计算机理解、解析、生成人类语言。在这个领域，预训练模型已经成为一个核心工具，它们通过在大规模无标注文本上进行预训练，学习到丰富的语言表示。这些模型可以进一步在特定任务上进行微调，以提高其在该任务上的性能。SST（Stanford Sentiment Treebank）是一个广泛使用的二分类任务数据集，主要用于训练和评估情感分析模型，即判断一段文本是正面还是负面情感。 SST 数据集由斯坦福大学的研究人员创建，它包含来自电影评论的句子，每个句子都有一个从1（非常负面）到5（非常正面）的情感评分。为了简化问题，通常将其转换为二分类任务：1和2被视为负面，4和5被视为正面，3被视为中性，通常被忽略。这个数据集因其复杂性和多样性而受到研究人员的青睐，适合用于验证模型在处理不同情感强度和复杂句法结构上的能力。预训练模型如BERT、RoBERTa、ALBERT、DistilBERT等，都是基于Transformer架构的模型，它们在诸如 masked language model 和 next sentence prediction 等任务上进行了大量预训练。这些模型已经学会了大量的语言规律，微调就是在预训练模型的基础上，针对特定任务如SST的文本分类，添加一个或多个任务相关的输出层，并用SST数据集对这些新层进行训练。这样可以利用预训练模型的通用语言知识，同时适应特定任务的需求。微调步骤大致包括： 1. 准备数据：将SST的数据集划分为训练集、验证集和测试集。 2. 初始化模型：选择一个预训练模型，并加载其预训练权重。 3. 构建模型：在预训练模型的顶部添加一个分类层，通常是一个全连接层，用于输出类别概率。 4. 训练模型：使用训练集对整个模型进行反向传播训练，调整所有参数，包括预训练部分和新增的分类层。 5. 评估模型：在验证集上监控模型性能，防止过拟合。 6. 调参：根据验证集的性能调整超参数，如学习率、批次大小等。 7. 最终测试：在未见过的测试集上评估模型的泛化能力。在实际操作中，可以利用Hugging Face的Transformers库，它提供了许多预训练模型和方便的接口来加载、微调和评估模型。通过简单的代码，就可以实现SST数据集的加载、模型的构建和训练过程。例如，使用Hugging Face的Transformers库进行SST微调的基本流程如下： ```python from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments # 加载预训练模型和tokenizer model = BertForSequenceClassification.from_pretrained('bert-base-uncased') tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # 加载数据集 train_dataset, eval_dataset = load_sst_data() # 定义训练参数 training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=16, warmup_steps=500, weight_decay=0.01, logging_dir='./logs', ) # 创建Trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, ) # 开始训练 trainer.train() ``` 通过这样的流程，你可以有效地使用预训练模型在SST数据集上进行微调，为文本分类任务构建一个高性能的模型。微调后的模型不仅可以应用于情感分析，还可以扩展到其他类似的NLP任务，如意见抽取、情绪识别等。

资源推荐

资源详情

资源评论

收起资源包目录

SST-2.zip （11个子文件）

SST-2

train.tsv 3.63MB

dev.tsv 93KB

original

datasetSentences.txt 1.23MB

SOStr.txt 1.17MB

README.txt 2KB

sentiment_labels.txt 3.11MB

original_rt_snippets.txt 1.14MB

dictionary.txt 11.45MB

STree.txt 1.25MB

datasetSplit.txt 82KB

test.tsv 193KB

Stanford Sentiment Treebank V1.0 This is the dataset of the paper: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts Conference on Empirical Methods in Natural Language Processing (EMNLP 2013) If you use this dataset in your research, please cite the above paper. @incollection{SocherEtAl2013:RNTN, title = {{Parsing With Compositional Vector Grammars}}, author = {Richard Socher and Alex Perelygin and Jean Wu and Jason Chuang and Christopher Manning and Andrew Ng and Christopher Potts}, booktitle = {{EMNLP}}, year = {2013} } This file includes: 1. original_rt_snippets.txt contains 10,605 processed snippets from the original pool of Rotten Tomatoes HTML files. Please note that some snippet may contain multiple sentences. 2. dictionary.txt contains all phrases and their IDs, separated by a vertical line | 3. sentiment_labels.txt contains all phrase ids and the corresponding sentiment labels, separated by a vertical line. Note that you can recover the 5 classes by mapping the positivity probability using the following cut-offs: [0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0] for very negative, negative, neutral, positive, very positive, respectively. Please note that phrase ids and sentence ids are not the same. 4. SOStr.txt and STree.txt encode the structure of the parse trees. STree encodes the trees in a parent pointer format. Each line corresponds to each sentence in the datasetSentences.txt file. The Matlab code of this paper will show you how to read this format if you are not familiar with it. 5. datasetSentences.txt contains the sentence index, followed by the sentence string separated by a tab. These are the sentences of the train/dev/test sets. 6. datasetSplit.txt contains the sentence index (corresponding to the index in datasetSentences.txt file) followed by the set label separated by a comma: 1 = train 2 = test 3 = dev Please note that the datasetSentences.txt file has more sentences/lines than the original_rt_snippet.txt. Each row in the latter represents a snippet as shown on RT, whereas the former is each sub sentence as determined by the Stanford parser. For comparing research and training models, please use the provided train/dev/test splits.

评论收藏

内容反馈