Python-面向文本分类的经典向量化方法实现与比较

共15个文件

png：4个

py：3个

ipynb：3个

需积分: 45 142 浏览量 2019-08-12 04:11:49 上传评论 1 收藏 409KB ZIP 举报

在自然语言处理（NLP）领域，文本向量化是将非结构化的文本数据转换为可用于机器学习算法的数值表示的关键步骤。"Python-面向文本分类的经典向量化方法实现与比较"这一主题，聚焦于如何利用Python高效地进行文本向量化，并通过比较不同方法来找出在分类任务中表现更优的策略。下面我们将深入探讨几种常见的文本向量化技术，并讨论如何在Python环境中实现它们。 1. **词袋模型（Bag-of-Words, BoW）**：这是最基础的文本向量化方法，它忽略了词语的顺序，只关注文档中哪些词出现以及出现的频率。在Python中，可以使用`sklearn.feature_extraction.text.CountVectorizer`来实现BoW。 2. **TF-IDF（Term Frequency-Inverse Document Frequency）**：TF-IDF是在BoW基础上考虑了词的重要性，词频高且在文档中不常见的词具有更高的权重。`sklearn.feature_extraction.text.TfidfVectorizer`是Python中的实现库。 3. **N-gram**：N-gram模型考虑了相邻词的组合，如二元语法（bigrams）或三元语法（trigrams），能捕获词序信息。`CountVectorizer`和`TfidfVectorizer`都支持N-gram参数。 4. **词嵌入（Word Embeddings）**，如**Word2Vec**和**GloVe**：这些方法将每个词映射到一个连续的向量空间，捕捉词汇的语义关系。Python中，可以使用`gensim`库实现Word2Vec，而GloVe预训练的词向量可以通过`gensim.downloader`获取。 5. **TF-IDF与词嵌入结合**：有时会将TF-IDF和词嵌入结合起来，比如使用TF-IDF加权的词嵌入，以兼顾局部频率和全局语义。 6. **BERT等预训练模型**：近年来，基于Transformer的预训练模型，如BERT、RoBERTa、ALBERT等，已经在NLP任务中展现出卓越性能。使用`transformers`库（Hugging Face）可以方便地将这些模型用于文本分类任务。为了对比这些方法的效果，你需要对数据集进行预处理（如去除停用词、标点符号等），然后使用各种向量化方法创建特征矩阵。接着，可以将这些特征输入到相同的分类器（如逻辑回归、SVM、随机森林等）中，比较不同向量化方法下的分类性能。常用的评估指标包括准确率、精确率、召回率和F1分数。在`textvec-master`这个压缩包中，可能包含了一个Python项目，用于实现和比较上述的文本向量化方法。项目可能分为以下几个部分： 1. **数据预处理**：包含读取数据、清洗和分词的脚本。 2. **向量化实现**：分别实现BoW、TF-IDF、N-gram、词嵌入等方法的代码。 3. **模型训练与评估**：定义分类器，进行模型训练并使用交叉验证进行性能评估。 4. **结果可视化**：可能有图表展示不同方法的性能差异。通过这个项目，你可以深入了解不同文本向量化方法的优缺点，以及如何在实际应用中选择合适的方法。记得在分析时要根据具体任务和数据特性来调整参数，以获得最佳性能。

资源推荐

资源详情

资源评论

收起资源包目录

Python-面向文本分类的经典向量化方法实现与比较.zip （15个子文件）

textvec-master

requirements.txt 62B

examples

binary_multiclass_quality_comparison.ipynb 49KB

images

rt.png 61KB

airsent_bin.png 60KB

logo.png 3KB

imdb_bin.png 61KB

binary_classification_quality_comparison.ipynb 263KB

basic_usage.ipynb 4KB

LICENSE 1KB

textvec

__init__.py 0B

vectorizers.py 20KB

setup.cfg 62B

setup.py 840B

README.md 4KB

.gitignore 107B

![textvec logo](https://github.com/zveryansky/textvec/blob/master/examples/images/logo.png) ## WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP methods in Python. The main idea of this project is to show alternatives for an excellent TFIDF method which is highly overused for supervised tasks. All interfaces are similar to [scikit-learn](https://github.com/scikit-learn/scikit-learn) so you should be able to test the performance of this supervised methods just with a few changes. Textvec is compatible with: __Python 2.7-3.7__. ------------------ ## WHY: Comparison with TFIDF As you can read in the different articles<sup>1,2</sup> almost on every dataset supervised methods outperform unsupervised. But most text classification examples on the internet ignores that fact. | | IMDB_bin | RT_bin | Airlines Sentiment_bin | Airlines Sentiment_multiclass | 20news_multiclass | |----------|--------------------|------------|--------------------------|-------------------------------|-------------------| | TFOR | __0.9088__ | __0.7820__ | 0.9173 | NA | NA | | TFICF | 0.8992 | 0.7661 | 0.9220 | __0.8067__ | __0.8552__ | | TFBINICF | 0.8978 | 0.7628 | __0.9238__ | NA | NA | | TFRF | 0.8977 | 0.7609 | 0.9207 | NA | NA | | TFIDF | 0.8923 | 0.7539 | 0.8939 | 0.7763 | 0.8335 | | TFPF | 0.8949 | 0.7464 | 0.9164 | NA | NA | | TF | 0.8786 | 0.7286 | 0.9017 | 0.7865 | 0.7796 | | TFIR | 0.8361 | 0.7159 | 0.9017 | NA | NA | | TFCHI2 | 0.8734 | 0.6990 | 0.8900 | NA | NA | | TFGR | 0.8581 | 0.6793 | 0.8883 | NA | NA | Here is a comparison for binary classification on imdb sentiment data set. Labels sorted by accuracy score and the heatmap shows the correlation between different approaches. As you can see some methods are good for to ensemble models or perform features selection. ![Binary comparison](https://github.com/zveryansky/textvec/blob/master/examples/images/imdb_bin.png) For more dataset benchmarks (rotten tomatoes, airline sentiment) see [Binary classification quality comparison](https://github.com/zveryansky/textvec/blob/master/examples/binary_classification_quality_comparison.ipynb) ------------------ ## Install: Usage: ``` pip install textvec ``` Source code: ``` git clone https://github.com/zveryansky/textvec cd textvec pip install . ``` ------------------ ## HOW: Examples The usage is similar to scikit-learn: ``` python from sklearn.feature_extraction.text import CountVectorizer from textvec.vectorizers import TfBinIcfVectorizer cvec = CountVectorizer().fit(train_data.text) tficf_vec = TfBinIcfVectorizer(sublinear_tf=True) tficf_vec.fit(cvec.transform(text), y) ``` For more detailed examples see [Basic example](https://github.com/zveryansky/textvec/blob/master/examples/basic_usage.ipynb) and other notebooks in [Examples](https://github.com/zveryansky/textvec/blob/master/examples) ### Currently impletented methods: - TfIcfVectorizer - TforVectorizer - TfgrVectorizer - TfigVectorizer - Tfchi2Vectorizer - TfrfVectorizer - TfrrfVectorizer - TfBinIcfVectorizer - TfpfVectorizer Most of the vectorization techniques you can find in articles<sup>1,2</sup>. If you see any method with wrong name or reference pls commit! ------------------ ## TODO - [ ] Add TFBNS - [ ] Remove dependence of sklearn - [ ] Tests - [ ] Docs - [ ] GridSearch for benchmark ------------------ ## REFERENCE - [1] https://arxiv.org/pdf/1012.2609.pdf - [2] [M. Lan, C. L. Tan, J. Su, and Y. Lu] Supervised and traditional term weighting methods for automatic text categorization - [3] Thanks [aysent](https://aysent.github.io/2015/10/21/supervised-term-weighting.html#motivation-for-text-classification-tasks) for an inspiration

评论收藏

内容反馈