TF-IDF实现关键词提取

最新推荐文章于 2025-06-11 20:07:28 发布

原创

最新推荐文章于 2025-06-11 20:07:28 发布 · 5.5k 阅读

27 ·

CC 4.0 BY-SA版权

本文介绍了TF-IDF方法用于关键词提取的原理，通过词频和逆向词频计算，排除常见词汇干扰，提取文章关键信息。同时提到了基于复杂网络的方法，利用单词共现构建网络并分析其拓扑特征来确定关键词的重要性，以解决词频计算中的问题。附带了相关代码实现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

TF-IDF方法简介

TF-IDF，实际上是两个部分：TF和IDF的乘积。下面分别对两个次解释。

TF：词频。简单理解，就是词语在文章中出现的频率。计算方法也很简单：

即文档i中词语j的词频等于词语j在文档i中的出现次数nij除以文档i中所有词语的数量。

IDF：逆向词频，也叫反文档频率。首先了解一下文档频率DF：一个词在所有文档中出现的频率，如共有100篇文章，10篇文章中出现，则频率为0.1。那么，IDF就是这个DF的倒数，也就是10。之后，在分母上+1，防止分母为0，再取对数。逆向词频解决的问题是方式常用词霸占词频榜，导致提取出来的关键词都是没有意义的常用词...（例如介词）。

即词i的逆向词频等于文档总数除以包含词i的文档数+1，再取对数。

最终的tf-idf算法将词频和逆向词频相乘，解决了常用词的问题，便可提取出文章的关键词。

实现代码

string = "Automatic keyword extraction is to extract topical and important words or phrases form document or document set. It is a basic and necessary work in text mining tasks such as text retrieval and text summarization. This paper discusses the connotation of keyword extraction and automatic keyword extraction. In the light of linguistics, cognitive science, complexity science, psychology and social science, this paper studies the theoretical basis of automatic keyword extraction. From macro, meso and micro perspectives, the development, techniques and methods of automatic keyword extraction are reviewed and analyzed. This paper summarizes the current key technologies and research progress of automatic keyword extraction methods, including statistical methods, topic based methods, and network based methods. The evaluation approach of automatic keyword extraction is analyzed, and the challenges and trends of automatic keyword extraction are also predicted."

from jieba.analyse import *
# print(jieba.cut(str))
# print()

for keyword, weight in extract_tags(string, withWeight=True):
    print('%s %s' % (keyword, weight))

# kw = tfidf(str)
# print(kw)

基于复杂网络的关键词提取方法