tf-idf和lgs简单的例子

最新推荐文章于 2024-09-23 15:48:11 发布

原创最新推荐文章于 2024-09-23 15:48:11 发布 · 793 阅读

0 ·

CC 4.0 BY-SA版权

本文介绍了一个基于Python的短信垃圾过滤项目实现过程。使用了pandas和sklearn等库，通过TF-IDF向量化短信文本，并利用逻辑回归进行训练。演示了如何划分数据集、构建模型并进行预测。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
 
df = pd.read_csv('e:\\SMSSpamCollection', delimiter='\t',header=None)

In [10]:

df.head()

Out[10]:

	0	1
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

In [12]:

X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1],df[0])
 
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

X_test = vectorizer.transform( ['URGENT! Your Mobile No 1234 was awarded a Prize', 'Hey honey, whats up?'] )
predictions = classifier.predict(X_test)
print(predictions)

['spam' 'ham']

如果新建个tfidf实例就会报错，必须要在原来的进行转换，如下就会报错：

new_vectorizer=TfidfVectorizer()
new_test = new_vectorizer.transform( ['URGENT! Your Mobile No 1234 was awarded a Prize', 'Hey honey, whats up?'] )
predictions=classifier.predict(new_test)