算法原理
为了处理学习数据,数据挖掘中的 K-means 算法首先随机选取第一组中心点,作为每个聚类的起始点,然后进行迭代(重复)计算(计算的是每个点到簇点的欧几里得距离),以优化中心点的位置。
当出现以下两种情况之一时,它将停止创建和优化群集:
- 中心点已经稳定--由于聚类成功,其值没有变化。
- 已达到规定的迭代次数。
代码实现
随机数据
import numpy as np
import pandas as pd
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# 创建一组随机数据 维度为二维
X= -2 * np.random.rand(100,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = 'brown',marker="p")
plt.show()
# 模型构建
model = KMeans(n_clusters=2, init="k-means++", n_init=10, max_iter=300)
"""
参数解释:
n_clusters=2 聚类个数(簇)
init="k-means++" 初始化聚类中心的方法
n_init=10 初始化次数
max_iter=300最大迭代次数
"""
model.fit(X)
result = model.predict(X) # result equals model.labels_
centers = model.cluster_centers_ # 簇的点位
# 画图
plt.figure(figsize=(15,5))
plt.subplot(121)
plt.scatter(X[ : , 0], X[ :, 1], s = 80, c = result, marker="+",cmap="rainbow")
plt.subplot(122)
plt.scatter(X[ : , 0], X[ :, 1], s = 80, c = result, marker="p")
plt.scatter(centers[:,0], centers[:,1], s = 80, c = "green", marker="o")
# 中心点用绿色圆表示出来
plt.show()
更换数据
# 不再赘述 直接上完整代码
data = pd.read_excel("K_Means.xlsx")
X, y = data["X"], data["Y"]
model = KMeans(n_clusters=4, init="k-means++", max_iter=300, n_init=10)
model.fit(data)
prediction = model.predict(data)
center_clusters = model.cluster_centers_
plt.figure(figsize=(15,5))
plt.subplot(121)
plt.scatter(X,y,c=prediction,marker="p",s=80)
plt.subplot(122)
plt.scatter(X,y,c=prediction,marker="p",s=80)
plt.scatter(center_clusters[:,0],center_clusters[:,1],color='red',s=100)
plt.show()
参考资料
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa12.3. Clustering — scikit-learn 1.3.2 documentationClustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on trai...
https://scikit-learn.org/stable/modules/clustering.html#k-means