1.KNN原理
存在一个样本数 据集合,也称作训练样本集,并且样本集中每个数据都存在标签,即我们知道样本集中每一数据与所属分类的对应关系。输人没有标签的新数据后,将新数据的每个特征与样本集中数据对应的特征进行比较,然后算法提取样本集中特征最相似数据(最近邻)的分类标签。
2.KNN流程
- 计算已知类别数据集中的点与当前点之间的距离;
- 按照距离递增次序排序;
- 选取与当前点距离最小的走个点;
- 确定前kkk个点所在类别的出现频率;
- 返回前kkk个点出现频率最高的类别作为当前点的预测分类。
3.KNN代码示例
import numpy as np
import operator
def createDataSet():
group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
labels = ['A', 'A', 'B', 'B']
return group, labels
def classify0(inX, dataSet, labels, k):
dataRowSize = dataSet.shape[0]
inputRow = np.tile(inX, (dataRowSize, 1))
subMat = inputRow - dataSet
squareMat = subMat ** 2
squareDis = squareMat.sum(axis=1)
distance = squareDis ** 0.5
sortedIndex = distance.argsort()
classCount = {}
for i in range(k):
label = labels[sortedIndex[i]]
classCount[label] = classCount.get(label, 0) + 1
sortClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortClassCount[0][0]
创建四个点,并给每个点对应标签,在classify0方法中输入参数分别是要分类输入向量inX,训练样本dataSet,标签labels,和用于选择最邻近的数目k。在方法中,通过判断输入向量与训练样本中每个点的欧氏距离:
d=∑i=0n(xi−xi^)2d=\sqrt{\sum\limits^{n}_{i=0}(x_i-\hat{x_i})^2}d=i=0∑n(xi−xi^)2
再确定前k个最小距离的点,返回这几个点中出现的频率最高的标签,最后得到分类结果。
4.约会数据测试
def file2matrix(filename):
file = open(filename)
arrayLine = file.readlines()
numberOfLine = len(arrayLine)
returnMat = np.zeros([numberOfLine, 3])
classLabel = []
index = 0
for line in arrayLine:
line = line.strip()
listFromLine = line.split('\t')
returnMat[index, :] = listFromLine[0:3]
classLabel.append(np.int(listFromLine[-1]))
index += 1
return returnMat, classLabel
def normalization(dataset):
maxVal = dataset.max(0)
minVal = dataset.min(0)
averange = maxVal - minVal
# normData = np.zeros(dataset.shape)
rowsNum = dataset.shape[0]
normData = dataset - np.tile(minVal, (rowsNum, 1))
normData = normData / np.tile(averange, (rowsNum, 1))
# print("normData:", normData)
return normData, averange, minVal
def classify0(inX, dataSet, labels, k):
dataRowSize = dataSet.shape[0]
inputRow = np.tile(inX, (dataRowSize, 1))
subMat = inputRow - dataSet
squareMat = subMat ** 2
squareDis = squareMat.sum(axis=1)
distance = squareDis ** 0.5
sortedIndex = distance.argsort()
classCount = {}
for i in range(k):
label = labels[sortedIndex[i]]
classCount[label] = classCount.get(label, 0) + 1
sortClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortClassCount[0][0]
参考文献
[1] Peter Harringtom. 机器学习实战[M].北京, 人民邮电出版社, 2013.