我试图在Python中针对这个数据集从头开始实现一个K-Nearest Neighbors分类算法,但在验证阶段遇到了一些问题。特别是,这是我的正常,直接的实现:
def KNNClassificationPredictOnValidation(row, norm_type, k):
distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, k)[:k]
values = [y_th_validation_train_np[indexes[i]] for i in range(k)]
return np.argmax(np.bincount(values))
它可以这样运行:
y_pred = []
for row in x_validation_np:
y_pred.append(KNNClassificationPredictOnValidation(row, 2, 169))
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")
得到58.600031530821376%
作为精度,我确信是正确的,因为我用Scikit得到了相同的结果,代码如下:
neigh = KNeighborsClassifier(n_neighbors=169)
neigh.fit(x_validation_train_np, y_th_validation_train_np)
y_pred = neigh.predict(x_validation_np)
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")
但是,我的实施非常缓慢。我想加快验证阶段的速度,而不需要实现更复杂的数据结构,如k-d或球树,我有了一个想法。
在验证阶段,我检查k函数的准确性结果,k函数从left_end
到right_end
,执行步骤2;在我的正常实现中,这意味着每次都要重新计算indexes
和distances
,它们是繁重的操作。
但是,这除了沉重之外,也是一种资源的浪费!基本上,假设我向函数传递right_end
的值。在这种情况下,我可以计算所有right_end
最近的邻居,并返回一个分类结果列表,其中只考虑每个left_end <= k < right_end
所需的邻居子集,这些子集来自已经计算的邻居:
# Version tweaked for fast validation
def KNNClassificationValidationPredict(row, norm_type, start, end, step):
distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, end)[:end+1]
return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end, step)]
我是这样测试的:
# My tweaked version for validation
left_end = 167
right_end = 171
y_pred = []
for row in x_validation_np:
y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end+1, 2))
results = []
y_pred = np.array([np.array(y) for y in y_pred])
for i in range(len(y_pred[0])):
y = y_pred[:, i]
accuracy = metrics.accuracy_score(y_th_validation_np, y)
results.append((left_end+i*2, accuracy*100))
print(results)
但这是输出:
[(167, 58.3793157811761), (169, 58.473908245309794), (171, 58.48967365599874)]
所以当k=169时,我得到的精度是58.473908245309794%
,这是不同的,我不明白我做错了什么:实现是相同的,我只是同时测试更多的情况。
下面我将留下一个最小可重复的示例:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
df = pd.read_csv('OnlineNewsPopularity/OnlineNewsPopularity.csv')
df = df.rename(columns=lambda x: x.strip())
df = df.iloc[: , 2:]
# non-thresholded shares
y = df.pop('shares')
# thresholded shares
y_th = y.copy(deep=True)
y_th = y_th.apply(lambda x: 1 if x >= 1400 else 0)
# renaming the variable
x = df
# tresholded version
x_train, x_test, y_th_train, y_th_test = train_test_split(
x, y_th,
test_size=0.20,
random_state=1
)
x_train_np = x_train.to_numpy()
x_test_np = x_test.to_numpy()
y_th_train_np = y_th_train.to_numpy()
y_th_test_np = y_th_test.to_numpy()
# Creating validation set
x_validation_train, x_validation, y_th_validation_train, y_th_validation = train_test_split(
x_train, y_th_train,
test_size=0.20,
random_state=1
)
x_validation_train_np = x_validation_train.to_numpy()
x_validation_np = x_validation.to_numpy()
y_th_validation_train_np = y_th_validation_train.to_numpy()
y_th_validation_np = y_th_validation.to_numpy()
def KNNClassificationPredict(row, norm_type, k):
distances = np.linalg.norm(x_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, k)[:k]
values = [y_th_train_np[indexes[i]] for i in range(k)]
return np.argmax(np.bincount(values))
# Version tweaked for fast validation
def KNNClassificationValidationPredict(row, norm_type, start, end, step):
distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, end)[:end+1]
return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end, step)]
def KNNClassificationPredictOnValidation(row, norm_type, k):
distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
indexes = np.argpartition(distances, k)[:k]
values = [y_th_validation_train_np[indexes[i]] for i in range(k)]
return np.argmax(np.bincount(values))
# Sklearn implementation against validation set
neigh = KNeighborsClassifier(n_neighbors=169)
neigh.fit(x_validation_train_np, y_th_validation_train_np)
y_pred = neigh.predict(x_validation_np)
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")
# My normal knn against validation set
y_pred = []
for row in x_validation_np:
y_pred.append(KNNClassificationPredictOnValidation(row, 2, 169))
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")
# My tweaked version for validation
left_end = 167
right_end = 171
y_pred = []
for row in x_validation_np:
y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end+1, 2))
results = []
y_pred = np.array([np.array(y) for y in y_pred])
for i in range(len(y_pred[0])):
y = y_pred[:, i]
accuracy = metrics.accuracy_score(y_th_validation_np, y)
results.append((left_end+i*2, accuracy*100))
print(results)
1条答案
按热度按时间t1qtbnec1#
问题
您的问题是
np.argpartition
如何工作。The documentation声明
np.argpartition(array, kth)
将返回array
的indices
,其顺序是array[indices][kth]
元素位于其最终排序位置,在它之前的所有元素(array[indices][:kth]
)较小,在它之后的所有元素(array[indices][kth+1:]
)较大。但是array[indices][:kth]
和array[indices][kth+1:]
中元素的 * 顺序 * 并不保证,它可能是完全随机的。因此,当你将
kth
的值从169
增加到172
时,会发生什么呢?array[indices][169]
不再被锁定在某个位置,它随机地落在了array[indices][:172]
的某个位置。此外,以前保证包含在
array[indices][:169]
中的一些值现在只保证包含在array[indices][:172]
中。其中一些值可能落入array[indices][169:172]
的范围内,并被先前保证包含在array[indices][169+1:]
中的值替换。重要的是当你打电话时
end=172
,然后一些
indexes[i]
将是错误的,当k < end
。解决方案
np.argpartition
接受kth
作为数组。当kth
是一个数组时,返回的indices
保证所有的array[indices][kth]
都在它们的最终排序位置。我们可以对整个有问题的范围left_end:right_end
执行此操作,而不会对性能造成太大影响:哪些输出