numpy K-NN算法几乎相同实现的不一致结果

moiiocjp  于 2023-04-30  发布在  其他
关注(0)|答案(1)|浏览(115)

我试图在Python中针对这个数据集从头开始实现一个K-Nearest Neighbors分类算法,但在验证阶段遇到了一些问题。特别是,这是我的正常,直接的实现:

def KNNClassificationPredictOnValidation(row, norm_type, k):
    distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, k)[:k]
    values = [y_th_validation_train_np[indexes[i]] for i in range(k)]
    return np.argmax(np.bincount(values))

它可以这样运行:

y_pred = []
for row in x_validation_np:
  y_pred.append(KNNClassificationPredictOnValidation(row, 2, 169))

print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")

得到58.600031530821376%作为精度,我确信是正确的,因为我用Scikit得到了相同的结果,代码如下:

neigh = KNeighborsClassifier(n_neighbors=169)
neigh.fit(x_validation_train_np, y_th_validation_train_np)
y_pred = neigh.predict(x_validation_np)
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")

但是,我的实施非常缓慢。我想加快验证阶段的速度,而不需要实现更复杂的数据结构,如k-d或球树,我有了一个想法。
在验证阶段,我检查k函数的准确性结果,k函数从left_endright_end,执行步骤2;在我的正常实现中,这意味着每次都要重新计算indexesdistances,它们是繁重的操作。
但是,这除了沉重之外,也是一种资源的浪费!基本上,假设我向函数传递right_end的值。在这种情况下,我可以计算所有right_end最近的邻居,并返回一个分类结果列表,其中只考虑每个left_end <= k < right_end所需的邻居子集,这些子集来自已经计算的邻居:

# Version tweaked for fast validation
def KNNClassificationValidationPredict(row, norm_type, start, end, step):
    distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, end)[:end+1]
    return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end, step)]

我是这样测试的:

# My tweaked version for validation
left_end = 167
right_end = 171
y_pred = []
for row in x_validation_np:
  y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end+1, 2))

results = []
y_pred = np.array([np.array(y) for y in y_pred])
for i in range(len(y_pred[0])):
  y = y_pred[:, i]
  accuracy = metrics.accuracy_score(y_th_validation_np, y)
  results.append((left_end+i*2, accuracy*100))
print(results)

但这是输出:

[(167, 58.3793157811761), (169, 58.473908245309794), (171, 58.48967365599874)]

所以当k=169时,我得到的精度是58.473908245309794%,这是不同的,我不明白我做错了什么:实现是相同的,我只是同时测试更多的情况。
下面我将留下一个最小可重复的示例:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import numpy as np
from sklearn.neighbors import KNeighborsClassifier


df = pd.read_csv('OnlineNewsPopularity/OnlineNewsPopularity.csv')
df = df.rename(columns=lambda x: x.strip())
df = df.iloc[: , 2:]

# non-thresholded shares
y = df.pop('shares')

# thresholded shares
y_th = y.copy(deep=True)
y_th = y_th.apply(lambda x: 1 if x >= 1400 else 0)

# renaming the variable
x = df

# tresholded version
x_train, x_test, y_th_train, y_th_test = train_test_split(
    x, y_th,
    test_size=0.20,
    random_state=1
)

x_train_np = x_train.to_numpy()
x_test_np = x_test.to_numpy()

y_th_train_np = y_th_train.to_numpy()
y_th_test_np = y_th_test.to_numpy()

# Creating validation set
x_validation_train, x_validation, y_th_validation_train, y_th_validation = train_test_split(
    x_train, y_th_train,
    test_size=0.20,
    random_state=1
)

x_validation_train_np = x_validation_train.to_numpy()
x_validation_np = x_validation.to_numpy()

y_th_validation_train_np = y_th_validation_train.to_numpy()
y_th_validation_np = y_th_validation.to_numpy()

def KNNClassificationPredict(row, norm_type, k):
    distances = np.linalg.norm(x_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, k)[:k]
    values = [y_th_train_np[indexes[i]] for i in range(k)]
    return np.argmax(np.bincount(values))

# Version tweaked for fast validation
def KNNClassificationValidationPredict(row, norm_type, start, end, step):
    distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, end)[:end+1]
    return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end, step)]

def KNNClassificationPredictOnValidation(row, norm_type, k):
    distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, k)[:k]
    values = [y_th_validation_train_np[indexes[i]] for i in range(k)]
    return np.argmax(np.bincount(values))

# Sklearn implementation against validation set
neigh = KNeighborsClassifier(n_neighbors=169)
neigh.fit(x_validation_train_np, y_th_validation_train_np)
y_pred = neigh.predict(x_validation_np)
print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")

# My normal knn against validation set
y_pred = []
for row in x_validation_np:
  y_pred.append(KNNClassificationPredictOnValidation(row, 2, 169))

print(f"{metrics.accuracy_score(y_th_validation_np, y_pred)*100}%")

# My tweaked version for validation
left_end = 167
right_end = 171
y_pred = []
for row in x_validation_np:
  y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end+1, 2))

results = []
y_pred = np.array([np.array(y) for y in y_pred])
for i in range(len(y_pred[0])):
  y = y_pred[:, i]
  accuracy = metrics.accuracy_score(y_th_validation_np, y)
  results.append((left_end+i*2, accuracy*100))
print(results)
t1qtbnec

t1qtbnec1#

问题

您的问题是np.argpartition如何工作。
The documentation声明np.argpartition(array, kth)将返回arrayindices,其顺序是array[indices][kth]元素位于其最终排序位置,在它之前的所有元素(array[indices][:kth])较小,在它之后的所有元素(array[indices][kth+1:])较大。但是array[indices][:kth]array[indices][kth+1:]中元素的 * 顺序 * 并不保证,它可能是完全随机的。
因此,当你将kth的值从169增加到172时,会发生什么呢?array[indices][169]不再被锁定在某个位置,它随机地落在了array[indices][:172]的某个位置。
此外,以前保证包含在array[indices][:169]中的一些值现在只保证包含在array[indices][:172]中。其中一些值可能落入array[indices][169:172]的范围内,并被先前保证包含在array[indices][169+1:]中的值替换。
重要的是当你打电话时

indexes = np.argpartition(distances, end)[:end+1]

end=172,然后

[np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end+1, step)]

一些indexes[i]将是错误的,当k < end

解决方案

np.argpartition接受kth作为数组。当kth是一个数组时,返回的indices保证所有的array[indices][kth]都在它们的最终排序位置。我们可以对整个有问题的范围left_end:right_end执行此操作,而不会对性能造成太大影响:

def KNNClassificationValidationPredict(row, norm_type, start, end, step):
    distances = np.linalg.norm(x_validation_train_np-row, ord=norm_type, axis=1)
    indexes = np.argpartition(distances, range(start, end+1))[:end+1]
    return [np.argmax(np.bincount([y_th_validation_train_np[indexes[i]] for i in range(k)])) for k in range(start, end+1, step)]

left_end = 167
right_end = 171
y_pred = []
for row in x_validation_np:
    y_pred.append(KNNClassificationValidationPredict(row, 2, left_end, right_end, 2))

results = []
y_pred = np.array([np.array(y) for y in y_pred])
for i in range(len(y_pred[0])):
    y = y_pred[:, i]
    accuracy = metrics.accuracy_score(y_th_validation_np, y)
    results.append((left_end+i*2, accuracy*100))
print(results)

哪些输出

[(167, 58.48967365599874), (169, 58.600031530821376), (171, 58.44237742393189)]

相关问题