python-3.x Sklearn KNN输入器缺少某些值

jhiyze9q  于 2022-11-19  发布在  Python
关注(0)|答案(1)|浏览(169)

我试图用Sk-learn的KNN插补器插补一列NaN。看起来一切正常,但我意识到插补列中仍有一些NaN。可能是什么原因呢?我已经计算了插补前后的NaN。
注:我已使用插补前使用的清洁代码更新了代码。
输入:

# Create row for both the singer and track name
train.insert(2,'Artist Track',(train['Artist Name']+ " " + train['Track Name']))

# Remove duplicates for same Artist, Song, and Class
# Sort values by Artist Track then columns with NaNs to possibly drop duplicates with NaNs
train.sort_values(by=['Artist Track','Popularity','key','instrumentalness'], inplace=True)
train.drop_duplicates(subset=['Artist Track', 'Class'], keep='first', inplace=True)

# Remove duplicates of tracks if instrumentalness duplicate is NaN
train.sort_values(by=['Artist Track','instrumentalness'], inplace=True)
dups_ins = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
ins_nans = np.where(train['instrumentalness'].isna())[0]
drop_ins = set(dups_ins).intersection(ins_nans)
train.drop(drop_ins, inplace=True)

# Remove duplicates of tracks if key duplicate is NaN
train.sort_values(by=['Artist Track','key'], inplace=True)
dups_key = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
key_nans = np.where(train['key'].isna())[0]
drop_key = set(dups_key).intersection(key_nans)
train.drop(drop_key, inplace=True)

# Remove duplicates of tracks if popularity duplicate is NaN
train.sort_values(by=['Artist Track','Popularity'], inplace=True)
dups_pop = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
pop_nans = np.where(train['Popularity'].isna())[0]
drop_pop = set(dups_pop).intersection(pop_nans)
train.drop(drop_pop, inplace=True)

train['instrumentalness'].isna().sum()

输出量:

3452

输入:

from sklearn.impute import KNNImputer 
fea_transformer = KNNImputer(n_neighbors=3)
values = fea_transformer.fit_transform(train[['instrumentalness']])
train['instrumentalness'] = pd.DataFrame(values)
train['instrumentalness'].isna().sum()

输出量:

472
fwzugrvs

fwzugrvs1#

第一个备注您正在将KNN插补器拟合到序列本身:

values = fea_transformer.fit_transform(train[['instrumentalness']])
这是对来自其他特征的所有信息的浪费,您可以使用所有这些信息来进行更好的插补。

第二条备注:您的问题不在于KNNImputer,而在于如何将values赋给DataFrame。当您将它放在自己的DataFrame中时,您创建了一个与原始索引不对齐的新索引,因此创建了新的NaN。如果您从代码中检查:

第一个
您将看到values实际上没有缺失值。
下面是一个完整的工作版本:
第一次

相关问题