我试图用Sk-learn的KNN插补器插补一列NaN。看起来一切正常,但我意识到插补列中仍有一些NaN。可能是什么原因呢?我已经计算了插补前后的NaN。
注:我已使用插补前使用的清洁代码更新了代码。
输入:
# Create row for both the singer and track name
train.insert(2,'Artist Track',(train['Artist Name']+ " " + train['Track Name']))
# Remove duplicates for same Artist, Song, and Class
# Sort values by Artist Track then columns with NaNs to possibly drop duplicates with NaNs
train.sort_values(by=['Artist Track','Popularity','key','instrumentalness'], inplace=True)
train.drop_duplicates(subset=['Artist Track', 'Class'], keep='first', inplace=True)
# Remove duplicates of tracks if instrumentalness duplicate is NaN
train.sort_values(by=['Artist Track','instrumentalness'], inplace=True)
dups_ins = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
ins_nans = np.where(train['instrumentalness'].isna())[0]
drop_ins = set(dups_ins).intersection(ins_nans)
train.drop(drop_ins, inplace=True)
# Remove duplicates of tracks if key duplicate is NaN
train.sort_values(by=['Artist Track','key'], inplace=True)
dups_key = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
key_nans = np.where(train['key'].isna())[0]
drop_key = set(dups_key).intersection(key_nans)
train.drop(drop_key, inplace=True)
# Remove duplicates of tracks if popularity duplicate is NaN
train.sort_values(by=['Artist Track','Popularity'], inplace=True)
dups_pop = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
pop_nans = np.where(train['Popularity'].isna())[0]
drop_pop = set(dups_pop).intersection(pop_nans)
train.drop(drop_pop, inplace=True)
train['instrumentalness'].isna().sum()
输出量:
3452
输入:
from sklearn.impute import KNNImputer
fea_transformer = KNNImputer(n_neighbors=3)
values = fea_transformer.fit_transform(train[['instrumentalness']])
train['instrumentalness'] = pd.DataFrame(values)
train['instrumentalness'].isna().sum()
输出量:
472
1条答案
按热度按时间fwzugrvs1#
第一个备注您正在将KNN插补器拟合到序列本身:
values = fea_transformer.fit_transform(train[['instrumentalness']])
这是对来自其他特征的所有信息的浪费,您可以使用所有这些信息来进行更好的插补。
第二条备注:您的问题不在于
KNNImputer
,而在于如何将values
赋给DataFrame。当您将它放在自己的DataFrame中时,您创建了一个与原始索引不对齐的新索引,因此创建了新的NaN。如果您从代码中检查:第一个
您将看到values实际上没有缺失值。
下面是一个完整的工作版本:
第一次