pandas 对 Dataframe 中特定列的缺失分类字符串值python进行KNN插补,并将替换值作为 Dataframe 返回

k97glaaz  于 2023-03-16  发布在  Python
关注(0)|答案(2)|浏览(102)

性别列中有一些缺失值,希望使用KNN插补来插补这些值。但是我没有得到填充的结果!有人可以帮助吗?

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

data = {'ID': [1, 2, 3, 4, 5],
        'Age': [20, 25, 30, 35, 40],
        'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = pd.factorize(df['Gender'])[0]
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df

输出:

ID  Age Gender  Gendermap  Gender_imputed_factorized Gender_imputed
0   1   20      M          0                        0.0              M
1   2   25      F          1                        1.0              F
2   3   30    NaN         -1                       -1.0            NaN
3   4   35      F          1                        1.0              F
4   5   40    NaN         -1                       -1.0            NaN

“性别插补”列不应包含Nan值。

xwmevbvl

xwmevbvl1#

我认为是您使用的factorize函数导致了问题。它删除了NaN值,因此当您使用fit_transform时,没有什么可以估算。
尝试使用map将性别转换为数字列,如下所示:

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

data = {'ID': [1, 2, 3, 4, 5],
        'Age': [20, 25, 30, 35, 40],
        'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = df['Gender'].map({'M': 0, 'F': 1}) # the new map function
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df
d6kp6zgx

d6kp6zgx2#

有办法了谢谢。

df['Gendermap'] = pd.factorize(df['Gender'])[0]
imputer = KNNImputer(n_neighbors=2)
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
imputed_labels = pd.unique(df['Gender'].dropna())
df['Gender_imputed'] = [imputed_labels[int(i)] if not np.isnan(i) else np.nan for i in df['Gender_imputed_factorized']]

print(df)

相关问题