基于其他列的分组向前填充或回填Pandas列中的NaN值

7kqas0il  于 2023-01-11  发布在  其他
关注(0)|答案(3)|浏览(156)

我有一个 Dataframe 如下:

import pandas as pd

df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
                   'Region':['Americas','NaN','NaN','Asia','Europe','NaN','NaN'],
                   'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
                   'Animal':['Bison','NaN','Golden Eagle','Tiger','Lion','Lion','NaN'],
                   'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})

我想按CountryFlower分组,并向前或向后填充RegionAnimal列中缺失的值,但Game列应保持不变
我试过这个,但它没有工作:

df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())

还有:

df.groupby(['Country','Flower'])['Animal', 'Region'].isna().bfill()

我想知道这件事该怎么办。
虽然这可以工作,但它删除了游戏列:
x1米5英寸x1米6英寸
如果我做了一个转换,就会有一个长度不匹配的地方,另外请注意,这是一个示例 Dataframe ,我在原始帧中添加了"NaN"作为字符串,它是np.nan。

6ojccjat

6ojccjat1#

如果您更改 Dataframe 代码以实际包含np.nan s,则您提供的代码实际上可以工作。尽管nans显示为正常文本“Nan”,但您无法创建手动写入该文本的 Dataframe ,因为这将被解释为字符串,而不是实际的缺失值。

import pandas as pd
import numpy as np

df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
                   'Region':['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],
                   'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
                   'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion','NaN'],
                   'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})

然后,这个:

df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())

实际上得出了这个结果

Animal Country     Flower      Game    Region
0         Bison     USA       Rose  Baseball  Americas
1           NaN     USA       Rose  Baseball  Americas
2  Golden Eagle     MEX       Lily    soccer       NaN
3         Tiger     IND     Orchid    hockey      Asia
4          Lion      UK  Dandelion   cricket    Europe
5          Lion      UK  Dandelion   cricket    Europe
6           NaN      UK  Dandelion   cricket    Europe
tcomlyy6

tcomlyy62#

首先,您需要知道'NaN'不是NaN

df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Region'].ffill()
Out[109]: 
0    Americas
1    Americas
2         NaN# since here only have single row , that why stay NaN
3        Asia
4      Europe
5      Europe
6      Europe
Name: Region, dtype: object

其次,如果需要在pandas中链接两个iid函数,则需要apply

df.update(df.groupby(['Country','Flower'])['Animal', 'Region'].apply(lambda x : x.bfill().ffill()))                               
df
Out[119]: 
         Animal Country     Flower      Game    Region
0         Bison     USA       Rose  Baseball  Americas
1         Bison     USA       Rose  Baseball  Americas
2  Golden Eagle     MEX       Lily    soccer       NaN
3         Tiger     IND     Orchid    hockey      Asia
4          Lion      UK  Dandelion   cricket    Europe
5          Lion      UK  Dandelion   cricket    Europe
6          Lion      UK  Dandelion   cricket    Europe
deyfvvtc

deyfvvtc3#

由于Mex和Lily只是行,而且它们的区域值为nan,因此fillna函数无法找到适当的组值。如果在fillna组模式下捕获异常,则没有组的值将保持原样。然后应用ffill和bfill覆盖没有适当组

df_stack = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],'Region': ['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],'Flower': ['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion',np.nan],'Game':  ['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
print("-------Before imputation------")
print(df_stack)
def fillna_Region(grp):
    try:
        return grp.fillna(grp.mode()[0])
    except BaseException as e:
        print('Error as no correspindg group: ' + str(e))
df_stack["Region"] = 
df_stack["Region"].fillna(df_stack.groupby(['Country','Flower']) ['Region'].transform(lambda grp : fillna_Region(grp)))
df_stack["Animal"] = 
df_stack["Animal"].fillna(df_stack.groupby(['Country','Flower']) ['Animal'].transform(lambda grp : fillna_Region(grp)))
 df_stack = df_stack.ffill(axis = 0)
df_stack = df_stack.bfill(axis =0)

print("-------After imputation------")
print(df_stack)

的值

相关问题