如何根据不同列中的值填充Pandas DataFrame中的Null值?

vxf3dgd4  于 2023-08-01  发布在  其他
关注(0)|答案(3)|浏览(105)

因此,我目前正在为一个投资组合项目工作,我有一个大型的数据框架(在许多其他列中)有两列,一列用于“邻居组”,另一列用于“邻居”。问题是在第一列中有一些NaN值。我需要找出如何填写这些与各自的城市地区。
范例:

resident   neighbourhood group    neighbourhood
John       Brooklyn               Clinton Hill
Mae        Brooklyn               Clinton Hill
Richard    Manhattan              East Harlem
Clark      Manhattan              Upper West Side
...
Claire          NaN               Clinton Hill
Susan           NaN               East Harlem

字符串
这只是一个简单的例子。我有数百个这样的情况下不可能手动改变这么多。你知道如何把合适的区域分配给合适的邻居吗?
我尝试过构建不同的函数来解决这个问题,但都无济于事,但我无法解决这个问题。

qnzebej0

qnzebej01#

我建议建立一个参考字典来匹配邻居到邻居组。假设这是原始 Dataframe :

import pandas as pd

data = {'resident': {0: 'John', 1: 'Mae', 2: 'Richard', 3: 'Clark', 4: 'Claire', 5: 'Susan'}, 'neighbourhoodgroup': {0: 'Brooklyn', 1: 'Brooklyn', 2: 'Manhattan', 3: 'Manhattan', 4: None, 5: None}, 'neighbourhood': {0: 'ClintonHill', 1: 'ClintonHill', 2: 'EastHarlem', 3: 'UpperWestSide', 4: 'ClintonHill', 5: 'EastHarlem'}}
df = pd.DataFrame(data)

'''
  resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire               None    ClintonHill
5    Susan               None     EastHarlem
'''

字符串
首先创建一个引用字典reference,其中键来自“neighborhood”列,值来自“neighborhoodgroup”列。

df_ref = df.dropna().drop_duplicates(['neighbourhoodgroup', 'neighbourhood'])
reference = {}
for k, v in list(zip(df_ref.neighbourhood, df_ref.neighbourhoodgroup)):
    reference[k] = v

'''
{'ClintonHill': 'Brooklyn',
 'EastHarlem': 'Manhattan',
 'UpperWestSide': 'Manhattan'}
'''


接下来,将字典引用应用于数据框架

df['result'] = df.neighbourhood.apply(lambda x: reference[x])
print(df)

'''
  resident neighbourhoodgroup  neighbourhood     result
0     John           Brooklyn    ClintonHill   Brooklyn
1      Mae           Brooklyn    ClintonHill   Brooklyn
2  Richard          Manhattan     EastHarlem  Manhattan
3    Clark          Manhattan  UpperWestSide  Manhattan
4   Claire               None    ClintonHill   Brooklyn
5    Susan               None     EastHarlem  Manhattan
'''

rsl1atfo

rsl1atfo2#

resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire               <NA>    ClintonHill
5    Susan               <NA>     EastHarlem

字符串
我们可以通过删除重复项和NaN,并转换为字典来从邻域到邻域组进行查找:

lookup_df = df[['neighbourhoodgroup','neighbourhood']].drop_duplicates().dropna()
lookup_dict = {x.neighbourhood: x.neighbourhoodgroup for x in lookup_df.itertuples()}

{'ClintonHill': 'Brooklyn', 'EastHarlem': 'Manhattan', 'UpperWestSide': 'Manhattan'}


然后,我们使用DataFrame.replace()将邻域与邻域组进行匹配:

neighbourhood_fill = df.neighbourhood.replace(lookup_dict)


然后我们用新的填充值替换neighbourhoodgroup列中的NaN,使用fillna

df.neighbourhoodgroup.fillna(neighbourhood_fill, inplace=True)

  resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire           Brooklyn    ClintonHill
5    Susan          Manhattan     EastHarlem

yqkkidmi

yqkkidmi3#

另一种可能的解决方案:

d = df.iloc[:, 1:].dropna().drop_duplicates()

(df.update(
    df['neighbourhood'][df['neighbourhood group'].isna()]
    .map(dict(zip(d.iloc[:,1], d.iloc[:,0])))
    .rename('neighbourhood group')))

字符串
输出量:

resident neighbourhood group    neighbourhood
0     John            Brooklyn     Clinton Hill
1      Mae            Brooklyn     Clinton Hill
2  Richard           Manhattan      East Harlem
3    Clark           Manhattan  Upper West Side
4   Claire            Brooklyn     Clinton Hill
5    Susan           Manhattan      East Harlem

相关问题