pandas 重复数据消除和合并数据

v2g6jxz6  于 2023-08-01  发布在  其他
关注(0)|答案(2)|浏览(96)

我有一个数据集,重复,三次和更多,我希望只保留每个唯一的一个记录,合并数据,例如:

id   name     address    age      city
1     Alex    123,blv    
1     Alex               13     
3     Alex               24     Florida
1     Alex                      Miami

字符串
使用id字段合并数据:输出量:

id   name     address    age      city
1     Alex    123,blv    13       Miami
3     Alex               24     Florida

gdx19jrr

gdx19jrr1#

我已经改变了一点代码从this答案。
创建初始数据框的代码:

import pandas as pd
import numpy as np

d = {'id': [1,1,3,1], 
     'name': ["Alex", "Alex", "Alex", "Alex"], 
     'address': ["123,blv" , None, None, None], 
     'age': [None, 13, 24, None], 
     'city': [None, None, "Florida", "Miami"]
}
df = pd.DataFrame(data=d, index=d["id"])
print(df)

字符串
输出量:

id  name  address   age     city
1   1  Alex  123,blv   NaN     None
1   1  Alex     None  13.0     None
3   3  Alex     None  24.0  Florida
1   1  Alex     None   NaN    Miami


聚合代码:

def get_notnull(x):
    if x.notnull().any():
        return  x[x.notnull()]
    else:
        return np.nan

aggregation_functions = {'name': 'first', 
                         'address': get_notnull, 
                         'age': get_notnull, 
                         'city': get_notnull
}
df = df.groupby(df['id']).aggregate(aggregation_functions)
print(df)


输出量:

name  address   age     city
id                              
1   Alex  123,blv  13.0    Miami
3   Alex      NaN  24.0  Florida

rsl1atfo

rsl1atfo2#

(
    df
    .reset_index(drop=True)      # set unique index for eash record
    .drop('id', axis=1)          # exclude 'id' column from processing
    .groupby(df['id'])           # group by 'id'
    .agg(
        # return first non-NA/None value for each column
        lambda s: s.get(s.first_valid_index())
    )
    .reset_index()               # get back the 'id' value for each record
)

字符串
ps.作为选项:

df.replace([None, ''], pd.NA).groupby('id').first().reset_index()

相关问题