numpy 在pandas groupby循环中将行分配给对象的最快方法

dauxcl2d  于 2023-10-19  发布在  其他
关注(0)|答案(1)|浏览(63)

好了,我有两个框架:

df = pd.DataFrame({'A':['German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})
df = df.T
df.columns = df.iloc[0]
df = df.drop(df.index[0])

A   German Shepherd     Border Collie   Golden Retriever    Beagle  Daschund

df2 = pd.DataFrame({'ID':['A','A','A','B','C','C','C','C','C'],
                   'Breed':['German Shepherd','Beagle','Dashung','Border Collie',
                           'German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})

ID  Breed
0   A   German Shepherd
1   A   Beagle
2   A   Dashung
3   B   Border Collie
4   C   German Shepherd
5   C   Border Collie
6   C   Golden Retriever
7   C   Beagle
8   C   Daschund

我想在df 2中找到狗品种的ID,然后更新df,如果它存在于该ID中:

dogs_grouped = df2.groupby('ID')
missing_dogs = []
vals = [np.nan for i in df.columns]
for group_name, df_group in dogs_grouped:
    print(f'Cluster: {group_name}')
    cluster_dogs = sorted(list(set(df_group['Breed'].to_list())))
    cluster_dogs = [i for i in cluster_dogs if i in all_dogs]
    weird_dogs = [i for i in cluster_dogs if i not in all_dogs]
    missing_dogs.append(weird_dogs)
    df = df.append(pd.Series(vals, index=df.columns, name=group_name))
    df.loc[group_name][cluster_dogs] = 1
df = df.fillna(0)

我的代码可以工作,但对于大型数据集来说非常慢。我有一个50万行的数据集,我正在迭代,创建一个4000 x 30,000的矩阵需要几个小时。

A   German Shepherd     Border Collie   Golden Retriever    Beagle      Daschund
A        1                    0               0                1           0
B        0                    1               0                0           0
C        1                    1               1                1           1

必须有一个更pythonic/Pandas的方式来处理这个问题?

jhdbpxl9

jhdbpxl91#

我认为你只是想要pd.crosstab(如果某些值(列)丢失,你可以从df1中的值重新索引列)

x = pd.crosstab(df2["ID"], df2["Breed"])
print(x)

图纸:

Breed  Beagle  Border Collie  Daschund  Dashung  German Shepherd  Golden Retriever
ID                                                                                
A           1              0         0        1                1                 0
B           0              1         0        0                0                 0
C           1              1         1        0                1                 1

然后是.reindex

x = x.reindex(
    columns=[
        "Some New Breed",
        "German Shepherd",
        "Border Collie",
        "Golden Retriever",
        "Beagle",
        "Daschund",
    ],
    fill_value=0,
)
print(x)

图纸:

Breed  Some New Breed  German Shepherd  Border Collie  Golden Retriever  Beagle  Daschund
ID                                                                                       
A                   0                1              0                 0       1         0
B                   0                0              1                 0       0         0
C                   0                1              1                 1       1         1

相关问题