好了,我有两个框架:
df = pd.DataFrame({'A':['German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})
df = df.T
df.columns = df.iloc[0]
df = df.drop(df.index[0])
A German Shepherd Border Collie Golden Retriever Beagle Daschund
df2 = pd.DataFrame({'ID':['A','A','A','B','C','C','C','C','C'],
'Breed':['German Shepherd','Beagle','Dashung','Border Collie',
'German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})
ID Breed
0 A German Shepherd
1 A Beagle
2 A Dashung
3 B Border Collie
4 C German Shepherd
5 C Border Collie
6 C Golden Retriever
7 C Beagle
8 C Daschund
我想在df 2中找到狗品种的ID,然后更新df,如果它存在于该ID中:
dogs_grouped = df2.groupby('ID')
missing_dogs = []
vals = [np.nan for i in df.columns]
for group_name, df_group in dogs_grouped:
print(f'Cluster: {group_name}')
cluster_dogs = sorted(list(set(df_group['Breed'].to_list())))
cluster_dogs = [i for i in cluster_dogs if i in all_dogs]
weird_dogs = [i for i in cluster_dogs if i not in all_dogs]
missing_dogs.append(weird_dogs)
df = df.append(pd.Series(vals, index=df.columns, name=group_name))
df.loc[group_name][cluster_dogs] = 1
df = df.fillna(0)
我的代码可以工作,但对于大型数据集来说非常慢。我有一个50万行的数据集,我正在迭代,创建一个4000 x 30,000的矩阵需要几个小时。
A German Shepherd Border Collie Golden Retriever Beagle Daschund
A 1 0 0 1 0
B 0 1 0 0 0
C 1 1 1 1 1
必须有一个更pythonic/Pandas的方式来处理这个问题?
1条答案
按热度按时间jhdbpxl91#
我认为你只是想要
pd.crosstab
(如果某些值(列)丢失,你可以从df1
中的值重新索引列)图纸:
然后是
.reindex
:图纸: