我有一个包含商店及其发票编号的 Dataframe ,我需要查找每个商店缺失的连续发票编号,例如:第一个我需要一个这样的 Dataframe :
Store MissInvoice 0 A 3 1 A 4 2 A 7 3 B 21 4 B 22 5 B 25 6 B 26 7 B 27 8 B 28 9 B 29 10 C 201 11 D 205
提前感谢!
wlsrxk511#
您可以使用groupby.apply来计算set与range的差值,从min到max的值。然后explode:
groupby.apply
set
range
min
max
explode
(df1.astype({'Invoice': int}) .groupby('Store')['Invoice'] .apply(lambda s: set(range(s.min(), s.max())).difference(s)) .explode().reset_index() )
lambda s: sorted(set(range(s.min(), s.max())).difference(s))
输出量:
Store Invoice 0 A 3 1 A 4 2 A 7 3 B 21 4 B 22 5 B 25 6 B 26 7 B 27 8 B 28 9 B 29 10 C 201 11 D 205
z9smfwbn2#
这里有一个方法:
import pandas as pd import numpy as np df1 = pd.DataFrame() df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C'] df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203'] df1['Invoice'] = df1['Invoice'].astype(int) df2 = df1.groupby('Store')['Invoice'].agg(['min','max']) df2['MissInvoice'] = [[]]*len(df2) for store,row in df2.iterrows(): df2.at[store,'MissInvoice'] = np.setdiff1d(np.arange(row['min'],row['max']+1), df1.loc[df1['Store'] == store, 'Invoice']) df2 = df2.explode('MissInvoice').drop(columns = ['min','max']).reset_index()
生成的 Dataframe df2:
Store MissInvoice 0 A 3 1 A 4 2 A 7 3 B 21 4 B 22 5 B 25 6 B 26 7 B 27 8 B 28 9 B 29 10 C 201
2条答案
按热度按时间wlsrxk511#
您可以使用
groupby.apply
来计算set
与range
的差值,从min
到max
的值。然后explode
:lambda s: sorted(set(range(s.min(), s.max())).difference(s))
。*输出量:
z9smfwbn2#
这里有一个方法:
生成的 Dataframe df2: