我有一个员工报销费用的数据框架:
import pandas as pd
data = {'Claim ID': [1, 2, 3, 4, 5, 6, 7],
'User': ['John', 'John', 'Jake', 'Bob', 'Bob', 'Tom', 'Tom'],
'Category': ['Meal', 'Meal', 'Stationary', 'Phone Charges', 'Phone Charges', 'Transport', 'Transport'],
'Amount': [12.00, 13.00, 20.00, 30, 30, 60, 60]}
df = pd.DataFrame(data)
Output:
Claim ID User Category Amount
1 John Meal 12.0
2 John Meal 13.0
3 Jake Stationary 20.0
4 Bob Phone Charges 30.0
5 Bob Phone Charges 30.0
6 Tom Transport 60.0
7 Tom Transport 60.0
我使用了以下代码查找基于User
、Category
和Amount
的重复索赔,并为所发现的每组重复索赔赋予了一个唯一的组号:
# Tag each duplicate set with a unique number
conditions = ['User', 'Amount', 'Category']
df['Group'] = df.groupby(conditions).ngroup().add(1)
# Then remove groups with only one row
df = df[df.groupby('Group')['Group'].transform('count') > 1]
Output:
Claim ID User Category Amount Group
4 Bob Phone Charges 30.0 1
5 Bob Phone Charges 30.0 1
6 Tom Transport 60.0 5
7 Tom Transport 60.0 5
现在,我的问题是,我希望找到具有相同User
、Category
的重复项,但不是完全相同的Amount
,我希望允许索赔金额有几美元的容差,比如1美元左右。因此,使用给定的样本 Dataframe ,预期输出如下所示:
Claim ID User Category Amount Group
1 John Meal 12.0 1
2 John Meal 13.0 1
3 Tom Transport 30.0 2
4 Tom Transport 30.0 2
5 Bob Phone Charges 60.0 3
6 Bob Phone Charges 60.0 3
1条答案
按热度按时间zzlelutf1#
我不知道这是否是最快的方法,但它确实有效,而且对公差这样的模糊条件效果很好: