pandas 查找 Dataframe 中的重复项,公差在一列中,而不是精确值

yvfmudvl  于 2023-02-14  发布在  其他
关注(0)|答案(1)|浏览(120)

我有一个员工报销费用的数据框架:

import pandas as pd

data = {'Claim ID': [1, 2, 3, 4, 5, 6, 7],
        'User': ['John', 'John', 'Jake', 'Bob', 'Bob', 'Tom', 'Tom'],
        'Category': ['Meal', 'Meal', 'Stationary', 'Phone Charges', 'Phone Charges', 'Transport', 'Transport'],
        'Amount': [12.00, 13.00, 20.00, 30, 30, 60, 60]}

df = pd.DataFrame(data)

Output:
     Claim ID  User       Category  Amount
            1  John           Meal    12.0
            2  John           Meal    13.0
            3  Jake     Stationary    20.0
            4   Bob  Phone Charges    30.0
            5   Bob  Phone Charges    30.0
            6   Tom      Transport    60.0
            7   Tom      Transport    60.0

我使用了以下代码查找基于UserCategoryAmount的重复索赔,并为所发现的每组重复索赔赋予了一个唯一的组号:

# Tag each duplicate set with a unique number
conditions = ['User', 'Amount', 'Category']
df['Group'] = df.groupby(conditions).ngroup().add(1)

# Then remove groups with only one row
df = df[df.groupby('Group')['Group'].transform('count') > 1]

Output:
 Claim ID User       Category  Amount  Group
        4  Bob  Phone Charges    30.0      1
        5  Bob  Phone Charges    30.0      1
        6  Tom      Transport    60.0      5
        7  Tom      Transport    60.0      5

现在,我的问题是,我希望找到具有相同UserCategory的重复项,但不是完全相同的Amount,我希望允许索赔金额有几美元的容差,比如1美元左右。因此,使用给定的样本 Dataframe ,预期输出如下所示:

Claim ID  User       Category  Amount  Group
        1  John           Meal    12.0      1
        2  John           Meal    13.0      1
        3   Tom      Transport    30.0      2
        4   Tom      Transport    30.0      2
        5   Bob  Phone Charges    60.0      3
        6   Bob  Phone Charges    60.0      3
zzlelutf

zzlelutf1#

我不知道这是否是最快的方法,但它确实有效,而且对公差这样的模糊条件效果很好:

df['group'] = np.piecewise(
    np.zeros(len(df)),
    [list((df.User.values == user) & (df.Category.values == category) & (df.Amount.values >= amount-1) & (df.Amount.values <= amount+1)) \
     for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
    df['Claim ID'].values
)

df[df.groupby('group')['group'].transform('count') > 1]

# Result:
   Claim ID  User       Category  Amount  group
0         1  John           Meal    12.0    2.0
1         2  John           Meal    13.0    2.0
3         4   Bob  Phone Charges    30.0    5.0
4         5   Bob  Phone Charges    30.0    5.0
5         6   Tom      Transport    60.0    7.0
6         7   Tom      Transport    60.0    7.0

相关问题