在一个列中寻找DataFrame中的重复项,而不是精确数值。

huangapple go评论66阅读模式
英文:

Find duplicates in dataframe with tolerance in one column instead of exact value

问题

我想要找到与用户、类别相同的重复项,但在索赔金额上允许几美元的容差,比如说1美元。使用给定的样本数据框,期望的输出将如下所示:

     Claim ID  User       Category  Amount  Group
            1  John           Meal    12.0      1
            2  John           Meal    13.0      1
            3   Tom      Transport    30.0      2
            4   Tom      Transport    30.0      2
            5   Bob  Phone Charges    60.0      3
            6   Bob  Phone Charges    60.0      3
英文:

I have a dataframe of expense claims made by staff:

import pandas as pd

data = {'Claim ID': [1, 2, 3, 4, 5, 6, 7],
        'User': ['John', 'John', 'Jake', 'Bob', 'Bob', 'Tom', 'Tom'],
        'Category': ['Meal', 'Meal', 'Stationary', 'Phone Charges', 'Phone Charges', 'Transport', 'Transport'],
        'Amount': [12.00, 13.00, 20.00, 30, 30, 60, 60]}

df = pd.DataFrame(data)

Output:
     Claim ID  User       Category  Amount
            1  John           Meal    12.0
            2  John           Meal    13.0
            3  Jake     Stationary    20.0
            4   Bob  Phone Charges    30.0
            5   Bob  Phone Charges    30.0
            6   Tom      Transport    60.0
            7   Tom      Transport    60.0

I used the following code to find duplicate claims based on User, Category and Amount and gave a unique group number to each set of duplicates found:

# Tag each duplicate set with a unique number
conditions = ['User', 'Amount', 'Category']
df['Group'] = df.groupby(conditions).ngroup().add(1)

# Then remove groups with only one row
df = df[df.groupby('Group')['Group'].transform('count') > 1]

Output:
 Claim ID User       Category  Amount  Group
        4  Bob  Phone Charges    30.0      1
        5  Bob  Phone Charges    30.0      1
        6  Tom      Transport    60.0      5
        7  Tom      Transport    60.0      5

Now my question is, I want to find duplicates with the same User, Category, but instead of the exact same Amount, I want to allow a tolerance of a few dollars in the amount claimed, let's say around $1. So using the sample dataframe given, the expected output will be like this:

 Claim ID  User       Category  Amount  Group
        1  John           Meal    12.0      1
        2  John           Meal    13.0      1
        3   Tom      Transport    30.0      2
        4   Tom      Transport    30.0      2
        5   Bob  Phone Charges    60.0      3
        6   Bob  Phone Charges    60.0      3

答案1

得分: 1

我不知道这是否是最快的方法,但它确实适用于模糊条件,比如容差:

df['group'] = np.piecewise(
    np.zeros(len(df)),
    [list((df.User.values == user) & (df.Category.values == category) & (df.Amount.values >= amount-1) & (df.Amount.values <= amount+1)) \
     for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
    df['Claim ID'].values
)

df[df.groupby('group')['group'].transform('count') > 1]

# 结果:
   Claim ID  User       Category  Amount  group
0         1  John           Meal    12.0    2.0
1         2  John           Meal    13.0    2.0
3         4   Bob  Phone Charges    30.0    5.0
4         5   Bob  Phone Charges    30.0    5.0
5         6   Tom      Transport    60.0    7.0
6         7   Tom      Transport    60.0    7.0
英文:

I don't know if it is the fastest way, but it does work and works great for fuzzy conditions like tolerance:

df[&#39;group&#39;] = np.piecewise(
    np.zeros(len(df)),
    [list((df.User.values == user) &amp; (df.Category.values == category) &amp; (df.Amount.values &gt;= amount-1) &amp; (df.Amount.values &lt;= amount+1)) \
     for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
    df[&#39;Claim ID&#39;].values
)

df[df.groupby(&#39;group&#39;)[&#39;group&#39;].transform(&#39;count&#39;) &gt; 1]

# Result:
   Claim ID  User       Category  Amount  group
0         1  John           Meal    12.0    2.0
1         2  John           Meal    13.0    2.0
3         4   Bob  Phone Charges    30.0    5.0
4         5   Bob  Phone Charges    30.0    5.0
5         6   Tom      Transport    60.0    7.0
6         7   Tom      Transport    60.0    7.0

</details>



huangapple
  • 本文由 发表于 2023年2月14日 01:58:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75439595.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定