英文:
Find duplicates in dataframe with tolerance in one column instead of exact value
问题
我想要找到与用户、类别相同的重复项,但在索赔金额上允许几美元的容差,比如说1美元。使用给定的样本数据框,期望的输出将如下所示:
Claim ID User Category Amount Group
1 John Meal 12.0 1
2 John Meal 13.0 1
3 Tom Transport 30.0 2
4 Tom Transport 30.0 2
5 Bob Phone Charges 60.0 3
6 Bob Phone Charges 60.0 3
英文:
I have a dataframe of expense claims made by staff:
import pandas as pd
data = {'Claim ID': [1, 2, 3, 4, 5, 6, 7],
'User': ['John', 'John', 'Jake', 'Bob', 'Bob', 'Tom', 'Tom'],
'Category': ['Meal', 'Meal', 'Stationary', 'Phone Charges', 'Phone Charges', 'Transport', 'Transport'],
'Amount': [12.00, 13.00, 20.00, 30, 30, 60, 60]}
df = pd.DataFrame(data)
Output:
Claim ID User Category Amount
1 John Meal 12.0
2 John Meal 13.0
3 Jake Stationary 20.0
4 Bob Phone Charges 30.0
5 Bob Phone Charges 30.0
6 Tom Transport 60.0
7 Tom Transport 60.0
I used the following code to find duplicate claims based on User
, Category
and Amount
and gave a unique group number to each set of duplicates found:
# Tag each duplicate set with a unique number
conditions = ['User', 'Amount', 'Category']
df['Group'] = df.groupby(conditions).ngroup().add(1)
# Then remove groups with only one row
df = df[df.groupby('Group')['Group'].transform('count') > 1]
Output:
Claim ID User Category Amount Group
4 Bob Phone Charges 30.0 1
5 Bob Phone Charges 30.0 1
6 Tom Transport 60.0 5
7 Tom Transport 60.0 5
Now my question is, I want to find duplicates with the same User
, Category
, but instead of the exact same Amount
, I want to allow a tolerance of a few dollars in the amount claimed, let's say around $1. So using the sample dataframe given, the expected output will be like this:
Claim ID User Category Amount Group
1 John Meal 12.0 1
2 John Meal 13.0 1
3 Tom Transport 30.0 2
4 Tom Transport 30.0 2
5 Bob Phone Charges 60.0 3
6 Bob Phone Charges 60.0 3
答案1
得分: 1
我不知道这是否是最快的方法,但它确实适用于模糊条件,比如容差:
df['group'] = np.piecewise(
np.zeros(len(df)),
[list((df.User.values == user) & (df.Category.values == category) & (df.Amount.values >= amount-1) & (df.Amount.values <= amount+1)) \
for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
df['Claim ID'].values
)
df[df.groupby('group')['group'].transform('count') > 1]
# 结果:
Claim ID User Category Amount group
0 1 John Meal 12.0 2.0
1 2 John Meal 13.0 2.0
3 4 Bob Phone Charges 30.0 5.0
4 5 Bob Phone Charges 30.0 5.0
5 6 Tom Transport 60.0 7.0
6 7 Tom Transport 60.0 7.0
英文:
I don't know if it is the fastest way, but it does work and works great for fuzzy conditions like tolerance:
df['group'] = np.piecewise(
np.zeros(len(df)),
[list((df.User.values == user) & (df.Category.values == category) & (df.Amount.values >= amount-1) & (df.Amount.values <= amount+1)) \
for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
df['Claim ID'].values
)
df[df.groupby('group')['group'].transform('count') > 1]
# Result:
Claim ID User Category Amount group
0 1 John Meal 12.0 2.0
1 2 John Meal 13.0 2.0
3 4 Bob Phone Charges 30.0 5.0
4 5 Bob Phone Charges 30.0 5.0
5 6 Tom Transport 60.0 7.0
6 7 Tom Transport 60.0 7.0
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论