在一个列中寻找DataFrame中的重复项,而不是精确数值。

huangapple go评论88阅读模式
英文:

Find duplicates in dataframe with tolerance in one column instead of exact value

问题

我想要找到与用户、类别相同的重复项,但在索赔金额上允许几美元的容差,比如说1美元。使用给定的样本数据框,期望的输出将如下所示:

  1. Claim ID User Category Amount Group
  2. 1 John Meal 12.0 1
  3. 2 John Meal 13.0 1
  4. 3 Tom Transport 30.0 2
  5. 4 Tom Transport 30.0 2
  6. 5 Bob Phone Charges 60.0 3
  7. 6 Bob Phone Charges 60.0 3
英文:

I have a dataframe of expense claims made by staff:

  1. import pandas as pd
  2. data = {'Claim ID': [1, 2, 3, 4, 5, 6, 7],
  3. 'User': ['John', 'John', 'Jake', 'Bob', 'Bob', 'Tom', 'Tom'],
  4. 'Category': ['Meal', 'Meal', 'Stationary', 'Phone Charges', 'Phone Charges', 'Transport', 'Transport'],
  5. 'Amount': [12.00, 13.00, 20.00, 30, 30, 60, 60]}
  6. df = pd.DataFrame(data)
  7. Output:
  8. Claim ID User Category Amount
  9. 1 John Meal 12.0
  10. 2 John Meal 13.0
  11. 3 Jake Stationary 20.0
  12. 4 Bob Phone Charges 30.0
  13. 5 Bob Phone Charges 30.0
  14. 6 Tom Transport 60.0
  15. 7 Tom Transport 60.0

I used the following code to find duplicate claims based on User, Category and Amount and gave a unique group number to each set of duplicates found:

  1. # Tag each duplicate set with a unique number
  2. conditions = ['User', 'Amount', 'Category']
  3. df['Group'] = df.groupby(conditions).ngroup().add(1)
  4. # Then remove groups with only one row
  5. df = df[df.groupby('Group')['Group'].transform('count') > 1]
  6. Output:
  7. Claim ID User Category Amount Group
  8. 4 Bob Phone Charges 30.0 1
  9. 5 Bob Phone Charges 30.0 1
  10. 6 Tom Transport 60.0 5
  11. 7 Tom Transport 60.0 5

Now my question is, I want to find duplicates with the same User, Category, but instead of the exact same Amount, I want to allow a tolerance of a few dollars in the amount claimed, let's say around $1. So using the sample dataframe given, the expected output will be like this:

  1. Claim ID User Category Amount Group
  2. 1 John Meal 12.0 1
  3. 2 John Meal 13.0 1
  4. 3 Tom Transport 30.0 2
  5. 4 Tom Transport 30.0 2
  6. 5 Bob Phone Charges 60.0 3
  7. 6 Bob Phone Charges 60.0 3

答案1

得分: 1

我不知道这是否是最快的方法,但它确实适用于模糊条件,比如容差:

  1. df['group'] = np.piecewise(
  2. np.zeros(len(df)),
  3. [list((df.User.values == user) & (df.Category.values == category) & (df.Amount.values >= amount-1) & (df.Amount.values <= amount+1)) \
  4. for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
  5. df['Claim ID'].values
  6. )
  7. df[df.groupby('group')['group'].transform('count') > 1]
  8. # 结果:
  9. Claim ID User Category Amount group
  10. 0 1 John Meal 12.0 2.0
  11. 1 2 John Meal 13.0 2.0
  12. 3 4 Bob Phone Charges 30.0 5.0
  13. 4 5 Bob Phone Charges 30.0 5.0
  14. 5 6 Tom Transport 60.0 7.0
  15. 6 7 Tom Transport 60.0 7.0
英文:

I don't know if it is the fastest way, but it does work and works great for fuzzy conditions like tolerance:

  1. df[&#39;group&#39;] = np.piecewise(
  2. np.zeros(len(df)),
  3. [list((df.User.values == user) &amp; (df.Category.values == category) &amp; (df.Amount.values &gt;= amount-1) &amp; (df.Amount.values &lt;= amount+1)) \
  4. for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
  5. df[&#39;Claim ID&#39;].values
  6. )
  7. df[df.groupby(&#39;group&#39;)[&#39;group&#39;].transform(&#39;count&#39;) &gt; 1]
  8. # Result:
  9. Claim ID User Category Amount group
  10. 0 1 John Meal 12.0 2.0
  11. 1 2 John Meal 13.0 2.0
  12. 3 4 Bob Phone Charges 30.0 5.0
  13. 4 5 Bob Phone Charges 30.0 5.0
  14. 5 6 Tom Transport 60.0 7.0
  15. 6 7 Tom Transport 60.0 7.0
  16. </details>

huangapple
  • 本文由 发表于 2023年2月14日 01:58:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75439595.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定