问题

我想要找到与用户、类别相同的重复项，但在索赔金额上允许几美元的容差，比如说1美元。使用给定的样本数据框，期望的输出将如下所示：

     Claim ID  User       Category  Amount  Group
            1  John           Meal    12.0      1
            2  John           Meal    13.0      1
            3   Tom      Transport    30.0      2
            4   Tom      Transport    30.0      2
            5   Bob  Phone Charges    60.0      3
            6   Bob  Phone Charges    60.0      3

英文:

I have a dataframe of expense claims made by staff:

import pandas as pd
data = {&#39;Claim ID&#39;: [1, 2, 3, 4, 5, 6, 7],
        &#39;User&#39;: [&#39;John&#39;, &#39;John&#39;, &#39;Jake&#39;, &#39;Bob&#39;, &#39;Bob&#39;, &#39;Tom&#39;, &#39;Tom&#39;],
        &#39;Category&#39;: [&#39;Meal&#39;, &#39;Meal&#39;, &#39;Stationary&#39;, &#39;Phone Charges&#39;, &#39;Phone Charges&#39;, &#39;Transport&#39;, &#39;Transport&#39;],
        &#39;Amount&#39;: [12.00, 13.00, 20.00, 30, 30, 60, 60]}
df = pd.DataFrame(data)
Output:
     Claim ID  User       Category  Amount
            1  John           Meal    12.0
            2  John           Meal    13.0
            3  Jake     Stationary    20.0
            4   Bob  Phone Charges    30.0
            5   Bob  Phone Charges    30.0
            6   Tom      Transport    60.0
            7   Tom      Transport    60.0

I used the following code to find duplicate claims based on User, Category and Amount and gave a unique group number to each set of duplicates found:

# Tag each duplicate set with a unique number
conditions = [&#39;User&#39;, &#39;Amount&#39;, &#39;Category&#39;]
df[&#39;Group&#39;] = df.groupby(conditions).ngroup().add(1)
# Then remove groups with only one row
df = df[df.groupby(&#39;Group&#39;)[&#39;Group&#39;].transform(&#39;count&#39;) &gt; 1]
Output:
 Claim ID User       Category  Amount  Group
        4  Bob  Phone Charges    30.0      1
        5  Bob  Phone Charges    30.0      1
        6  Tom      Transport    60.0      5
        7  Tom      Transport    60.0      5

Now my question is, I want to find duplicates with the same User, Category, but instead of the exact same Amount, I want to allow a tolerance of a few dollars in the amount claimed, let's say around $1. So using the sample dataframe given, the expected output will be like this:

 Claim ID  User       Category  Amount  Group
        1  John           Meal    12.0      1
        2  John           Meal    13.0      1
        3   Tom      Transport    30.0      2
        4   Tom      Transport    30.0      2
        5   Bob  Phone Charges    60.0      3
        6   Bob  Phone Charges    60.0      3

答案1

得分: 1

我不知道这是否是最快的方法，但它确实适用于模糊条件，比如容差：

df['group'] = np.piecewise(
    np.zeros(len(df)),
    [list((df.User.values == user) & (df.Category.values == category) & (df.Amount.values >= amount-1) & (df.Amount.values <= amount+1)) \
     for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
    df['Claim ID'].values
)
df[df.groupby('group')['group'].transform('count') > 1]
# 结果:
   Claim ID  User       Category  Amount  group
0         1  John           Meal    12.0    2.0
1         2  John           Meal    13.0    2.0
3         4   Bob  Phone Charges    30.0    5.0
4         5   Bob  Phone Charges    30.0    5.0
5         6   Tom      Transport    60.0    7.0
6         7   Tom      Transport    60.0    7.0

英文:

I don't know if it is the fastest way, but it does work and works great for fuzzy conditions like tolerance:

df[&#39;group&#39;] = np.piecewise(
    np.zeros(len(df)),
    [list((df.User.values == user) &amp; (df.Category.values == category) &amp; (df.Amount.values &gt;= amount-1) &amp; (df.Amount.values &lt;= amount+1)) \
     for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
    df[&#39;Claim ID&#39;].values
)
df[df.groupby(&#39;group&#39;)[&#39;group&#39;].transform(&#39;count&#39;) &gt; 1]
# Result:
   Claim ID  User       Category  Amount  group
0         1  John           Meal    12.0    2.0
1         2  John           Meal    13.0    2.0
3         4   Bob  Phone Charges    30.0    5.0
4         5   Bob  Phone Charges    30.0    5.0
5         6   Tom      Transport    60.0    7.0
6         7   Tom      Transport    60.0    7.0
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在一个列中寻找DataFrame中的重复项，而不是精确数值。

问题

答案1

如何高效计算多个样本的逐样本梯度？

I'm creating a Slackbot in Python and want to repeat the message until a reaction is added to that message. What am I doing wrong?

可以使用光流进行视频的四分之一插值吗？

在Python中引用F-string中的字符串值。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。