2023年7月12日 21:39:54go评论99阅读模式

英文:

Find First and Second Occurrence of Value Across Two Columns by Group

问题

P1 First Occ	P2 First Occ	P1 Second Occ	P2 Second Occ
2	1	0	1

英文:

I have a df that looks like the one below. It is sorted by Ref1 and Seq.

Ref1	EvnNo	P1	P2	Seq	PP1	PP2
aaaa	0	xxx	yyy	1	0	1
aaaa	0	xxx	yyy	2	0	0
aaaa	0	xxx	yyy	3	1	0
aaaa	0	xxx	yyy	4	0	0
aaaa	1	xxx	yyy	5	0	0
aaaa	1	xxx	yyy	6	1	0
aaaa	1	xxx	yyy	7	1	0
aaaa	1	xxx	yyy	8	0	1
bbbb	0	xxx	yyy	1	0	0
bbbb	0	xxx	yyy	2	0	0
bbbb	0	xxx	yyy	3	0	0
bbbb	0	xxx	yyy	4	0	0
bbbb	1	xxx	yyy	5	0	0
bbbb	1	xxx	yyy	6	0	0
bbbb	1	xxx	yyy	7	1	0
bbbb	1	xxx	yyy	8	0	1

I am trying to work out how to do two things:

count the first occurrences of a 1 in either PP1 or PP2 grouped by Ref1 and EvNo. There may be no occurrences or there may be multiple occurrences but there will never be a 1 in both columns on the same row.
after the first occurrence (if any) count if there is a 1 in the other of PP1 or PP2 in the same group. Eg if the first 1 in a group was in PP1 count if the next occurrence of 1 is in PP2. If the next 1 is also in PP1 it should not be counted. There may be no further occurrences of a 1 in either column.

Output:

P1 First Occ	P2 First Occ	P1 Second Occ	P2 Second Occ
2	1	0	1

答案1

得分: 1

我成功地通过对数据框的分组应用单独的函数来获得结果，将结果转化为一个新的数据框并进行汇总。

在函数中的主要技巧是使用 np.where。我将它用于列 PP1 和 PP2 的总和，以找到任何出现，并且通过检查 PP1 列中的值是否为 1（如果是，则出现在 PP1 中，否则出现在 PP2 中，因为您说出现不能同时发生）来检查哪列提供了出现。

尽管如此，我不确定为什么您的输出中对于 P1 Second Occ 没有 1，因为第一组（Ref1 == "aaaa" 和 EvnNo == 0）确实显示了这一点，如果我正确理解了问题。

import numpy as np
def count_occurences(group):
    result = [0] * 4
    occurences = np.where(group.sum(axis=1) == 1)[0]
    
    # 跟踪第一次出现
    if len(occurences) > 0 and group.iloc[occurences[0]]["PP1"] == 1:
        result[0] += 1
    elif len(occurences) > 0:
        result[1] += 1
    
    # 跟踪第二次出现
    if len(occurences) > 1 and group.iloc[occurences[1]]["PP1"] == 1:
        if result[0] != 1:
            result[2] += 1
    elif len(occurences) > 1:
        if result[1] != 1:
            result[3] += 1
        
    return result
occurences_df = pd.DataFrame(
    df \
        .groupby(["Ref1", "EvnNo"]) \
        [["PP1", "PP2"]] \
        .apply(count_occurences) \
        .to_list(),
    columns = ["P1 First Occ", "P2 First Occ", "P1 Second Occ", "P2 Second Occ"]
)
print(occurences_df.sum())

输出:

P1 First Occ     2
P2 First Occ     1
P1 Second Occ    1
P2 Second Occ    1

英文:

I managed to get the result by applying a separate function to groups of the dataframe, turning the result into a new dataframe and summarizing it.

In the function the main trick is to use np.where. I used it on a sum of columns PP1 and PP2 to find just any occurences, and then checked which column provided the occurence just by checking if the value in PP1 column is 1 (if yes - then the occurence is in PP1, if not - in PP2, as you said that the occurences can not happen simultaneously).

Although, I am not sure why your output doesn't have 1 for P1 Second Occ, because the first group (Ref1 == "aaaa" and EvnNo == 0) shows exactly that, if I understood the question correctly.

import numpy as np
def count_occurences(group):
    result = [0] * 4
    occurences = np.where(group.sum(axis=1) == 1)[0]
    
    # track first occurence
    if len(occurences) &gt; 0 and group.iloc[occurences[0]][&quot;PP1&quot;] == 1:
        result[0] += 1
    elif len(occurences) &gt; 0:
        result[1] += 1
    
    # track second occurence
    if len(occurences) &gt; 1 and group.iloc[occurences[1]][&quot;PP1&quot;] == 1:
        if result[0] != 1:
            result[2] += 1
    elif len(occurences) &gt; 1:
        if result[1] != 1:
            result[3] += 1
        
    return result
occurences_df = pd.DataFrame(
    df \
        .groupby([&quot;Ref1&quot;, &quot;EvnNo&quot;]) \
        [[&quot;PP1&quot;, &quot;PP2&quot;]] \
        .apply(count_occurences) \
        .to_list(),
    columns = [&quot;P1 First Occ&quot;, &quot;P2 First Occ&quot;, &quot;P1 Second Occ&quot;, &quot;P2 Second Occ&quot;]
)
print(occurences_df.sum())

Output:

P1 First Occ     2
P2 First Occ     1
P1 Second Occ    1
P2 Second Occ    1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在每个分组中查找两列中值的第一次和第二次出现。

问题

答案1

运行Go异步操作并写入映射。

如何使用Python获取USDT TRC20代币余额？

在Python中，筛选小于给定公差的列表元素。

无法使用 `streamlit` 对包含多个标签的数据集进行标注。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。