在每个分组中查找两列中值的第一次和第二次出现。

huangapple go评论64阅读模式
英文:

Find First and Second Occurrence of Value Across Two Columns by Group

问题

P1 First Occ P2 First Occ P1 Second Occ P2 Second Occ
2 1 0 1
英文:

I have a df that looks like the one below. It is sorted by Ref1 and Seq.

Ref1 EvnNo P1 P2 Seq PP1 PP2
aaaa 0 xxx yyy 1 0 1
aaaa 0 xxx yyy 2 0 0
aaaa 0 xxx yyy 3 1 0
aaaa 0 xxx yyy 4 0 0
aaaa 1 xxx yyy 5 0 0
aaaa 1 xxx yyy 6 1 0
aaaa 1 xxx yyy 7 1 0
aaaa 1 xxx yyy 8 0 1
bbbb 0 xxx yyy 1 0 0
bbbb 0 xxx yyy 2 0 0
bbbb 0 xxx yyy 3 0 0
bbbb 0 xxx yyy 4 0 0
bbbb 1 xxx yyy 5 0 0
bbbb 1 xxx yyy 6 0 0
bbbb 1 xxx yyy 7 1 0
bbbb 1 xxx yyy 8 0 1

I am trying to work out how to do two things:

  1. count the first occurrences of a 1 in either PP1 or PP2 grouped by Ref1 and EvNo. There may be no occurrences or there may be multiple occurrences but there will never be a 1 in both columns on the same row.

  2. after the first occurrence (if any) count if there is a 1 in the other of PP1 or PP2 in the same group. Eg if the first 1 in a group was in PP1 count if the next occurrence of 1 is in PP2. If the next 1 is also in PP1 it should not be counted. There may be no further occurrences of a 1 in either column.

Output:

P1 First Occ P2 First Occ P1 Second Occ P2 Second Occ
2 1 0 1

答案1

得分: 1

我成功地通过对数据框的分组应用单独的函数来获得结果,将结果转化为一个新的数据框并进行汇总。

在函数中的主要技巧是使用 np.where。我将它用于列 PP1PP2 的总和,以找到任何出现,并且通过检查 PP1 列中的值是否为 1(如果是,则出现在 PP1 中,否则出现在 PP2 中,因为您说出现不能同时发生)来检查哪列提供了出现。

尽管如此,我不确定为什么您的输出中对于 P1 Second Occ 没有 1,因为第一组(Ref1 == "aaaa"EvnNo == 0)确实显示了这一点,如果我正确理解了问题。

import numpy as np

def count_occurences(group):
    result = [0] * 4

    occurences = np.where(group.sum(axis=1) == 1)[0]
    
    # 跟踪第一次出现
    if len(occurences) > 0 and group.iloc[occurences[0]]["PP1"] == 1:
        result[0] += 1
    elif len(occurences) > 0:
        result[1] += 1
    
    # 跟踪第二次出现
    if len(occurences) > 1 and group.iloc[occurences[1]]["PP1"] == 1:
        if result[0] != 1:
            result[2] += 1
    elif len(occurences) > 1:
        if result[1] != 1:
            result[3] += 1
        
    return result


occurences_df = pd.DataFrame(
    df \
        .groupby(["Ref1", "EvnNo"]) \
        [["PP1", "PP2"]] \
        .apply(count_occurences) \
        .to_list(),
    columns = ["P1 First Occ", "P2 First Occ", "P1 Second Occ", "P2 Second Occ"]
)

print(occurences_df.sum())

输出:

P1 First Occ     2
P2 First Occ     1
P1 Second Occ    1
P2 Second Occ    1
英文:

I managed to get the result by applying a separate function to groups of the dataframe, turning the result into a new dataframe and summarizing it.

In the function the main trick is to use np.where. I used it on a sum of columns PP1 and PP2 to find just any occurences, and then checked which column provided the occurence just by checking if the value in PP1 column is 1 (if yes - then the occurence is in PP1, if not - in PP2, as you said that the occurences can not happen simultaneously).

Although, I am not sure why your output doesn't have 1 for P1 Second Occ, because the first group (Ref1 == "aaaa" and EvnNo == 0) shows exactly that, if I understood the question correctly.

import numpy as np

def count_occurences(group):
    result = [0] * 4

    occurences = np.where(group.sum(axis=1) == 1)[0]
    
    # track first occurence
    if len(occurences) > 0 and group.iloc[occurences[0]]["PP1"] == 1:
        result[0] += 1
    elif len(occurences) > 0:
        result[1] += 1
    
    # track second occurence
    if len(occurences) > 1 and group.iloc[occurences[1]]["PP1"] == 1:
        if result[0] != 1:
            result[2] += 1
    elif len(occurences) > 1:
        if result[1] != 1:
            result[3] += 1
        
    return result


occurences_df = pd.DataFrame(
    df \
        .groupby(["Ref1", "EvnNo"]) \
        [["PP1", "PP2"]] \
        .apply(count_occurences) \
        .to_list(),
    columns = ["P1 First Occ", "P2 First Occ", "P1 Second Occ", "P2 Second Occ"]
)

print(occurences_df.sum())

Output:

P1 First Occ     2
P2 First Occ     1
P1 Second Occ    1
P2 Second Occ    1

huangapple
  • 本文由 发表于 2023年7月12日 21:39:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76671228.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定