英文:
Find First and Second Occurrence of Value Across Two Columns by Group
问题
P1 First Occ | P2 First Occ | P1 Second Occ | P2 Second Occ |
---|---|---|---|
2 | 1 | 0 | 1 |
英文:
I have a df that looks like the one below. It is sorted by Ref1 and Seq.
Ref1 | EvnNo | P1 | P2 | Seq | PP1 | PP2 |
---|---|---|---|---|---|---|
aaaa | 0 | xxx | yyy | 1 | 0 | 1 |
aaaa | 0 | xxx | yyy | 2 | 0 | 0 |
aaaa | 0 | xxx | yyy | 3 | 1 | 0 |
aaaa | 0 | xxx | yyy | 4 | 0 | 0 |
aaaa | 1 | xxx | yyy | 5 | 0 | 0 |
aaaa | 1 | xxx | yyy | 6 | 1 | 0 |
aaaa | 1 | xxx | yyy | 7 | 1 | 0 |
aaaa | 1 | xxx | yyy | 8 | 0 | 1 |
bbbb | 0 | xxx | yyy | 1 | 0 | 0 |
bbbb | 0 | xxx | yyy | 2 | 0 | 0 |
bbbb | 0 | xxx | yyy | 3 | 0 | 0 |
bbbb | 0 | xxx | yyy | 4 | 0 | 0 |
bbbb | 1 | xxx | yyy | 5 | 0 | 0 |
bbbb | 1 | xxx | yyy | 6 | 0 | 0 |
bbbb | 1 | xxx | yyy | 7 | 1 | 0 |
bbbb | 1 | xxx | yyy | 8 | 0 | 1 |
I am trying to work out how to do two things:
-
count the first occurrences of a 1 in either PP1 or PP2 grouped by Ref1 and EvNo. There may be no occurrences or there may be multiple occurrences but there will never be a 1 in both columns on the same row.
-
after the first occurrence (if any) count if there is a 1 in the other of PP1 or PP2 in the same group. Eg if the first 1 in a group was in PP1 count if the next occurrence of 1 is in PP2. If the next 1 is also in PP1 it should not be counted. There may be no further occurrences of a 1 in either column.
Output:
P1 First Occ | P2 First Occ | P1 Second Occ | P2 Second Occ |
---|---|---|---|
2 | 1 | 0 | 1 |
答案1
得分: 1
我成功地通过对数据框的分组应用单独的函数来获得结果,将结果转化为一个新的数据框并进行汇总。
在函数中的主要技巧是使用 np.where
。我将它用于列 PP1
和 PP2
的总和,以找到任何出现,并且通过检查 PP1
列中的值是否为 1
(如果是,则出现在 PP1
中,否则出现在 PP2
中,因为您说出现不能同时发生)来检查哪列提供了出现。
尽管如此,我不确定为什么您的输出中对于 P1 Second Occ
没有 1
,因为第一组(Ref1 == "aaaa"
和 EvnNo == 0
)确实显示了这一点,如果我正确理解了问题。
import numpy as np
def count_occurences(group):
result = [0] * 4
occurences = np.where(group.sum(axis=1) == 1)[0]
# 跟踪第一次出现
if len(occurences) > 0 and group.iloc[occurences[0]]["PP1"] == 1:
result[0] += 1
elif len(occurences) > 0:
result[1] += 1
# 跟踪第二次出现
if len(occurences) > 1 and group.iloc[occurences[1]]["PP1"] == 1:
if result[0] != 1:
result[2] += 1
elif len(occurences) > 1:
if result[1] != 1:
result[3] += 1
return result
occurences_df = pd.DataFrame(
df \
.groupby(["Ref1", "EvnNo"]) \
[["PP1", "PP2"]] \
.apply(count_occurences) \
.to_list(),
columns = ["P1 First Occ", "P2 First Occ", "P1 Second Occ", "P2 Second Occ"]
)
print(occurences_df.sum())
输出:
P1 First Occ 2
P2 First Occ 1
P1 Second Occ 1
P2 Second Occ 1
英文:
I managed to get the result by applying a separate function to groups of the dataframe, turning the result into a new dataframe and summarizing it.
In the function the main trick is to use np.where
. I used it on a sum of columns PP1
and PP2
to find just any occurences, and then checked which column provided the occurence just by checking if the value in PP1
column is 1
(if yes - then the occurence is in PP1
, if not - in PP2
, as you said that the occurences can not happen simultaneously).
Although, I am not sure why your output doesn't have 1
for P1 Second Occ
, because the first group (Ref1 == "aaaa" and EvnNo == 0
) shows exactly that, if I understood the question correctly.
import numpy as np
def count_occurences(group):
result = [0] * 4
occurences = np.where(group.sum(axis=1) == 1)[0]
# track first occurence
if len(occurences) > 0 and group.iloc[occurences[0]]["PP1"] == 1:
result[0] += 1
elif len(occurences) > 0:
result[1] += 1
# track second occurence
if len(occurences) > 1 and group.iloc[occurences[1]]["PP1"] == 1:
if result[0] != 1:
result[2] += 1
elif len(occurences) > 1:
if result[1] != 1:
result[3] += 1
return result
occurences_df = pd.DataFrame(
df \
.groupby(["Ref1", "EvnNo"]) \
[["PP1", "PP2"]] \
.apply(count_occurences) \
.to_list(),
columns = ["P1 First Occ", "P2 First Occ", "P1 Second Occ", "P2 Second Occ"]
)
print(occurences_df.sum())
Output:
P1 First Occ 2
P2 First Occ 1
P1 Second Occ 1
P2 Second Occ 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论