Return when last event occurred in Dataframe by Group.

huangapple go评论62阅读模式
英文:

Return when last event occurred in Dataframe by Group

问题

这是我的数据框架:

set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20), round_number= rep(1:20, 5), event_occurred=rbinom(100, size=1, prob=0.2))

我想要一个新的列lagEventOccurred,告诉我事件上次发生是多少回合前,即:

lagEventOccurred <- c(Inf, Inf, Inf, Inf, Inf, 1,  2, 1, 1, 2, 3, 4, 5, 6, 7)

对于前几个条目以此类推(请注意,如果事件在同一回合发生,它不会被考虑,我只想查看以前的回合),所有这些都应该按ID分组。

我尝试了多种方法,包括类似于以下内容的内容:

minimalsample %>%
  dplyr::mutate(firststep = ifelse(event_occurred==1, round_number, Inf) )  %>%
  dplyr::mutate(secondstep = ifelse((lag(firststep) < round_number), lag(firststep), firststep )) %>%
  dplyr::mutate(thirdstep = ifelse((lag(firststep) < round_number), round_number- lag(firststep), round_number-firststep ))

但似乎我需要迭代"lag"过程20次才能得到所需的结果。我想知道是否有一个函数,它不返回向量的滞后值,而是返回满足条件的条件是多少个滞后值前发生的。

感谢您提供如何处理这个问题的任何想法。

英文:

This is my data frame:

set.seed(1)
minimalsample &lt;- data.frame(ID=rep(1:5, each= 20), round_number= rep(1:20, 5), event_occurred=rbinom(100, size=1, prob=0.2))
ID round_number B event_occurred
1 1 0
1 2 0
1 3 0
1 4 1
1 5 0
1 6 1
1 7 1
1 8 0
1 9 0
1 10 0
1 11 0

I want a new column lagEventOccurred which tells me how many rounds ago the event has last occurred, i.e.

lagEventOccurred &lt;- c(Inf, Inf, Inf, Inf, Inf, 1,  2, 1, 1, 2, 3, 4, 5, 6, 7)

for the first few entries and so on (note that if the event occurred in the same round, it is not considered, I only want to look at previous rounds), all of this should be grouped by ID.

ID round_number B event_occurred lagEventOccurred
1 1 0 Inf
1 2 0 Inf
1 3 0 Inf
1 4 1 Inf
1 5 0 1
1 6 1 2
1 7 1 1
1 8 0 1
1 9 0 2
1 10 0 3
1 11 0 4

I tried multiple things, including something like

minimalsample %&gt;% dplyr::mutate(firststep = ifelse(event_occurred==1, round_number, Inf) )  %&gt;%
  dplyr::mutate(secondstep = ifelse((lag(firststep) &lt; round_number), lag(firststep), firststep )) %&gt;%
  dplyr::mutate(thirdstep = ifelse((lag(firststep) &lt; round_number), round_number- lag(firststep), round_number-firststep ))

but it seems that I would have to iterate over the "lag" procedure 20 times to get the desired results. I wonder if there is a function that does not return the lagged value of a vector, but returns how many lags ago a condition was satisfied.

Thank you for any ideas how to approach this.

答案1

得分: 1

以下是翻译好的内容:

"I couldn't think of a more straightforward method, but this works. I have assumed that if the beginning of each 'ID' group is a sequence of zeros, it should be treated as Inf. Let me know if this is incorrect. Using dplyr() and tidyr(), the workflow is:

  1. Create copy of 'event_occurred' called 'lagEventOccurred'
  2. Within each 'ID' group, create a new 'id' value for every event using cumsum()
  3. Use fill() to define start and end of each event 'group'
  4. Lag the 'lagEventOccurred' column and use seq_along() to return lagged length since the last event occurred
library(dplyr)
library(tidyr)

set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20), 
                            round_number= rep(1:20, 5), 
                            event_occurred=rbinom(100, size=1, prob=0.2))


minimalsample <- minimalsample %>%
  mutate(lagEventOccurred = ifelse(event_occurred == 0, NA, 1)) %>%
  group_by(ID, lagEventOccurred) %>%
  mutate(lagEventOccurred = cumsum(lagEventOccurred)) %>%
  ungroup() %>%
  group_by(ID) %>%
  fill(lagEventOccurred) %>%
  mutate(lagEventOccurred = lag(lagEventOccurred)) %>%
  group_by(ID, lagEventOccurred) %>%
  mutate(lagEventOccurred = ifelse(is.na(lagEventOccurred), 
                                   Inf, seq_along(lagEventOccurred))) %>%
  ungroup()


head(minimalsample, 10)
# A tibble: 10 × 4
     ID round_number event_occurred lagEventOccurred
  <int>        <int>          <int>            <dbl>
1     1            1              0              Inf
2     1            2              0              Inf
3     1            3              0              Inf
4     1            4              1              Inf
5     1            5              0                1
6     1            6              1                2
7     1            7              1                1
8     1            8              0                1
9     1            9              0                2
10    1           10              0                3

tail(minimalsample, 10)
# A tibble: 10 × 4
     ID round_number event_occurred lagEventOccurred
  <int>        <int>          <int>            <dbl>
1     5           11              0              Inf
2     5           12              0              Inf
3     5           13              0              Inf
4     5           14              1              Inf
5     5           15              0                1
6     5           16              0                2
7     5           17              0                3
8     5           18              0                4
9     5           19              1                5
10    5           20              0                1

"

英文:

I couldn't think of a more straightforward method, but this works. I have assumed that if the beginning of each "ID" group is a sequence of zeros, it should be treated as Inf. Let me know if this is incorrect. Using dplyr() and tidyr(), the workflow is:

  1. Create copy of "event_occurred" called "lagEventOccurred"
  2. Within each "ID" group, create a new 'id' value for every event using cumsum()
  3. Use fill() to define start and end of each event 'group'
  4. Lag the "lagEventOccurred" column and use seq_along() to return lagged length since last event occurred
library(dplyr)
library(tidyr)
set.seed(1)
minimalsample &lt;- data.frame(ID=rep(1:5, each= 20), 
round_number= rep(1:20, 5), 
event_occurred=rbinom(100, size=1, prob=0.2))
minimalsample &lt;- minimalsample %&gt;%
mutate(lagEventOccurred = ifelse(event_occurred == 0, NA, 1)) %&gt;%
group_by(ID, lagEventOccurred) %&gt;%
mutate(lagEventOccurred = cumsum(lagEventOccurred)) %&gt;%
ungroup() %&gt;%
group_by(ID) %&gt;%
fill(lagEventOccurred) %&gt;%
mutate(lagEventOccurred = lag(lagEventOccurred)) %&gt;%
group_by(ID, lagEventOccurred) %&gt;%
mutate(lagEventOccurred = ifelse(is.na(lagEventOccurred), 
Inf, seq_along(lagEventOccurred))) %&gt;%
ungroup()
head(minimalsample, 10)
# A tibble: 10 &#215; 4
ID round_number event_occurred lagEventOccurred
&lt;int&gt;        &lt;int&gt;          &lt;int&gt;            &lt;dbl&gt;
1     1            1              0              Inf
2     1            2              0              Inf
3     1            3              0              Inf
4     1            4              1              Inf
5     1            5              0                1
6     1            6              1                2
7     1            7              1                1
8     1            8              0                1
9     1            9              0                2
10    1           10              0                3
tail(minimalsample, 10)
# A tibble: 10 &#215; 4
ID round_number event_occurred lagEventOccurred
&lt;int&gt;        &lt;int&gt;          &lt;int&gt;            &lt;dbl&gt;
1     5           11              0              Inf
2     5           12              0              Inf
3     5           13              0              Inf
4     5           14              1              Inf
5     5           15              0                1
6     5           16              0                2
7     5           17              0                3
8     5           18              0                4
9     5           19              1                5
10    5           20              0                1

答案2

得分: 0

以下是翻译好的内容:

创建一个类似的解决方案,涉及为每个新事件创建一个组ID,为捕获该组ID中的更改创建一个连续ID,为每个连续ID创建一个行号,最后使用if_else() + lag()语句来获得所需的结果。

minimalsample %>%
mutate(cum = cumsum(event_occurred),
new_id = consecutive_id(cum), .by = ID) %>%
mutate(row_id = row_number(), .by = c(ID, new_id)) %>%
mutate(lagEventOccurred = if_else(cum == 0 | (cum == 1 & event_occurred == 1), 
NA_integer_, 
lag(row_id)), .by = ID) %>%
select(ID, round_number, event_occurred, lagEventOccurred)
# 一个 tibble: 100 x 4
ID round_number event_occurred lagEventOccurred
<int>        <int>          <int>            <int>
1     1            1              0               NA
2     1            2              0               NA
3     1            3              0               NA
4     1            4              1               NA
5     1            5              0                1
6     1            6              1                2
7     1            7              1                1
8     1            8              0                1
9     1            9              0                2
10     1           10              0                3
# 还有更多的行,请使用 `print(n = ...)` 查看更多行

如果您需要任何其他帮助,请随时告诉我。

英文:

A similar solution that involves creating a group ID for when each new event occurs, a consecutive ID to capture changes in that group ID, a row number for each consecutive ID, and finally an if_else() + lag() statement to get the desired result.

minimalsample %&gt;%
mutate(cum = cumsum(event_occurred),
new_id = consecutive_id(cum), .by = ID) %&gt;%
mutate(row_id = row_number(), .by = c(ID, new_id)) %&gt;%
mutate(lagEventOccurred = if_else(cum == 0 | (cum == 1 &amp; event_occurred == 1), 
NA_integer_, 
lag(row_id)), .by = ID) %&gt;%
select(ID, round_number, event_occurred, lagEventOccurred)
# A tibble: 100 x 4
ID round_number event_occurred lagEventOccurred
&lt;int&gt;        &lt;int&gt;          &lt;int&gt;            &lt;int&gt;
1     1            1              0               NA
2     1            2              0               NA
3     1            3              0               NA
4     1            4              1               NA
5     1            5              0                1
6     1            6              1                2
7     1            7              1                1
8     1            8              0                1
9     1            9              0                2
10     1           10              0                3
# i 90 more rows
# i Use `print(n = ...)` to see more rows

huangapple
  • 本文由 发表于 2023年5月10日 20:41:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76218561.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定