英文:
Return when last event occurred in Dataframe by Group
问题
这是我的数据框架:
set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20), round_number= rep(1:20, 5), event_occurred=rbinom(100, size=1, prob=0.2))
我想要一个新的列lagEventOccurred
,告诉我事件上次发生是多少回合前,即:
lagEventOccurred <- c(Inf, Inf, Inf, Inf, Inf, 1, 2, 1, 1, 2, 3, 4, 5, 6, 7)
对于前几个条目以此类推(请注意,如果事件在同一回合发生,它不会被考虑,我只想查看以前的回合),所有这些都应该按ID分组。
我尝试了多种方法,包括类似于以下内容的内容:
minimalsample %>%
dplyr::mutate(firststep = ifelse(event_occurred==1, round_number, Inf) ) %>%
dplyr::mutate(secondstep = ifelse((lag(firststep) < round_number), lag(firststep), firststep )) %>%
dplyr::mutate(thirdstep = ifelse((lag(firststep) < round_number), round_number- lag(firststep), round_number-firststep ))
但似乎我需要迭代"lag"过程20次才能得到所需的结果。我想知道是否有一个函数,它不返回向量的滞后值,而是返回满足条件的条件是多少个滞后值前发生的。
感谢您提供如何处理这个问题的任何想法。
英文:
This is my data frame:
set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20), round_number= rep(1:20, 5), event_occurred=rbinom(100, size=1, prob=0.2))
ID | round_number B | event_occurred |
---|---|---|
1 | 1 | 0 |
1 | 2 | 0 |
1 | 3 | 0 |
1 | 4 | 1 |
1 | 5 | 0 |
1 | 6 | 1 |
1 | 7 | 1 |
1 | 8 | 0 |
1 | 9 | 0 |
1 | 10 | 0 |
1 | 11 | 0 |
I want a new column lagEventOccurred which tells me how many rounds ago the event has last occurred, i.e.
lagEventOccurred <- c(Inf, Inf, Inf, Inf, Inf, 1, 2, 1, 1, 2, 3, 4, 5, 6, 7)
for the first few entries and so on (note that if the event occurred in the same round, it is not considered, I only want to look at previous rounds), all of this should be grouped by ID.
ID | round_number B | event_occurred | lagEventOccurred |
---|---|---|---|
1 | 1 | 0 | Inf |
1 | 2 | 0 | Inf |
1 | 3 | 0 | Inf |
1 | 4 | 1 | Inf |
1 | 5 | 0 | 1 |
1 | 6 | 1 | 2 |
1 | 7 | 1 | 1 |
1 | 8 | 0 | 1 |
1 | 9 | 0 | 2 |
1 | 10 | 0 | 3 |
1 | 11 | 0 | 4 |
I tried multiple things, including something like
minimalsample %>% dplyr::mutate(firststep = ifelse(event_occurred==1, round_number, Inf) ) %>%
dplyr::mutate(secondstep = ifelse((lag(firststep) < round_number), lag(firststep), firststep )) %>%
dplyr::mutate(thirdstep = ifelse((lag(firststep) < round_number), round_number- lag(firststep), round_number-firststep ))
but it seems that I would have to iterate over the "lag" procedure 20 times to get the desired results. I wonder if there is a function that does not return the lagged value of a vector, but returns how many lags ago a condition was satisfied.
Thank you for any ideas how to approach this.
答案1
得分: 1
以下是翻译好的内容:
"I couldn't think of a more straightforward method, but this works. I have assumed that if the beginning of each 'ID' group is a sequence of zeros, it should be treated as Inf. Let me know if this is incorrect. Using dplyr()
and tidyr()
, the workflow is:
- Create copy of 'event_occurred' called 'lagEventOccurred'
- Within each 'ID' group, create a new 'id' value for every event using
cumsum()
- Use
fill()
to define start and end of each event 'group' - Lag the 'lagEventOccurred' column and use
seq_along()
to return lagged length since the last event occurred
library(dplyr)
library(tidyr)
set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20),
round_number= rep(1:20, 5),
event_occurred=rbinom(100, size=1, prob=0.2))
minimalsample <- minimalsample %>%
mutate(lagEventOccurred = ifelse(event_occurred == 0, NA, 1)) %>%
group_by(ID, lagEventOccurred) %>%
mutate(lagEventOccurred = cumsum(lagEventOccurred)) %>%
ungroup() %>%
group_by(ID) %>%
fill(lagEventOccurred) %>%
mutate(lagEventOccurred = lag(lagEventOccurred)) %>%
group_by(ID, lagEventOccurred) %>%
mutate(lagEventOccurred = ifelse(is.na(lagEventOccurred),
Inf, seq_along(lagEventOccurred))) %>%
ungroup()
head(minimalsample, 10)
# A tibble: 10 × 4
ID round_number event_occurred lagEventOccurred
<int> <int> <int> <dbl>
1 1 1 0 Inf
2 1 2 0 Inf
3 1 3 0 Inf
4 1 4 1 Inf
5 1 5 0 1
6 1 6 1 2
7 1 7 1 1
8 1 8 0 1
9 1 9 0 2
10 1 10 0 3
tail(minimalsample, 10)
# A tibble: 10 × 4
ID round_number event_occurred lagEventOccurred
<int> <int> <int> <dbl>
1 5 11 0 Inf
2 5 12 0 Inf
3 5 13 0 Inf
4 5 14 1 Inf
5 5 15 0 1
6 5 16 0 2
7 5 17 0 3
8 5 18 0 4
9 5 19 1 5
10 5 20 0 1
"
英文:
I couldn't think of a more straightforward method, but this works. I have assumed that if the beginning of each "ID" group is a sequence of zeros, it should be treated as Inf. Let me know if this is incorrect. Using dplyr()
and tidyr()
, the workflow is:
- Create copy of "event_occurred" called "lagEventOccurred"
- Within each "ID" group, create a new 'id' value for every event using
cumsum()
- Use
fill()
to define start and end of each event 'group' - Lag the "lagEventOccurred" column and use
seq_along()
to return lagged length since last event occurred
library(dplyr)
library(tidyr)
set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20),
round_number= rep(1:20, 5),
event_occurred=rbinom(100, size=1, prob=0.2))
minimalsample <- minimalsample %>%
mutate(lagEventOccurred = ifelse(event_occurred == 0, NA, 1)) %>%
group_by(ID, lagEventOccurred) %>%
mutate(lagEventOccurred = cumsum(lagEventOccurred)) %>%
ungroup() %>%
group_by(ID) %>%
fill(lagEventOccurred) %>%
mutate(lagEventOccurred = lag(lagEventOccurred)) %>%
group_by(ID, lagEventOccurred) %>%
mutate(lagEventOccurred = ifelse(is.na(lagEventOccurred),
Inf, seq_along(lagEventOccurred))) %>%
ungroup()
head(minimalsample, 10)
# A tibble: 10 × 4
ID round_number event_occurred lagEventOccurred
<int> <int> <int> <dbl>
1 1 1 0 Inf
2 1 2 0 Inf
3 1 3 0 Inf
4 1 4 1 Inf
5 1 5 0 1
6 1 6 1 2
7 1 7 1 1
8 1 8 0 1
9 1 9 0 2
10 1 10 0 3
tail(minimalsample, 10)
# A tibble: 10 × 4
ID round_number event_occurred lagEventOccurred
<int> <int> <int> <dbl>
1 5 11 0 Inf
2 5 12 0 Inf
3 5 13 0 Inf
4 5 14 1 Inf
5 5 15 0 1
6 5 16 0 2
7 5 17 0 3
8 5 18 0 4
9 5 19 1 5
10 5 20 0 1
答案2
得分: 0
以下是翻译好的内容:
创建一个类似的解决方案,涉及为每个新事件创建一个组ID,为捕获该组ID中的更改创建一个连续ID,为每个连续ID创建一个行号,最后使用if_else()
+ lag()
语句来获得所需的结果。
minimalsample %>%
mutate(cum = cumsum(event_occurred),
new_id = consecutive_id(cum), .by = ID) %>%
mutate(row_id = row_number(), .by = c(ID, new_id)) %>%
mutate(lagEventOccurred = if_else(cum == 0 | (cum == 1 & event_occurred == 1),
NA_integer_,
lag(row_id)), .by = ID) %>%
select(ID, round_number, event_occurred, lagEventOccurred)
# 一个 tibble: 100 x 4
ID round_number event_occurred lagEventOccurred
<int> <int> <int> <int>
1 1 1 0 NA
2 1 2 0 NA
3 1 3 0 NA
4 1 4 1 NA
5 1 5 0 1
6 1 6 1 2
7 1 7 1 1
8 1 8 0 1
9 1 9 0 2
10 1 10 0 3
# 还有更多的行,请使用 `print(n = ...)` 查看更多行
如果您需要任何其他帮助,请随时告诉我。
英文:
A similar solution that involves creating a group ID for when each new event occurs, a consecutive ID to capture changes in that group ID, a row number for each consecutive ID, and finally an if_else()
+ lag()
statement to get the desired result.
minimalsample %>%
mutate(cum = cumsum(event_occurred),
new_id = consecutive_id(cum), .by = ID) %>%
mutate(row_id = row_number(), .by = c(ID, new_id)) %>%
mutate(lagEventOccurred = if_else(cum == 0 | (cum == 1 & event_occurred == 1),
NA_integer_,
lag(row_id)), .by = ID) %>%
select(ID, round_number, event_occurred, lagEventOccurred)
# A tibble: 100 x 4
ID round_number event_occurred lagEventOccurred
<int> <int> <int> <int>
1 1 1 0 NA
2 1 2 0 NA
3 1 3 0 NA
4 1 4 1 NA
5 1 5 0 1
6 1 6 1 2
7 1 7 1 1
8 1 8 0 1
9 1 9 0 2
10 1 10 0 3
# i 90 more rows
# i Use `print(n = ...)` to see more rows
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论