Return when last event occurred in Dataframe by Group.

huangapple go评论82阅读模式
英文:

Return when last event occurred in Dataframe by Group

问题

这是我的数据框架:

  1. set.seed(1)
  2. minimalsample <- data.frame(ID=rep(1:5, each= 20), round_number= rep(1:20, 5), event_occurred=rbinom(100, size=1, prob=0.2))

我想要一个新的列lagEventOccurred,告诉我事件上次发生是多少回合前,即:

  1. lagEventOccurred <- c(Inf, Inf, Inf, Inf, Inf, 1, 2, 1, 1, 2, 3, 4, 5, 6, 7)

对于前几个条目以此类推(请注意,如果事件在同一回合发生,它不会被考虑,我只想查看以前的回合),所有这些都应该按ID分组。

我尝试了多种方法,包括类似于以下内容的内容:

  1. minimalsample %>%
  2. dplyr::mutate(firststep = ifelse(event_occurred==1, round_number, Inf) ) %>%
  3. dplyr::mutate(secondstep = ifelse((lag(firststep) < round_number), lag(firststep), firststep )) %>%
  4. dplyr::mutate(thirdstep = ifelse((lag(firststep) < round_number), round_number- lag(firststep), round_number-firststep ))

但似乎我需要迭代"lag"过程20次才能得到所需的结果。我想知道是否有一个函数,它不返回向量的滞后值,而是返回满足条件的条件是多少个滞后值前发生的。

感谢您提供如何处理这个问题的任何想法。

英文:

This is my data frame:

  1. set.seed(1)
  2. minimalsample &lt;- data.frame(ID=rep(1:5, each= 20), round_number= rep(1:20, 5), event_occurred=rbinom(100, size=1, prob=0.2))
ID round_number B event_occurred
1 1 0
1 2 0
1 3 0
1 4 1
1 5 0
1 6 1
1 7 1
1 8 0
1 9 0
1 10 0
1 11 0

I want a new column lagEventOccurred which tells me how many rounds ago the event has last occurred, i.e.

  1. lagEventOccurred &lt;- c(Inf, Inf, Inf, Inf, Inf, 1, 2, 1, 1, 2, 3, 4, 5, 6, 7)

for the first few entries and so on (note that if the event occurred in the same round, it is not considered, I only want to look at previous rounds), all of this should be grouped by ID.

ID round_number B event_occurred lagEventOccurred
1 1 0 Inf
1 2 0 Inf
1 3 0 Inf
1 4 1 Inf
1 5 0 1
1 6 1 2
1 7 1 1
1 8 0 1
1 9 0 2
1 10 0 3
1 11 0 4

I tried multiple things, including something like

  1. minimalsample %&gt;% dplyr::mutate(firststep = ifelse(event_occurred==1, round_number, Inf) ) %&gt;%
  2. dplyr::mutate(secondstep = ifelse((lag(firststep) &lt; round_number), lag(firststep), firststep )) %&gt;%
  3. dplyr::mutate(thirdstep = ifelse((lag(firststep) &lt; round_number), round_number- lag(firststep), round_number-firststep ))

but it seems that I would have to iterate over the "lag" procedure 20 times to get the desired results. I wonder if there is a function that does not return the lagged value of a vector, but returns how many lags ago a condition was satisfied.

Thank you for any ideas how to approach this.

答案1

得分: 1

以下是翻译好的内容:

"I couldn't think of a more straightforward method, but this works. I have assumed that if the beginning of each 'ID' group is a sequence of zeros, it should be treated as Inf. Let me know if this is incorrect. Using dplyr() and tidyr(), the workflow is:

  1. Create copy of 'event_occurred' called 'lagEventOccurred'
  2. Within each 'ID' group, create a new 'id' value for every event using cumsum()
  3. Use fill() to define start and end of each event 'group'
  4. Lag the 'lagEventOccurred' column and use seq_along() to return lagged length since the last event occurred
  1. library(dplyr)
  2. library(tidyr)
  3. set.seed(1)
  4. minimalsample <- data.frame(ID=rep(1:5, each= 20),
  5. round_number= rep(1:20, 5),
  6. event_occurred=rbinom(100, size=1, prob=0.2))
  7. minimalsample <- minimalsample %>%
  8. mutate(lagEventOccurred = ifelse(event_occurred == 0, NA, 1)) %>%
  9. group_by(ID, lagEventOccurred) %>%
  10. mutate(lagEventOccurred = cumsum(lagEventOccurred)) %>%
  11. ungroup() %>%
  12. group_by(ID) %>%
  13. fill(lagEventOccurred) %>%
  14. mutate(lagEventOccurred = lag(lagEventOccurred)) %>%
  15. group_by(ID, lagEventOccurred) %>%
  16. mutate(lagEventOccurred = ifelse(is.na(lagEventOccurred),
  17. Inf, seq_along(lagEventOccurred))) %>%
  18. ungroup()
  19. head(minimalsample, 10)
  20. # A tibble: 10 × 4
  21. ID round_number event_occurred lagEventOccurred
  22. <int> <int> <int> <dbl>
  23. 1 1 1 0 Inf
  24. 2 1 2 0 Inf
  25. 3 1 3 0 Inf
  26. 4 1 4 1 Inf
  27. 5 1 5 0 1
  28. 6 1 6 1 2
  29. 7 1 7 1 1
  30. 8 1 8 0 1
  31. 9 1 9 0 2
  32. 10 1 10 0 3
  33. tail(minimalsample, 10)
  34. # A tibble: 10 × 4
  35. ID round_number event_occurred lagEventOccurred
  36. <int> <int> <int> <dbl>
  37. 1 5 11 0 Inf
  38. 2 5 12 0 Inf
  39. 3 5 13 0 Inf
  40. 4 5 14 1 Inf
  41. 5 5 15 0 1
  42. 6 5 16 0 2
  43. 7 5 17 0 3
  44. 8 5 18 0 4
  45. 9 5 19 1 5
  46. 10 5 20 0 1

"

英文:

I couldn't think of a more straightforward method, but this works. I have assumed that if the beginning of each "ID" group is a sequence of zeros, it should be treated as Inf. Let me know if this is incorrect. Using dplyr() and tidyr(), the workflow is:

  1. Create copy of "event_occurred" called "lagEventOccurred"
  2. Within each "ID" group, create a new 'id' value for every event using cumsum()
  3. Use fill() to define start and end of each event 'group'
  4. Lag the "lagEventOccurred" column and use seq_along() to return lagged length since last event occurred
  1. library(dplyr)
  2. library(tidyr)
  3. set.seed(1)
  4. minimalsample &lt;- data.frame(ID=rep(1:5, each= 20),
  5. round_number= rep(1:20, 5),
  6. event_occurred=rbinom(100, size=1, prob=0.2))
  7. minimalsample &lt;- minimalsample %&gt;%
  8. mutate(lagEventOccurred = ifelse(event_occurred == 0, NA, 1)) %&gt;%
  9. group_by(ID, lagEventOccurred) %&gt;%
  10. mutate(lagEventOccurred = cumsum(lagEventOccurred)) %&gt;%
  11. ungroup() %&gt;%
  12. group_by(ID) %&gt;%
  13. fill(lagEventOccurred) %&gt;%
  14. mutate(lagEventOccurred = lag(lagEventOccurred)) %&gt;%
  15. group_by(ID, lagEventOccurred) %&gt;%
  16. mutate(lagEventOccurred = ifelse(is.na(lagEventOccurred),
  17. Inf, seq_along(lagEventOccurred))) %&gt;%
  18. ungroup()
  19. head(minimalsample, 10)
  20. # A tibble: 10 &#215; 4
  21. ID round_number event_occurred lagEventOccurred
  22. &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
  23. 1 1 1 0 Inf
  24. 2 1 2 0 Inf
  25. 3 1 3 0 Inf
  26. 4 1 4 1 Inf
  27. 5 1 5 0 1
  28. 6 1 6 1 2
  29. 7 1 7 1 1
  30. 8 1 8 0 1
  31. 9 1 9 0 2
  32. 10 1 10 0 3
  33. tail(minimalsample, 10)
  34. # A tibble: 10 &#215; 4
  35. ID round_number event_occurred lagEventOccurred
  36. &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
  37. 1 5 11 0 Inf
  38. 2 5 12 0 Inf
  39. 3 5 13 0 Inf
  40. 4 5 14 1 Inf
  41. 5 5 15 0 1
  42. 6 5 16 0 2
  43. 7 5 17 0 3
  44. 8 5 18 0 4
  45. 9 5 19 1 5
  46. 10 5 20 0 1

答案2

得分: 0

以下是翻译好的内容:

创建一个类似的解决方案,涉及为每个新事件创建一个组ID,为捕获该组ID中的更改创建一个连续ID,为每个连续ID创建一个行号,最后使用if_else() + lag()语句来获得所需的结果。

  1. minimalsample %>%
  2. mutate(cum = cumsum(event_occurred),
  3. new_id = consecutive_id(cum), .by = ID) %>%
  4. mutate(row_id = row_number(), .by = c(ID, new_id)) %>%
  5. mutate(lagEventOccurred = if_else(cum == 0 | (cum == 1 & event_occurred == 1),
  6. NA_integer_,
  7. lag(row_id)), .by = ID) %>%
  8. select(ID, round_number, event_occurred, lagEventOccurred)
  9. # 一个 tibble: 100 x 4
  10. ID round_number event_occurred lagEventOccurred
  11. <int> <int> <int> <int>
  12. 1 1 1 0 NA
  13. 2 1 2 0 NA
  14. 3 1 3 0 NA
  15. 4 1 4 1 NA
  16. 5 1 5 0 1
  17. 6 1 6 1 2
  18. 7 1 7 1 1
  19. 8 1 8 0 1
  20. 9 1 9 0 2
  21. 10 1 10 0 3
  22. # 还有更多的行,请使用 `print(n = ...)` 查看更多行

如果您需要任何其他帮助,请随时告诉我。

英文:

A similar solution that involves creating a group ID for when each new event occurs, a consecutive ID to capture changes in that group ID, a row number for each consecutive ID, and finally an if_else() + lag() statement to get the desired result.

  1. minimalsample %&gt;%
  2. mutate(cum = cumsum(event_occurred),
  3. new_id = consecutive_id(cum), .by = ID) %&gt;%
  4. mutate(row_id = row_number(), .by = c(ID, new_id)) %&gt;%
  5. mutate(lagEventOccurred = if_else(cum == 0 | (cum == 1 &amp; event_occurred == 1),
  6. NA_integer_,
  7. lag(row_id)), .by = ID) %&gt;%
  8. select(ID, round_number, event_occurred, lagEventOccurred)
  9. # A tibble: 100 x 4
  10. ID round_number event_occurred lagEventOccurred
  11. &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
  12. 1 1 1 0 NA
  13. 2 1 2 0 NA
  14. 3 1 3 0 NA
  15. 4 1 4 1 NA
  16. 5 1 5 0 1
  17. 6 1 6 1 2
  18. 7 1 7 1 1
  19. 8 1 8 0 1
  20. 9 1 9 0 2
  21. 10 1 10 0 3
  22. # i 90 more rows
  23. # i Use `print(n = ...)` to see more rows

huangapple
  • 本文由 发表于 2023年5月10日 20:41:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76218561.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定