2023年5月10日 20:41:53go评论82阅读模式

英文:

Return when last event occurred in Dataframe by Group

问题

这是我的数据框架：

set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20), round_number= rep(1:20, 5), event_occurred=rbinom(100, size=1, prob=0.2))

我想要一个新的列lagEventOccurred，告诉我事件上次发生是多少回合前，即：

lagEventOccurred <- c(Inf, Inf, Inf, Inf, Inf, 1,  2, 1, 1, 2, 3, 4, 5, 6, 7)

对于前几个条目以此类推（请注意，如果事件在同一回合发生，它不会被考虑，我只想查看以前的回合），所有这些都应该按ID分组。

我尝试了多种方法，包括类似于以下内容的内容：

minimalsample %>%
  dplyr::mutate(firststep = ifelse(event_occurred==1, round_number, Inf) )  %>%
  dplyr::mutate(secondstep = ifelse((lag(firststep) < round_number), lag(firststep), firststep )) %>%
  dplyr::mutate(thirdstep = ifelse((lag(firststep) < round_number), round_number- lag(firststep), round_number-firststep ))

但似乎我需要迭代"lag"过程20次才能得到所需的结果。我想知道是否有一个函数，它不返回向量的滞后值，而是返回满足条件的条件是多少个滞后值前发生的。

感谢您提供如何处理这个问题的任何想法。

英文:

This is my data frame:

set.seed(1)
minimalsample &lt;- data.frame(ID=rep(1:5, each= 20), round_number= rep(1:20, 5), event_occurred=rbinom(100, size=1, prob=0.2))

ID	round_number B	event_occurred
1	1	0
1	2	0
1	3	0
1	4	1
1	5	0
1	6	1
1	7	1
1	8	0
1	9	0
1	10	0
1	11	0

I want a new column lagEventOccurred which tells me how many rounds ago the event has last occurred, i.e.

lagEventOccurred &lt;- c(Inf, Inf, Inf, Inf, Inf, 1,  2, 1, 1, 2, 3, 4, 5, 6, 7)

for the first few entries and so on (note that if the event occurred in the same round, it is not considered, I only want to look at previous rounds), all of this should be grouped by ID.

ID	round_number B	event_occurred	lagEventOccurred
1	1	0	Inf
1	2	0	Inf
1	3	0	Inf
1	4	1	Inf
1	5	0	1
1	6	1	2
1	7	1	1
1	8	0	1
1	9	0	2
1	10	0	3
1	11	0	4

I tried multiple things, including something like

minimalsample %&gt;% dplyr::mutate(firststep = ifelse(event_occurred==1, round_number, Inf) )  %&gt;%
  dplyr::mutate(secondstep = ifelse((lag(firststep) &lt; round_number), lag(firststep), firststep )) %&gt;%
  dplyr::mutate(thirdstep = ifelse((lag(firststep) &lt; round_number), round_number- lag(firststep), round_number-firststep ))

but it seems that I would have to iterate over the "lag" procedure 20 times to get the desired results. I wonder if there is a function that does not return the lagged value of a vector, but returns how many lags ago a condition was satisfied.

Thank you for any ideas how to approach this.

答案1

得分: 1

以下是翻译好的内容：

"I couldn't think of a more straightforward method, but this works. I have assumed that if the beginning of each 'ID' group is a sequence of zeros, it should be treated as Inf. Let me know if this is incorrect. Using dplyr() and tidyr(), the workflow is:

Create copy of 'event_occurred' called 'lagEventOccurred'
Within each 'ID' group, create a new 'id' value for every event using cumsum()
Use fill() to define start and end of each event 'group'
Lag the 'lagEventOccurred' column and use seq_along() to return lagged length since the last event occurred

library(dplyr)
library(tidyr)
set.seed(1)
minimalsample <- data.frame(ID=rep(1:5, each= 20), 
                            round_number= rep(1:20, 5), 
                            event_occurred=rbinom(100, size=1, prob=0.2))
minimalsample <- minimalsample %>%
  mutate(lagEventOccurred = ifelse(event_occurred == 0, NA, 1)) %>%
  group_by(ID, lagEventOccurred) %>%
  mutate(lagEventOccurred = cumsum(lagEventOccurred)) %>%
  ungroup() %>%
  group_by(ID) %>%
  fill(lagEventOccurred) %>%
  mutate(lagEventOccurred = lag(lagEventOccurred)) %>%
  group_by(ID, lagEventOccurred) %>%
  mutate(lagEventOccurred = ifelse(is.na(lagEventOccurred), 
                                   Inf, seq_along(lagEventOccurred))) %>%
  ungroup()
head(minimalsample, 10)
# A tibble: 10 × 4
     ID round_number event_occurred lagEventOccurred
  <int>        <int>          <int>            <dbl>
1     1            1              0              Inf
2     1            2              0              Inf
3     1            3              0              Inf
4     1            4              1              Inf
5     1            5              0                1
6     1            6              1                2
7     1            7              1                1
8     1            8              0                1
9     1            9              0                2
10    1           10              0                3
tail(minimalsample, 10)
# A tibble: 10 × 4
     ID round_number event_occurred lagEventOccurred
  <int>        <int>          <int>            <dbl>
1     5           11              0              Inf
2     5           12              0              Inf
3     5           13              0              Inf
4     5           14              1              Inf
5     5           15              0                1
6     5           16              0                2
7     5           17              0                3
8     5           18              0                4
9     5           19              1                5
10    5           20              0                1

英文:

I couldn't think of a more straightforward method, but this works. I have assumed that if the beginning of each "ID" group is a sequence of zeros, it should be treated as Inf. Let me know if this is incorrect. Using dplyr() and tidyr(), the workflow is:

Create copy of "event_occurred" called "lagEventOccurred"
Within each "ID" group, create a new 'id' value for every event using cumsum()
Use fill() to define start and end of each event 'group'
Lag the "lagEventOccurred" column and use seq_along() to return lagged length since last event occurred

library(dplyr)
library(tidyr)
set.seed(1)
minimalsample &lt;- data.frame(ID=rep(1:5, each= 20), 
round_number= rep(1:20, 5), 
event_occurred=rbinom(100, size=1, prob=0.2))
minimalsample &lt;- minimalsample %&gt;%
mutate(lagEventOccurred = ifelse(event_occurred == 0, NA, 1)) %&gt;%
group_by(ID, lagEventOccurred) %&gt;%
mutate(lagEventOccurred = cumsum(lagEventOccurred)) %&gt;%
ungroup() %&gt;%
group_by(ID) %&gt;%
fill(lagEventOccurred) %&gt;%
mutate(lagEventOccurred = lag(lagEventOccurred)) %&gt;%
group_by(ID, lagEventOccurred) %&gt;%
mutate(lagEventOccurred = ifelse(is.na(lagEventOccurred), 
Inf, seq_along(lagEventOccurred))) %&gt;%
ungroup()
head(minimalsample, 10)
# A tibble: 10 &#215; 4
ID round_number event_occurred lagEventOccurred
&lt;int&gt;        &lt;int&gt;          &lt;int&gt;            &lt;dbl&gt;
1     1            1              0              Inf
2     1            2              0              Inf
3     1            3              0              Inf
4     1            4              1              Inf
5     1            5              0                1
6     1            6              1                2
7     1            7              1                1
8     1            8              0                1
9     1            9              0                2
10    1           10              0                3
tail(minimalsample, 10)
# A tibble: 10 &#215; 4
ID round_number event_occurred lagEventOccurred
&lt;int&gt;        &lt;int&gt;          &lt;int&gt;            &lt;dbl&gt;
1     5           11              0              Inf
2     5           12              0              Inf
3     5           13              0              Inf
4     5           14              1              Inf
5     5           15              0                1
6     5           16              0                2
7     5           17              0                3
8     5           18              0                4
9     5           19              1                5
10    5           20              0                1

答案2

得分: 0

以下是翻译好的内容：

创建一个类似的解决方案，涉及为每个新事件创建一个组ID，为捕获该组ID中的更改创建一个连续ID，为每个连续ID创建一个行号，最后使用if_else() + lag()语句来获得所需的结果。

minimalsample %>%
mutate(cum = cumsum(event_occurred),
new_id = consecutive_id(cum), .by = ID) %>%
mutate(row_id = row_number(), .by = c(ID, new_id)) %>%
mutate(lagEventOccurred = if_else(cum == 0 | (cum == 1 & event_occurred == 1), 
NA_integer_, 
lag(row_id)), .by = ID) %>%
select(ID, round_number, event_occurred, lagEventOccurred)
# 一个 tibble: 100 x 4
ID round_number event_occurred lagEventOccurred
<int>        <int>          <int>            <int>
1     1            1              0               NA
2     1            2              0               NA
3     1            3              0               NA
4     1            4              1               NA
5     1            5              0                1
6     1            6              1                2
7     1            7              1                1
8     1            8              0                1
9     1            9              0                2
10     1           10              0                3
# 还有更多的行，请使用 `print(n = ...)` 查看更多行

如果您需要任何其他帮助，请随时告诉我。

英文:

A similar solution that involves creating a group ID for when each new event occurs, a consecutive ID to capture changes in that group ID, a row number for each consecutive ID, and finally an if_else() + lag() statement to get the desired result.

minimalsample %&gt;%
mutate(cum = cumsum(event_occurred),
new_id = consecutive_id(cum), .by = ID) %&gt;%
mutate(row_id = row_number(), .by = c(ID, new_id)) %&gt;%
mutate(lagEventOccurred = if_else(cum == 0 | (cum == 1 &amp; event_occurred == 1), 
NA_integer_, 
lag(row_id)), .by = ID) %&gt;%
select(ID, round_number, event_occurred, lagEventOccurred)
# A tibble: 100 x 4
ID round_number event_occurred lagEventOccurred
&lt;int&gt;        &lt;int&gt;          &lt;int&gt;            &lt;int&gt;
1     1            1              0               NA
2     1            2              0               NA
3     1            3              0               NA
4     1            4              1               NA
5     1            5              0                1
6     1            6              1                2
7     1            7              1                1
8     1            8              0                1
9     1            9              0                2
10     1           10              0                3
# i 90 more rows
# i Use `print(n = ...)` to see more rows

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Return when last event occurred in Dataframe by Group.

问题

答案1

答案2

using mutate, paste0 and case_when in iteration; problem with mutating existing variables with paste0

在另一个geom_sf层之上插入一个geom_sf层。

在R中如何为矩阵分配多个值？

SQL查询在Access中有效，但在R中使用RJDBC包时无效。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。