英文:
How to increment a new column based on values in a specified range [a moving window] of another column in R
问题
I would like to create new columns based on values in a moving-window range of another column. One of the new column would be based on a moving window from previous rows in another column, while the other column would be based on a moving window from subsequent rows in another column.
Let's take the following dataset as an example:
dt <- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1))
I would like to create:
- a column called "A_LM_corrected" detecting when 1s are in A_LM and only keeping those for which no other 1s were in the last 5 previous rows of A_LM. It should consist of 1s when the condition is met and 0 when the condition is not met,
- a column called "B_LM_corrected" with the same principle but for values in column B_LM,
- a column called "A_LM_foll" detecting when 1s are present in A_LM_corrected and only keeping those for which 1s are also present in the next 5 subsequent rows of B_LM_corrected [detecting when 1s are present in A_LM_corrected and in at least 1 of the 5 subsequent rows of B_LM_corrected], and
- a column called "B_LM_foll" with the same principle [detecting when 1s are present in B_LM_corrected and in at least 1 of the 5 subsequent rows of A_LM_corrected].
Ideally, the final data frame would be this one:
dt_aim <- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
A_LM_corrected = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
B_LM_corrected = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
A_LM_foll = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
B_LM_foll = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0))
I tried to use the tidyverse
and data.table
to do so, but I do not manage to get what I need [I am even very far from it].
I know that this is quite easy to be done in Excel:
-
Column A [A_LM]: first value of interest in cell A2 (0 in this example)
-
Column B [B_LM]: first value of interest in cell B2 (1 in this example)
-
Column C [A_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):
- C2 = IF(A2 = 1, IF(SUM(A1:A1) > 0, 0, A2), A2)
- C3 = IF(A3 = 1, IF(SUM(A1:A2) > 0, 0, A3), A3)
- C4 = IF(A4 = 1, IF(SUM(A1:A3) > 0, 0, A4), A4)
- C5 = IF(A5 = 1, IF(SUM(A1:A4) > 0, 0, A5), A5)
- C6 = IF(A6 = 1, IF(SUM(A1:A5) > 0, 0, A6), A6)
- C7 = IF(A7 = 1, IF(SUM(A2:A6) > 0, 0, A7), A7)
- C8 = IF(A8 = 1, IF(SUM(A3:A7) > 0, 0, A8), A8)
-
Column D [B_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):
- D2 = IF(B2 = 1, IF(SUM(B1:B1) > 0, 0, B2), B2)
- D3 = IF(B3 = 1, IF(SUM(B1:B2) > 0, 0, B3), B3)
- D4 = IF(B4 = 1, IF(SUM(B1:B3) > 0, 0, B4), B4)
- D5 = IF(B5 = 1, IF(SUM(B1:B4) > 0, 0, B5), B5)
- D6 = IF(B6 = 1, IF(SUM(B1:B5) > 0
英文:
I would like to create new columns based on values in a moving-window range of another column. One of the new column would be based on a moving window from previous rows in another column, while the other column would be based on a moving window from subsequent rows in another column.
Let´s take the following dataset as an example:
dt <- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1))
I would like to create:
- a column called "A_LM_corrected" detecting when 1s are in A_LM and only keeping those for which no other 1s were in the last 5 previous rows of A_LM. It should consist of 1s when the condition is met and 0 when the condition is not met,
- a column called "B_LM_corrected" with the same principle but for values in column B_LM,
- a column called "A_LM_foll" detecting when 1s are present in A_LM_corrected and only keeping those for which 1s are also present in the next 5 subsequent rows of B_LM_corrected [detecting when 1s are present in A_LM_corrected and in at least 1 of the 5 subsequent rows of B_LM_corrected], and
- a column called "B_LM_foll" with the same principle [detecting when 1s are present in B_LM_corrected and in at least 1 of the 5 subsequent rows of A_LM_corrected].
Ideally, the final data frame would be this one:
dt_aim <- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
A_LM_corrected = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
B_LM_corrected = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
A_LM_foll = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
B_LM_foll = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0))
I tried to use the tidyverse
and data.table
to do so, but I do not manage to get what I need [I am even very far from it].
I know that this is quite easy to be done in Excel:
-
Column A [A_LM]: first value of interest in cell A2 (0 in this example)
-
Column B [B_LM]: first value of interest in cell B2 (1 in this example)
-
Column C [A_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):
- C2 = IF(A2 = 1, IF(SUM(A1:A1) > 0, 0, A2), A2)
- C3 = IF(A3 = 1, IF(SUM(A1:A2) > 0, 0, A3), A3)
- C4 = IF(A4 = 1, IF(SUM(A1:A3) > 0, 0, A4), A4)
- C5 = IF(A5 = 1, IF(SUM(A1:A4) > 0, 0, A5), A5)
- C6 = IF(A6 = 1, IF(SUM(A1:A5) > 0, 0, A6), A6)
- C7 = IF(A7 = 1, IF(SUM(A2:A6) > 0, 0, A7), A7)
- C8 = IF(A8 = 1, IF(SUM(A3:A7) > 0, 0, A8), A8)
-
Column D [B_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):
- D2 = IF(B2 = 1, IF(SUM(B1:B1) > 0, 0, B2), B2)
- D3 = IF(B3 = 1, IF(SUM(B1:B2) > 0, 0, B3), B3)
- D4 = IF(B4 = 1, IF(SUM(B1:B3) > 0, 0, B4), B4)
- D5 = IF(B5 = 1, IF(SUM(B1:B4) > 0, 0, B5), B5)
- D6 = IF(B6 = 1, IF(SUM(B1:B5) > 0, 0, B6), B6)
- D7 = IF(B7 = 1, IF(SUM(B2:B6) > 0, 0, B7), B7)
- D8 = IF(B8 = 1, IF(SUM(B3:B7) > 0, 0, B8), B8)
-
column E [A_LM_foll] (copy directly copied/pasted without initial manual incrementation):
- E2 = IF(C2 = 1, IF(SUM(D3:D7) > 0, 1, 0), 0)
-
column F [B_LM_foll] (copy directly copied/pasted without initial manual incrementation):
- F2 = IF(D2 = 1, IF(SUM(C3:C7) > 0, 1, 0), 0)
But I need to have it work in R
Any help is welcome
答案1
得分: 4
使用zoo
的替代dplyr
方法:
library(dplyr)
dt %>%
mutate(
across(
ends_with("_LM"),
~ +(. > 0 & zoo::rollapplyr(. < 1, list(-(1:4)), all, partial = TRUE, fill = TRUE)),
.names = "{.col}_corrected"),
A_LM_foll = +(A_LM_corrected & zoo::rollapply(B_LM_corrected > 0, list(-1:4), any, partial = TRUE, align = "left")),
B_LM_foll = +(B_LM_corrected & zoo::rollapply(A_LM_corrected > 0, list(-1:4), any, partial = TRUE, align = "left"))
)
# A_LM B_LM A_LM_corrected B_LM_corrected A_LM_foll B_LM_foll
# 1 0 1 0 1 0 1
# 2 0 1 0 0 0 0
# 3 0 0 0 0 0 0
# 4 1 1 1 0 0 0
# 5 0 0 0 0 0 0
# 6 0 0 0 0 0 0
# 7 0 0 0 0 0 0
# 8 0 1 0 0 0 0
# 9 0 0 0 0 0 0
# 10 0 0 0 0 0 0
# 11 1 0 1 0 0 0
# 12 0 0 0 0 0 0
# 13 0 0 0 0 0 0
# 14 0 0 0 0 0 0
# 15 0 0 0 0 0 0
# 16 0 0 0 0 0 0
# 17 0 1 0 1 0 1
# 18 0 0 0 0 0 0
# 19 0 0 0 0 0 0
# 20 0 0 0 0 0 0
# 21 1 0 1 0 1 0
# 22 0 0 0 0 0 0
# 23 1 0 0 0 0 0
# 24 0 0 0 0 0 0
# 25 0 1 0 1 0 0
感谢@G.Grothendieck改进了后两个rollapply
的方法,并更多地帮助我理解了第一个rollapplyr
中list(-(1:4))
的用法。对未来的我来说:如果width=
是数值向量,那么它的宽度应用于by=
中的每个元素,如果它比数据短,它会被循环使用。然而,如果widths=
是一个list
,那么它的值被视为从当前时间的偏移量,所以当widths=5, align="right"
时,意味着从"here"偏移c(-4, -3, -2, -1, 0)
,而widths=list(-(1:4)), align="right"
意味着从"here"偏移c(-4, -3, -2, -1)
,忽略第0个("here"的)元素。真是让人大开眼界
英文:
An alternative dplyr
method using zoo
:
library(dplyr)
dt %>%
mutate(
across(
ends_with("_LM"),
~ +(. > 0 & zoo::rollapplyr(. < 1, list(-(1:4)), all, partial = TRUE, fill = TRUE)),
.names = "{.col}_corrected"),
A_LM_foll = +(A_LM_corrected & zoo::rollapply(B_LM_corrected > 0, list(-1:4), any, partial = TRUE, align = "left")),
B_LM_foll = +(B_LM_corrected & zoo::rollapply(A_LM_corrected > 0, list(-1:4), any, partial = TRUE, align = "left"))
)
# A_LM B_LM A_LM_corrected B_LM_corrected A_LM_foll B_LM_foll
# 1 0 1 0 1 0 1
# 2 0 1 0 0 0 0
# 3 0 0 0 0 0 0
# 4 1 1 1 0 0 0
# 5 0 0 0 0 0 0
# 6 0 0 0 0 0 0
# 7 0 0 0 0 0 0
# 8 0 1 0 0 0 0
# 9 0 0 0 0 0 0
# 10 0 0 0 0 0 0
# 11 1 0 1 0 0 0
# 12 0 0 0 0 0 0
# 13 0 0 0 0 0 0
# 14 0 0 0 0 0 0
# 15 0 0 0 0 0 0
# 16 0 0 0 0 0 0
# 17 0 1 0 1 0 1
# 18 0 0 0 0 0 0
# 19 0 0 0 0 0 0
# 20 0 0 0 0 0 0
# 21 1 0 1 0 1 0
# 22 0 0 0 0 0 0
# 23 1 0 0 0 0 0
# 24 0 0 0 0 0 0
# 25 0 1 0 1 0 0
Many thanks to @G.Grothendieck for improving the latter two rollapply
's, and much more so helping me to understand the use of list(-(1:4))
in the first rollapplyr
. For future me: if width=
is a numeric vector, then its widths apply to each of the by=
elements, and if it is shorter than the data, it is recycled. However, if widths=
is a list
, then its values are considered offsets from the current time, so while widths=5, align="right"
means c(-4, -3, -2, -1, 0)
offset from "here", widths=list(-(1:4)), align="right"
means c(-4, -3, -2, -1)
from "here", ignoring the 0
th ("here"th) element. Mind blown
答案2
得分: 1
以下是使用“tidyverse”的一种可能性。
特别是我使用函数map_dbl
来创建辅助变量A_LM_back
和B_LM_back
来检查前五行。
dt %>%
mutate(
A_LM_back = map_dbl(1:n(), \(k) max(0, A_LM[pmax(k-1:5, 0)])),
B_LM_back = map_dbl(1:n(), \(k) max(0, B_LM[pmax(k-1:5, 0)])),
) %>%
mutate(
A_LM_cor = if_else(A_LM == 1 & A_LM_back == 0, 1, 0),
B_LM_cor = if_else(B_LM == 1 & B_LM_back == 0, 1, 0),
) %>%
mutate(
A_LM_next = map_dbl(1:n(), \(k) max(A_LM_cor[pmin(k+1:5, nrow(dt))])),
B_LM_next = map_dbl(1:n(), \(k) max(B_LM_cor[pmin(k+1:5, nrow(dt))])),
) %>%
mutate(
A_LM_foll = if_else(A_LM_cor == 1 & B_LM_next == 1, 1, 0),
B_LM_foll = if_else(B_LM_cor == 1 & A_LM_next == 1, 1, 0),
)
英文:
Here is one possibility using the tidyverse
.
In particular I use the function map_dbl
to create help variables A_LM_back
and B_LM_back
to check the previous five rows.
dt %>%
mutate(
A_LM_back = map_dbl(1:n(), \(k) max(0, A_LM[pmax(k-1:5, 0)])),
B_LM_back = map_dbl(1:n(), \(k) max(0, B_LM[pmax(k-1:5, 0)])),
) %>%
mutate(
A_LM_cor = if_else(A_LM == 1 & A_LM_back == 0, 1, 0),
B_LM_cor = if_else(B_LM == 1 & B_LM_back == 0, 1, 0),
) %>%
mutate(
A_LM_next = map_dbl(1:n(), \(k) max(A_LM_cor[pmin(k+1:5, nrow(dt))])),
B_LM_next = map_dbl(1:n(), \(k) max(B_LM_cor[pmin(k+1:5, nrow(dt))])),
) %>%
mutate(
A_LM_foll = if_else(A_LM_cor == 1 & B_LM_next == 1, 1, 0),
B_LM_foll = if_else(B_LM_cor == 1 & A_LM_next == 1, 1, 0),
)
答案3
得分: 1
以下是翻译好的内容:
这里是一个`data.table`的方法。[`frollsum()`](https://rdrr.io/cran/data.table/man/froll.html)可以对向量进行滚动求和。在向量开始之前,它的工作方式可能会有点令人惊讶:
```r
x <- 1:10
SHIFT <- 5
frollsum(x, n = SHIFT, fill = 0)
# [1] 0 0 0 0 15 20 25 30 35 40
我希望这会返回前n
个数的三角形数,然后以n
的恒定速率增加。然而,前n
个数取fill
的值。如果你没有提供fill
,那么它们将是NA
。要从第一个索引开始获取三角形数,你需要做这个操作,我觉得有点笨拙:
frollsum(c(rep(0, SHIFT), x), SHIFT)[(SHIFT + 1):(length(x) + SHIFT)]
# [1] 1 3 6 10 15 20 25 30 35 40
但我认为这应该相对快速。
library(data.table)
setDT(dt)
sd_cols <- c("A_LM", "B_LM")
SHIFT <- 5
dt[,
(sprintf("%s_corrected", sd_cols)) := lapply(
.SD,
\(x) as.numeric(x == 1 & frollsum(c(rep(0, SHIFT), shift(x)), SHIFT, na.rm = TRUE)[(SHIFT + 1):(.N + SHIFT)] == 0)
),
.SDcols = sd_cols
]
dt[, A_LM_foll := as.numeric(
A_LM_corrected == 1 & frollsum(c(B_LM_corrected, rep(0, SHIFT)), SHIFT, align = "left", na.rm = TRUE)[1:.N] > 0
)]
dt[, B_LM_foll := as.numeric(
B_LM_corrected == 1 & frollsum(c(A_LM_corrected, rep(0, SHIFT)), SHIFT, align = "left", na.rm = TRUE)[1:.N] > 0
)]
identical(dt, dt_aim) # TRUE
英文:
Here's a data.table
approach. frollsum()
takes the rolling sum of a vector. It works in a slightly surprising way before the start of the vector:
x <- 1:10
SHIFT <- 5
frollsum(x, n = SHIFT, fill = 0)
# [1] 0 0 0 0 15 20 25 30 35 40
I would expect this to return triangle numbers for the first n
numbers, and then increase at a constant rate of n
. However, the first n
numbers take the value of fill
. If you don't supply a fill
then they will be NA
. To get triangle numbers starting from the first index, you need to do this, which I find a little awkward:
frollsum(c(rep(0, SHIFT), x), SHIFT)[(SHIFT + 1):(length(x) + SHIFT)]
# [1] 1 3 6 10 15 20 25 30 35 40
Still I think this should be relatively fast.
library(data.table)
setDT(dt)
sd_cols <- c("A_LM", "B_LM")
SHIFT <- 5
dt[,
(sprintf("%s_corrected", sd_cols)) := lapply(
.SD,
\(x) as.numeric(x == 1 & frollsum(c(rep(0, SHIFT), shift(x)), SHIFT, na.rm = TRUE)[(SHIFT + 1):(.N + SHIFT)] == 0)
),
.SDcols = sd_cols
]
dt[, A_LM_foll := as.numeric(
A_LM_corrected == 1 & frollsum(c(B_LM_corrected, rep(0, SHIFT)), SHIFT, align = "left", na.rm = TRUE)[1:.N] > 0
)]
dt[, B_LM_foll := as.numeric(
B_LM_corrected == 1 & frollsum(c(A_LM_corrected, rep(0, SHIFT)), SHIFT, align = "left", na.rm = TRUE)[1:.N] > 0
)]
identical(dt, dt_aim) # TRUE
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论