如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

huangapple go评论67阅读模式
英文:

How to increment a new column based on values in a specified range [a moving window] of another column in R

问题

I would like to create new columns based on values in a moving-window range of another column. One of the new column would be based on a moving window from previous rows in another column, while the other column would be based on a moving window from subsequent rows in another column.

Let's take the following dataset as an example:

dt <- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                          0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
                 B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
                          0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1))

I would like to create:

  • a column called "A_LM_corrected" detecting when 1s are in A_LM and only keeping those for which no other 1s were in the last 5 previous rows of A_LM. It should consist of 1s when the condition is met and 0 when the condition is not met,
  • a column called "B_LM_corrected" with the same principle but for values in column B_LM,
  • a column called "A_LM_foll" detecting when 1s are present in A_LM_corrected and only keeping those for which 1s are also present in the next 5 subsequent rows of B_LM_corrected [detecting when 1s are present in A_LM_corrected and in at least 1 of the 5 subsequent rows of B_LM_corrected], and
  • a column called "B_LM_foll" with the same principle [detecting when 1s are present in B_LM_corrected and in at least 1 of the 5 subsequent rows of A_LM_corrected].

Ideally, the final data frame would be this one:

dt_aim <- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
                     B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
                              0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
                     A_LM_corrected = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                                        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
                     B_LM_corrected = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
                     A_LM_foll = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                   0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
                     B_LM_foll = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                   0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0))

I tried to use the tidyverse and data.table to do so, but I do not manage to get what I need [I am even very far from it].

I know that this is quite easy to be done in Excel:

  • Column A [A_LM]: first value of interest in cell A2 (0 in this example)

  • Column B [B_LM]: first value of interest in cell B2 (1 in this example)

  • Column C [A_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):

    • C2 = IF(A2 = 1, IF(SUM(A1:A1) > 0, 0, A2), A2)
    • C3 = IF(A3 = 1, IF(SUM(A1:A2) > 0, 0, A3), A3)
    • C4 = IF(A4 = 1, IF(SUM(A1:A3) > 0, 0, A4), A4)
    • C5 = IF(A5 = 1, IF(SUM(A1:A4) > 0, 0, A5), A5)
    • C6 = IF(A6 = 1, IF(SUM(A1:A5) > 0, 0, A6), A6)
    • C7 = IF(A7 = 1, IF(SUM(A2:A6) > 0, 0, A7), A7)
    • C8 = IF(A8 = 1, IF(SUM(A3:A7) > 0, 0, A8), A8)
  • Column D [B_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):

    • D2 = IF(B2 = 1, IF(SUM(B1:B1) > 0, 0, B2), B2)
    • D3 = IF(B3 = 1, IF(SUM(B1:B2) > 0, 0, B3), B3)
    • D4 = IF(B4 = 1, IF(SUM(B1:B3) > 0, 0, B4), B4)
    • D5 = IF(B5 = 1, IF(SUM(B1:B4) > 0, 0, B5), B5)
    • D6 = IF(B6 = 1, IF(SUM(B1:B5) > 0
英文:

I would like to create new columns based on values in a moving-window range of another column. One of the new column would be based on a moving window from previous rows in another column, while the other column would be based on a moving window from subsequent rows in another column.

Let´s take the following dataset as an example:

dt &lt;- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                          0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
                 B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
                          0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1))

I would like to create:

  • a column called "A_LM_corrected" detecting when 1s are in A_LM and only keeping those for which no other 1s were in the last 5 previous rows of A_LM. It should consist of 1s when the condition is met and 0 when the condition is not met,
  • a column called "B_LM_corrected" with the same principle but for values in column B_LM,
  • a column called "A_LM_foll" detecting when 1s are present in A_LM_corrected and only keeping those for which 1s are also present in the next 5 subsequent rows of B_LM_corrected [detecting when 1s are present in A_LM_corrected and in at least 1 of the 5 subsequent rows of B_LM_corrected], and
  • a column called "B_LM_foll" with the same principle [detecting when 1s are present in B_LM_corrected and in at least 1 of the 5 subsequent rows of A_LM_corrected].

Ideally, the final data frame would be this one:

dt_aim &lt;- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
                     B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
                              0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
                     A_LM_corrected = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                                        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
                     B_LM_corrected = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
                     A_LM_foll = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                   0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
                     B_LM_foll = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                   0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0))

I tried to use the tidyverse and data.table to do so, but I do not manage to get what I need [I am even very far from it].

I know that this is quite easy to be done in Excel:

  • Column A [A_LM]: first value of interest in cell A2 (0 in this example)

  • Column B [B_LM]: first value of interest in cell B2 (1 in this example)

  • Column C [A_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):

    • C2 = IF(A2 = 1, IF(SUM(A1:A1) > 0, 0, A2), A2)
    • C3 = IF(A3 = 1, IF(SUM(A1:A2) > 0, 0, A3), A3)
    • C4 = IF(A4 = 1, IF(SUM(A1:A3) > 0, 0, A4), A4)
    • C5 = IF(A5 = 1, IF(SUM(A1:A4) > 0, 0, A5), A5)
    • C6 = IF(A6 = 1, IF(SUM(A1:A5) > 0, 0, A6), A6)
    • C7 = IF(A7 = 1, IF(SUM(A2:A6) > 0, 0, A7), A7)
    • C8 = IF(A8 = 1, IF(SUM(A3:A7) > 0, 0, A8), A8)
  • Column D [B_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):

    • D2 = IF(B2 = 1, IF(SUM(B1:B1) > 0, 0, B2), B2)
    • D3 = IF(B3 = 1, IF(SUM(B1:B2) > 0, 0, B3), B3)
    • D4 = IF(B4 = 1, IF(SUM(B1:B3) > 0, 0, B4), B4)
    • D5 = IF(B5 = 1, IF(SUM(B1:B4) > 0, 0, B5), B5)
    • D6 = IF(B6 = 1, IF(SUM(B1:B5) > 0, 0, B6), B6)
    • D7 = IF(B7 = 1, IF(SUM(B2:B6) > 0, 0, B7), B7)
    • D8 = IF(B8 = 1, IF(SUM(B3:B7) > 0, 0, B8), B8)
  • column E [A_LM_foll] (copy directly copied/pasted without initial manual incrementation):

    • E2 = IF(C2 = 1, IF(SUM(D3:D7) > 0, 1, 0), 0)
  • column F [B_LM_foll] (copy directly copied/pasted without initial manual incrementation):

    • F2 = IF(D2 = 1, IF(SUM(C3:C7) > 0, 1, 0), 0)

But I need to have it work in R 如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

Any help is welcome 如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

答案1

得分: 4

使用zoo的替代dplyr方法:

library(dplyr)
dt %>%
  mutate(
    across(
      ends_with("_LM"),
      ~ +(. > 0 & zoo::rollapplyr(. < 1, list(-(1:4)), all, partial = TRUE, fill = TRUE)),
      .names = "{.col}_corrected"),
    A_LM_foll = +(A_LM_corrected & zoo::rollapply(B_LM_corrected > 0, list(-1:4), any, partial = TRUE, align = "left")),
    B_LM_foll = +(B_LM_corrected & zoo::rollapply(A_LM_corrected > 0, list(-1:4), any, partial = TRUE, align = "left"))
  )
#    A_LM B_LM A_LM_corrected B_LM_corrected A_LM_foll B_LM_foll
# 1     0    1              0              1         0         1
# 2     0    1              0              0         0         0
# 3     0    0              0              0         0         0
# 4     1    1              1              0         0         0
# 5     0    0              0              0         0         0
# 6     0    0              0              0         0         0
# 7     0    0              0              0         0         0
# 8     0    1              0              0         0         0
# 9     0    0              0              0         0         0
# 10    0    0              0              0         0         0
# 11    1    0              1              0         0         0
# 12    0    0              0              0         0         0
# 13    0    0              0              0         0         0
# 14    0    0              0              0         0         0
# 15    0    0              0              0         0         0
# 16    0    0              0              0         0         0
# 17    0    1              0              1         0         1
# 18    0    0              0              0         0         0
# 19    0    0              0              0         0         0
# 20    0    0              0              0         0         0
# 21    1    0              1              0         1         0
# 22    0    0              0              0         0         0
# 23    1    0              0              0         0         0
# 24    0    0              0              0         0         0
# 25    0    1              0              1         0         0

感谢@G.Grothendieck改进了后两个rollapply的方法,并更多地帮助我理解了第一个rollapplyrlist(-(1:4))的用法。对未来的我来说:如果width=是数值向量,那么它的宽度应用于by=中的每个元素,如果它比数据短,它会被循环使用。然而,如果widths=是一个list,那么它的值被视为从当前时间的偏移量,所以当widths=5, align="right"时,意味着从"here"偏移c(-4, -3, -2, -1, 0),而widths=list(-(1:4)), align="right"意味着从"here"偏移c(-4, -3, -2, -1),忽略第0个("here"的)元素。真是让人大开眼界 如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

英文:

An alternative dplyr method using zoo:

library(dplyr)
dt %&gt;%
  mutate(
    across(
      ends_with(&quot;_LM&quot;),
      ~ +(. &gt; 0 &amp; zoo::rollapplyr(. &lt; 1, list(-(1:4)), all, partial = TRUE, fill = TRUE)),
      .names = &quot;{.col}_corrected&quot;),
    A_LM_foll = +(A_LM_corrected &amp; zoo::rollapply(B_LM_corrected &gt; 0, list(-1:4), any, partial = TRUE, align = &quot;left&quot;)),
    B_LM_foll = +(B_LM_corrected &amp; zoo::rollapply(A_LM_corrected &gt; 0, list(-1:4), any, partial = TRUE, align = &quot;left&quot;))
  )
#    A_LM B_LM A_LM_corrected B_LM_corrected A_LM_foll B_LM_foll
# 1     0    1              0              1         0         1
# 2     0    1              0              0         0         0
# 3     0    0              0              0         0         0
# 4     1    1              1              0         0         0
# 5     0    0              0              0         0         0
# 6     0    0              0              0         0         0
# 7     0    0              0              0         0         0
# 8     0    1              0              0         0         0
# 9     0    0              0              0         0         0
# 10    0    0              0              0         0         0
# 11    1    0              1              0         0         0
# 12    0    0              0              0         0         0
# 13    0    0              0              0         0         0
# 14    0    0              0              0         0         0
# 15    0    0              0              0         0         0
# 16    0    0              0              0         0         0
# 17    0    1              0              1         0         1
# 18    0    0              0              0         0         0
# 19    0    0              0              0         0         0
# 20    0    0              0              0         0         0
# 21    1    0              1              0         1         0
# 22    0    0              0              0         0         0
# 23    1    0              0              0         0         0
# 24    0    0              0              0         0         0
# 25    0    1              0              1         0         0

Many thanks to @G.Grothendieck for improving the latter two rollapply's, and much more so helping me to understand the use of list(-(1:4)) in the first rollapplyr. For future me: if width= is a numeric vector, then its widths apply to each of the by= elements, and if it is shorter than the data, it is recycled. However, if widths= is a list, then its values are considered offsets from the current time, so while widths=5, align=&quot;right&quot; means c(-4, -3, -2, -1, 0) offset from "here", widths=list(-(1:4)), align=&quot;right&quot; means c(-4, -3, -2, -1) from "here", ignoring the 0th ("here"th) element. Mind blown 如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

答案2

得分: 1

以下是使用“tidyverse”的一种可能性。

特别是我使用函数map_dbl来创建辅助变量A_LM_backB_LM_back来检查前五行。

dt %>% 
  mutate(
    A_LM_back = map_dbl(1:n(), \(k) max(0, A_LM[pmax(k-1:5, 0)])),
    B_LM_back = map_dbl(1:n(), \(k) max(0, B_LM[pmax(k-1:5, 0)])),
  ) %>%
  mutate(
    A_LM_cor = if_else(A_LM == 1 & A_LM_back == 0, 1, 0),
    B_LM_cor = if_else(B_LM == 1 & B_LM_back == 0, 1, 0),
  ) %>%
  mutate(
    A_LM_next = map_dbl(1:n(), \(k) max(A_LM_cor[pmin(k+1:5, nrow(dt))])),
    B_LM_next = map_dbl(1:n(), \(k) max(B_LM_cor[pmin(k+1:5, nrow(dt))])),
  ) %>% 
  mutate(
    A_LM_foll = if_else(A_LM_cor == 1 & B_LM_next == 1, 1, 0),
    B_LM_foll = if_else(B_LM_cor == 1 & A_LM_next == 1, 1, 0),
  )
英文:

Here is one possibility using the tidyverse.

In particular I use the function map_dbl to create help variables A_LM_back and B_LM_back to check the previous five rows.

dt %&gt;% 
  mutate(
    A_LM_back = map_dbl(1:n(), \(k) max(0, A_LM[pmax(k-1:5, 0)])),
    B_LM_back = map_dbl(1:n(), \(k) max(0, B_LM[pmax(k-1:5, 0)])),
  ) %&gt;% 
  mutate(
    A_LM_cor = if_else(A_LM == 1 &amp; A_LM_back == 0, 1, 0),
    B_LM_cor = if_else(B_LM == 1 &amp; B_LM_back == 0, 1, 0),
  ) %&gt;%
  mutate(
    A_LM_next = map_dbl(1:n(), \(k) max(A_LM_cor[pmin(k+1:5, nrow(dt))])),
    B_LM_next = map_dbl(1:n(), \(k) max(B_LM_cor[pmin(k+1:5, nrow(dt))])),
  ) %&gt;% 
  mutate(
    A_LM_foll = if_else(A_LM_cor == 1 &amp; B_LM_next == 1, 1, 0),
    B_LM_foll = if_else(B_LM_cor == 1 &amp; A_LM_next == 1, 1, 0),
  )

答案3

得分: 1

以下是翻译好的内容:

这里是一个`data.table`的方法。[`frollsum()`](https://rdrr.io/cran/data.table/man/froll.html)可以对向量进行滚动求和。在向量开始之前,它的工作方式可能会有点令人惊讶:

```r
x <- 1:10
SHIFT <- 5
frollsum(x, n = SHIFT, fill = 0)
#  [1]  0  0  0  0 15 20 25 30 35 40

我希望这会返回前n个数的三角形数,然后以n的恒定速率增加。然而,前n个数取fill的值。如果你没有提供fill,那么它们将是NA。要从第一个索引开始获取三角形数,你需要做这个操作,我觉得有点笨拙:

frollsum(c(rep(0, SHIFT), x), SHIFT)[(SHIFT + 1):(length(x) + SHIFT)]
#  [1]  1  3  6 10 15 20 25 30 35 40

但我认为这应该相对快速。

library(data.table)
setDT(dt)
sd_cols <- c("A_LM", "B_LM")
SHIFT <- 5

dt[,
    (sprintf("%s_corrected", sd_cols)) := lapply(
        .SD,
        \(x) as.numeric(x == 1 & frollsum(c(rep(0, SHIFT), shift(x)), SHIFT, na.rm = TRUE)[(SHIFT + 1):(.N + SHIFT)] == 0)
    ),
    .SDcols = sd_cols
]

dt[, A_LM_foll := as.numeric(
    A_LM_corrected == 1 & frollsum(c(B_LM_corrected, rep(0, SHIFT)), SHIFT, align = "left", na.rm = TRUE)[1:.N] > 0
)]

dt[, B_LM_foll := as.numeric(
    B_LM_corrected == 1 & frollsum(c(A_LM_corrected, rep(0, SHIFT)), SHIFT, align = "left", na.rm = TRUE)[1:.N] > 0
)]

identical(dt, dt_aim) # TRUE
英文:

Here's a data.table approach. frollsum() takes the rolling sum of a vector. It works in a slightly surprising way before the start of the vector:

x &lt;- 1:10
SHIFT &lt;- 5
frollsum(x, n = SHIFT, fill = 0)
#  [1]  0  0  0  0 15 20 25 30 35 40

I would expect this to return triangle numbers for the first n numbers, and then increase at a constant rate of n. However, the first n numbers take the value of fill. If you don't supply a fill then they will be NA. To get triangle numbers starting from the first index, you need to do this, which I find a little awkward:

frollsum(c(rep(0, SHIFT), x), SHIFT)[(SHIFT + 1):(length(x) + SHIFT)]
#  [1]  1  3  6 10 15 20 25 30 35 40

Still I think this should be relatively fast.

library(data.table)
setDT(dt)
sd_cols &lt;- c(&quot;A_LM&quot;, &quot;B_LM&quot;)
SHIFT &lt;- 5

dt[,
    (sprintf(&quot;%s_corrected&quot;, sd_cols)) := lapply(
        .SD,
        \(x) as.numeric(x == 1 &amp; frollsum(c(rep(0, SHIFT), shift(x)), SHIFT, na.rm = TRUE)[(SHIFT + 1):(.N + SHIFT)] == 0)
    ),
    .SDcols = sd_cols
]

dt[, A_LM_foll := as.numeric(
    A_LM_corrected == 1 &amp; frollsum(c(B_LM_corrected, rep(0, SHIFT)), SHIFT, align = &quot;left&quot;, na.rm = TRUE)[1:.N] &gt; 0
)]

dt[, B_LM_foll := as.numeric(
    B_LM_corrected == 1 &amp; frollsum(c(A_LM_corrected, rep(0, SHIFT)), SHIFT, align = &quot;left&quot;, na.rm = TRUE)[1:.N] &gt; 0
)]

identical(dt, dt_aim) # TRUE

huangapple
  • 本文由 发表于 2023年7月12日 22:44:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76671818.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定