如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

huangapple go评论94阅读模式
英文:

How to increment a new column based on values in a specified range [a moving window] of another column in R

问题

I would like to create new columns based on values in a moving-window range of another column. One of the new column would be based on a moving window from previous rows in another column, while the other column would be based on a moving window from subsequent rows in another column.

Let's take the following dataset as an example:

  1. dt <- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
  2. 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
  3. B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
  4. 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1))

I would like to create:

  • a column called "A_LM_corrected" detecting when 1s are in A_LM and only keeping those for which no other 1s were in the last 5 previous rows of A_LM. It should consist of 1s when the condition is met and 0 when the condition is not met,
  • a column called "B_LM_corrected" with the same principle but for values in column B_LM,
  • a column called "A_LM_foll" detecting when 1s are present in A_LM_corrected and only keeping those for which 1s are also present in the next 5 subsequent rows of B_LM_corrected [detecting when 1s are present in A_LM_corrected and in at least 1 of the 5 subsequent rows of B_LM_corrected], and
  • a column called "B_LM_foll" with the same principle [detecting when 1s are present in B_LM_corrected and in at least 1 of the 5 subsequent rows of A_LM_corrected].

Ideally, the final data frame would be this one:

  1. dt_aim <- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
  2. 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
  3. B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
  4. 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
  5. A_LM_corrected = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
  6. 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
  7. B_LM_corrected = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  8. 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
  9. A_LM_foll = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  10. 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
  11. B_LM_foll = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  12. 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0))

I tried to use the tidyverse and data.table to do so, but I do not manage to get what I need [I am even very far from it].

I know that this is quite easy to be done in Excel:

  • Column A [A_LM]: first value of interest in cell A2 (0 in this example)

  • Column B [B_LM]: first value of interest in cell B2 (1 in this example)

  • Column C [A_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):

    • C2 = IF(A2 = 1, IF(SUM(A1:A1) > 0, 0, A2), A2)
    • C3 = IF(A3 = 1, IF(SUM(A1:A2) > 0, 0, A3), A3)
    • C4 = IF(A4 = 1, IF(SUM(A1:A3) > 0, 0, A4), A4)
    • C5 = IF(A5 = 1, IF(SUM(A1:A4) > 0, 0, A5), A5)
    • C6 = IF(A6 = 1, IF(SUM(A1:A5) > 0, 0, A6), A6)
    • C7 = IF(A7 = 1, IF(SUM(A2:A6) > 0, 0, A7), A7)
    • C8 = IF(A8 = 1, IF(SUM(A3:A7) > 0, 0, A8), A8)
  • Column D [B_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):

    • D2 = IF(B2 = 1, IF(SUM(B1:B1) > 0, 0, B2), B2)
    • D3 = IF(B3 = 1, IF(SUM(B1:B2) > 0, 0, B3), B3)
    • D4 = IF(B4 = 1, IF(SUM(B1:B3) > 0, 0, B4), B4)
    • D5 = IF(B5 = 1, IF(SUM(B1:B4) > 0, 0, B5), B5)
    • D6 = IF(B6 = 1, IF(SUM(B1:B5) > 0
英文:

I would like to create new columns based on values in a moving-window range of another column. One of the new column would be based on a moving window from previous rows in another column, while the other column would be based on a moving window from subsequent rows in another column.

Let´s take the following dataset as an example:

  1. dt &lt;- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
  2. 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
  3. B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
  4. 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1))

I would like to create:

  • a column called "A_LM_corrected" detecting when 1s are in A_LM and only keeping those for which no other 1s were in the last 5 previous rows of A_LM. It should consist of 1s when the condition is met and 0 when the condition is not met,
  • a column called "B_LM_corrected" with the same principle but for values in column B_LM,
  • a column called "A_LM_foll" detecting when 1s are present in A_LM_corrected and only keeping those for which 1s are also present in the next 5 subsequent rows of B_LM_corrected [detecting when 1s are present in A_LM_corrected and in at least 1 of the 5 subsequent rows of B_LM_corrected], and
  • a column called "B_LM_foll" with the same principle [detecting when 1s are present in B_LM_corrected and in at least 1 of the 5 subsequent rows of A_LM_corrected].

Ideally, the final data frame would be this one:

  1. dt_aim &lt;- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
  2. 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
  3. B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
  4. 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
  5. A_LM_corrected = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
  6. 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
  7. B_LM_corrected = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  8. 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
  9. A_LM_foll = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  10. 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
  11. B_LM_foll = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  12. 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0))

I tried to use the tidyverse and data.table to do so, but I do not manage to get what I need [I am even very far from it].

I know that this is quite easy to be done in Excel:

  • Column A [A_LM]: first value of interest in cell A2 (0 in this example)

  • Column B [B_LM]: first value of interest in cell B2 (1 in this example)

  • Column C [A_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):

    • C2 = IF(A2 = 1, IF(SUM(A1:A1) > 0, 0, A2), A2)
    • C3 = IF(A3 = 1, IF(SUM(A1:A2) > 0, 0, A3), A3)
    • C4 = IF(A4 = 1, IF(SUM(A1:A3) > 0, 0, A4), A4)
    • C5 = IF(A5 = 1, IF(SUM(A1:A4) > 0, 0, A5), A5)
    • C6 = IF(A6 = 1, IF(SUM(A1:A5) > 0, 0, A6), A6)
    • C7 = IF(A7 = 1, IF(SUM(A2:A6) > 0, 0, A7), A7)
    • C8 = IF(A8 = 1, IF(SUM(A3:A7) > 0, 0, A8), A8)
  • Column D [B_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):

    • D2 = IF(B2 = 1, IF(SUM(B1:B1) > 0, 0, B2), B2)
    • D3 = IF(B3 = 1, IF(SUM(B1:B2) > 0, 0, B3), B3)
    • D4 = IF(B4 = 1, IF(SUM(B1:B3) > 0, 0, B4), B4)
    • D5 = IF(B5 = 1, IF(SUM(B1:B4) > 0, 0, B5), B5)
    • D6 = IF(B6 = 1, IF(SUM(B1:B5) > 0, 0, B6), B6)
    • D7 = IF(B7 = 1, IF(SUM(B2:B6) > 0, 0, B7), B7)
    • D8 = IF(B8 = 1, IF(SUM(B3:B7) > 0, 0, B8), B8)
  • column E [A_LM_foll] (copy directly copied/pasted without initial manual incrementation):

    • E2 = IF(C2 = 1, IF(SUM(D3:D7) > 0, 1, 0), 0)
  • column F [B_LM_foll] (copy directly copied/pasted without initial manual incrementation):

    • F2 = IF(D2 = 1, IF(SUM(C3:C7) > 0, 1, 0), 0)

But I need to have it work in R 如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

Any help is welcome 如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

答案1

得分: 4

使用zoo的替代dplyr方法:

  1. library(dplyr)
  2. dt %>%
  3. mutate(
  4. across(
  5. ends_with("_LM"),
  6. ~ +(. > 0 & zoo::rollapplyr(. < 1, list(-(1:4)), all, partial = TRUE, fill = TRUE)),
  7. .names = "{.col}_corrected"),
  8. A_LM_foll = +(A_LM_corrected & zoo::rollapply(B_LM_corrected > 0, list(-1:4), any, partial = TRUE, align = "left")),
  9. B_LM_foll = +(B_LM_corrected & zoo::rollapply(A_LM_corrected > 0, list(-1:4), any, partial = TRUE, align = "left"))
  10. )
  11. # A_LM B_LM A_LM_corrected B_LM_corrected A_LM_foll B_LM_foll
  12. # 1 0 1 0 1 0 1
  13. # 2 0 1 0 0 0 0
  14. # 3 0 0 0 0 0 0
  15. # 4 1 1 1 0 0 0
  16. # 5 0 0 0 0 0 0
  17. # 6 0 0 0 0 0 0
  18. # 7 0 0 0 0 0 0
  19. # 8 0 1 0 0 0 0
  20. # 9 0 0 0 0 0 0
  21. # 10 0 0 0 0 0 0
  22. # 11 1 0 1 0 0 0
  23. # 12 0 0 0 0 0 0
  24. # 13 0 0 0 0 0 0
  25. # 14 0 0 0 0 0 0
  26. # 15 0 0 0 0 0 0
  27. # 16 0 0 0 0 0 0
  28. # 17 0 1 0 1 0 1
  29. # 18 0 0 0 0 0 0
  30. # 19 0 0 0 0 0 0
  31. # 20 0 0 0 0 0 0
  32. # 21 1 0 1 0 1 0
  33. # 22 0 0 0 0 0 0
  34. # 23 1 0 0 0 0 0
  35. # 24 0 0 0 0 0 0
  36. # 25 0 1 0 1 0 0

感谢@G.Grothendieck改进了后两个rollapply的方法,并更多地帮助我理解了第一个rollapplyrlist(-(1:4))的用法。对未来的我来说:如果width=是数值向量,那么它的宽度应用于by=中的每个元素,如果它比数据短,它会被循环使用。然而,如果widths=是一个list,那么它的值被视为从当前时间的偏移量,所以当widths=5, align="right"时,意味着从"here"偏移c(-4, -3, -2, -1, 0),而widths=list(-(1:4)), align="right"意味着从"here"偏移c(-4, -3, -2, -1),忽略第0个("here"的)元素。真是让人大开眼界 如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

英文:

An alternative dplyr method using zoo:

  1. library(dplyr)
  2. dt %&gt;%
  3. mutate(
  4. across(
  5. ends_with(&quot;_LM&quot;),
  6. ~ +(. &gt; 0 &amp; zoo::rollapplyr(. &lt; 1, list(-(1:4)), all, partial = TRUE, fill = TRUE)),
  7. .names = &quot;{.col}_corrected&quot;),
  8. A_LM_foll = +(A_LM_corrected &amp; zoo::rollapply(B_LM_corrected &gt; 0, list(-1:4), any, partial = TRUE, align = &quot;left&quot;)),
  9. B_LM_foll = +(B_LM_corrected &amp; zoo::rollapply(A_LM_corrected &gt; 0, list(-1:4), any, partial = TRUE, align = &quot;left&quot;))
  10. )
  11. # A_LM B_LM A_LM_corrected B_LM_corrected A_LM_foll B_LM_foll
  12. # 1 0 1 0 1 0 1
  13. # 2 0 1 0 0 0 0
  14. # 3 0 0 0 0 0 0
  15. # 4 1 1 1 0 0 0
  16. # 5 0 0 0 0 0 0
  17. # 6 0 0 0 0 0 0
  18. # 7 0 0 0 0 0 0
  19. # 8 0 1 0 0 0 0
  20. # 9 0 0 0 0 0 0
  21. # 10 0 0 0 0 0 0
  22. # 11 1 0 1 0 0 0
  23. # 12 0 0 0 0 0 0
  24. # 13 0 0 0 0 0 0
  25. # 14 0 0 0 0 0 0
  26. # 15 0 0 0 0 0 0
  27. # 16 0 0 0 0 0 0
  28. # 17 0 1 0 1 0 1
  29. # 18 0 0 0 0 0 0
  30. # 19 0 0 0 0 0 0
  31. # 20 0 0 0 0 0 0
  32. # 21 1 0 1 0 1 0
  33. # 22 0 0 0 0 0 0
  34. # 23 1 0 0 0 0 0
  35. # 24 0 0 0 0 0 0
  36. # 25 0 1 0 1 0 0

Many thanks to @G.Grothendieck for improving the latter two rollapply's, and much more so helping me to understand the use of list(-(1:4)) in the first rollapplyr. For future me: if width= is a numeric vector, then its widths apply to each of the by= elements, and if it is shorter than the data, it is recycled. However, if widths= is a list, then its values are considered offsets from the current time, so while widths=5, align=&quot;right&quot; means c(-4, -3, -2, -1, 0) offset from "here", widths=list(-(1:4)), align=&quot;right&quot; means c(-4, -3, -2, -1) from "here", ignoring the 0th ("here"th) element. Mind blown 如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

答案2

得分: 1

以下是使用“tidyverse”的一种可能性。

特别是我使用函数map_dbl来创建辅助变量A_LM_backB_LM_back来检查前五行。

  1. dt %>%
  2. mutate(
  3. A_LM_back = map_dbl(1:n(), \(k) max(0, A_LM[pmax(k-1:5, 0)])),
  4. B_LM_back = map_dbl(1:n(), \(k) max(0, B_LM[pmax(k-1:5, 0)])),
  5. ) %>%
  6. mutate(
  7. A_LM_cor = if_else(A_LM == 1 & A_LM_back == 0, 1, 0),
  8. B_LM_cor = if_else(B_LM == 1 & B_LM_back == 0, 1, 0),
  9. ) %>%
  10. mutate(
  11. A_LM_next = map_dbl(1:n(), \(k) max(A_LM_cor[pmin(k+1:5, nrow(dt))])),
  12. B_LM_next = map_dbl(1:n(), \(k) max(B_LM_cor[pmin(k+1:5, nrow(dt))])),
  13. ) %>%
  14. mutate(
  15. A_LM_foll = if_else(A_LM_cor == 1 & B_LM_next == 1, 1, 0),
  16. B_LM_foll = if_else(B_LM_cor == 1 & A_LM_next == 1, 1, 0),
  17. )
英文:

Here is one possibility using the tidyverse.

In particular I use the function map_dbl to create help variables A_LM_back and B_LM_back to check the previous five rows.

  1. dt %&gt;%
  2. mutate(
  3. A_LM_back = map_dbl(1:n(), \(k) max(0, A_LM[pmax(k-1:5, 0)])),
  4. B_LM_back = map_dbl(1:n(), \(k) max(0, B_LM[pmax(k-1:5, 0)])),
  5. ) %&gt;%
  6. mutate(
  7. A_LM_cor = if_else(A_LM == 1 &amp; A_LM_back == 0, 1, 0),
  8. B_LM_cor = if_else(B_LM == 1 &amp; B_LM_back == 0, 1, 0),
  9. ) %&gt;%
  10. mutate(
  11. A_LM_next = map_dbl(1:n(), \(k) max(A_LM_cor[pmin(k+1:5, nrow(dt))])),
  12. B_LM_next = map_dbl(1:n(), \(k) max(B_LM_cor[pmin(k+1:5, nrow(dt))])),
  13. ) %&gt;%
  14. mutate(
  15. A_LM_foll = if_else(A_LM_cor == 1 &amp; B_LM_next == 1, 1, 0),
  16. B_LM_foll = if_else(B_LM_cor == 1 &amp; A_LM_next == 1, 1, 0),
  17. )

答案3

得分: 1

以下是翻译好的内容:

  1. 这里是一个`data.table`的方法。[`frollsum()`](https://rdrr.io/cran/data.table/man/froll.html)可以对向量进行滚动求和。在向量开始之前,它的工作方式可能会有点令人惊讶:
  2. ```r
  3. x <- 1:10
  4. SHIFT <- 5
  5. frollsum(x, n = SHIFT, fill = 0)
  6. # [1] 0 0 0 0 15 20 25 30 35 40

我希望这会返回前n个数的三角形数,然后以n的恒定速率增加。然而,前n个数取fill的值。如果你没有提供fill,那么它们将是NA。要从第一个索引开始获取三角形数,你需要做这个操作,我觉得有点笨拙:

  1. frollsum(c(rep(0, SHIFT), x), SHIFT)[(SHIFT + 1):(length(x) + SHIFT)]
  2. # [1] 1 3 6 10 15 20 25 30 35 40

但我认为这应该相对快速。

  1. library(data.table)
  2. setDT(dt)
  3. sd_cols <- c("A_LM", "B_LM")
  4. SHIFT <- 5
  5. dt[,
  6. (sprintf("%s_corrected", sd_cols)) := lapply(
  7. .SD,
  8. \(x) as.numeric(x == 1 & frollsum(c(rep(0, SHIFT), shift(x)), SHIFT, na.rm = TRUE)[(SHIFT + 1):(.N + SHIFT)] == 0)
  9. ),
  10. .SDcols = sd_cols
  11. ]
  12. dt[, A_LM_foll := as.numeric(
  13. A_LM_corrected == 1 & frollsum(c(B_LM_corrected, rep(0, SHIFT)), SHIFT, align = "left", na.rm = TRUE)[1:.N] > 0
  14. )]
  15. dt[, B_LM_foll := as.numeric(
  16. B_LM_corrected == 1 & frollsum(c(A_LM_corrected, rep(0, SHIFT)), SHIFT, align = "left", na.rm = TRUE)[1:.N] > 0
  17. )]
  18. identical(dt, dt_aim) # TRUE
英文:

Here's a data.table approach. frollsum() takes the rolling sum of a vector. It works in a slightly surprising way before the start of the vector:

  1. x &lt;- 1:10
  2. SHIFT &lt;- 5
  3. frollsum(x, n = SHIFT, fill = 0)
  4. # [1] 0 0 0 0 15 20 25 30 35 40

I would expect this to return triangle numbers for the first n numbers, and then increase at a constant rate of n. However, the first n numbers take the value of fill. If you don't supply a fill then they will be NA. To get triangle numbers starting from the first index, you need to do this, which I find a little awkward:

  1. frollsum(c(rep(0, SHIFT), x), SHIFT)[(SHIFT + 1):(length(x) + SHIFT)]
  2. # [1] 1 3 6 10 15 20 25 30 35 40

Still I think this should be relatively fast.

  1. library(data.table)
  2. setDT(dt)
  3. sd_cols &lt;- c(&quot;A_LM&quot;, &quot;B_LM&quot;)
  4. SHIFT &lt;- 5
  5. dt[,
  6. (sprintf(&quot;%s_corrected&quot;, sd_cols)) := lapply(
  7. .SD,
  8. \(x) as.numeric(x == 1 &amp; frollsum(c(rep(0, SHIFT), shift(x)), SHIFT, na.rm = TRUE)[(SHIFT + 1):(.N + SHIFT)] == 0)
  9. ),
  10. .SDcols = sd_cols
  11. ]
  12. dt[, A_LM_foll := as.numeric(
  13. A_LM_corrected == 1 &amp; frollsum(c(B_LM_corrected, rep(0, SHIFT)), SHIFT, align = &quot;left&quot;, na.rm = TRUE)[1:.N] &gt; 0
  14. )]
  15. dt[, B_LM_foll := as.numeric(
  16. B_LM_corrected == 1 &amp; frollsum(c(A_LM_corrected, rep(0, SHIFT)), SHIFT, align = &quot;left&quot;, na.rm = TRUE)[1:.N] &gt; 0
  17. )]
  18. identical(dt, dt_aim) # TRUE

huangapple
  • 本文由 发表于 2023年7月12日 22:44:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76671818.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定