2023年7月12日 22:44:57go评论94阅读模式

英文:

How to increment a new column based on values in a specified range [a moving window] of another column in R

问题

I would like to create new columns based on values in a moving-window range of another column. One of the new column would be based on a moving window from previous rows in another column, while the other column would be based on a moving window from subsequent rows in another column.

Let's take the following dataset as an example:

dt <- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                          0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
                 B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
                          0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1))

I would like to create:

a column called "A_LM_corrected" detecting when 1s are in A_LM and only keeping those for which no other 1s were in the last 5 previous rows of A_LM. It should consist of 1s when the condition is met and 0 when the condition is not met,
a column called "B_LM_corrected" with the same principle but for values in column B_LM,
a column called "A_LM_foll" detecting when 1s are present in A_LM_corrected and only keeping those for which 1s are also present in the next 5 subsequent rows of B_LM_corrected [detecting when 1s are present in A_LM_corrected and in at least 1 of the 5 subsequent rows of B_LM_corrected], and
a column called "B_LM_foll" with the same principle [detecting when 1s are present in B_LM_corrected and in at least 1 of the 5 subsequent rows of A_LM_corrected].

Ideally, the final data frame would be this one:

dt_aim <- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
                     B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
                              0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
                     A_LM_corrected = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                                        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
                     B_LM_corrected = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
                     A_LM_foll = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                   0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
                     B_LM_foll = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                   0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0))

I tried to use the tidyverse and data.table to do so, but I do not manage to get what I need [I am even very far from it].

I know that this is quite easy to be done in Excel:

Column A [A_LM]: first value of interest in cell A2 (0 in this example)
Column B [B_LM]: first value of interest in cell B2 (1 in this example)
Column C [A_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):
- C2 = IF(A2 = 1, IF(SUM(A1:A1) > 0, 0, A2), A2)
- C3 = IF(A3 = 1, IF(SUM(A1:A2) > 0, 0, A3), A3)
- C4 = IF(A4 = 1, IF(SUM(A1:A3) > 0, 0, A4), A4)
- C5 = IF(A5 = 1, IF(SUM(A1:A4) > 0, 0, A5), A5)
- C6 = IF(A6 = 1, IF(SUM(A1:A5) > 0, 0, A6), A6)
- C7 = IF(A7 = 1, IF(SUM(A2:A6) > 0, 0, A7), A7)
- C8 = IF(A8 = 1, IF(SUM(A3:A7) > 0, 0, A8), A8)
Column D [B_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):
- D2 = IF(B2 = 1, IF(SUM(B1:B1) > 0, 0, B2), B2)
- D3 = IF(B3 = 1, IF(SUM(B1:B2) > 0, 0, B3), B3)
- D4 = IF(B4 = 1, IF(SUM(B1:B3) > 0, 0, B4), B4)
- D5 = IF(B5 = 1, IF(SUM(B1:B4) > 0, 0, B5), B5)
- D6 = IF(B6 = 1, IF(SUM(B1:B5) > 0

英文:

Let´s take the following dataset as an example:

dt &lt;- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                          0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
                 B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
                          0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1))

I would like to create:

a column called "A_LM_corrected" detecting when 1s are in A_LM and only keeping those for which no other 1s were in the last 5 previous rows of A_LM. It should consist of 1s when the condition is met and 0 when the condition is not met,
a column called "B_LM_corrected" with the same principle but for values in column B_LM,
a column called "A_LM_foll" detecting when 1s are present in A_LM_corrected and only keeping those for which 1s are also present in the next 5 subsequent rows of B_LM_corrected [detecting when 1s are present in A_LM_corrected and in at least 1 of the 5 subsequent rows of B_LM_corrected], and
a column called "B_LM_foll" with the same principle [detecting when 1s are present in B_LM_corrected and in at least 1 of the 5 subsequent rows of A_LM_corrected].

Ideally, the final data frame would be this one:

dt_aim &lt;- data.frame(A_LM = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                              0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0),
                     B_LM = c(1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
                              0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
                     A_LM_corrected = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
                                        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
                     B_LM_corrected = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
                     A_LM_foll = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                   0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
                     B_LM_foll = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                                   0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0))

I tried to use the tidyverse and data.table to do so, but I do not manage to get what I need [I am even very far from it].

I know that this is quite easy to be done in Excel:

Column A [A_LM]: first value of interest in cell A2 (0 in this example)
Column B [B_LM]: first value of interest in cell B2 (1 in this example)
Column C [A_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):
- C2 = IF(A2 = 1, IF(SUM(A1:A1) > 0, 0, A2), A2)
- C3 = IF(A3 = 1, IF(SUM(A1:A2) > 0, 0, A3), A3)
- C4 = IF(A4 = 1, IF(SUM(A1:A3) > 0, 0, A4), A4)
- C5 = IF(A5 = 1, IF(SUM(A1:A4) > 0, 0, A5), A5)
- C6 = IF(A6 = 1, IF(SUM(A1:A5) > 0, 0, A6), A6)
- C7 = IF(A7 = 1, IF(SUM(A2:A6) > 0, 0, A7), A7)
- C8 = IF(A8 = 1, IF(SUM(A3:A7) > 0, 0, A8), A8)
Column D [B_LM_corrected] (it has to be incremented for the first 5 rows, afterwards it can just be copied/pasted):
- D2 = IF(B2 = 1, IF(SUM(B1:B1) > 0, 0, B2), B2)
- D3 = IF(B3 = 1, IF(SUM(B1:B2) > 0, 0, B3), B3)
- D4 = IF(B4 = 1, IF(SUM(B1:B3) > 0, 0, B4), B4)
- D5 = IF(B5 = 1, IF(SUM(B1:B4) > 0, 0, B5), B5)
- D6 = IF(B6 = 1, IF(SUM(B1:B5) > 0, 0, B6), B6)
- D7 = IF(B7 = 1, IF(SUM(B2:B6) > 0, 0, B7), B7)
- D8 = IF(B8 = 1, IF(SUM(B3:B7) > 0, 0, B8), B8)
column E [A_LM_foll] (copy directly copied/pasted without initial manual incrementation):
- E2 = IF(C2 = 1, IF(SUM(D3:D7) > 0, 1, 0), 0)
column F [B_LM_foll] (copy directly copied/pasted without initial manual incrementation):
- F2 = IF(D2 = 1, IF(SUM(C3:C7) > 0, 1, 0), 0)

But I need to have it work in R

Any help is welcome

答案1

得分: 4

使用zoo的替代dplyr方法：

library(dplyr)
dt %>%
  mutate(
    across(
      ends_with("_LM"),
      ~ +(. > 0 & zoo::rollapplyr(. < 1, list(-(1:4)), all, partial = TRUE, fill = TRUE)),
      .names = "{.col}_corrected"),
    A_LM_foll = +(A_LM_corrected & zoo::rollapply(B_LM_corrected > 0, list(-1:4), any, partial = TRUE, align = "left")),
    B_LM_foll = +(B_LM_corrected & zoo::rollapply(A_LM_corrected > 0, list(-1:4), any, partial = TRUE, align = "left"))
  )
#    A_LM B_LM A_LM_corrected B_LM_corrected A_LM_foll B_LM_foll
# 1     0    1              0              1         0         1
# 2     0    1              0              0         0         0
# 3     0    0              0              0         0         0
# 4     1    1              1              0         0         0
# 5     0    0              0              0         0         0
# 6     0    0              0              0         0         0
# 7     0    0              0              0         0         0
# 8     0    1              0              0         0         0
# 9     0    0              0              0         0         0
# 10    0    0              0              0         0         0
# 11    1    0              1              0         0         0
# 12    0    0              0              0         0         0
# 13    0    0              0              0         0         0
# 14    0    0              0              0         0         0
# 15    0    0              0              0         0         0
# 16    0    0              0              0         0         0
# 17    0    1              0              1         0         1
# 18    0    0              0              0         0         0
# 19    0    0              0              0         0         0
# 20    0    0              0              0         0         0
# 21    1    0              1              0         1         0
# 22    0    0              0              0         0         0
# 23    1    0              0              0         0         0
# 24    0    0              0              0         0         0
# 25    0    1              0              1         0         0

感谢@G.Grothendieck改进了后两个rollapply的方法，并更多地帮助我理解了第一个rollapplyr中list(-(1:4))的用法。对未来的我来说：如果width=是数值向量，那么它的宽度应用于by=中的每个元素，如果它比数据短，它会被循环使用。然而，如果widths=是一个list，那么它的值被视为从当前时间的偏移量，所以当widths=5, align="right"时，意味着从"here"偏移c(-4, -3, -2, -1, 0)，而widths=list(-(1:4)), align="right"意味着从"here"偏移c(-4, -3, -2, -1)，忽略第0个("here"的)元素。真是让人大开眼界

英文:

An alternative dplyr method using zoo:

library(dplyr)
dt %&gt;%
  mutate(
    across(
      ends_with(&quot;_LM&quot;),
      ~ +(. &gt; 0 &amp; zoo::rollapplyr(. &lt; 1, list(-(1:4)), all, partial = TRUE, fill = TRUE)),
      .names = &quot;{.col}_corrected&quot;),
    A_LM_foll = +(A_LM_corrected &amp; zoo::rollapply(B_LM_corrected &gt; 0, list(-1:4), any, partial = TRUE, align = &quot;left&quot;)),
    B_LM_foll = +(B_LM_corrected &amp; zoo::rollapply(A_LM_corrected &gt; 0, list(-1:4), any, partial = TRUE, align = &quot;left&quot;))
  )
#    A_LM B_LM A_LM_corrected B_LM_corrected A_LM_foll B_LM_foll
# 1     0    1              0              1         0         1
# 2     0    1              0              0         0         0
# 3     0    0              0              0         0         0
# 4     1    1              1              0         0         0
# 5     0    0              0              0         0         0
# 6     0    0              0              0         0         0
# 7     0    0              0              0         0         0
# 8     0    1              0              0         0         0
# 9     0    0              0              0         0         0
# 10    0    0              0              0         0         0
# 11    1    0              1              0         0         0
# 12    0    0              0              0         0         0
# 13    0    0              0              0         0         0
# 14    0    0              0              0         0         0
# 15    0    0              0              0         0         0
# 16    0    0              0              0         0         0
# 17    0    1              0              1         0         1
# 18    0    0              0              0         0         0
# 19    0    0              0              0         0         0
# 20    0    0              0              0         0         0
# 21    1    0              1              0         1         0
# 22    0    0              0              0         0         0
# 23    1    0              0              0         0         0
# 24    0    0              0              0         0         0
# 25    0    1              0              1         0         0

Many thanks to @G.Grothendieck for improving the latter two rollapply's, and much more so helping me to understand the use of list(-(1:4)) in the first rollapplyr. For future me: if width= is a numeric vector, then its widths apply to each of the by= elements, and if it is shorter than the data, it is recycled. However, if widths= is a list, then its values are considered offsets from the current time, so while widths=5, align="right" means c(-4, -3, -2, -1, 0) offset from "here", widths=list(-(1:4)), align="right" means c(-4, -3, -2, -1) from "here", ignoring the 0th ("here"th) element. Mind blown

答案2

得分: 1

以下是使用“tidyverse”的一种可能性。

特别是我使用函数map_dbl来创建辅助变量A_LM_back和B_LM_back来检查前五行。

dt %>% 
  mutate(
    A_LM_back = map_dbl(1:n(), \(k) max(0, A_LM[pmax(k-1:5, 0)])),
    B_LM_back = map_dbl(1:n(), \(k) max(0, B_LM[pmax(k-1:5, 0)])),
  ) %>%
  mutate(
    A_LM_cor = if_else(A_LM == 1 & A_LM_back == 0, 1, 0),
    B_LM_cor = if_else(B_LM == 1 & B_LM_back == 0, 1, 0),
  ) %>%
  mutate(
    A_LM_next = map_dbl(1:n(), \(k) max(A_LM_cor[pmin(k+1:5, nrow(dt))])),
    B_LM_next = map_dbl(1:n(), \(k) max(B_LM_cor[pmin(k+1:5, nrow(dt))])),
  ) %>% 
  mutate(
    A_LM_foll = if_else(A_LM_cor == 1 & B_LM_next == 1, 1, 0),
    B_LM_foll = if_else(B_LM_cor == 1 & A_LM_next == 1, 1, 0),
  )

英文:

Here is one possibility using the tidyverse.

In particular I use the function map_dbl to create help variables A_LM_back and B_LM_back to check the previous five rows.

dt %&gt;% 
  mutate(
    A_LM_back = map_dbl(1:n(), \(k) max(0, A_LM[pmax(k-1:5, 0)])),
    B_LM_back = map_dbl(1:n(), \(k) max(0, B_LM[pmax(k-1:5, 0)])),
  ) %&gt;% 
  mutate(
    A_LM_cor = if_else(A_LM == 1 &amp; A_LM_back == 0, 1, 0),
    B_LM_cor = if_else(B_LM == 1 &amp; B_LM_back == 0, 1, 0),
  ) %&gt;%
  mutate(
    A_LM_next = map_dbl(1:n(), \(k) max(A_LM_cor[pmin(k+1:5, nrow(dt))])),
    B_LM_next = map_dbl(1:n(), \(k) max(B_LM_cor[pmin(k+1:5, nrow(dt))])),
  ) %&gt;% 
  mutate(
    A_LM_foll = if_else(A_LM_cor == 1 &amp; B_LM_next == 1, 1, 0),
    B_LM_foll = if_else(B_LM_cor == 1 &amp; A_LM_next == 1, 1, 0),
  )

答案3

得分: 1

以下是翻译好的内容：

这里是一个`data.table`的方法。[`frollsum()`](https://rdrr.io/cran/data.table/man/froll.html)可以对向量进行滚动求和。在向量开始之前，它的工作方式可能会有点令人惊讶：
```r
x <- 1:10
SHIFT <- 5
frollsum(x, n = SHIFT, fill = 0)
#  [1]  0  0  0  0 15 20 25 30 35 40

我希望这会返回前n个数的三角形数，然后以n的恒定速率增加。然而，前n个数取fill的值。如果你没有提供fill，那么它们将是NA。要从第一个索引开始获取三角形数，你需要做这个操作，我觉得有点笨拙：

frollsum(c(rep(0, SHIFT), x), SHIFT)[(SHIFT + 1):(length(x) + SHIFT)]
#  [1]  1  3  6 10 15 20 25 30 35 40

但我认为这应该相对快速。

library(data.table)
setDT(dt)
sd_cols <- c("A_LM", "B_LM")
SHIFT <- 5
dt[,
    (sprintf("%s_corrected", sd_cols)) := lapply(
        .SD,
        \(x) as.numeric(x == 1 & frollsum(c(rep(0, SHIFT), shift(x)), SHIFT, na.rm = TRUE)[(SHIFT + 1):(.N + SHIFT)] == 0)
    ),
    .SDcols = sd_cols
]
dt[, A_LM_foll := as.numeric(
    A_LM_corrected == 1 & frollsum(c(B_LM_corrected, rep(0, SHIFT)), SHIFT, align = "left", na.rm = TRUE)[1:.N] > 0
)]
dt[, B_LM_foll := as.numeric(
    B_LM_corrected == 1 & frollsum(c(A_LM_corrected, rep(0, SHIFT)), SHIFT, align = "left", na.rm = TRUE)[1:.N] > 0
)]
identical(dt, dt_aim) # TRUE

英文:

Here's a data.table approach. frollsum() takes the rolling sum of a vector. It works in a slightly surprising way before the start of the vector:

x &lt;- 1:10
SHIFT &lt;- 5
frollsum(x, n = SHIFT, fill = 0)
#  [1]  0  0  0  0 15 20 25 30 35 40

I would expect this to return triangle numbers for the first n numbers, and then increase at a constant rate of n. However, the first n numbers take the value of fill. If you don't supply a fill then they will be NA. To get triangle numbers starting from the first index, you need to do this, which I find a little awkward:

frollsum(c(rep(0, SHIFT), x), SHIFT)[(SHIFT + 1):(length(x) + SHIFT)]
#  [1]  1  3  6 10 15 20 25 30 35 40

Still I think this should be relatively fast.

library(data.table)
setDT(dt)
sd_cols &lt;- c(&quot;A_LM&quot;, &quot;B_LM&quot;)
SHIFT &lt;- 5
dt[,
    (sprintf(&quot;%s_corrected&quot;, sd_cols)) := lapply(
        .SD,
        \(x) as.numeric(x == 1 &amp; frollsum(c(rep(0, SHIFT), shift(x)), SHIFT, na.rm = TRUE)[(SHIFT + 1):(.N + SHIFT)] == 0)
    ),
    .SDcols = sd_cols
]
dt[, A_LM_foll := as.numeric(
    A_LM_corrected == 1 &amp; frollsum(c(B_LM_corrected, rep(0, SHIFT)), SHIFT, align = &quot;left&quot;, na.rm = TRUE)[1:.N] &gt; 0
)]
dt[, B_LM_foll := as.numeric(
    B_LM_corrected == 1 &amp; frollsum(c(A_LM_corrected, rep(0, SHIFT)), SHIFT, align = &quot;left&quot;, na.rm = TRUE)[1:.N] &gt; 0
)]
identical(dt, dt_aim) # TRUE

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在R中基于另一列中的指定范围值[移动窗口]递增一个新列

问题

答案1

答案2

答案3

将列求和，然后除以相邻的单元格。

在将 R 中的 `huxreg` 导出到 LaTeX 时，如何在列标题中实现换行？

如何仅保留列中的第一组重复项，如果有多个重复项。

如何在R中按年份制作日历热图？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。