2023年4月17日 18:09:35go评论100阅读模式

英文:

Take n number of rows after value for each column

问题

I have a data frame dt

         V1       V2       V3       V4       V5       V6
52  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
53  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
54 23.41610 27.74736  0.00000  0.00000  0.00000  0.00000
55 46.25229 26.80305 12.08680  0.00000  0.00000  0.00000
56 16.93179  0.00000 12.76963 12.21179  0.00000  0.00000
57  0.00000 24.35663  0.00000 15.47197 11.55125  0.00000
58 46.11487 14.91367  0.00000  0.00000 16.51914 12.40029
59 35.93963  0.00000  0.00000  0.00000 15.10201 13.44208
60  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000

对于每一列，我想找到第一个大于1的值，并选择其周围的(n-1:n+25)行，并将它们放入一个新的数据表中。

我尝试使用data.table：

for (i in 1:ncol(df)) {df[i > 1 | shift(i > 1, n=1L, type = "lead") | shift(i > 1, n=25L, type = "lag")]}

但显然我在调用列时出错了。

我还尝试在相同的for循环结构中使用seq_along来获取25个“后续”行：

output <- seq(min(which(df[i] > 1)), length.out = 25)

这仅为第一列在满足阈值的地方给了我一系列行号。

提前感谢您的帮助！

英文:

I have a data frame dt

         V1       V2       V3       V4       V5       V6
52  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
53  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
54 23.41610 27.74736  0.00000  0.00000  0.00000  0.00000
55 46.25229 26.80305 12.08680  0.00000  0.00000  0.00000
56 16.93179  0.00000 12.76963 12.21179  0.00000  0.00000
57  0.00000 24.35663  0.00000 15.47197 11.55125  0.00000
58 46.11487 14.91367  0.00000  0.00000 16.51914 12.40029
59 35.93963  0.00000  0.00000  0.00000 15.10201 13.44208
60  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000

For each column, I want to find the first value greater than 1 and select the (n-1:n+25) rows around it, and put them into a new data table.

I've tried using data.table with

for (i in 1:ncol(df)) {df[i &gt;1 | shift(i&gt;1, n=1L, type = &quot;lead&quot;) | shift(i&gt;1, n=25L, type = &quot;lag&quot;)]}

but I am apparently calling my columns wrong.

I've tried using seq_along as well in the same for loop structure just to get the 25 "after" rows:

output &lt;- seq(min(which(df[i] &gt; 1)), length.out = 25)

Which gave me the series of row numbers for the first column only where the threshold was met.

Thanks in advance for the help!

答案1

得分: 2

根据您的示例表格，您是否希望类似这样的结果？为了演示，我只使用了 target_row + 2 而不是 + 25。

如评论中建议的，用 which(df[, x] > 1)[1L] 或 which.max(df[, x] > 1) 替换 min(which(df[, x] > 1)) 可能更有效率。

sapply(1:ncol(df), \(x) {
  target_row <- min(which(df[, x] > 1))
  df[(target_row - 1):(target_row + 2), x]
  })
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
[1,]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[2,] 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
[3,] 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
[4,] 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000

输入：

df <- structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 
35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 
0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0, 
0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0, 
11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029, 
13.44208, 0)), class = "data.frame", row.names = c("52", "53", 
"54", "55", "56", "57", "58", "59", "60"))

英文:

Based on your sample table, do you want something like this? For demonstration purpose I only used target_row + 2 instead of + 25.

As suggested in the comment, replacing min(which(df[, x] > 1)) with which(df[, x] > 1)[1L] or which.max(df[, x] > 1) might be more efficient.

sapply(1:ncol(df), \(x) {
  target_row &lt;- min(which(df[, x] &gt; 1))
  df[(target_row - 1):(target_row + 2), x]
  })
         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
[1,]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[2,] 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
[3,] 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
[4,] 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000

Input

df &lt;- structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 
35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 
0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0, 
0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0, 
11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029, 
13.44208, 0)), class = &quot;data.frame&quot;, row.names = c(&quot;52&quot;, &quot;53&quot;, 
&quot;54&quot;, &quot;55&quot;, &quot;56&quot;, &quot;57&quot;, &quot;58&quot;, &quot;59&quot;, &quot;60&quot;))

答案2

得分: 1

Using data.table（与benson的基本相同）：

dt[, lapply(.SD, function(i){ 
  x <- min(which(i > 1))
  i[ (x - 1):(x + 2) ]
})]

V1 V2 V3 V4 V5 V6

1: 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029

3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208

4: 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000

英文:

Using data.table (essentially the same as benson's):

dt[, lapply(.SD, function(i){ 
  x &lt;- min(which(i &gt; 1))
  i[ (x - 1):(x + 2) ]
  })]
#          V1       V2       V3       V4       V5       V6
# 1:  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000

答案3

得分: 0

如果有许多行，而我们要查找的行通常在前面找到，使用循环比比较整个向量更有效：

find_first <- function(x) {
  i <- 1L
  n <- length(x)
  while (x[i] <= 1 && i <= n) i <- i + 1L
  i
}    
dt[, lapply(.SD, \(x) x[find_first(x) + (-1:2)])]
#          V1       V2       V3       V4       V5       V6
#       <num>    <num>    <num>    <num>    <num>    <num>
# 1:  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000

数据：

dt <- data.table(
  V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0),
  V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0),
  V3 = rep(c(0, 12.0868, 12.76963, 0), c(3L, 1L, 1L, 4L)),
  V4 = rep(c(0, 12.21179, 15.47197, 0), c(4L, 1L, 1L, 3L)),
  V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0),
  V6 = rep(c(0, 12.40029, 13.44208, 0), c(6L, 1L, 1L, 1L))
)

英文:

If there are many rows and the row we are looking for is generally found early on, a loop would be more efficient than comparing the whole vector:

find_first &lt;- function(x) {
  i &lt;- 1L
  n &lt;- length(x)
  while (x[i] &lt;= 1 &amp;&amp; i &lt;= n) i &lt;- i + 1L
  i
}    
dt[, lapply(.SD, \(x) x[find_first(x) + (-1:2)])]
#          V1       V2       V3       V4       V5       V6
#       &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;
# 1:  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000

Data:

dt &lt;- data.table(
  V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0),
  V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0),
  V3 = rep(c(0, 12.0868, 12.76963, 0), c(3L, 1L, 1L, 4L)),
  V4 = rep(c(0, 12.21179, 15.47197, 0), c(4L, 1L, 1L, 3L)),
  V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0),
  V6 = rep(c(0, 12.40029, 13.44208, 0), c(6L, 1L, 1L, 1L))
)

答案4

得分: 0

Here is the translated content:

另一个选择，假设使用 data.table：

df[, lapply(.SD, function(z) z[fcoalesce(cumsum(z > 1 | shift(z, type="lead") > 1) > 0, TRUE)][1:5])]
#          V1       V2       V3       V4       V5       V6
#       <num>    <num>    <num>    <num>    <num>    <num>
# 1:  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000
# 5:  0.00000 24.35663  0.00000  0.00000  0.00000       NA

z > 0 | shift(z, type="lead") > 0 在可能不太可能的情况下，如果第一行有大于1的值；如果您确切知道第一个数字永远不会匹配，那么可以从表达式中移除 z > 0 |
cumsum(..) > 0 类似于 dplyr::cumany
fcoalesce 是因为 shift(z, type="lead") 是 NA，它错误地将最后的 0 转换为 NA 在 V5 和 V6 中
.[1:5] 保证我们将得到长度为5的结果，用 NA 填充尾部元素（就像 V6 中一样）；相比之下，head(., 5) 可能会返回短于5的结果，导致 data.table 报错为 "has 4 rows but longest item has 5; recycled with remainder"，然后被循环使用（用0填充）

数据

df <- data.table::as.data.table(structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0, 0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029, 13.44208, 0)), class = c("data.table", "data.frame"), row.names = c(NA, -9L)))

(Note: The translated content has been provided without additional information or responses to your translation request.)

英文:

Another option, assuming data.table:

df[, lapply(.SD, function(z) z[fcoalesce(cumsum(z &gt; 1 | shift(z, type=&quot;lead&quot;) &gt; 1) &gt; 0, TRUE)][1:5])]
#          V1       V2       V3       V4       V5       V6
#       &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;
# 1:  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000
# 5:  0.00000 24.35663  0.00000  0.00000  0.00000       NA

z > 0 | shift(z, type="lead") > 0 in the perhaps-unlikely case that row 1 has a value over 1; if you know that the first number will never match, then you can remove z > 0 | from the expression
cumsum(..) > 0 is akin to dplyr::cumany
fcoalesce because shift(z, type="lead") is NA, which incorrectly converts the last 0 to NA in V5 and V6
.[1:5] guarantees that we'll get length 5, filling the trailing elements with NA (as in V6); in contrast, head(., 5) can return shorter than 5, causing data.table to complain with "has 4 rows but longest item has 5; recycled with remainder", which is then recycled (filled with 0)

Data

df &lt;- data.table::as.data.table(structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0, 0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029, 13.44208, 0)), class = c(&quot;data.table&quot;, &quot;data.frame&quot;), row.names = c(NA, -9L)))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Take n number of rows after value for each column 取每列的值后的n行数据

问题

答案1

Input

答案2

V1 V2 V3 V4 V5 V6

1: 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029

3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208

4: 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000

答案3

答案4

使用函数raster::getData子集的状态看起来非常奇怪。

Calculate Row Decile/Quantile by Column Dplyr R

合并数据框，不包括重复行和带有NA值的行。

如何手动计算自回归模型的残差

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。