Take n number of rows after value for each column 取每列的值后的n行数据

huangapple go评论72阅读模式
英文:

Take n number of rows after value for each column

问题

I have a data frame dt

         V1       V2       V3       V4       V5       V6
52  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
53  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
54 23.41610 27.74736  0.00000  0.00000  0.00000  0.00000
55 46.25229 26.80305 12.08680  0.00000  0.00000  0.00000
56 16.93179  0.00000 12.76963 12.21179  0.00000  0.00000
57  0.00000 24.35663  0.00000 15.47197 11.55125  0.00000
58 46.11487 14.91367  0.00000  0.00000 16.51914 12.40029
59 35.93963  0.00000  0.00000  0.00000 15.10201 13.44208
60  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 

对于每一列,我想找到第一个大于1的值,并选择其周围的(n-1:n+25)行,并将它们放入一个新的数据表中。

我尝试使用data.table:

for (i in 1:ncol(df)) {df[i > 1 | shift(i > 1, n=1L, type = "lead") | shift(i > 1, n=25L, type = "lag")]}

但显然我在调用列时出错了。

我还尝试在相同的for循环结构中使用seq_along来获取25个“后续”行:

output <- seq(min(which(df[i] > 1)), length.out = 25)

这仅为第一列在满足阈值的地方给了我一系列行号。

提前感谢您的帮助!

英文:

I have a data frame dt

         V1       V2       V3       V4       V5       V6
52  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
53  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
54 23.41610 27.74736  0.00000  0.00000  0.00000  0.00000
55 46.25229 26.80305 12.08680  0.00000  0.00000  0.00000
56 16.93179  0.00000 12.76963 12.21179  0.00000  0.00000
57  0.00000 24.35663  0.00000 15.47197 11.55125  0.00000
58 46.11487 14.91367  0.00000  0.00000 16.51914 12.40029
59 35.93963  0.00000  0.00000  0.00000 15.10201 13.44208
60  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 

For each column, I want to find the first value greater than 1 and select the (n-1:n+25) rows around it, and put them into a new data table.

I've tried using data.table with

for (i in 1:ncol(df)) {df[i &gt;1 | shift(i&gt;1, n=1L, type = &quot;lead&quot;) | shift(i&gt;1, n=25L, type = &quot;lag&quot;)]}

but I am apparently calling my columns wrong.

I've tried using seq_along as well in the same for loop structure just to get the 25 "after" rows:

output &lt;- seq(min(which(df[i] &gt; 1)), length.out = 25)

Which gave me the series of row numbers for the first column only where the threshold was met.

Thanks in advance for the help!

答案1

得分: 2

根据您的示例表格,您是否希望类似这样的结果?为了演示,我只使用了 target_row + 2 而不是 + 25

如评论中建议的,用 which(df[, x] > 1)[1L]which.max(df[, x] > 1) 替换 min(which(df[, x] > 1)) 可能更有效率。

sapply(1:ncol(df), \(x) {
  target_row <- min(which(df[, x] > 1))
  df[(target_row - 1):(target_row + 2), x]
  })

         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
[1,]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[2,] 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
[3,] 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
[4,] 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000

输入:

df <- structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 
35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 
0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0, 
0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0, 
11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029, 
13.44208, 0)), class = "data.frame", row.names = c("52", "53", 
"54", "55", "56", "57", "58", "59", "60"))
英文:

Based on your sample table, do you want something like this? For demonstration purpose I only used target_row + 2 instead of + 25.

As suggested in the comment, replacing min(which(df[, x] &gt; 1)) with which(df[, x] &gt; 1)[1L] or which.max(df[, x] &gt; 1) might be more efficient.

sapply(1:ncol(df), \(x) {
  target_row &lt;- min(which(df[, x] &gt; 1))
  df[(target_row - 1):(target_row + 2), x]
  })

         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
[1,]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
[2,] 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
[3,] 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
[4,] 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000

Input

df &lt;- structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 
35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 
0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0, 
0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0, 
11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029, 
13.44208, 0)), class = &quot;data.frame&quot;, row.names = c(&quot;52&quot;, &quot;53&quot;, 
&quot;54&quot;, &quot;55&quot;, &quot;56&quot;, &quot;57&quot;, &quot;58&quot;, &quot;59&quot;, &quot;60&quot;))

答案2

得分: 1

Using data.table(与benson的基本相同):

dt[, lapply(.SD, function(i){ 
  x <- min(which(i > 1))
  i[ (x - 1):(x + 2) ]
})]

V1 V2 V3 V4 V5 V6

1: 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029

3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208

4: 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000

英文:

Using data.table (essentially the same as benson's):

dt[, lapply(.SD, function(i){ 
  x &lt;- min(which(i &gt; 1))
  i[ (x - 1):(x + 2) ]
  })]

#          V1       V2       V3       V4       V5       V6
# 1:  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000

答案3

得分: 0

如果有许多行,而我们要查找的行通常在前面找到,使用循环比比较整个向量更有效:

find_first <- function(x) {
  i <- 1L
  n <- length(x)
  while (x[i] <= 1 && i <= n) i <- i + 1L
  i
}    
dt[, lapply(.SD, \(x) x[find_first(x) + (-1:2)])]
#          V1       V2       V3       V4       V5       V6
#       <num>    <num>    <num>    <num>    <num>    <num>
# 1:  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000

数据:

dt <- data.table(
  V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0),
  V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0),
  V3 = rep(c(0, 12.0868, 12.76963, 0), c(3L, 1L, 1L, 4L)),
  V4 = rep(c(0, 12.21179, 15.47197, 0), c(4L, 1L, 1L, 3L)),
  V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0),
  V6 = rep(c(0, 12.40029, 13.44208, 0), c(6L, 1L, 1L, 1L))
)
英文:

If there are many rows and the row we are looking for is generally found early on, a loop would be more efficient than comparing the whole vector:

find_first &lt;- function(x) {
  i &lt;- 1L
  n &lt;- length(x)
  while (x[i] &lt;= 1 &amp;&amp; i &lt;= n) i &lt;- i + 1L
  i
}    
dt[, lapply(.SD, \(x) x[find_first(x) + (-1:2)])]
#          V1       V2       V3       V4       V5       V6
#       &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;
# 1:  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000

Data:

dt &lt;- data.table(
  V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0),
  V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0),
  V3 = rep(c(0, 12.0868, 12.76963, 0), c(3L, 1L, 1L, 4L)),
  V4 = rep(c(0, 12.21179, 15.47197, 0), c(4L, 1L, 1L, 3L)),
  V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0),
  V6 = rep(c(0, 12.40029, 13.44208, 0), c(6L, 1L, 1L, 1L))
)

答案4

得分: 0

Here is the translated content:

另一个选择,假设使用 data.table

df[, lapply(.SD, function(z) z[fcoalesce(cumsum(z > 1 | shift(z, type="lead") > 1) > 0, TRUE)][1:5])]
#          V1       V2       V3       V4       V5       V6
#       <num>    <num>    <num>    <num>    <num>    <num>
# 1:  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000
# 5:  0.00000 24.35663  0.00000  0.00000  0.00000       NA
  • z > 0 | shift(z, type="lead") > 0 在可能不太可能的情况下,如果第一行有大于1的值;如果您确切知道第一个数字永远不会匹配,那么可以从表达式中移除 z > 0 |
  • cumsum(..) > 0 类似于 dplyr::cumany
  • fcoalesce 是因为 shift(z, type="lead")NA,它错误地将最后的 0 转换为 NAV5V6
  • .[1:5] 保证我们将得到长度为5的结果,用 NA 填充尾部元素(就像 V6 中一样);相比之下,head(., 5) 可能会返回短于5的结果,导致 data.table 报错为 "has 4 rows but longest item has 5; recycled with remainder",然后被循环使用(用0填充)

数据

df <- data.table::as.data.table(structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0, 0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029, 13.44208, 0)), class = c("data.table", "data.frame"), row.names = c(NA, -9L)))

(Note: The translated content has been provided without additional information or responses to your translation request.)

英文:

Another option, assuming data.table:

df[, lapply(.SD, function(z) z[fcoalesce(cumsum(z &gt; 1 | shift(z, type=&quot;lead&quot;) &gt; 1) &gt; 0, TRUE)][1:5])]
#          V1       V2       V3       V4       V5       V6
#       &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;    &lt;num&gt;
# 1:  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179  0.00000  0.00000  0.00000 15.10201  0.00000
# 5:  0.00000 24.35663  0.00000  0.00000  0.00000       NA
  • z &gt; 0 | shift(z, type=&quot;lead&quot;) &gt; 0 in the perhaps-unlikely case that row 1 has a value over 1; if you know that the first number will never match, then you can remove z &gt; 0 | from the expression
  • cumsum(..) &gt; 0 is akin to dplyr::cumany
  • fcoalesce because shift(z, type=&quot;lead&quot;) is NA, which incorrectly converts the last 0 to NA in V5 and V6
  • .[1:5] guarantees that we'll get length 5, filling the trailing elements with NA (as in V6); in contrast, head(., 5) can return shorter than 5, causing data.table to complain with "has 4 rows but longest item has 5; recycled with remainder", which is then recycled (filled with 0)

Data

df &lt;- data.table::as.data.table(structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0, 0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029, 13.44208, 0)), class = c(&quot;data.table&quot;, &quot;data.frame&quot;), row.names = c(NA, -9L)))

huangapple
  • 本文由 发表于 2023年4月17日 18:09:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76033997.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定