英文:
Take n number of rows after value for each column
问题
I have a data frame dt
V1 V2 V3 V4 V5 V6
52 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
53 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
54 23.41610 27.74736 0.00000 0.00000 0.00000 0.00000
55 46.25229 26.80305 12.08680 0.00000 0.00000 0.00000
56 16.93179 0.00000 12.76963 12.21179 0.00000 0.00000
57 0.00000 24.35663 0.00000 15.47197 11.55125 0.00000
58 46.11487 14.91367 0.00000 0.00000 16.51914 12.40029
59 35.93963 0.00000 0.00000 0.00000 15.10201 13.44208
60 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
对于每一列,我想找到第一个大于1的值,并选择其周围的(n-1:n+25)
行,并将它们放入一个新的数据表中。
我尝试使用data.table:
for (i in 1:ncol(df)) {df[i > 1 | shift(i > 1, n=1L, type = "lead") | shift(i > 1, n=25L, type = "lag")]}
但显然我在调用列时出错了。
我还尝试在相同的for循环结构中使用seq_along
来获取25个“后续”行:
output <- seq(min(which(df[i] > 1)), length.out = 25)
这仅为第一列在满足阈值的地方给了我一系列行号。
提前感谢您的帮助!
英文:
I have a data frame dt
V1 V2 V3 V4 V5 V6
52 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
53 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
54 23.41610 27.74736 0.00000 0.00000 0.00000 0.00000
55 46.25229 26.80305 12.08680 0.00000 0.00000 0.00000
56 16.93179 0.00000 12.76963 12.21179 0.00000 0.00000
57 0.00000 24.35663 0.00000 15.47197 11.55125 0.00000
58 46.11487 14.91367 0.00000 0.00000 16.51914 12.40029
59 35.93963 0.00000 0.00000 0.00000 15.10201 13.44208
60 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
For each column, I want to find the first value greater than 1 and select the (n-1:n+25)
rows around it, and put them into a new data table.
I've tried using data.table with
for (i in 1:ncol(df)) {df[i >1 | shift(i>1, n=1L, type = "lead") | shift(i>1, n=25L, type = "lag")]}
but I am apparently calling my columns wrong.
I've tried using seq_along
as well in the same for loop structure just to get the 25 "after" rows:
output <- seq(min(which(df[i] > 1)), length.out = 25)
Which gave me the series of row numbers for the first column only where the threshold was met.
Thanks in advance for the help!
答案1
得分: 2
根据您的示例表格,您是否希望类似这样的结果?为了演示,我只使用了 target_row + 2
而不是 + 25
。
如评论中建议的,用 which(df[, x] > 1)[1L]
或 which.max(df[, x] > 1)
替换 min(which(df[, x] > 1))
可能更有效率。
sapply(1:ncol(df), \(x) {
target_row <- min(which(df[, x] > 1))
df[(target_row - 1):(target_row + 2), x]
})
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[2,] 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
[3,] 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
[4,] 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000
输入:
df <- structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487,
35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367,
0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0,
0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0,
11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029,
13.44208, 0)), class = "data.frame", row.names = c("52", "53",
"54", "55", "56", "57", "58", "59", "60"))
英文:
Based on your sample table, do you want something like this? For demonstration purpose I only used target_row + 2
instead of + 25
.
As suggested in the comment, replacing min(which(df[, x] > 1))
with which(df[, x] > 1)[1L]
or which.max(df[, x] > 1)
might be more efficient.
sapply(1:ncol(df), \(x) {
target_row <- min(which(df[, x] > 1))
df[(target_row - 1):(target_row + 2), x]
})
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[2,] 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
[3,] 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
[4,] 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000
Input
df <- structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487,
35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367,
0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0,
0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0,
11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029,
13.44208, 0)), class = "data.frame", row.names = c("52", "53",
"54", "55", "56", "57", "58", "59", "60"))
答案2
得分: 1
Using data.table(与benson的基本相同):
dt[, lapply(.SD, function(i){
x <- min(which(i > 1))
i[ (x - 1):(x + 2) ]
})]
V1 V2 V3 V4 V5 V6
1: 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
4: 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000
英文:
Using data.table (essentially the same as benson's):
dt[, lapply(.SD, function(i){
x <- min(which(i > 1))
i[ (x - 1):(x + 2) ]
})]
# V1 V2 V3 V4 V5 V6
# 1: 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000
答案3
得分: 0
如果有许多行,而我们要查找的行通常在前面找到,使用循环比比较整个向量更有效:
find_first <- function(x) {
i <- 1L
n <- length(x)
while (x[i] <= 1 && i <= n) i <- i + 1L
i
}
dt[, lapply(.SD, \(x) x[find_first(x) + (-1:2)])]
# V1 V2 V3 V4 V5 V6
# <num> <num> <num> <num> <num> <num>
# 1: 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000
数据:
dt <- data.table(
V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0),
V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0),
V3 = rep(c(0, 12.0868, 12.76963, 0), c(3L, 1L, 1L, 4L)),
V4 = rep(c(0, 12.21179, 15.47197, 0), c(4L, 1L, 1L, 3L)),
V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0),
V6 = rep(c(0, 12.40029, 13.44208, 0), c(6L, 1L, 1L, 1L))
)
英文:
If there are many rows and the row we are looking for is generally found early on, a loop would be more efficient than comparing the whole vector:
find_first <- function(x) {
i <- 1L
n <- length(x)
while (x[i] <= 1 && i <= n) i <- i + 1L
i
}
dt[, lapply(.SD, \(x) x[find_first(x) + (-1:2)])]
# V1 V2 V3 V4 V5 V6
# <num> <num> <num> <num> <num> <num>
# 1: 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000
Data:
dt <- data.table(
V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0),
V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0),
V3 = rep(c(0, 12.0868, 12.76963, 0), c(3L, 1L, 1L, 4L)),
V4 = rep(c(0, 12.21179, 15.47197, 0), c(4L, 1L, 1L, 3L)),
V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0),
V6 = rep(c(0, 12.40029, 13.44208, 0), c(6L, 1L, 1L, 1L))
)
答案4
得分: 0
Here is the translated content:
另一个选择,假设使用 data.table
:
df[, lapply(.SD, function(z) z[fcoalesce(cumsum(z > 1 | shift(z, type="lead") > 1) > 0, TRUE)][1:5])]
# V1 V2 V3 V4 V5 V6
# <num> <num> <num> <num> <num> <num>
# 1: 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000
# 5: 0.00000 24.35663 0.00000 0.00000 0.00000 NA
z > 0 | shift(z, type="lead") > 0
在可能不太可能的情况下,如果第一行有大于1的值;如果您确切知道第一个数字永远不会匹配,那么可以从表达式中移除z > 0 |
cumsum(..) > 0
类似于dplyr::cumany
fcoalesce
是因为shift(z, type="lead")
是NA
,它错误地将最后的0
转换为NA
在V5
和V6
中.[1:5]
保证我们将得到长度为5的结果,用NA
填充尾部元素(就像V6
中一样);相比之下,head(., 5)
可能会返回短于5的结果,导致data.table
报错为 "has 4 rows but longest item has 5; recycled with remainder",然后被循环使用(用0填充)
数据
df <- data.table::as.data.table(structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0, 0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029, 13.44208, 0)), class = c("data.table", "data.frame"), row.names = c(NA, -9L)))
(Note: The translated content has been provided without additional information or responses to your translation request.)
英文:
Another option, assuming data.table
:
df[, lapply(.SD, function(z) z[fcoalesce(cumsum(z > 1 | shift(z, type="lead") > 1) > 0, TRUE)][1:5])]
# V1 V2 V3 V4 V5 V6
# <num> <num> <num> <num> <num> <num>
# 1: 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
# 2: 23.41610 27.74736 12.08680 12.21179 11.55125 12.40029
# 3: 46.25229 26.80305 12.76963 15.47197 16.51914 13.44208
# 4: 16.93179 0.00000 0.00000 0.00000 15.10201 0.00000
# 5: 0.00000 24.35663 0.00000 0.00000 0.00000 NA
z > 0 | shift(z, type="lead") > 0
in the perhaps-unlikely case that row 1 has a value over 1; if you know that the first number will never match, then you can removez > 0 |
from the expressioncumsum(..) > 0
is akin todplyr::cumany
fcoalesce
becauseshift(z, type="lead")
isNA
, which incorrectly converts the last0
toNA
inV5
andV6
.[1:5]
guarantees that we'll get length 5, filling the trailing elements withNA
(as inV6
); in contrast,head(., 5)
can return shorter than 5, causingdata.table
to complain with "has 4 rows but longest item has 5; recycled with remainder", which is then recycled (filled with 0)
Data
df <- data.table::as.data.table(structure(list(V1 = c(0, 0, 23.4161, 46.25229, 16.93179, 0, 46.11487, 35.93963, 0), V2 = c(0, 0, 27.74736, 26.80305, 0, 24.35663, 14.91367, 0, 0), V3 = c(0, 0, 0, 12.0868, 12.76963, 0, 0, 0, 0), V4 = c(0, 0, 0, 0, 12.21179, 15.47197, 0, 0, 0), V5 = c(0, 0, 0, 0, 0, 11.55125, 16.51914, 15.10201, 0), V6 = c(0, 0, 0, 0, 0, 0, 12.40029, 13.44208, 0)), class = c("data.table", "data.frame"), row.names = c(NA, -9L)))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论