英文:
Conditional cumulative sum in R with NA and keeping track of episodic wave
问题
以下是翻译好的代码部分:
我有以下数据:
df <- data.frame(time = 0:9,
y_ex1 = c(1,NA,0,1,NA,NA,NA,1,0,0),
y_ex2 = c(1,NA,0,0,NA,NA,NA,1,0,0),
y_ex3 = c(1,NA,0,1,NA,NA,NA,0,0,0),
y_ex4 = c(1,NA,0,0,NA,NA,NA,0,0,0),
y_ex5 = c(1,1,0,0,1,1,1,0,0,0)
)
我想要对每个向量进行如下评分:
df$y_ex1_scored <- c(1, 1.5, 0, 1, 2, 3, 4, 5, 0, 0)
df$y_ex2_scored <- c(1, 1.5, 0, 0, .5, 1, 1.5, 2.5, 0, 0)
df$y_ex3_scored <- c(1, 1.5, 0, 1, 1.5, 2, 2.5, 0, 0, 0)
df$y_ex4_scored <- c(1, 1.5, 0, 0, 0, 0, 0, 0, 0, 0)
df$y_ex5_scored <- c(1, 2, 0, 0, 1, 2, 3, 0, 0, 0)
基本上,我试图计算一个累积和,直到遇到0或NA。如果遇到0,我希望和重新从0开始。如果遇到NA,我希望分数取决于NA之前和之后的值。如果值相同,那么我希望累积和继续。例如,如果它们都是1,我希望和继续,就好像数据没有缺失一样。如果它们都是0,那么NA应该被替换为0。如果值不同,我希望在每个NA处添加0.5。
我还想跟踪每个波动,当值从0变为1或从1变为0时,就会发生波动。因此,对于`y_ex1`变量,它应该如下所示:
y_ex1_wave <- c(1, 1, 2, 2, 2, 2, 2, 2, 3, 3)
对于`y_ex4`,它将是:
y_ex4_wave <- c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2)
愿意进一步澄清,我一直在R中编写了一堆糟糕的循环来尝试做到这一点,已经花费了太长时间,取得了很少的进展。愿意提供一个基于R或tidyverse的解决方案。
希望这有助于您的问题。如果您需要进一步的解释或帮助,请随时告诉我。
英文:
I have the following data:
df <- data.frame(time = 0:9,
y_ex1 = c(1,NA,0,1,NA,NA,NA,1,0,0),
y_ex2 = c(1,NA,0,0,NA,NA,NA,1,0,0),
y_ex3 = c(1,NA,0,1,NA,NA,NA,0,0,0),
y_ex4 = c(1,NA,0,0,NA,NA,NA,0,0,0),
y_ex5 = c(1,1,0,0,1,1,1,0,0,0),
)
And I'd like to score each of these vectors as follows:
df$y_ex1_scored <- c(1, 1.5, 0, 1, 2, 3, 4, 5, 0, 0)
df$y_ex2_scored <- c(1, 1.5, 0, 0, .5, 1, 1.5, 2.5, 0, 0)
df$y_ex3_scored <- c(1, 1.5, 0, 1, 1.5, 2, 2.5, 0, 0, 0)
df$y_ex4_scored <- c(1, 1.5, 0, 0, 0, 0, 0, 0, 0, 0)
df$y_ex5_scored <- c(1, 2, 0, 0, 1, 2, 3, 0, 0, 0)
Basically, what I am trying to do is calculate a cumulative sum until I have a 0 or an NA. If I hit a 0, I want the sum to start all over again at 0. If I hit an NA, I want the score to depend on what the value was before and after the NA. If the values are the same, then I want the cumulative sum to continue. For example, if they are both 1, I want the sum to continue as though the data were not missing. If they are both 0, the NAs should be replaced with 0s. If the values are different, I want to add a .5 where each NA.
I also want to keep track of each wave and a wave occurs when the values shift from 0 to 1 or 1 to 0. So for the y_ex1
variable, it should look like
y_ex1_wave <- c(1, 1, 2, 2, 2, 2, 2, 2, 3, 3)
and for y_ex4
it would be
y_ex4_wave <- c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2)
Happy to clarify this, I have been writing a bunch of terrible loops in R to try to do this and have spent far to0 long and have made little progress. Happy to have a base R or tidyverse solution.
答案1
得分: 1
这是一个可以帮助的函数:
fun <- function(z) {
r <- rle(replace(z, is.na(z), Inf))
for (na in which(is.infinite(r$values))) {
if (na %in% c(1L, length(r$values))) {
stop("oops?")
} else {
r$values[na] <- ifelse(r$values[na-1] == r$values[na+1], r$values[na-1], 0.5)
}
}
z2 <- inverse.rle(r)
ave(z2, cumsum(z2 == 0), FUN = cumsum)
}
演示:
out <- as.data.frame(lapply(df[,2:6], fun))
names(out) <- paste0(names(out), "_scored")
all.equal(out, df[,7:11])
# [1] TRUE
out
# y_ex1_scored y_ex2_scored y_ex3_scored y_ex4_scored y_ex5_scored
# 1 1.0 1.0 1.0 1.0 1
# 2 1.5 1.5 1.5 1.5 2
# 3 0.0 0.0 0.0 0.0 0
# 4 1.0 0.0 1.0 0.0 0
# 5 2.0 0.5 1.5 0.0 1
# 6 3.0 1.0 2.0 0.0 2
# 7 4.0 1.5 2.5 0.0 3
# 8 5.0 2.5 0.0 0.0 0
# 9 0.0 0.0 0.0 0.0 0
# 10 0.0 0.0 0.0 0.0 0
fun
ction 的简要步骤:
rle
用于运行长度编码,它将类似c(1,1,2,2,2,3)
的向量转换为c(1,2,3)
(值)和c(2,3,1)
(每个值的长度)。这在这里很有用,因此我们可以轻松地跟踪NA
值以重新赋值。- 不幸的是,
rle
(以及大多数 R 函数)将NA
视为“可能是任何值”,因此(来自?rle
的说明)“缺失值被视为与前一个值不相等”。因此,NA
将不会合并(如第5-7行);但是,作为一个技巧,并假设您拥有所有有限值,我将NA
替换为Inf
,以便我可以执行需要的操作。这里任何“标志”值都可以(比如-99
),我认为Inf
是不模糊的值。并且因为我们稍后将所有这些值替换为前一个值或0.5
,所以我们不关心中间值是Inf
。 - 我们需要遍历每个 折叠(由
rle
)的无限值(以前是NA
),并查看前面和后面的值。注意:如果无限(NA)值在第一个或最后一个位置,我明确地使用stop
(错误)。但是对于更健壮的函数,您可能需要用一些更好的逻辑替换stop("oops")
。 - 通过将 单一的
Inf
替换为前一个值或0.5
,然后我们可以使用inverse.rle
,它(可能您已经猜到了)将值和长度对转换为单个向量,就像以前一样。 (您可能会放心地知道inverse.rle(rle(vec))
应该是相同的vec
tor。) - 调用
ave(..)
函数执行分组操作:cumsum(z2 == 0)
帮助我们将向量拆分为第一个是0
(或者如果永远不找到0
,则是整个原始向量未打破)。在每个这些子向量上,我们调用cumsum
。例如,如果vec <- c(1,2,0,3,4)
,那么ave(vec, cumsum(vec==0), FUN=cumsum)
首先调用cumsum(c(1,2))
,然后调用cumsum(c(0,3,4))
,并且累积和被放置在返回向量中的正确位置。(ave
函数还可以很好地处理不假定连续组的分组操作,但这是另一个问题。)
从那里,我们使用 lapply
在每一列上调用此函数。
编辑:我不想假设您想要用于第一个/最后一个值的 NA
的逻辑,但是您可以像这样做一些事情:
fun <- function(z) {
r <- rle(replace(z, is.na(z), Inf))
for (na in which(is.infinite(r$values))) {
if (na == 1L) {
r$values[na] <- 0
} else if (na == length(r$values)) {
r$values[na] <- r$values[na-1]
} else {
r$values[na] <- ifelse(r$values[na-1] == r$values[na+1], r$values[na-1], 0.5)
}
}
z2 <- inverse.rle(r)
ave(z2, cumsum(z2 == 0), FUN = cumsum)
}
英文:
Here's a function that'll help:
fun <- function(z) {
r <- rle(replace(z, is.na(z), Inf))
for (na in which(is.infinite(r$values))) {
if (na %in% c(1L, length(r$values))) {
stop("oops?")
} else {
r$values[na] <- ifelse(r$values[na-1] == r$values[na+1], r$values[na-1], 0.5)
}
}
z2 <- inverse.rle(r)
ave(z2, cumsum(z2 == 0), FUN = cumsum)
}
Demonstration:
out <- as.data.frame(lapply(df[,2:6], fun))
names(out) <- paste0(names(out), "_scored")
all.equal(out, df[,7:11])
# [1] TRUE
out
# y_ex1_scored y_ex2_scored y_ex3_scored y_ex4_scored y_ex5_scored
# 1 1.0 1.0 1.0 1.0 1
# 2 1.5 1.5 1.5 1.5 2
# 3 0.0 0.0 0.0 0.0 0
# 4 1.0 0.0 1.0 0.0 0
# 5 2.0 0.5 1.5 0.0 1
# 6 3.0 1.0 2.0 0.0 2
# 7 4.0 1.5 2.5 0.0 3
# 8 5.0 2.5 0.0 0.0 0
# 9 0.0 0.0 0.0 0.0 0
# 10 0.0 0.0 0.0 0.0 0
Brief walk-through of the fun
ction:
rle
is for run-length encoding, which converts something likec(1,1,2,2,2,3)
intoc(1,2,3)
(values) andc(2,3,1)
(lengths of each value). This is useful here so that we can easily keep track ofNA
s for revaluing.- Unfortunately,
rle
(and most of R) treatsNA
as "it could be anything", so (from?rle
) "Missing values are regarded as unequal to the previous value". Because of this,NA
will not be collapsed (as in rows 5-7); however, as a trick, and assuming you have all finite values, I replaceNA
s withInf
so that I can do what needs to be done. Any "sentinel" value would work here (such as-99
), I thoughInf
was unambiguous. And since we replace all of these values with the preceding value or0.5
later, we don't care that the intermediate value wasInf
. - We need to iterate over each of the collapsed (by
rle
) infinite values (formerlyNA
), and look at the preceding and following. Note: I explicitlystop
(error) if the infinite (NA) values are first or last, but for a more robust function, you will likely need to replacestop("oops")
with some better logic. - By replacing the singular
Inf
with either the preceding value or0.5
, we can then useinverse.rle
which (as might guess) converts the pair of values-and-lengths to a single vector, as before. (You might be comforted to know thatinverse.rle(rle(vec))
should be an identicalvec
tor.) - The call to
ave(..)
does a grouping function:cumsum(z2 == 0)
helps us break the vector into groups where the first is a0
(or the whole original vector unbroken, if0
is never found). On each of these subvectors, we callcumsum
. For instance, ifvec <- c(1,2,0,3,4)
, thenave(vec, cumsum(vec==0), FUN=cumsum)
will first callcumsum(c(1,2))
then callcumsum(c(0,3,4))
, and the cumulative-sums are placed in the correct places in the returned vector. (Theave
function also works well with grouping operations that do not assume contiguous groups, but that's for another question.)
From there, we call this function on each column using lapply
.
Edit: I don't want to assume what logic you would want to use for first/last value being NA
, but you might do something like this:
fun <- function(z) {
r <- rle(replace(z, is.na(z), Inf))
for (na in which(is.infinite(r$values))) {
if (na == 1L) {
r$values[na] <- 0
} else if (na == length(r$values)) {
r$values[na] <- r$values[na-1]
} else {
r$values[na] <- ifelse(r$values[na-1] == r$values[na+1], r$values[na-1], 0.5)
}
}
z2 <- inverse.rle(r)
ave(z2, cumsum(z2 == 0), FUN = cumsum)
}
答案2
得分: 0
你可以使用以下代码:
fn <- function(x){
x <- zoo::na.approx(x)
x[x>0&x<1] <- 0.5
grp <- data.table::rleid(!x)
data.frame(wave = grp, scored = ave(x, grp, FUN = cumsum))
}
cbind(df1, lapply(df1[-1], fn))
y_ex1.wave y_ex1.scored y_ex2.wave y_ex2.scored y_ex3.wave y_ex3.scored y_ex4.wave y_ex4.scored y_ex5.wave y_ex5.scored
1 1 1.0 1 1.0 1 1.0 1 1.0 1 1
2 1 1.5 1 1.5 1 1.5 1 1.5 1 2
3 2 0.0 2 0.0 2 0.0 2 0.0 2 0
4 3 1.0 2 0.0 3 1.0 2 0.0 2 0
5 3 2.0 3 0.5 3 1.5 2 0.0 3 1
6 3 3.0 3 1.0 3 2.0 2 0.0 3 2
7 3 4.0 3 1.5 3 2.5 2 0.0 3 3
8 3 5.0 3 2.5 4 0.0 2 0.0 4 0
9 4 0.0 4 0.0 4 0.0 2 0.0 4 0
10 4 0.0 4 0.0 4 0.0 2 0.0 4 0
英文:
You could use the following:
fn <- function(x){
x <- zoo::na.approx(x)
x[x>0&x<1] <- 0.5
grp <- data.table::rleid(!x)
data.frame(wave = grp, scored = ave(x, grp, FUN = cumsum))
}
cbind(df1, lapply(df1[-1], fn))
y_ex1.wave y_ex1.scored y_ex2.wave y_ex2.scored y_ex3.wave y_ex3.scored y_ex4.wave y_ex4.scored y_ex5.wave y_ex5.scored
1 1 1.0 1 1.0 1 1.0 1 1.0 1 1
2 1 1.5 1 1.5 1 1.5 1 1.5 1 2
3 2 0.0 2 0.0 2 0.0 2 0.0 2 0
4 3 1.0 2 0.0 3 1.0 2 0.0 2 0
5 3 2.0 3 0.5 3 1.5 2 0.0 3 1
6 3 3.0 3 1.0 3 2.0 2 0.0 3 2
7 3 4.0 3 1.5 3 2.5 2 0.0 3 3
8 3 5.0 3 2.5 4 0.0 2 0.0 4 0
9 4 0.0 4 0.0 4 0.0 2 0.0 4 0
10 4 0.0 4 0.0 4 0.0 2 0.0 4 0
答案3
得分: 0
在基本的R中,你可以这样做:
fn <- function(x){
y <- cummax(seq_along(x)*!is.na(x))
idx <- which(is.na(x))
x[idx] <- (x[y[idx]] + x[rank(y,ties.method = 'max')[idx]+1])/2
grp <- cumsum(!x)
data.frame(score = ave(x,grp , FUN = cumsum), wave = grp)
}
cbind(df1, lapply(df1[-1], fn))
时间 y_ex1 y_ex2 y_ex3 y_ex4 y_ex5 y_ex1.score y_ex1.wave y_ex2.score y_ex2.wave
1 0 1 1 1 1 1 1.0 0 1.0 0
2 1 NA NA NA NA 1 1.5 0 1.5 0
3 2 0 0 0 0 0 0.0 1 0.0 1
4 3 1 0 1 0 0 1.0 1 0.0 2
5 4 NA NA NA NA 1 2.0 1 0.5 2
6 5 NA NA NA NA 1 3.0 1 1.0 2
7 6 NA NA NA NA 1 4.0 1 1.5 2
8 7 1 1 0 0 0 5.0 1 2.5 2
9 8 0 0 0 0 0 0.0 2 0.0 3
10 9 0 0 0 0 0 0.0 3 0.0 4
英文:
In base R you could do:
fn <- function(x){
y <- cummax(seq_along(x)*!is.na(x))
idx <-which(is.na(x))
x[idx] <- (x[y[idx]] + x[rank(y,ties.method = 'max')[idx]+1])/2
grp <- cumsum(!x)
data.frame(score = ave(x,grp , FUN = cumsum), wave = grp)
}
cbind(df1, lapply(df1[-1], fn))
time y_ex1 y_ex2 y_ex3 y_ex4 y_ex5 y_ex1.score y_ex1.wave y_ex2.score y_ex2.wave
1 0 1 1 1 1 1 1.0 0 1.0 0
2 1 NA NA NA NA 1 1.5 0 1.5 0
3 2 0 0 0 0 0 0.0 1 0.0 1
4 3 1 0 1 0 0 1.0 1 0.0 2
5 4 NA NA NA NA 1 2.0 1 0.5 2
6 5 NA NA NA NA 1 3.0 1 1.0 2
7 6 NA NA NA NA 1 4.0 1 1.5 2
8 7 1 1 0 0 0 5.0 1 2.5 2
9 8 0 0 0 0 0 0.0 2 0.0 3
10 9 0 0 0 0 0 0.0 3 0.0 4
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论