在R中进行有条件的累加求和,包括处理NA值并跟踪周期波动。

huangapple go评论108阅读模式
英文:

Conditional cumulative sum in R with NA and keeping track of episodic wave

问题

以下是翻译好的代码部分:

我有以下数据:
df <- data.frame(time = 0:9,
  y_ex1 = c(1,NA,0,1,NA,NA,NA,1,0,0),
  y_ex2 = c(1,NA,0,0,NA,NA,NA,1,0,0),
  y_ex3 = c(1,NA,0,1,NA,NA,NA,0,0,0),
  y_ex4 = c(1,NA,0,0,NA,NA,NA,0,0,0),
  y_ex5 = c(1,1,0,0,1,1,1,0,0,0)
)

我想要对每个向量进行如下评分:
df$y_ex1_scored <- c(1, 1.5, 0, 1, 2, 3, 4, 5, 0, 0)
df$y_ex2_scored <- c(1, 1.5, 0, 0, .5, 1, 1.5, 2.5, 0, 0)
df$y_ex3_scored <- c(1, 1.5, 0, 1, 1.5, 2, 2.5, 0, 0, 0)
df$y_ex4_scored <- c(1, 1.5, 0, 0, 0, 0, 0, 0, 0, 0)
df$y_ex5_scored <- c(1, 2, 0, 0, 1, 2, 3, 0, 0, 0)

基本上,我试图计算一个累积和,直到遇到0NA。如果遇到0,我希望和重新从0开始。如果遇到NA,我希望分数取决于NA之前和之后的值。如果值相同,那么我希望累积和继续。例如,如果它们都是1,我希望和继续,就好像数据没有缺失一样。如果它们都是0,那么NA应该被替换为0。如果值不同,我希望在每个NA处添加0.5。

我还想跟踪每个波动,当值从0变为1或从1变为0时,就会发生波动。因此,对于`y_ex1`变量,它应该如下所示:
y_ex1_wave <- c(1, 1, 2, 2, 2, 2, 2, 2, 3, 3)
对于`y_ex4`,它将是:
y_ex4_wave <- c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2)

愿意进一步澄清,我一直在R中编写了一堆糟糕的循环来尝试做到这一点,已经花费了太长时间,取得了很少的进展。愿意提供一个基于R或tidyverse的解决方案。

希望这有助于您的问题。如果您需要进一步的解释或帮助,请随时告诉我。

英文:

I have the following data:

df &lt;- data.frame(time = 0:9,
y_ex1 = c(1,NA,0,1,NA,NA,NA,1,0,0),
y_ex2 = c(1,NA,0,0,NA,NA,NA,1,0,0),
y_ex3 = c(1,NA,0,1,NA,NA,NA,0,0,0),
y_ex4 = c(1,NA,0,0,NA,NA,NA,0,0,0),
y_ex5 = c(1,1,0,0,1,1,1,0,0,0),
)

And I'd like to score each of these vectors as follows:

df$y_ex1_scored &lt;- c(1, 1.5, 0, 1, 2, 3, 4, 5, 0, 0)
df$y_ex2_scored &lt;- c(1, 1.5, 0, 0, .5, 1, 1.5, 2.5, 0, 0)
df$y_ex3_scored &lt;- c(1, 1.5, 0, 1, 1.5, 2, 2.5, 0, 0, 0)
df$y_ex4_scored &lt;- c(1, 1.5, 0, 0, 0, 0, 0, 0, 0, 0)
df$y_ex5_scored &lt;- c(1, 2, 0, 0, 1, 2, 3, 0, 0, 0)

Basically, what I am trying to do is calculate a cumulative sum until I have a 0 or an NA. If I hit a 0, I want the sum to start all over again at 0. If I hit an NA, I want the score to depend on what the value was before and after the NA. If the values are the same, then I want the cumulative sum to continue. For example, if they are both 1, I want the sum to continue as though the data were not missing. If they are both 0, the NAs should be replaced with 0s. If the values are different, I want to add a .5 where each NA.

I also want to keep track of each wave and a wave occurs when the values shift from 0 to 1 or 1 to 0. So for the y_ex1 variable, it should look like

y_ex1_wave &lt;- c(1, 1, 2, 2, 2, 2, 2, 2, 3, 3)

and for y_ex4 it would be

y_ex4_wave &lt;- c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2)

Happy to clarify this, I have been writing a bunch of terrible loops in R to try to do this and have spent far to0 long and have made little progress. Happy to have a base R or tidyverse solution.

答案1

得分: 1

这是一个可以帮助的函数:

fun &lt;- function(z) {
  r &lt;- rle(replace(z, is.na(z), Inf))
  for (na in which(is.infinite(r$values))) {
    if (na %in% c(1L, length(r$values))) {
      stop(&quot;oops?&quot;)
    } else {
      r$values[na] &lt;- ifelse(r$values[na-1] == r$values[na+1], r$values[na-1], 0.5)
    }
  }
  z2 &lt;- inverse.rle(r)
  ave(z2, cumsum(z2 == 0), FUN = cumsum)
}

演示:

out &lt;- as.data.frame(lapply(df[,2:6], fun))
names(out) &lt;- paste0(names(out), &quot;_scored&quot;)
all.equal(out, df[,7:11])
# [1] TRUE
out
#    y_ex1_scored y_ex2_scored y_ex3_scored y_ex4_scored y_ex5_scored
# 1           1.0          1.0          1.0          1.0            1
# 2           1.5          1.5          1.5          1.5            2
# 3           0.0          0.0          0.0          0.0            0
# 4           1.0          0.0          1.0          0.0            0
# 5           2.0          0.5          1.5          0.0            1
# 6           3.0          1.0          2.0          0.0            2
# 7           4.0          1.5          2.5          0.0            3
# 8           5.0          2.5          0.0          0.0            0
# 9           0.0          0.0          0.0          0.0            0
# 10          0.0          0.0          0.0          0.0            0

function 的简要步骤:

  • rle 用于运行长度编码,它将类似 c(1,1,2,2,2,3) 的向量转换为 c(1,2,3)(值)和 c(2,3,1)(每个值的长度)。这在这里很有用,因此我们可以轻松地跟踪 NA 值以重新赋值。
  • 不幸的是,rle(以及大多数 R 函数)将 NA 视为“可能是任何值”,因此(来自 ?rle 的说明)“缺失值被视为与前一个值不相等”。因此,NA 将不会合并(如第5-7行);但是,作为一个技巧,并假设您拥有所有有限值,我将 NA 替换为 Inf,以便我可以执行需要的操作。这里任何“标志”值都可以(比如 -99),我认为 Inf 是不模糊的值。并且因为我们稍后将所有这些值替换为前一个值或 0.5,所以我们不关心中间值是 Inf
  • 我们需要遍历每个 折叠(由 rle)的无限值(以前是 NA),并查看前面和后面的值。注意:如果无限(NA)值在第一个或最后一个位置,我明确地使用 stop(错误)。但是对于更健壮的函数,您可能需要用一些更好的逻辑替换 stop(&quot;oops&quot;)
  • 通过将 单一的 Inf 替换为前一个值或 0.5,然后我们可以使用 inverse.rle,它(可能您已经猜到了)将值和长度对转换为单个向量,就像以前一样。 (您可能会放心地知道 inverse.rle(rle(vec)) 应该是相同的 vector。)
  • 调用 ave(..) 函数执行分组操作:cumsum(z2 == 0) 帮助我们将向量拆分为第一个是 0(或者如果永远不找到 0,则是整个原始向量未打破)。在每个这些子向量上,我们调用 cumsum。例如,如果 vec &lt;- c(1,2,0,3,4),那么 ave(vec, cumsum(vec==0), FUN=cumsum) 首先调用 cumsum(c(1,2)),然后调用 cumsum(c(0,3,4)),并且累积和被放置在返回向量中的正确位置。(ave 函数还可以很好地处理不假定连续组的分组操作,但这是另一个问题。)

从那里,我们使用 lapply 在每一列上调用此函数。


编辑:我不想假设您想要用于第一个/最后一个值的 NA 的逻辑,但是您可以像这样做一些事情:

fun &lt;- function(z) {
  r &lt;- rle(replace(z, is.na(z), Inf))
  for (na in which(is.infinite(r$values))) {
    if (na == 1L) {
      r$values[na] &lt;- 0
    } else if (na == length(r$values)) {
      r$values[na] &lt;- r$values[na-1]
    } else {
      r$values[na] &lt;- ifelse(r$values[na-1] == r$values[na+1], r$values[na-1], 0.5)
    }
  }
  z2 &lt;- inverse.rle(r)
  ave(z2, cumsum(z2 == 0), FUN = cumsum)
}
英文:

Here's a function that'll help:

fun &lt;- function(z) {
  r &lt;- rle(replace(z, is.na(z), Inf))
  for (na in which(is.infinite(r$values))) {
    if (na %in% c(1L, length(r$values))) {
      stop(&quot;oops?&quot;)
    } else {
      r$values[na] &lt;- ifelse(r$values[na-1] == r$values[na+1], r$values[na-1], 0.5)
    }
  }
  z2 &lt;- inverse.rle(r)
  ave(z2, cumsum(z2 == 0), FUN = cumsum)
}

Demonstration:

out &lt;- as.data.frame(lapply(df[,2:6], fun))
names(out) &lt;- paste0(names(out), &quot;_scored&quot;)
all.equal(out, df[,7:11])
# [1] TRUE
out
#    y_ex1_scored y_ex2_scored y_ex3_scored y_ex4_scored y_ex5_scored
# 1           1.0          1.0          1.0          1.0            1
# 2           1.5          1.5          1.5          1.5            2
# 3           0.0          0.0          0.0          0.0            0
# 4           1.0          0.0          1.0          0.0            0
# 5           2.0          0.5          1.5          0.0            1
# 6           3.0          1.0          2.0          0.0            2
# 7           4.0          1.5          2.5          0.0            3
# 8           5.0          2.5          0.0          0.0            0
# 9           0.0          0.0          0.0          0.0            0
# 10          0.0          0.0          0.0          0.0            0

Brief walk-through of the function:

  • rle is for run-length encoding, which converts something like c(1,1,2,2,2,3) into c(1,2,3) (values) and c(2,3,1) (lengths of each value). This is useful here so that we can easily keep track of NAs for revaluing.
  • Unfortunately, rle (and most of R) treats NA as "it could be anything", so (from ?rle) "Missing values are regarded as unequal to the previous value". Because of this, NA will not be collapsed (as in rows 5-7); however, as a trick, and assuming you have all finite values, I replace NAs with Inf so that I can do what needs to be done. Any "sentinel" value would work here (such as -99), I though Inf was unambiguous. And since we replace all of these values with the preceding value or 0.5 later, we don't care that the intermediate value was Inf.
  • We need to iterate over each of the collapsed (by rle) infinite values (formerly NA), and look at the preceding and following. Note: I explicitly stop (error) if the infinite (NA) values are first or last, but for a more robust function, you will likely need to replace stop(&quot;oops&quot;) with some better logic.
  • By replacing the singular Inf with either the preceding value or 0.5, we can then use inverse.rle which (as might guess) converts the pair of values-and-lengths to a single vector, as before. (You might be comforted to know that inverse.rle(rle(vec)) should be an identical vector.)
  • The call to ave(..) does a grouping function: cumsum(z2 == 0) helps us break the vector into groups where the first is a 0 (or the whole original vector unbroken, if 0 is never found). On each of these subvectors, we call cumsum. For instance, if vec &lt;- c(1,2,0,3,4), then ave(vec, cumsum(vec==0), FUN=cumsum) will first call cumsum(c(1,2)) then call cumsum(c(0,3,4)), and the cumulative-sums are placed in the correct places in the returned vector. (The ave function also works well with grouping operations that do not assume contiguous groups, but that's for another question.)

From there, we call this function on each column using lapply.


Edit: I don't want to assume what logic you would want to use for first/last value being NA, but you might do something like this:

fun &lt;- function(z) {
  r &lt;- rle(replace(z, is.na(z), Inf))
  for (na in which(is.infinite(r$values))) {
    if (na == 1L) {
      r$values[na] &lt;- 0
    } else if (na == length(r$values)) {
      r$values[na] &lt;- r$values[na-1]
    } else {
      r$values[na] &lt;- ifelse(r$values[na-1] == r$values[na+1], r$values[na-1], 0.5)
    }
  }
  z2 &lt;- inverse.rle(r)
  ave(z2, cumsum(z2 == 0), FUN = cumsum)
}

答案2

得分: 0

你可以使用以下代码:

fn <- function(x){
  x <- zoo::na.approx(x)
  x[x>0&x<1] <- 0.5
  grp <- data.table::rleid(!x)
  data.frame(wave = grp, scored = ave(x, grp, FUN = cumsum))
}

cbind(df1, lapply(df1[-1], fn))

   y_ex1.wave y_ex1.scored y_ex2.wave y_ex2.scored y_ex3.wave y_ex3.scored y_ex4.wave y_ex4.scored y_ex5.wave y_ex5.scored
1           1          1.0          1          1.0          1          1.0          1          1.0          1            1
2           1          1.5          1          1.5          1          1.5          1          1.5          1            2
3           2          0.0          2          0.0          2          0.0          2          0.0          2            0
4           3          1.0          2          0.0          3          1.0          2          0.0          2            0
5           3          2.0          3          0.5          3          1.5          2          0.0          3            1
6           3          3.0          3          1.0          3          2.0          2          0.0          3            2
7           3          4.0          3          1.5          3          2.5          2          0.0          3            3
8           3          5.0          3          2.5          4          0.0          2          0.0          4            0
9           4          0.0          4          0.0          4          0.0          2          0.0          4            0
10          4          0.0          4          0.0          4          0.0          2          0.0          4            0
英文:

You could use the following:

fn &lt;- function(x){
x &lt;- zoo::na.approx(x)
x[x&gt;0&amp;x&lt;1] &lt;- 0.5
grp &lt;- data.table::rleid(!x)
data.frame(wave = grp, scored = ave(x, grp, FUN = cumsum))
}
cbind(df1, lapply(df1[-1], fn))
y_ex1.wave y_ex1.scored y_ex2.wave y_ex2.scored y_ex3.wave y_ex3.scored y_ex4.wave y_ex4.scored y_ex5.wave y_ex5.scored
1           1          1.0          1          1.0          1          1.0          1          1.0          1            1
2           1          1.5          1          1.5          1          1.5          1          1.5          1            2
3           2          0.0          2          0.0          2          0.0          2          0.0          2            0
4           3          1.0          2          0.0          3          1.0          2          0.0          2            0
5           3          2.0          3          0.5          3          1.5          2          0.0          3            1
6           3          3.0          3          1.0          3          2.0          2          0.0          3            2
7           3          4.0          3          1.5          3          2.5          2          0.0          3            3
8           3          5.0          3          2.5          4          0.0          2          0.0          4            0
9           4          0.0          4          0.0          4          0.0          2          0.0          4            0
10          4          0.0          4          0.0          4          0.0          2          0.0          4            0

答案3

得分: 0

在基本的R中,你可以这样做:

fn <- function(x){
  y <- cummax(seq_along(x)*!is.na(x))
  idx <- which(is.na(x))
  x[idx] <- (x[y[idx]] + x[rank(y,ties.method = 'max')[idx]+1])/2
  grp <- cumsum(!x)
  data.frame(score = ave(x,grp , FUN = cumsum), wave = grp)
}
cbind(df1, lapply(df1[-1], fn))

时间 y_ex1 y_ex2 y_ex3 y_ex4 y_ex5 y_ex1.score y_ex1.wave y_ex2.score y_ex2.wave
1     0     1     1     1     1     1         1.0          0         1.0          0
2     1    NA    NA    NA    NA     1         1.5          0         1.5          0
3     2     0     0     0     0     0         0.0          1         0.0          1
4     3     1     0     1     0     0         1.0          1         0.0          2
5     4    NA    NA    NA    NA     1         2.0          1         0.5          2
6     5    NA    NA    NA    NA     1         3.0          1         1.0          2
7     6    NA    NA    NA    NA     1         4.0          1         1.5          2
8     7     1     1     0     0     0         5.0          1         2.5          2
9     8     0     0     0     0     0         0.0          2         0.0          3
10    9     0     0     0     0     0         0.0          3         0.0          4
英文:

In base R you could do:

fn &lt;- function(x){
y &lt;- cummax(seq_along(x)*!is.na(x))
idx &lt;-which(is.na(x))
x[idx] &lt;- (x[y[idx]] + x[rank(y,ties.method = &#39;max&#39;)[idx]+1])/2
grp &lt;- cumsum(!x)
data.frame(score = ave(x,grp , FUN = cumsum), wave = grp)
}
cbind(df1, lapply(df1[-1], fn))
time y_ex1 y_ex2 y_ex3 y_ex4 y_ex5 y_ex1.score y_ex1.wave y_ex2.score y_ex2.wave
1     0     1     1     1     1     1         1.0          0         1.0          0
2     1    NA    NA    NA    NA     1         1.5          0         1.5          0
3     2     0     0     0     0     0         0.0          1         0.0          1
4     3     1     0     1     0     0         1.0          1         0.0          2
5     4    NA    NA    NA    NA     1         2.0          1         0.5          2
6     5    NA    NA    NA    NA     1         3.0          1         1.0          2
7     6    NA    NA    NA    NA     1         4.0          1         1.5          2
8     7     1     1     0     0     0         5.0          1         2.5          2
9     8     0     0     0     0     0         0.0          2         0.0          3
10    9     0     0     0     0     0         0.0          3         0.0          4

huangapple
  • 本文由 发表于 2023年7月7日 05:48:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76632703.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定