英文:
How to output a tibble with if_else, and how to output a list-column with mutate?
问题
I'm attempting to extrapolate missing values from a time series. I'm comfortable with tidyverse, but seem to be running into a bit of a roadblock and am not sure if I should be taking a different approach.
我正在尝试从时间序列中外推缺失的值。我对tidyverse很熟悉,但似乎遇到了一些困难,不确定是否应采取不同的方法。
I have a data-frame with unevenly spaced-in-time measurements. If the time between measurements exceeds a certain threshold, I'd like to do a linear extrapolation between the nearest time above and below, with the measured values being evenly spaced. Something like:
我有一个时间不均匀的数据框,其中包含测量值。如果测量之间的时间超过某个阈值,我想在最接近的上下时间之间进行线性外推,测量值均匀分布。类似这样:
time = c(1, 2, 3, 8, 9),
meas = c(200, 300, 320, 500, 600)
);
avg_meas_interval <- median(df$time - lag(df$time);
df %>% mutate(
if_else(
time - lag(time) > threshold,
tibble(
time_extrp = seq(lag(time), time, avg_meas_interval),
meas_extrp = seq(lag(meas), meas, length.out = length(time_extrp))
),
NA
)
)```
I would then have a column of tibbles that represent the extrapolated rows that should be inserted before a given row, or NA if there's no rows to be inserted. I could pull this and row_bind.
然后,我将拥有一个列的tibble,表示应在给定行之前插入的外推行,如果没有行可插入,则为NA。我可以提取这些并进行row_bind。
Now, there's a few issues I've been bumping into with this approach:
1) if_else doesn't like returning tibbles, as it wants the possible outcomes of the if-else to have equal length.
2) Mutate doesn't like outputting lists of tibbles. I can get around this with just using map, outputting a separate list to a new variable (so that it's not part of the tibble), and then rowbinding, but this seems a bit circuitous.
现在,我遇到了一些问题:
1) if_else不喜欢返回tibbles,因为它希望if-else的可能结果具有相等的长度。
2) Mutate不喜欢输出tibbles列表。我可以通过使用map来解决这个问题,将一个单独的列表输出到一个新变量中(以便它不是tibble的一部分),然后进行rowbinding,但这似乎有点迂回。
It seems a simple extrapolation like this, where the output size is dependent on the input data, is extremely hard to implement in pure functional style. Not sure if there's a better approach than just implementing this in procedural R or python.
似乎像这样一个简单的外推,其中输出大小依赖于输入数据,非常难以在纯函数式风格中实现。不确定是否有比在过程化的R或Python中实现更好的方法。
<details>
<summary>英文:</summary>
I'm attempting to extrapolate missing values from a time series. I'm comfortable with tidyverse, but seem to be running into a bit of a roadblock and am not sure if I should be taking a different approach.
I have a data-frame with unevenly spaced-in-time measurements. If the time between measurements exceeds a certain threshold, I'd like to do a linear extrapolation between the nearest time above and below, with the measured values being evenly spaced. Something like:
df <- tibble(
time = c(1, 2, 3, 8, 9),
meas = c(200, 300, 320, 500, 600)
);
avg_meas_interval <- median(df$time - lag(df$time);
df %>% mutate(
if_else(
time - lag(time) > threshold,
tibble(
time_extrp = seq(lag(time), time, avg_meas_interval),
meas_extrp = seq(lag(meas), meas, length.out = length(time_extrp))
),
NA
)
)
I would then have a column of tibbles that represent the extrapolated rows that should be inserted before a given row, or NA if there's no rows to be inserted. I could pull this and row_bind.
Now, there's a few issues I've been bumping into with this approach:
1) if_else doesn't like returning tibbles, as it wants the possible outcomes of the if-else to have equal length.
2) Mutate doesn't like outputting lists of tibbles. I can get around this with just using map, outputting a separate list to a new variable (so that it's not part of the tibble), and then rowbinding, but this seems a bit circuitous.
It seems a simple extrapolation like this, where the output size is dependent on the input data, is extremely hard to implement in pure functional style. Not sure if there's a better approach than just implementing this in procedural R or python.
</details>
# 答案1
**得分**: 1
Here is the translated code:
```r
也许是这样?
```r
threshold <- 3
df %>%
mutate(
across(c(time, meas), lag, .names = "lag_{.col}"),
intrvl = c(1, rep(median(diff(time)), n() - 1)), # 我在这里推断...
extrp = purrr::pmap(
list(time, lag_time, meas, lag_meas, intrvl),
~ if (!is.na(..2) && (..1 - ..2) > threshold) {
tibble(time_extrp = seq(..2, ..1, length.out=..5),
meas_extrp = seq(..4, ..3, length.out=..5))
})
)
# # 一个 tibble: 5 × 6
# time meas lag_time lag_meas intrvl extrp
# <dbl> <dbl> <dbl> <dbl> <dbl> <list>
# 1 1 200 NA NA 1 <NULL>
# 2 2 300 1 200 1 <NULL>
# 3 3 320 2 300 1 <NULL>
# 4 8 500 3 320 1 <tibble [1 × 2]>
# 5 9 600 8 500 1 <NULL>
请注意,seq
调用需要返回相同的长度,因此在这里使用了 length.out=
。在你的原始调用中,仅使用间隔将产生 time_extrp
元素数量较少,而 meas_extrp
的元素数量较多。在当前形式下,这将生成一个要插入的 1 行 tibble,我建议你可能更喜欢类似这样的内容:
extrp = purrr::pmap(
list(time, lag_time, meas, lag_meas),
~ if (!is.na(..2) && (..1 - ..2) > threshold) {
len <- ceiling((..1 - ..2) / threshold)
tibble(time_extrp = seq(..2, ..1, length.out=len),
meas_extrp = seq(..4, ..3, length.out=len))
})
(在 mutate
调用内)。
另一种方法,因为你说你打算将其重新加入:
threshold <- 3
intrvl <- 2 # 使用有意义的值
df %>%
mutate(nexttime = lead(time)) %>%
filter(nexttime - time > threshold) %>%
reframe(time = unlist(Map(seq, time, nexttime, by=intrvl))) %>%
full_join(df, by = "time") %>%
arrange(time) %>%
mutate(meas2 = if_else(is.na(meas), approx(time, meas, xout = time)$y, meas))
# # 一个 tibble: 7 × 3
# time meas meas2
# <dbl> <dbl> <dbl>
# 1 1 200 200
# 2 2 300 300
# 3 3 320 320
# 4 5 NA 392
# 5 7 NA 464
# 6 8 500 500
# 7 9 600 600
我创建了 meas2
以便将原始的 meas
与 NA
-插值的值进行对比,可以随意覆盖 meas
。此外,还有其他可用于 NA
-插值的工具,这只是一种方法,还有其他方法(例如,zoo::na.approx
)。
我们首先要做的是确定缺口的位置(mutate(..) %>% filter(..)
),然后生成所需的 time
(reframe
,从 dplyr_1.1.0
或更高版本开始可用),并将它们与原始数据重新加入(full_join
)。逐步查看这个过程可能会很有见地:
df %>%
mutate(nexttime = lead(time)) %>%
filter(nexttime - time > threshold)
# # 一个 tibble: 1 × 3
# time meas nexttime
# <dbl> <dbl> <dbl>
# 1 3 320 8
... %>%
reframe(time = unlist(Map(seq, time, nexttime, by=intrvl)))
# # 一个 tibble: 3 × 1
# time
# <dbl>
# 1 3
# 2 5
# 3 7
... %>%
full_join(df, by = "time")
# # 一个 tibble: 7 × 2
# time meas
# <dbl> <dbl>
# 1 3 320
# 2 5 NA
# 3 7 NA
# 4 1 200
# 5 2 300
# 6 8 500
# 7 9 600
然后,我们只需整理数据并插值缺失的值。
英文:
Perhaps this?
threshold <- 3
df %>%
mutate(
across(c(time, meas), lag, .names = "lag_{.col}"),
intrvl = c(1, rep(median(diff(time)), n() - 1)), # I'm inferring here ...
extrp = purrr::pmap(
list(time, lag_time, meas, lag_meas, intrvl),
~ if (!is.na(..2) && (..1 - ..2) > threshold) {
tibble(time_extrp = seq(..2, ..1, length.out=..5),
meas_extrp = seq(..4, ..3, length.out=..5))
})
)
# # A tibble: 5 × 6
# time meas lag_time lag_meas intrvl extrp
# <dbl> <dbl> <dbl> <dbl> <dbl> <list>
# 1 1 200 NA NA 1 <NULL>
# 2 2 300 1 200 1 <NULL>
# 3 3 320 2 300 1 <NULL>
# 4 8 500 3 320 1 <tibble [1 × 2]>
# 5 9 600 8 500 1 <NULL>
Note that the seq
calls need to return the same length, ergo length.out=
here. In your original call, using the interval by itself would produce time_extrp
on the order of a few elements, and meas_extrp
on the order of 100s of elements. In its current form, this produces a 1-row tibble to be inserted, I suggest you might prefer something more akin to this:
extrp = purrr::pmap(
list(time, lag_time, meas, lag_meas),
~ if (!is.na(..2) && (..1 - ..2) > threshold) {
len <- ceiling((..1 - ..2) / threshold)
tibble(time_extrp = seq(..2, ..1, length.out=len),
meas_extrp = seq(..4, ..3, length.out=len))
})
(inside the call to mutate
).
An alternative, since you say you're going to join it back in anyway:
threshold <- 3
intrvl <- 2 # use something meaningful
df %>%
mutate(nexttime = lead(time)) %>%
filter(nexttime - time > threshold) %>%
reframe(time = unlist(Map(seq, time, nexttime, by=intrvl))) %>%
full_join(df, by = "time") %>%
arrange(time) %>%
mutate(meas2 = if_else(is.na(meas), approx(time, meas, xout = time)$y, meas))
# # A tibble: 7 × 3
# time meas meas2
# <dbl> <dbl> <dbl>
# 1 1 200 200
# 2 2 300 300
# 3 3 320 320
# 4 5 NA 392
# 5 7 NA 464
# 6 8 500 500
# 7 9 600 600
I created meas2
in order to compare the original meas
side-by-side with the NA
-interpolated values, feel free to overwrite meas
. Also, there are certainly other tools available for NA
-interpolation, this is one way, others exist (e.g., zoo::na.approx
).
The first thing we do here is determine where the gaps exist (mutate(..) %>% filter(..)
), then generate the needed time
s (reframe
, from dplyr_1.1.0
or later), and join them back in with the original data (full_join
). It can be insightful to see this stepwise:
df %>%
mutate(nexttime = lead(time)) %>%
filter(nexttime - time > threshold)
# # A tibble: 1 × 3
# time meas nexttime
# <dbl> <dbl> <dbl>
# 1 3 320 8
... %>%
reframe(time = unlist(Map(seq, time, nexttime, by=intrvl)))
# # A tibble: 3 × 1
# time
# <dbl>
# 1 3
# 2 5
# 3 7
... %>%
full_join(df, by = "time")
# # A tibble: 7 × 2
# time meas
# <dbl> <dbl>
# 1 3 320
# 2 5 NA
# 3 7 NA
# 4 1 200
# 5 2 300
# 6 8 500
# 7 9 600
From there, we just arrange
the data and interpolate the missing values.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论