如何使用if_else输出一个tibble,以及如何使用mutate输出一个list-column?

huangapple go评论59阅读模式
英文:

How to output a tibble with if_else, and how to output a list-column with mutate?

问题

I'm attempting to extrapolate missing values from a time series. I'm comfortable with tidyverse, but seem to be running into a bit of a roadblock and am not sure if I should be taking a different approach.

我正在尝试从时间序列中外推缺失的值。我对tidyverse很熟悉,但似乎遇到了一些困难,不确定是否应采取不同的方法。

I have a data-frame with unevenly spaced-in-time measurements. If the time between measurements exceeds a certain threshold, I'd like to do a linear extrapolation between the nearest time above and below, with the measured values being evenly spaced. Something like:

我有一个时间不均匀的数据框,其中包含测量值。如果测量之间的时间超过某个阈值,我想在最接近的上下时间之间进行线性外推,测量值均匀分布。类似这样:

  time = c(1, 2, 3, 8, 9), 
  meas = c(200, 300, 320, 500, 600)
);

avg_meas_interval <- median(df$time - lag(df$time);

df %>% mutate(
  if_else(
    time - lag(time) > threshold,
    tibble(
      time_extrp = seq(lag(time), time, avg_meas_interval),
      meas_extrp = seq(lag(meas), meas, length.out = length(time_extrp))
    ),
    NA
  )
)```

I would then have a column of tibbles that represent the extrapolated rows that should be inserted before a given row, or NA if there's no rows to be inserted. I could pull this and row_bind.

然后,我将拥有一个列的tibble,表示应在给定行之前插入的外推行,如果没有行可插入,则为NA。我可以提取这些并进行row_bind。

Now, there's a few issues I've been bumping into with this approach:
1) if_else doesn't like returning tibbles, as it wants the possible outcomes of the if-else to have equal length.
2) Mutate doesn't like outputting lists of tibbles. I can get around this with just using map, outputting a separate list to a new variable (so that it's not part of the tibble), and then rowbinding, but this seems a bit circuitous.

现在,我遇到了一些问题:
1) if_else不喜欢返回tibbles,因为它希望if-else的可能结果具有相等的长度。
2) Mutate不喜欢输出tibbles列表。我可以通过使用map来解决这个问题,将一个单独的列表输出到一个新变量中(以便它不是tibble的一部分),然后进行rowbinding,但这似乎有点迂回。

It seems a simple extrapolation like this, where the output size is dependent on the input data, is extremely hard to implement in pure functional style. Not sure if there's a better approach than just implementing this in procedural R or python.

似乎像这样一个简单的外推,其中输出大小依赖于输入数据,非常难以在纯函数式风格中实现。不确定是否有比在过程化的R或Python中实现更好的方法。

<details>
<summary>英文:</summary>

I&#39;m attempting to extrapolate missing values from a time series. I&#39;m comfortable with tidyverse, but seem to be running into a bit of a roadblock and am not sure if I should be taking a different approach.

I have a data-frame with unevenly spaced-in-time measurements. If the time between measurements exceeds a certain threshold, I&#39;d like to do a linear extrapolation between the nearest time above and below, with the measured values being evenly spaced. Something like:

df <- tibble(
time = c(1, 2, 3, 8, 9),
meas = c(200, 300, 320, 500, 600)
);

avg_meas_interval <- median(df$time - lag(df$time);

df %>% mutate(
if_else(
time - lag(time) > threshold,
tibble(
time_extrp = seq(lag(time), time, avg_meas_interval),
meas_extrp = seq(lag(meas), meas, length.out = length(time_extrp))
),
NA
)
)


I would then have a column of tibbles that represent the extrapolated rows that should be inserted before a given row, or NA if there&#39;s no rows to be inserted. I could pull this and row_bind.

Now, there&#39;s a few issues I&#39;ve been bumping into with this approach:
1) if_else doesn&#39;t like returning tibbles, as it wants the possible outcomes of the if-else to have equal length.
2) Mutate doesn&#39;t like outputting lists of tibbles. I can get around this with just using map, outputting a separate list to a new variable (so that it&#39;s not part of the tibble), and then rowbinding, but this seems a bit circuitous.

It seems a simple extrapolation like this, where the output size is dependent on the input data, is extremely hard to implement in pure functional style. Not sure if there&#39;s a better approach than just implementing this in procedural R or python.

</details>


# 答案1
**得分**: 1

Here is the translated code:

```r
也许是这样?

```r
threshold <- 3
df %>%
  mutate(
    across(c(time, meas), lag, .names = "lag_{.col}"),
    intrvl = c(1, rep(median(diff(time)), n() - 1)), # 我在这里推断...
    extrp = purrr::pmap(
      list(time, lag_time, meas, lag_meas, intrvl),
      ~ if (!is.na(..2) && (..1 - ..2) > threshold) {
         tibble(time_extrp = seq(..2, ..1, length.out=..5),
                meas_extrp = seq(..4, ..3, length.out=..5))
      })
  )
# # 一个 tibble: 5 × 6
#    time  meas lag_time lag_meas intrvl extrp           
#   <dbl> <dbl>    <dbl>    <dbl>  <dbl> <list>          
# 1     1   200       NA       NA      1 <NULL>          
# 2     2   300        1      200      1 <NULL>          
# 3     3   320        2      300      1 <NULL>          
# 4     8   500        3      320      1 <tibble [1 × 2]>
# 5     9   600        8      500      1 <NULL>          

请注意,seq 调用需要返回相同的长度,因此在这里使用了 length.out=。在你的原始调用中,仅使用间隔将产生 time_extrp 元素数量较少,而 meas_extrp 的元素数量较多。在当前形式下,这将生成一个要插入的 1 行 tibble,我建议你可能更喜欢类似这样的内容:

    extrp = purrr::pmap(
      list(time, lag_time, meas, lag_meas),
      ~ if (!is.na(..2) && (..1 - ..2) > threshold) {
         len <- ceiling((..1 - ..2) / threshold)
         tibble(time_extrp = seq(..2, ..1, length.out=len),
                meas_extrp = seq(..4, ..3, length.out=len))
      })

(在 mutate 调用内)。


另一种方法,因为你说你打算将其重新加入:

threshold <- 3
intrvl <- 2 # 使用有意义的值
df %>%
  mutate(nexttime = lead(time)) %>%
  filter(nexttime - time > threshold) %>%
  reframe(time = unlist(Map(seq, time, nexttime, by=intrvl))) %>%
  full_join(df, by = "time") %>%
  arrange(time) %>%
  mutate(meas2 = if_else(is.na(meas), approx(time, meas, xout = time)$y, meas))
# # 一个 tibble: 7 × 3
#    time  meas meas2
#   <dbl> <dbl> <dbl>
# 1     1   200   200
# 2     2   300   300
# 3     3   320   320
# 4     5    NA   392
# 5     7    NA   464
# 6     8   500   500
# 7     9   600   600

我创建了 meas2 以便将原始的 measNA-插值的值进行对比,可以随意覆盖 meas。此外,还有其他可用于 NA-插值的工具,这只是一种方法,还有其他方法(例如,zoo::na.approx)。

我们首先要做的是确定缺口的位置(mutate(..) %>% filter(..)),然后生成所需的 timereframe,从 dplyr_1.1.0 或更高版本开始可用),并将它们与原始数据重新加入(full_join)。逐步查看这个过程可能会很有见地:

df %>%
  mutate(nexttime = lead(time)) %>%
  filter(nexttime - time > threshold)
# # 一个 tibble: 1 × 3
#    time  meas nexttime
#   <dbl> <dbl>    <dbl>
# 1     3   320        8

... %>%
  reframe(time = unlist(Map(seq, time, nexttime, by=intrvl)))
# # 一个 tibble: 3 × 1
#    time
#   <dbl>
# 1     3
# 2     5
# 3     7

... %>%
  full_join(df, by = "time")
# # 一个 tibble: 7 × 2
#    time  meas
#   <dbl> <dbl>
# 1     3   320
# 2     5    NA
# 3     7    NA
# 4     1   200
# 5     2   300
# 6     8   500
# 7     9   600

然后,我们只需整理数据并插值缺失的值。

英文:

Perhaps this?

threshold &lt;- 3
df %&gt;%
  mutate(
    across(c(time, meas), lag, .names = &quot;lag_{.col}&quot;),
    intrvl = c(1, rep(median(diff(time)), n() - 1)), # I&#39;m inferring here ...
    extrp = purrr::pmap(
      list(time, lag_time, meas, lag_meas, intrvl),
      ~ if (!is.na(..2) &amp;&amp; (..1 - ..2) &gt; threshold) {
         tibble(time_extrp = seq(..2, ..1, length.out=..5),
                meas_extrp = seq(..4, ..3, length.out=..5))
      })
  )
# # A tibble: 5 &#215; 6
#    time  meas lag_time lag_meas intrvl extrp           
#   &lt;dbl&gt; &lt;dbl&gt;    &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt; &lt;list&gt;          
# 1     1   200       NA       NA      1 &lt;NULL&gt;          
# 2     2   300        1      200      1 &lt;NULL&gt;          
# 3     3   320        2      300      1 &lt;NULL&gt;          
# 4     8   500        3      320      1 &lt;tibble [1 &#215; 2]&gt;
# 5     9   600        8      500      1 &lt;NULL&gt;          

Note that the seq calls need to return the same length, ergo length.out= here. In your original call, using the interval by itself would produce time_extrp on the order of a few elements, and meas_extrp on the order of 100s of elements. In its current form, this produces a 1-row tibble to be inserted, I suggest you might prefer something more akin to this:

    extrp = purrr::pmap(
      list(time, lag_time, meas, lag_meas),
      ~ if (!is.na(..2) &amp;&amp; (..1 - ..2) &gt; threshold) {
         len &lt;- ceiling((..1 - ..2) / threshold)
         tibble(time_extrp = seq(..2, ..1, length.out=len),
                meas_extrp = seq(..4, ..3, length.out=len))
      })

(inside the call to mutate).


An alternative, since you say you're going to join it back in anyway:

threshold &lt;- 3
intrvl &lt;- 2 # use something meaningful
df %&gt;%
  mutate(nexttime = lead(time)) %&gt;%
  filter(nexttime - time &gt; threshold) %&gt;%
  reframe(time = unlist(Map(seq, time, nexttime, by=intrvl))) %&gt;%
  full_join(df, by = &quot;time&quot;) %&gt;%
  arrange(time) %&gt;%
  mutate(meas2 = if_else(is.na(meas), approx(time, meas, xout = time)$y, meas))
# # A tibble: 7 &#215; 3
#    time  meas meas2
#   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
# 1     1   200   200
# 2     2   300   300
# 3     3   320   320
# 4     5    NA   392
# 5     7    NA   464
# 6     8   500   500
# 7     9   600   600

I created meas2 in order to compare the original meas side-by-side with the NA-interpolated values, feel free to overwrite meas. Also, there are certainly other tools available for NA-interpolation, this is one way, others exist (e.g., zoo::na.approx).

The first thing we do here is determine where the gaps exist (mutate(..) %&gt;% filter(..)), then generate the needed times (reframe, from dplyr_1.1.0 or later), and join them back in with the original data (full_join). It can be insightful to see this stepwise:

df %&gt;%
  mutate(nexttime = lead(time)) %&gt;%
  filter(nexttime - time &gt; threshold)
# # A tibble: 1 &#215; 3
#    time  meas nexttime
#   &lt;dbl&gt; &lt;dbl&gt;    &lt;dbl&gt;
# 1     3   320        8

... %&gt;%
  reframe(time = unlist(Map(seq, time, nexttime, by=intrvl)))
# # A tibble: 3 &#215; 1
#    time
#   &lt;dbl&gt;
# 1     3
# 2     5
# 3     7

... %&gt;%
  full_join(df, by = &quot;time&quot;)
# # A tibble: 7 &#215; 2
#    time  meas
#   &lt;dbl&gt; &lt;dbl&gt;
# 1     3   320
# 2     5    NA
# 3     7    NA
# 4     1   200
# 5     2   300
# 6     8   500
# 7     9   600

From there, we just arrange the data and interpolate the missing values.

huangapple
  • 本文由 发表于 2023年4月17日 10:59:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76031436.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定