2023年4月17日 10:59:08go评论62阅读模式

英文:

How to output a tibble with if_else, and how to output a list-column with mutate?

问题

I'm attempting to extrapolate missing values from a time series. I'm comfortable with tidyverse, but seem to be running into a bit of a roadblock and am not sure if I should be taking a different approach.

我正在尝试从时间序列中外推缺失的值。我对tidyverse很熟悉，但似乎遇到了一些困难，不确定是否应采取不同的方法。

I have a data-frame with unevenly spaced-in-time measurements. If the time between measurements exceeds a certain threshold, I'd like to do a linear extrapolation between the nearest time above and below, with the measured values being evenly spaced. Something like:

我有一个时间不均匀的数据框，其中包含测量值。如果测量之间的时间超过某个阈值，我想在最接近的上下时间之间进行线性外推，测量值均匀分布。类似这样：

  time = c(1, 2, 3, 8, 9), 
  meas = c(200, 300, 320, 500, 600)
);

avg_meas_interval &lt;- median(df$time - lag(df$time);

df %&gt;% mutate(
  if_else(
    time - lag(time) &gt; threshold,
    tibble(
      time_extrp = seq(lag(time), time, avg_meas_interval),
      meas_extrp = seq(lag(meas), meas, length.out = length(time_extrp))
    ),
    NA
  )
)```

I would then have a column of tibbles that represent the extrapolated rows that should be inserted before a given row, or NA if there's no rows to be inserted. I could pull this and row_bind.

然后，我将拥有一个列的tibble，表示应在给定行之前插入的外推行，如果没有行可插入，则为NA。我可以提取这些并进行row_bind。

Now, there's a few issues I've been bumping into with this approach:
1) if_else doesn't like returning tibbles, as it wants the possible outcomes of the if-else to have equal length.
2) Mutate doesn't like outputting lists of tibbles. I can get around this with just using map, outputting a separate list to a new variable (so that it's not part of the tibble), and then rowbinding, but this seems a bit circuitous.

现在，我遇到了一些问题：
1) if_else不喜欢返回tibbles，因为它希望if-else的可能结果具有相等的长度。
2) Mutate不喜欢输出tibbles列表。我可以通过使用map来解决这个问题，将一个单独的列表输出到一个新变量中（以便它不是tibble的一部分），然后进行rowbinding，但这似乎有点迂回。

It seems a simple extrapolation like this, where the output size is dependent on the input data, is extremely hard to implement in pure functional style. Not sure if there's a better approach than just implementing this in procedural R or python.

似乎像这样一个简单的外推，其中输出大小依赖于输入数据，非常难以在纯函数式风格中实现。不确定是否有比在过程化的R或Python中实现更好的方法。

<details>
<summary>英文:</summary>

I&#39;m attempting to extrapolate missing values from a time series. I&#39;m comfortable with tidyverse, but seem to be running into a bit of a roadblock and am not sure if I should be taking a different approach.

I have a data-frame with unevenly spaced-in-time measurements. If the time between measurements exceeds a certain threshold, I&#39;d like to do a linear extrapolation between the nearest time above and below, with the measured values being evenly spaced. Something like:

df <- tibble(
time = c(1, 2, 3, 8, 9),
meas = c(200, 300, 320, 500, 600)
);

avg_meas_interval <- median(df$time - lag(df$time);

df %>% mutate(
if_else(
time - lag(time) > threshold,
tibble(
time_extrp = seq(lag(time), time, avg_meas_interval),
meas_extrp = seq(lag(meas), meas, length.out = length(time_extrp))
),
NA
)
)


I would then have a column of tibbles that represent the extrapolated rows that should be inserted before a given row, or NA if there&#39;s no rows to be inserted. I could pull this and row_bind.

Now, there&#39;s a few issues I&#39;ve been bumping into with this approach:
1) if_else doesn&#39;t like returning tibbles, as it wants the possible outcomes of the if-else to have equal length.
2) Mutate doesn&#39;t like outputting lists of tibbles. I can get around this with just using map, outputting a separate list to a new variable (so that it&#39;s not part of the tibble), and then rowbinding, but this seems a bit circuitous.

It seems a simple extrapolation like this, where the output size is dependent on the input data, is extremely hard to implement in pure functional style. Not sure if there&#39;s a better approach than just implementing this in procedural R or python.

</details>


# 答案1
**得分**: 1

Here is the translated code:

```r
也许是这样？

```r
threshold <- 3
df %>%
  mutate(
    across(c(time, meas), lag, .names = "lag_{.col}"),
    intrvl = c(1, rep(median(diff(time)), n() - 1)), # 我在这里推断...
    extrp = purrr::pmap(
      list(time, lag_time, meas, lag_meas, intrvl),
      ~ if (!is.na(..2) && (..1 - ..2) > threshold) {
         tibble(time_extrp = seq(..2, ..1, length.out=..5),
                meas_extrp = seq(..4, ..3, length.out=..5))
      })
  )
# # 一个 tibble: 5 × 6
#    time  meas lag_time lag_meas intrvl extrp           
#   <dbl> <dbl>    <dbl>    <dbl>  <dbl> <list>          
# 1     1   200       NA       NA      1 <NULL>          
# 2     2   300        1      200      1 <NULL>          
# 3     3   320        2      300      1 <NULL>          
# 4     8   500        3      320      1 <tibble [1 × 2]>
# 5     9   600        8      500      1 <NULL>

请注意，seq 调用需要返回相同的长度，因此在这里使用了 length.out=。在你的原始调用中，仅使用间隔将产生 time_extrp 元素数量较少，而 meas_extrp 的元素数量较多。在当前形式下，这将生成一个要插入的 1 行 tibble，我建议你可能更喜欢类似这样的内容：

    extrp = purrr::pmap(
      list(time, lag_time, meas, lag_meas),
      ~ if (!is.na(..2) && (..1 - ..2) > threshold) {
         len <- ceiling((..1 - ..2) / threshold)
         tibble(time_extrp = seq(..2, ..1, length.out=len),
                meas_extrp = seq(..4, ..3, length.out=len))
      })

（在 mutate 调用内）。

另一种方法，因为你说你打算将其重新加入：

threshold <- 3
intrvl <- 2 # 使用有意义的值
df %>%
  mutate(nexttime = lead(time)) %>%
  filter(nexttime - time > threshold) %>%
  reframe(time = unlist(Map(seq, time, nexttime, by=intrvl))) %>%
  full_join(df, by = "time") %>%
  arrange(time) %>%
  mutate(meas2 = if_else(is.na(meas), approx(time, meas, xout = time)$y, meas))
# # 一个 tibble: 7 × 3
#    time  meas meas2
#   <dbl> <dbl> <dbl>
# 1     1   200   200
# 2     2   300   300
# 3     3   320   320
# 4     5    NA   392
# 5     7    NA   464
# 6     8   500   500
# 7     9   600   600

我创建了 meas2 以便将原始的 meas 与 NA-插值的值进行对比，可以随意覆盖 meas。此外，还有其他可用于 NA-插值的工具，这只是一种方法，还有其他方法（例如，zoo::na.approx）。

我们首先要做的是确定缺口的位置（mutate(..) %>% filter(..)），然后生成所需的 time（reframe，从 dplyr_1.1.0 或更高版本开始可用），并将它们与原始数据重新加入（full_join）。逐步查看这个过程可能会很有见地：

df %>%
  mutate(nexttime = lead(time)) %>%
  filter(nexttime - time > threshold)
# # 一个 tibble: 1 × 3
#    time  meas nexttime
#   <dbl> <dbl>    <dbl>
# 1     3   320        8

... %>%
  reframe(time = unlist(Map(seq, time, nexttime, by=intrvl)))
# # 一个 tibble: 3 × 1
#    time
#   <dbl>
# 1     3
# 2     5
# 3     7

... %>%
  full_join(df, by = "time")
# # 一个 tibble: 7 × 2
#    time  meas
#   <dbl> <dbl>
# 1     3   320
# 2     5    NA
# 3     7    NA
# 4     1   200
# 5     2   300
# 6     8   500
# 7     9   600

然后，我们只需整理数据并插值缺失的值。

英文:

Perhaps this?

threshold &lt;- 3
df %&gt;%
  mutate(
    across(c(time, meas), lag, .names = &quot;lag_{.col}&quot;),
    intrvl = c(1, rep(median(diff(time)), n() - 1)), # I&#39;m inferring here ...
    extrp = purrr::pmap(
      list(time, lag_time, meas, lag_meas, intrvl),
      ~ if (!is.na(..2) &amp;&amp; (..1 - ..2) &gt; threshold) {
         tibble(time_extrp = seq(..2, ..1, length.out=..5),
                meas_extrp = seq(..4, ..3, length.out=..5))
      })
  )
# # A tibble: 5 &#215; 6
#    time  meas lag_time lag_meas intrvl extrp           
#   &lt;dbl&gt; &lt;dbl&gt;    &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt; &lt;list&gt;          
# 1     1   200       NA       NA      1 &lt;NULL&gt;          
# 2     2   300        1      200      1 &lt;NULL&gt;          
# 3     3   320        2      300      1 &lt;NULL&gt;          
# 4     8   500        3      320      1 &lt;tibble [1 &#215; 2]&gt;
# 5     9   600        8      500      1 &lt;NULL&gt;

Note that the seq calls need to return the same length, ergo length.out= here. In your original call, using the interval by itself would produce time_extrp on the order of a few elements, and meas_extrp on the order of 100s of elements. In its current form, this produces a 1-row tibble to be inserted, I suggest you might prefer something more akin to this:

    extrp = purrr::pmap(
      list(time, lag_time, meas, lag_meas),
      ~ if (!is.na(..2) &amp;&amp; (..1 - ..2) &gt; threshold) {
         len &lt;- ceiling((..1 - ..2) / threshold)
         tibble(time_extrp = seq(..2, ..1, length.out=len),
                meas_extrp = seq(..4, ..3, length.out=len))
      })

(inside the call to mutate).

An alternative, since you say you're going to join it back in anyway:

threshold &lt;- 3
intrvl &lt;- 2 # use something meaningful
df %&gt;%
  mutate(nexttime = lead(time)) %&gt;%
  filter(nexttime - time &gt; threshold) %&gt;%
  reframe(time = unlist(Map(seq, time, nexttime, by=intrvl))) %&gt;%
  full_join(df, by = &quot;time&quot;) %&gt;%
  arrange(time) %&gt;%
  mutate(meas2 = if_else(is.na(meas), approx(time, meas, xout = time)$y, meas))
# # A tibble: 7 &#215; 3
#    time  meas meas2
#   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
# 1     1   200   200
# 2     2   300   300
# 3     3   320   320
# 4     5    NA   392
# 5     7    NA   464
# 6     8   500   500
# 7     9   600   600

I created meas2 in order to compare the original meas side-by-side with the NA-interpolated values, feel free to overwrite meas. Also, there are certainly other tools available for NA-interpolation, this is one way, others exist (e.g., zoo::na.approx).

The first thing we do here is determine where the gaps exist (mutate(..) %>% filter(..)), then generate the needed times (reframe, from dplyr_1.1.0 or later), and join them back in with the original data (full_join). It can be insightful to see this stepwise:

df %&gt;%
  mutate(nexttime = lead(time)) %&gt;%
  filter(nexttime - time &gt; threshold)
# # A tibble: 1 &#215; 3
#    time  meas nexttime
#   &lt;dbl&gt; &lt;dbl&gt;    &lt;dbl&gt;
# 1     3   320        8

... %&gt;%
  reframe(time = unlist(Map(seq, time, nexttime, by=intrvl)))
# # A tibble: 3 &#215; 1
#    time
#   &lt;dbl&gt;
# 1     3
# 2     5
# 3     7

... %&gt;%
  full_join(df, by = &quot;time&quot;)
# # A tibble: 7 &#215; 2
#    time  meas
#   &lt;dbl&gt; &lt;dbl&gt;
# 1     3   320
# 2     5    NA
# 3     7    NA
# 4     1   200
# 5     2   300
# 6     8   500
# 7     9   600

From there, we just arrange the data and interpolate the missing values.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用if_else输出一个tibble，以及如何使用mutate输出一个list-column？

问题

将行转为列，但保持每列下面的列出的值。

Flextable Word输出 – 如何在不同页面上显示每个表格？

R包lme4和glmmTMB对于相同的模型和数据产生不同的AIC值。

提取所选列中的第一个非NA值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论