在一个大型的 R 数据框中,通过 tidyverse 按日期分组线性插值数值。

huangapple go评论83阅读模式
英文:

Linearly interpolating values in a large R data frame grouped by date using tidyverse

问题

我正在尝试在一个R数据框中对值进行插值(线性或其他方法)。这些值根据一个很大的数据框中的date_index分组,插值是基于f_dates进行的,以每日为单位进行的。f_dates的数量会发生变化,所以插值可能需要针对每个date_index进行分组,然后解除分组以生成更大的数据框?请参见下面的示例。

我的代码尝试如下,但似乎无法使分组和插值正常工作。

df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106", "20230106", "20230106"),
"f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
"value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %>%
mutate(date_index=ymd(date_index), f_date=ymd(f_date))

df %>%
group_by(date_index) %>%
arrange(f_date)

date_first <- first(df$f_date)
date_last <- last(df$f_date)

ApproxFun <- approxfun(x = df$f_date, y = df$value)

date_seq <- seq.Date(ymd(date_first), ymd(date_last), by = 1)

LinearFit <- ApproxFun(date_seq)

# 插值后的数据框
dfi <- data.frame(f_date = date_seq, value = LinearFit)

我期望的输出应该如下所示:

date_index   f_date  value
2023-01-04 2023-02-01  1.50
2023-01-04 2023-02-02  1.75
2023-01-04 2023-02-03  2.00
2023-01-04 2023-02-04  2.18
2023-01-04 2023-02-05  2.36
2023-01-04 2023-02-06  2.54
2023-01-04 2023-02-07  2.71
2023-01-04 2023-02-08  2.89
2023-01-04 2023-02-09  3.07
2023-01-04 2023-02-10  3.25
2023-01-06 2023-02-01  4.00
2023-01-06 2023-02-02  3.50
2023-01-06 2023-02-03  3.00
2023-01-06 2023-02-04  2.93
2023-01-06 2023-02-05  2.86
2023-01-06 2023-02-06  2.79
2023-01-06 2023-02-07  2.71
2023-01-06 2023-02-08  2.64
2023-01-06 2023-02-09  2.57
2023-01-06 2023-02-10  2.50
英文:

I am attempting to interpolate (linearly or by other methods) between values in an R data frame. The values are grouped by a date_index in a very large dataframe, and the interpolation is between the values on a daily basis based on the f_dates. The number of f_dates changes so the interpolation, presumably, needs to be grouped for each date_index and then ungrouped into a larger data frame? See example below.

My attempt at the code is as follows, but I can't seem to get the grouping and interpolation to work.

df &lt;- data.frame(&quot;date_index&quot;=c(&quot;20230104&quot;, &quot;20230104&quot;, &quot;20230104&quot;, &quot;20230106&quot;,  &quot;20230106&quot;, &quot;20230106&quot;),
&quot;f_date&quot;=c(&quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;, &quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;),
                 &quot;value&quot; = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50))  %&gt;% 
  mutate(date_index=ymd(date_index), f_date=ymd(f_date))

df %&gt;% 
  group_by(date_index) %&gt;% 
  arrange(f_date)

date_first &lt;- first(df$f_date)
date_last &lt;- last(df$f_date)

ApproxFun &lt;- approxfun(x = df$f_date, y = df$value)

date_seq &lt;- seq.Date(ymd(date_first), ymd(date_last), by = 1)

LinearFit &lt;- ApproxFun(date_seq)

# the interpolated dataframe
dfi &lt;- data.frame(f_date = date_seq,  value = LinearFit)

The output I was looking for should be as follows:

date_index	f_date  value
2023-01-04	2023-02-01	1.50
2023-01-04	2023-02-02	1.75
2023-01-04	2023-02-03	2.00
2023-01-04	2023-02-04	2.18
2023-01-04	2023-02-05	2.36
2023-01-04	2023-02-06	2.54
2023-01-04	2023-02-07	2.71
2023-01-04	2023-02-08	2.89
2023-01-04	2023-02-09	3.07
2023-01-04	2023-02-10	3.25
2023-01-06	2023-02-01	4.00
2023-01-06	2023-02-02	3.50
2023-01-06	2023-02-03	3.00
2023-01-06	2023-02-04	2.93
2023-01-06	2023-02-05	2.86
2023-01-06	2023-02-06	2.79
2023-01-06	2023-02-07	2.71
2023-01-06	2023-02-08	2.64
2023-01-06	2023-02-09	2.57
2023-01-06	2023-02-10	2.50

答案1

得分: 2

这是一个使用tidyr::complete()函数和approx()函数的一行代码示例。

library(dplyr)
library(tidyr)
library(lubridate)

df %>%
  group_by(date_index) %>%
  complete(f_date = full_seq(f_date, 1)) %>%
  mutate(value = approx(f_date[!is.na(.$value)], y = value[!is.na(.$value)], xout = f_date, method = "linear")$y)

complete函数非常方便,可以填充给定序列中的缺失值,这里是以1天为间隔填充。一旦缺失日期被定义,然后只需使用approx函数来插值NA值。

英文:

Here is a one liner using the tidyr::complete() function and the approx()

library(dplyr)
library(tidyr)
library(lubridate)

df %&gt;% group_by(date_index) %&gt;% 
   complete(f_date=full_seq(f_date, 1)) %&gt;% 
   mutate(value=approx(f_date[!is.na(.$value)], y=value[!is.na(.$value)], xout=f_date, method = &quot;linear&quot;)$y)

# A tibble: 20 &#215; 3
# Groups:   date_index [2]
  date_index f_date     value
      &lt;date&gt;     &lt;date&gt;     &lt;dbl&gt;
1 2023-01-04 2023-02-01  1.5 
2 2023-01-04 2023-02-02  1.75
3 2023-01-04 2023-02-03  2   
4 2023-01-04 2023-02-04  2.18
5 2023-01-04 2023-02-05  2.36
6 2023-01-04 2023-02-06  2.54
7 2023-01-04 2023-02-07  2.71
8 2023-01-04 2023-02-08  2.89
9 2023-01-04 2023-02-09  3.07
10 2023-01-04 2023-02-10  3.25
11 2023-01-06 2023-02-01  4   
12 2023-01-06 2023-02-02  3.5 
13 2023-01-06 2023-02-03  3   
14 2023-01-06 2023-02-04  2.93
15 2023-01-06 2023-02-05  2.86
16 2023-01-06 2023-02-06  2.79
17 2023-01-06 2023-02-07  2.71
18 2023-01-06 2023-02-08  2.64
19 2023-01-06 2023-02-09  2.57
20 2023-01-06 2023-02-10  2.5 

The complete function is a handy to have and will fill-in the missing values in a given sequence, in this case in 1 day intervals.
Once the missing dates are define, then simply use the approx function to interpolate for the NA values.

答案2

得分: 1

我写了一个函数,使用分段线性回归模型进行插值,该模型只是标识性的 - 它具有与value变量的非缺失值一样多的参数。然后,它使用该模型使用观察到的value值作为节点来生成预测。

interp_fun <- function(x,y){
  pwl <- function(x, k)sapply(k, function(K)ifelse(x >= K, x-K, 0))
  w <- which(!is.na(y))
  if(length(w) > 2){
    w <- w[-c(1, length(w))]
    kts <- x[w]
    mod <- lm(y ~ x + pwl(x, kts), na.action=na.exclude)
    ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
  }else{
    mod <- lm(y ~ x, na.action=na.exclude)
    ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
  }
}

有了插值函数,你可以按date_index分组数据,然后从f_date的最小日期到最大日期生成一个新的日期序列,然后将其合并回原始数据中。你可以按date_index分组,然后将插值函数应用于f_datevalue。结果看起来似乎是你想要的。这应该适用于你在观察端点之间有多少“内部”观察值。

df %>% 
  group_by(date_index) %>% 
  arrange(f_date, .by_group = TRUE) %>% 
  reframe(f_date = seq(from = min(f_date), 
                       to = max(f_date), 
                       by = "1 day")) %>% 
  left_join(df) %>%
  group_by(date_index) %>% 
  mutate(value = interp_fun(f_date, value))

希望这些翻译对你有所帮助。

英文:

I'm not sure how to make this happen with an "out-of-the-box" function, though I don't doubt that it's possible. First, let's make the data then I'll explain the solution.

library(dplyr)
library(lubridate)
df &lt;- data.frame(&quot;date_index&quot;=c(&quot;20230104&quot;, &quot;20230104&quot;, &quot;20230104&quot;, &quot;20230106&quot;,  &quot;20230106&quot;, &quot;20230106&quot;),
                 &quot;f_date&quot;=c(&quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;, &quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;),
                 &quot;value&quot; = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50))  %&gt;% 
  mutate(date_index=ymd(date_index), f_date=ymd(f_date))

I wrote a function that does an interpolation with a piecewise-linear regression model that is just identified - it has as many parameters as there are non-missing values of the value variable. It then uses that model to generate predictions using the observed values of value as knots.

interp_fun &lt;- function(x,y){
  pwl &lt;- function(x, k)sapply(k, function(K)ifelse(x &gt;= K, x-K, 0))
  w &lt;- which(!is.na(y))
  if(length(w) &gt; 2){
    w &lt;- w[-c(1, length(w))]
    kts &lt;- x[w]
    mod &lt;- lm(y ~ x + pwl(x, kts), na.action=na.exclude)
    ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
  }else{
    mod &lt;- lm(y ~ x, na.action=na.exclude)
    ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
  }
}

With the interpolation function in hand, you can group the data by date_index, then generate a new sequence of dates from the minimum to the maximum of f_date and then merge that back into the original data. You can group by date_index and then apply the interpolation function to f_date and value. The result appears as though it's the one you want. This should work regardless of how many "interior" observed values you have between the observed endpoints.


df %&gt;% 
  group_by(date_index) %&gt;% 
  arrange(f_date, .by_group = TRUE) %&gt;% 
  reframe(f_date = seq(from = min(f_date), 
                       to = max(f_date), 
                       by = &quot;1 day&quot;)) %&gt;% 
  left_join(df) %&gt;%
  group_by(date_index) %&gt;% 
  mutate(value = interp_fun(f_date, value))
#&gt; Joining with `by = join_by(date_index, f_date)`
#&gt; # A tibble: 20 &#215; 3
#&gt; # Groups:   date_index [2]
#&gt;    date_index f_date     value
#&gt;    &lt;date&gt;     &lt;date&gt;     &lt;dbl&gt;
#&gt;  1 2023-01-04 2023-02-01  1.5 
#&gt;  2 2023-01-04 2023-02-02  1.75
#&gt;  3 2023-01-04 2023-02-03  2   
#&gt;  4 2023-01-04 2023-02-04  2.18
#&gt;  5 2023-01-04 2023-02-05  2.36
#&gt;  6 2023-01-04 2023-02-06  2.54
#&gt;  7 2023-01-04 2023-02-07  2.71
#&gt;  8 2023-01-04 2023-02-08  2.89
#&gt;  9 2023-01-04 2023-02-09  3.07
#&gt; 10 2023-01-04 2023-02-10  3.25
#&gt; 11 2023-01-06 2023-02-01  4   
#&gt; 12 2023-01-06 2023-02-02  3.50
#&gt; 13 2023-01-06 2023-02-03  3   
#&gt; 14 2023-01-06 2023-02-04  2.93
#&gt; 15 2023-01-06 2023-02-05  2.86
#&gt; 16 2023-01-06 2023-02-06  2.79
#&gt; 17 2023-01-06 2023-02-07  2.71
#&gt; 18 2023-01-06 2023-02-08  2.64
#&gt; 19 2023-01-06 2023-02-09  2.57
#&gt; 20 2023-01-06 2023-02-10  2.5

<sup>Created on 2023-06-01 with reprex v2.0.2</sup>

答案3

得分: 0

这是我对它的看法,使用 summariseunnest

library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(tidyr)
interpolator <- function(x, y) {
  date_first <- first(x)
  date_last <- last(x)
  interpolant <- approxfun(x = x, y = y)
  date_seq <- seq.Date(ymd(date_first), ymd(date_last), by = 1)
  tibble(
    f_date = date_seq, value = interpolant(date_seq)
  )
}

df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106",  "20230106", "20230106"),
                 "f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
                 "value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %>%
  mutate(date_index=ymd(date_index), f_date=ymd(f_date))

df %>%
  group_by(date_index) %>%
  arrange(f_date) %>%
  summarise(
    tmp = list(interpolator(f_date, value))
  ) %>%
  unnest(tmp)

在 2023-06-01 使用 reprex v2.0.2 创建

英文:

Here is my take on it using summarise and unnest:

library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(tidyr)
interpolator &lt;- function(x, y) {
  date_first &lt;- first(x)
  date_last &lt;- last(x)
  interpolant &lt;- approxfun(x = x, y = y)
  date_seq &lt;- seq.Date(ymd(date_first), ymd(date_last), by = 1)
  tibble(
    f_date = date_seq, value = interpolant(date_seq)
  )
}

df &lt;- data.frame(&quot;date_index&quot;=c(&quot;20230104&quot;, &quot;20230104&quot;, &quot;20230104&quot;, &quot;20230106&quot;,  &quot;20230106&quot;, &quot;20230106&quot;),
                 &quot;f_date&quot;=c(&quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;, &quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;),
                 &quot;value&quot; = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50))  %&gt;% 
  mutate(date_index=ymd(date_index), f_date=ymd(f_date))

df %&gt;% 
  group_by(date_index) %&gt;% 
  arrange(f_date) %&gt;% 
  summarise(
    tmp = list(interpolator(f_date, value))
  ) %&gt;% 
  unnest(tmp)
#&gt; # A tibble: 20 &#215; 3
#&gt;    date_index f_date     value
#&gt;    &lt;date&gt;     &lt;date&gt;     &lt;dbl&gt;
#&gt;  1 2023-01-04 2023-02-01  1.5 
#&gt;  2 2023-01-04 2023-02-02  1.75
#&gt;  3 2023-01-04 2023-02-03  2   
#&gt;  4 2023-01-04 2023-02-04  2.18
#&gt;  5 2023-01-04 2023-02-05  2.36
#&gt;  6 2023-01-04 2023-02-06  2.54
#&gt;  7 2023-01-04 2023-02-07  2.71
#&gt;  8 2023-01-04 2023-02-08  2.89
#&gt;  9 2023-01-04 2023-02-09  3.07
#&gt; 10 2023-01-04 2023-02-10  3.25
#&gt; 11 2023-01-06 2023-02-01  4   
#&gt; 12 2023-01-06 2023-02-02  3.5 
#&gt; 13 2023-01-06 2023-02-03  3   
#&gt; 14 2023-01-06 2023-02-04  2.93
#&gt; 15 2023-01-06 2023-02-05  2.86
#&gt; 16 2023-01-06 2023-02-06  2.79
#&gt; 17 2023-01-06 2023-02-07  2.71
#&gt; 18 2023-01-06 2023-02-08  2.64
#&gt; 19 2023-01-06 2023-02-09  2.57
#&gt; 20 2023-01-06 2023-02-10  2.5

<sup>Created on 2023-06-01 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月2日 05:11:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76385735.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定