在一个大型的 R 数据框中,通过 tidyverse 按日期分组线性插值数值。

huangapple go评论110阅读模式
英文:

Linearly interpolating values in a large R data frame grouped by date using tidyverse

问题

我正在尝试在一个R数据框中对值进行插值(线性或其他方法)。这些值根据一个很大的数据框中的date_index分组,插值是基于f_dates进行的,以每日为单位进行的。f_dates的数量会发生变化,所以插值可能需要针对每个date_index进行分组,然后解除分组以生成更大的数据框?请参见下面的示例。

我的代码尝试如下,但似乎无法使分组和插值正常工作。

  1. df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106", "20230106", "20230106"),
  2. "f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
  3. "value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %>%
  4. mutate(date_index=ymd(date_index), f_date=ymd(f_date))
  5. df %>%
  6. group_by(date_index) %>%
  7. arrange(f_date)
  8. date_first <- first(df$f_date)
  9. date_last <- last(df$f_date)
  10. ApproxFun <- approxfun(x = df$f_date, y = df$value)
  11. date_seq <- seq.Date(ymd(date_first), ymd(date_last), by = 1)
  12. LinearFit <- ApproxFun(date_seq)
  13. # 插值后的数据框
  14. dfi <- data.frame(f_date = date_seq, value = LinearFit)

我期望的输出应该如下所示:

  1. date_index f_date value
  2. 2023-01-04 2023-02-01 1.50
  3. 2023-01-04 2023-02-02 1.75
  4. 2023-01-04 2023-02-03 2.00
  5. 2023-01-04 2023-02-04 2.18
  6. 2023-01-04 2023-02-05 2.36
  7. 2023-01-04 2023-02-06 2.54
  8. 2023-01-04 2023-02-07 2.71
  9. 2023-01-04 2023-02-08 2.89
  10. 2023-01-04 2023-02-09 3.07
  11. 2023-01-04 2023-02-10 3.25
  12. 2023-01-06 2023-02-01 4.00
  13. 2023-01-06 2023-02-02 3.50
  14. 2023-01-06 2023-02-03 3.00
  15. 2023-01-06 2023-02-04 2.93
  16. 2023-01-06 2023-02-05 2.86
  17. 2023-01-06 2023-02-06 2.79
  18. 2023-01-06 2023-02-07 2.71
  19. 2023-01-06 2023-02-08 2.64
  20. 2023-01-06 2023-02-09 2.57
  21. 2023-01-06 2023-02-10 2.50
英文:

I am attempting to interpolate (linearly or by other methods) between values in an R data frame. The values are grouped by a date_index in a very large dataframe, and the interpolation is between the values on a daily basis based on the f_dates. The number of f_dates changes so the interpolation, presumably, needs to be grouped for each date_index and then ungrouped into a larger data frame? See example below.

My attempt at the code is as follows, but I can't seem to get the grouping and interpolation to work.

  1. df &lt;- data.frame(&quot;date_index&quot;=c(&quot;20230104&quot;, &quot;20230104&quot;, &quot;20230104&quot;, &quot;20230106&quot;, &quot;20230106&quot;, &quot;20230106&quot;),
  2. &quot;f_date&quot;=c(&quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;, &quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;),
  3. &quot;value&quot; = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %&gt;%
  4. mutate(date_index=ymd(date_index), f_date=ymd(f_date))
  5. df %&gt;%
  6. group_by(date_index) %&gt;%
  7. arrange(f_date)
  8. date_first &lt;- first(df$f_date)
  9. date_last &lt;- last(df$f_date)
  10. ApproxFun &lt;- approxfun(x = df$f_date, y = df$value)
  11. date_seq &lt;- seq.Date(ymd(date_first), ymd(date_last), by = 1)
  12. LinearFit &lt;- ApproxFun(date_seq)
  13. # the interpolated dataframe
  14. dfi &lt;- data.frame(f_date = date_seq, value = LinearFit)

The output I was looking for should be as follows:

  1. date_index f_date value
  2. 2023-01-04 2023-02-01 1.50
  3. 2023-01-04 2023-02-02 1.75
  4. 2023-01-04 2023-02-03 2.00
  5. 2023-01-04 2023-02-04 2.18
  6. 2023-01-04 2023-02-05 2.36
  7. 2023-01-04 2023-02-06 2.54
  8. 2023-01-04 2023-02-07 2.71
  9. 2023-01-04 2023-02-08 2.89
  10. 2023-01-04 2023-02-09 3.07
  11. 2023-01-04 2023-02-10 3.25
  12. 2023-01-06 2023-02-01 4.00
  13. 2023-01-06 2023-02-02 3.50
  14. 2023-01-06 2023-02-03 3.00
  15. 2023-01-06 2023-02-04 2.93
  16. 2023-01-06 2023-02-05 2.86
  17. 2023-01-06 2023-02-06 2.79
  18. 2023-01-06 2023-02-07 2.71
  19. 2023-01-06 2023-02-08 2.64
  20. 2023-01-06 2023-02-09 2.57
  21. 2023-01-06 2023-02-10 2.50

答案1

得分: 2

这是一个使用tidyr::complete()函数和approx()函数的一行代码示例。

  1. library(dplyr)
  2. library(tidyr)
  3. library(lubridate)
  4. df %>%
  5. group_by(date_index) %>%
  6. complete(f_date = full_seq(f_date, 1)) %>%
  7. mutate(value = approx(f_date[!is.na(.$value)], y = value[!is.na(.$value)], xout = f_date, method = "linear")$y)

complete函数非常方便,可以填充给定序列中的缺失值,这里是以1天为间隔填充。一旦缺失日期被定义,然后只需使用approx函数来插值NA值。

英文:

Here is a one liner using the tidyr::complete() function and the approx()

  1. library(dplyr)
  2. library(tidyr)
  3. library(lubridate)
  4. df %&gt;% group_by(date_index) %&gt;%
  5. complete(f_date=full_seq(f_date, 1)) %&gt;%
  6. mutate(value=approx(f_date[!is.na(.$value)], y=value[!is.na(.$value)], xout=f_date, method = &quot;linear&quot;)$y)
  7. # A tibble: 20 &#215; 3
  8. # Groups: date_index [2]
  9. date_index f_date value
  10. &lt;date&gt; &lt;date&gt; &lt;dbl&gt;
  11. 1 2023-01-04 2023-02-01 1.5
  12. 2 2023-01-04 2023-02-02 1.75
  13. 3 2023-01-04 2023-02-03 2
  14. 4 2023-01-04 2023-02-04 2.18
  15. 5 2023-01-04 2023-02-05 2.36
  16. 6 2023-01-04 2023-02-06 2.54
  17. 7 2023-01-04 2023-02-07 2.71
  18. 8 2023-01-04 2023-02-08 2.89
  19. 9 2023-01-04 2023-02-09 3.07
  20. 10 2023-01-04 2023-02-10 3.25
  21. 11 2023-01-06 2023-02-01 4
  22. 12 2023-01-06 2023-02-02 3.5
  23. 13 2023-01-06 2023-02-03 3
  24. 14 2023-01-06 2023-02-04 2.93
  25. 15 2023-01-06 2023-02-05 2.86
  26. 16 2023-01-06 2023-02-06 2.79
  27. 17 2023-01-06 2023-02-07 2.71
  28. 18 2023-01-06 2023-02-08 2.64
  29. 19 2023-01-06 2023-02-09 2.57
  30. 20 2023-01-06 2023-02-10 2.5

The complete function is a handy to have and will fill-in the missing values in a given sequence, in this case in 1 day intervals.
Once the missing dates are define, then simply use the approx function to interpolate for the NA values.

答案2

得分: 1

我写了一个函数,使用分段线性回归模型进行插值,该模型只是标识性的 - 它具有与value变量的非缺失值一样多的参数。然后,它使用该模型使用观察到的value值作为节点来生成预测。

  1. interp_fun <- function(x,y){
  2. pwl <- function(x, k)sapply(k, function(K)ifelse(x >= K, x-K, 0))
  3. w <- which(!is.na(y))
  4. if(length(w) > 2){
  5. w <- w[-c(1, length(w))]
  6. kts <- x[w]
  7. mod <- lm(y ~ x + pwl(x, kts), na.action=na.exclude)
  8. ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
  9. }else{
  10. mod <- lm(y ~ x, na.action=na.exclude)
  11. ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
  12. }
  13. }

有了插值函数,你可以按date_index分组数据,然后从f_date的最小日期到最大日期生成一个新的日期序列,然后将其合并回原始数据中。你可以按date_index分组,然后将插值函数应用于f_datevalue。结果看起来似乎是你想要的。这应该适用于你在观察端点之间有多少“内部”观察值。

  1. df %>%
  2. group_by(date_index) %>%
  3. arrange(f_date, .by_group = TRUE) %>%
  4. reframe(f_date = seq(from = min(f_date),
  5. to = max(f_date),
  6. by = "1 day")) %>%
  7. left_join(df) %>%
  8. group_by(date_index) %>%
  9. mutate(value = interp_fun(f_date, value))

希望这些翻译对你有所帮助。

英文:

I'm not sure how to make this happen with an "out-of-the-box" function, though I don't doubt that it's possible. First, let's make the data then I'll explain the solution.

  1. library(dplyr)
  2. library(lubridate)
  3. df &lt;- data.frame(&quot;date_index&quot;=c(&quot;20230104&quot;, &quot;20230104&quot;, &quot;20230104&quot;, &quot;20230106&quot;, &quot;20230106&quot;, &quot;20230106&quot;),
  4. &quot;f_date&quot;=c(&quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;, &quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;),
  5. &quot;value&quot; = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %&gt;%
  6. mutate(date_index=ymd(date_index), f_date=ymd(f_date))

I wrote a function that does an interpolation with a piecewise-linear regression model that is just identified - it has as many parameters as there are non-missing values of the value variable. It then uses that model to generate predictions using the observed values of value as knots.

  1. interp_fun &lt;- function(x,y){
  2. pwl &lt;- function(x, k)sapply(k, function(K)ifelse(x &gt;= K, x-K, 0))
  3. w &lt;- which(!is.na(y))
  4. if(length(w) &gt; 2){
  5. w &lt;- w[-c(1, length(w))]
  6. kts &lt;- x[w]
  7. mod &lt;- lm(y ~ x + pwl(x, kts), na.action=na.exclude)
  8. ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
  9. }else{
  10. mod &lt;- lm(y ~ x, na.action=na.exclude)
  11. ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
  12. }
  13. }

With the interpolation function in hand, you can group the data by date_index, then generate a new sequence of dates from the minimum to the maximum of f_date and then merge that back into the original data. You can group by date_index and then apply the interpolation function to f_date and value. The result appears as though it's the one you want. This should work regardless of how many "interior" observed values you have between the observed endpoints.

  1. df %&gt;%
  2. group_by(date_index) %&gt;%
  3. arrange(f_date, .by_group = TRUE) %&gt;%
  4. reframe(f_date = seq(from = min(f_date),
  5. to = max(f_date),
  6. by = &quot;1 day&quot;)) %&gt;%
  7. left_join(df) %&gt;%
  8. group_by(date_index) %&gt;%
  9. mutate(value = interp_fun(f_date, value))
  10. #&gt; Joining with `by = join_by(date_index, f_date)`
  11. #&gt; # A tibble: 20 &#215; 3
  12. #&gt; # Groups: date_index [2]
  13. #&gt; date_index f_date value
  14. #&gt; &lt;date&gt; &lt;date&gt; &lt;dbl&gt;
  15. #&gt; 1 2023-01-04 2023-02-01 1.5
  16. #&gt; 2 2023-01-04 2023-02-02 1.75
  17. #&gt; 3 2023-01-04 2023-02-03 2
  18. #&gt; 4 2023-01-04 2023-02-04 2.18
  19. #&gt; 5 2023-01-04 2023-02-05 2.36
  20. #&gt; 6 2023-01-04 2023-02-06 2.54
  21. #&gt; 7 2023-01-04 2023-02-07 2.71
  22. #&gt; 8 2023-01-04 2023-02-08 2.89
  23. #&gt; 9 2023-01-04 2023-02-09 3.07
  24. #&gt; 10 2023-01-04 2023-02-10 3.25
  25. #&gt; 11 2023-01-06 2023-02-01 4
  26. #&gt; 12 2023-01-06 2023-02-02 3.50
  27. #&gt; 13 2023-01-06 2023-02-03 3
  28. #&gt; 14 2023-01-06 2023-02-04 2.93
  29. #&gt; 15 2023-01-06 2023-02-05 2.86
  30. #&gt; 16 2023-01-06 2023-02-06 2.79
  31. #&gt; 17 2023-01-06 2023-02-07 2.71
  32. #&gt; 18 2023-01-06 2023-02-08 2.64
  33. #&gt; 19 2023-01-06 2023-02-09 2.57
  34. #&gt; 20 2023-01-06 2023-02-10 2.5

<sup>Created on 2023-06-01 with reprex v2.0.2</sup>

答案3

得分: 0

这是我对它的看法,使用 summariseunnest

  1. library(dplyr, warn.conflicts = FALSE)
  2. library(lubridate, warn.conflicts = FALSE)
  3. library(tidyr)
  4. interpolator <- function(x, y) {
  5. date_first <- first(x)
  6. date_last <- last(x)
  7. interpolant <- approxfun(x = x, y = y)
  8. date_seq <- seq.Date(ymd(date_first), ymd(date_last), by = 1)
  9. tibble(
  10. f_date = date_seq, value = interpolant(date_seq)
  11. )
  12. }
  13. df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106", "20230106", "20230106"),
  14. "f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
  15. "value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %>%
  16. mutate(date_index=ymd(date_index), f_date=ymd(f_date))
  17. df %>%
  18. group_by(date_index) %>%
  19. arrange(f_date) %>%
  20. summarise(
  21. tmp = list(interpolator(f_date, value))
  22. ) %>%
  23. unnest(tmp)

在 2023-06-01 使用 reprex v2.0.2 创建

英文:

Here is my take on it using summarise and unnest:

  1. library(dplyr, warn.conflicts = FALSE)
  2. library(lubridate, warn.conflicts = FALSE)
  3. library(tidyr)
  4. interpolator &lt;- function(x, y) {
  5. date_first &lt;- first(x)
  6. date_last &lt;- last(x)
  7. interpolant &lt;- approxfun(x = x, y = y)
  8. date_seq &lt;- seq.Date(ymd(date_first), ymd(date_last), by = 1)
  9. tibble(
  10. f_date = date_seq, value = interpolant(date_seq)
  11. )
  12. }
  13. df &lt;- data.frame(&quot;date_index&quot;=c(&quot;20230104&quot;, &quot;20230104&quot;, &quot;20230104&quot;, &quot;20230106&quot;, &quot;20230106&quot;, &quot;20230106&quot;),
  14. &quot;f_date&quot;=c(&quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;, &quot;20230201&quot;, &quot;20230203&quot;, &quot;20230210&quot;),
  15. &quot;value&quot; = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %&gt;%
  16. mutate(date_index=ymd(date_index), f_date=ymd(f_date))
  17. df %&gt;%
  18. group_by(date_index) %&gt;%
  19. arrange(f_date) %&gt;%
  20. summarise(
  21. tmp = list(interpolator(f_date, value))
  22. ) %&gt;%
  23. unnest(tmp)
  24. #&gt; # A tibble: 20 &#215; 3
  25. #&gt; date_index f_date value
  26. #&gt; &lt;date&gt; &lt;date&gt; &lt;dbl&gt;
  27. #&gt; 1 2023-01-04 2023-02-01 1.5
  28. #&gt; 2 2023-01-04 2023-02-02 1.75
  29. #&gt; 3 2023-01-04 2023-02-03 2
  30. #&gt; 4 2023-01-04 2023-02-04 2.18
  31. #&gt; 5 2023-01-04 2023-02-05 2.36
  32. #&gt; 6 2023-01-04 2023-02-06 2.54
  33. #&gt; 7 2023-01-04 2023-02-07 2.71
  34. #&gt; 8 2023-01-04 2023-02-08 2.89
  35. #&gt; 9 2023-01-04 2023-02-09 3.07
  36. #&gt; 10 2023-01-04 2023-02-10 3.25
  37. #&gt; 11 2023-01-06 2023-02-01 4
  38. #&gt; 12 2023-01-06 2023-02-02 3.5
  39. #&gt; 13 2023-01-06 2023-02-03 3
  40. #&gt; 14 2023-01-06 2023-02-04 2.93
  41. #&gt; 15 2023-01-06 2023-02-05 2.86
  42. #&gt; 16 2023-01-06 2023-02-06 2.79
  43. #&gt; 17 2023-01-06 2023-02-07 2.71
  44. #&gt; 18 2023-01-06 2023-02-08 2.64
  45. #&gt; 19 2023-01-06 2023-02-09 2.57
  46. #&gt; 20 2023-01-06 2023-02-10 2.5

<sup>Created on 2023-06-01 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月2日 05:11:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76385735.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定