英文:
Linearly interpolating values in a large R data frame grouped by date using tidyverse
问题
我正在尝试在一个R数据框中对值进行插值(线性或其他方法)。这些值根据一个很大的数据框中的date_index分组,插值是基于f_dates进行的,以每日为单位进行的。f_dates的数量会发生变化,所以插值可能需要针对每个date_index进行分组,然后解除分组以生成更大的数据框?请参见下面的示例。
我的代码尝试如下,但似乎无法使分组和插值正常工作。
df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106", "20230106", "20230106"),
"f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
"value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %>%
mutate(date_index=ymd(date_index), f_date=ymd(f_date))
df %>%
group_by(date_index) %>%
arrange(f_date)
date_first <- first(df$f_date)
date_last <- last(df$f_date)
ApproxFun <- approxfun(x = df$f_date, y = df$value)
date_seq <- seq.Date(ymd(date_first), ymd(date_last), by = 1)
LinearFit <- ApproxFun(date_seq)
# 插值后的数据框
dfi <- data.frame(f_date = date_seq, value = LinearFit)
我期望的输出应该如下所示:
date_index f_date value
2023-01-04 2023-02-01 1.50
2023-01-04 2023-02-02 1.75
2023-01-04 2023-02-03 2.00
2023-01-04 2023-02-04 2.18
2023-01-04 2023-02-05 2.36
2023-01-04 2023-02-06 2.54
2023-01-04 2023-02-07 2.71
2023-01-04 2023-02-08 2.89
2023-01-04 2023-02-09 3.07
2023-01-04 2023-02-10 3.25
2023-01-06 2023-02-01 4.00
2023-01-06 2023-02-02 3.50
2023-01-06 2023-02-03 3.00
2023-01-06 2023-02-04 2.93
2023-01-06 2023-02-05 2.86
2023-01-06 2023-02-06 2.79
2023-01-06 2023-02-07 2.71
2023-01-06 2023-02-08 2.64
2023-01-06 2023-02-09 2.57
2023-01-06 2023-02-10 2.50
英文:
I am attempting to interpolate (linearly or by other methods) between values in an R data frame. The values are grouped by a date_index in a very large dataframe, and the interpolation is between the values on a daily basis based on the f_dates. The number of f_dates changes so the interpolation, presumably, needs to be grouped for each date_index and then ungrouped into a larger data frame? See example below.
My attempt at the code is as follows, but I can't seem to get the grouping and interpolation to work.
df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106", "20230106", "20230106"),
"f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
"value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %>%
mutate(date_index=ymd(date_index), f_date=ymd(f_date))
df %>%
group_by(date_index) %>%
arrange(f_date)
date_first <- first(df$f_date)
date_last <- last(df$f_date)
ApproxFun <- approxfun(x = df$f_date, y = df$value)
date_seq <- seq.Date(ymd(date_first), ymd(date_last), by = 1)
LinearFit <- ApproxFun(date_seq)
# the interpolated dataframe
dfi <- data.frame(f_date = date_seq, value = LinearFit)
The output I was looking for should be as follows:
date_index f_date value
2023-01-04 2023-02-01 1.50
2023-01-04 2023-02-02 1.75
2023-01-04 2023-02-03 2.00
2023-01-04 2023-02-04 2.18
2023-01-04 2023-02-05 2.36
2023-01-04 2023-02-06 2.54
2023-01-04 2023-02-07 2.71
2023-01-04 2023-02-08 2.89
2023-01-04 2023-02-09 3.07
2023-01-04 2023-02-10 3.25
2023-01-06 2023-02-01 4.00
2023-01-06 2023-02-02 3.50
2023-01-06 2023-02-03 3.00
2023-01-06 2023-02-04 2.93
2023-01-06 2023-02-05 2.86
2023-01-06 2023-02-06 2.79
2023-01-06 2023-02-07 2.71
2023-01-06 2023-02-08 2.64
2023-01-06 2023-02-09 2.57
2023-01-06 2023-02-10 2.50
答案1
得分: 2
这是一个使用tidyr::complete()
函数和approx()
函数的一行代码示例。
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
group_by(date_index) %>%
complete(f_date = full_seq(f_date, 1)) %>%
mutate(value = approx(f_date[!is.na(.$value)], y = value[!is.na(.$value)], xout = f_date, method = "linear")$y)
complete
函数非常方便,可以填充给定序列中的缺失值,这里是以1天为间隔填充。一旦缺失日期被定义,然后只需使用approx
函数来插值NA值。
英文:
Here is a one liner using the tidyr::complete()
function and the approx()
library(dplyr)
library(tidyr)
library(lubridate)
df %>% group_by(date_index) %>%
complete(f_date=full_seq(f_date, 1)) %>%
mutate(value=approx(f_date[!is.na(.$value)], y=value[!is.na(.$value)], xout=f_date, method = "linear")$y)
# A tibble: 20 × 3
# Groups: date_index [2]
date_index f_date value
<date> <date> <dbl>
1 2023-01-04 2023-02-01 1.5
2 2023-01-04 2023-02-02 1.75
3 2023-01-04 2023-02-03 2
4 2023-01-04 2023-02-04 2.18
5 2023-01-04 2023-02-05 2.36
6 2023-01-04 2023-02-06 2.54
7 2023-01-04 2023-02-07 2.71
8 2023-01-04 2023-02-08 2.89
9 2023-01-04 2023-02-09 3.07
10 2023-01-04 2023-02-10 3.25
11 2023-01-06 2023-02-01 4
12 2023-01-06 2023-02-02 3.5
13 2023-01-06 2023-02-03 3
14 2023-01-06 2023-02-04 2.93
15 2023-01-06 2023-02-05 2.86
16 2023-01-06 2023-02-06 2.79
17 2023-01-06 2023-02-07 2.71
18 2023-01-06 2023-02-08 2.64
19 2023-01-06 2023-02-09 2.57
20 2023-01-06 2023-02-10 2.5
The complete function is a handy to have and will fill-in the missing values in a given sequence, in this case in 1 day intervals.
Once the missing dates are define, then simply use the approx function to interpolate for the NA values.
答案2
得分: 1
我写了一个函数,使用分段线性回归模型进行插值,该模型只是标识性的 - 它具有与value
变量的非缺失值一样多的参数。然后,它使用该模型使用观察到的value
值作为节点来生成预测。
interp_fun <- function(x,y){
pwl <- function(x, k)sapply(k, function(K)ifelse(x >= K, x-K, 0))
w <- which(!is.na(y))
if(length(w) > 2){
w <- w[-c(1, length(w))]
kts <- x[w]
mod <- lm(y ~ x + pwl(x, kts), na.action=na.exclude)
ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
}else{
mod <- lm(y ~ x, na.action=na.exclude)
ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
}
}
有了插值函数,你可以按date_index
分组数据,然后从f_date
的最小日期到最大日期生成一个新的日期序列,然后将其合并回原始数据中。你可以按date_index
分组,然后将插值函数应用于f_date
和value
。结果看起来似乎是你想要的。这应该适用于你在观察端点之间有多少“内部”观察值。
df %>%
group_by(date_index) %>%
arrange(f_date, .by_group = TRUE) %>%
reframe(f_date = seq(from = min(f_date),
to = max(f_date),
by = "1 day")) %>%
left_join(df) %>%
group_by(date_index) %>%
mutate(value = interp_fun(f_date, value))
希望这些翻译对你有所帮助。
英文:
I'm not sure how to make this happen with an "out-of-the-box" function, though I don't doubt that it's possible. First, let's make the data then I'll explain the solution.
library(dplyr)
library(lubridate)
df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106", "20230106", "20230106"),
"f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
"value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %>%
mutate(date_index=ymd(date_index), f_date=ymd(f_date))
I wrote a function that does an interpolation with a piecewise-linear regression model that is just identified - it has as many parameters as there are non-missing values of the value
variable. It then uses that model to generate predictions using the observed values of value
as knots.
interp_fun <- function(x,y){
pwl <- function(x, k)sapply(k, function(K)ifelse(x >= K, x-K, 0))
w <- which(!is.na(y))
if(length(w) > 2){
w <- w[-c(1, length(w))]
kts <- x[w]
mod <- lm(y ~ x + pwl(x, kts), na.action=na.exclude)
ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
}else{
mod <- lm(y ~ x, na.action=na.exclude)
ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
}
}
With the interpolation function in hand, you can group the data by date_index
, then generate a new sequence of dates from the minimum to the maximum of f_date
and then merge that back into the original data. You can group by date_index
and then apply the interpolation function to f_date
and value
. The result appears as though it's the one you want. This should work regardless of how many "interior" observed values you have between the observed endpoints.
df %>%
group_by(date_index) %>%
arrange(f_date, .by_group = TRUE) %>%
reframe(f_date = seq(from = min(f_date),
to = max(f_date),
by = "1 day")) %>%
left_join(df) %>%
group_by(date_index) %>%
mutate(value = interp_fun(f_date, value))
#> Joining with `by = join_by(date_index, f_date)`
#> # A tibble: 20 × 3
#> # Groups: date_index [2]
#> date_index f_date value
#> <date> <date> <dbl>
#> 1 2023-01-04 2023-02-01 1.5
#> 2 2023-01-04 2023-02-02 1.75
#> 3 2023-01-04 2023-02-03 2
#> 4 2023-01-04 2023-02-04 2.18
#> 5 2023-01-04 2023-02-05 2.36
#> 6 2023-01-04 2023-02-06 2.54
#> 7 2023-01-04 2023-02-07 2.71
#> 8 2023-01-04 2023-02-08 2.89
#> 9 2023-01-04 2023-02-09 3.07
#> 10 2023-01-04 2023-02-10 3.25
#> 11 2023-01-06 2023-02-01 4
#> 12 2023-01-06 2023-02-02 3.50
#> 13 2023-01-06 2023-02-03 3
#> 14 2023-01-06 2023-02-04 2.93
#> 15 2023-01-06 2023-02-05 2.86
#> 16 2023-01-06 2023-02-06 2.79
#> 17 2023-01-06 2023-02-07 2.71
#> 18 2023-01-06 2023-02-08 2.64
#> 19 2023-01-06 2023-02-09 2.57
#> 20 2023-01-06 2023-02-10 2.5
<sup>Created on 2023-06-01 with reprex v2.0.2</sup>
答案3
得分: 0
这是我对它的看法,使用 summarise
和 unnest
:
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(tidyr)
interpolator <- function(x, y) {
date_first <- first(x)
date_last <- last(x)
interpolant <- approxfun(x = x, y = y)
date_seq <- seq.Date(ymd(date_first), ymd(date_last), by = 1)
tibble(
f_date = date_seq, value = interpolant(date_seq)
)
}
df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106", "20230106", "20230106"),
"f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
"value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %>%
mutate(date_index=ymd(date_index), f_date=ymd(f_date))
df %>%
group_by(date_index) %>%
arrange(f_date) %>%
summarise(
tmp = list(interpolator(f_date, value))
) %>%
unnest(tmp)
在 2023-06-01 使用 reprex v2.0.2 创建
英文:
Here is my take on it using summarise
and unnest
:
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(tidyr)
interpolator <- function(x, y) {
date_first <- first(x)
date_last <- last(x)
interpolant <- approxfun(x = x, y = y)
date_seq <- seq.Date(ymd(date_first), ymd(date_last), by = 1)
tibble(
f_date = date_seq, value = interpolant(date_seq)
)
}
df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106", "20230106", "20230106"),
"f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
"value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50)) %>%
mutate(date_index=ymd(date_index), f_date=ymd(f_date))
df %>%
group_by(date_index) %>%
arrange(f_date) %>%
summarise(
tmp = list(interpolator(f_date, value))
) %>%
unnest(tmp)
#> # A tibble: 20 × 3
#> date_index f_date value
#> <date> <date> <dbl>
#> 1 2023-01-04 2023-02-01 1.5
#> 2 2023-01-04 2023-02-02 1.75
#> 3 2023-01-04 2023-02-03 2
#> 4 2023-01-04 2023-02-04 2.18
#> 5 2023-01-04 2023-02-05 2.36
#> 6 2023-01-04 2023-02-06 2.54
#> 7 2023-01-04 2023-02-07 2.71
#> 8 2023-01-04 2023-02-08 2.89
#> 9 2023-01-04 2023-02-09 3.07
#> 10 2023-01-04 2023-02-10 3.25
#> 11 2023-01-06 2023-02-01 4
#> 12 2023-01-06 2023-02-02 3.5
#> 13 2023-01-06 2023-02-03 3
#> 14 2023-01-06 2023-02-04 2.93
#> 15 2023-01-06 2023-02-05 2.86
#> 16 2023-01-06 2023-02-06 2.79
#> 17 2023-01-06 2023-02-07 2.71
#> 18 2023-01-06 2023-02-08 2.64
#> 19 2023-01-06 2023-02-09 2.57
#> 20 2023-01-06 2023-02-10 2.5
<sup>Created on 2023-06-01 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论