英文:
Rolling mean of time series with missing dates in R
问题
以下是代码部分的翻译:
R初学者,使用`tidyverse` / RStudio。
我有一个整洁的数据集,每一行都有一个日期、一个分组特征和一个值(实际数据更复杂,但这是核心内容):
我按`Group`分组,对每个`Date`计算`Value`的一些汇总统计信息,得到每个日期的分组摘要。例如:
```R
grouped <- data %>%
group_by(Date, Group) %>%
summarise(mean = mean(Value))
head(grouped)
# A tibble: 6 × 3
# Groups: Date [4]
Date Group mean
<date> <fct> <dbl>
1 2021-02-18 A 37.4
2 2021-02-19 B 25.5
3 2021-02-19 A 26.1
4 2021-02-22 B 34.2
5 2021-02-22 A 26.4
6 2021-02-23 B 34.2
(注意:为了可重现性,下面是数据。)
到目前为止一切都好。现在我想计算这些摘要统计信息(在这种情况下是mean
)的滚动平均值,按Group
分组。我尝试使用zoo::rollmean
来实现:
grouped <- grouped %>%
group_by(Group) %>%
mutate(rolling = zoo::rollmean(mean, window_length, fill=NA))
但这里出现了一个问题 - 理想情况下,移动平均值应该是严格按照一定的天数,而不是记录,但一个或两个组中可能有一些缺少的天数。
如何确保移动平均值正确考虑了缺失的天数和组,将它们视为需要的NA
值?
(我了解从这个答案中得知,zoo::rollmean
不能处理NA
值,但zoo::rollapply
应该能够处理。)
我尝试创建一个简单的日历数据框,包含所有日期,以便将分组数据连接到,但这也会导致Group
变量也变成NA
,因此rollmean / rollapply
函数仍然会忽略缺失的天数和组。
希望这一切都能理解!
希望这对你有帮助。如果有任何问题,请随时提问。
<details>
<summary>英文:</summary>
R noob (still) here, working in `tidyverse` / RStudio.
I have a tidy dataset where each row has a date, a grouping characteristic, and a value (actual dataset more complicated but that's the core of it):
I group the data by `Group` for each `Date`, and calculate some summary stats of the `Value`, yielding a by-group summary for each date. For instance:
grouped <- data %>% group_by(Date, Group) %>% summarise(mean = mean(Value))
head(grouped)
A tibble: 6 × 3
Groups: Date [4]
Date Group mean
<date> <fct> <dbl>
1 2021-02-18 A 37.4
2 2021-02-19 B 25.5
3 2021-02-19 A 26.1
4 2021-02-22 B 34.2
5 2021-02-22 A 26.4
6 2021-02-23 B 34.2
(Note: data is below for reproducibility.)
So far so good. Now I want to take the moving average of those summary stats (`mean` in this case, but could be others) by `Group`. I tried this with `zoo::rollmean`:
grouped <- grouped %>%
group_by(Group) %>%
mutate(rolling = zoo::rollmean(mean, window_length, fill=NA))
But here a problem arises - ideally, the moving average should be strictly some number of **days**, not **records**, but there are some days missing for one or both groups.
What's the best way to ensure that the moving average correctly takes into account the missing days x groups, treating them as `NA` as needed?
(I understand from [this answer][1] that `zoo::rollmean` wouldn't be able to handle `NA` values, but `zoo::rollapply` should be able to.)
I have tried creating a simple calendar dataframe with the full set of dates to `join` the grouped data to, but that leaves the `Group` variable as `NA` as well, so the missing days x groups are still ignored by the `rollmean / rollapply` function.
Hope that all makes sense!
-----
data <- structure(list(Date = structure(c(18676, 18677, 18677, 18680,
18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680,
18680, 18680, 18681, 18681, 18681, 18681, 18681, 18681, 18681,
18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681,
18681, 18681, 18681, 18682, 18682, 18682, 18682, 18682, 18683,
18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683,
18683, 18683, 18683, 18684, 18684, 18684, 18684, 18684, 18684,
18684, 18684, 18684, 18684, 18684, 18685, 18685, 18685, 18685,
18685, 18685, 18685, 18685, 18685, 18685, 18685, 18687, 18687,
18687, 18687, 18687, 18687, 18687, 18687, 18687, 18688, 18688,
18688, 18688, 18688, 18688, 18688, 18688, 18688, 18689, 18689,
18689, 18689, 18689, 18689, 18690, 18690, 18690, 18690, 18690,
18690, 18690, 18690, 18691, 18691, 18691, 18691, 18691, 18691,
18691, 18691, 18691, 18691, 18692, 18692, 18692, 18692, 18692,
18692, 18692, 18692, 18692, 18692, 18692, 18692, 18693, 18694,
18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694,
18694, 18694, 18695, 18695, 18695, 18695, 18695, 18695, 18695,
18695, 18695, 18696, 18696, 18696, 18696, 18696, 18696, 18696,
18696, 18696, 18697, 18697, 18697, 18697, 18697, 18697, 18697,
18697, 18697, 18698, 18698, 18698, 18698, 18698, 18698, 18698,
18698, 18698, 18699, 18699, 18699, 18699, 18699, 18699, 18699,
18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699,
18699, 18699, 18699, 18700, 18701, 18701, 18701, 18701, 18701,
18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701,
18701, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702,
18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702,
18702, 18702, 18703, 18703, 18703, 18703, 18703, 18703, 18703,
18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703,
18703, 18703, 18703), class = "Date"), Group = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L,
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("B", "A"), class = "factor"),
Value = c(37.43, 26.13, 25.54, 31.65, 26.95, 15.29, 35.93,
28.59, 17.14, 30.42, 20.52, 33.4, 35.3, 36.87, 28.32, 21.78,
25.49, 34.13, 20.35, 40.21, 16, 24.58, 23.61, 38.94, 36.76,
29.68, 15.97, 20.79, 17.83, 14.65, 16.76, 35.74, 31.5, 25.6,
32.96, 14.1, 40.4, 24.53, 39.57, 21.38, 14.49, 22.11, 27.12,
16.46, 17.65, 37.32, 15.74, 17.07, 28.52, 14.72, 27.75, 36.69,
39.47, 26.13, 35.57, 24.08, 24.39, 13.1, 16.75, 24.49, 23.61,
15.04, 23.22, 37.3, 36.76, 15.77, 28.34, 35.06, 28.32, 29.39,
19.09, 35.68, 35.9, 37.13, 36.1, 40.55, 33.97, 24.03, 37.25,
34.39, 13.05, 21.64, 40.02, 26.17, 19.39, 25.76, 40.92, 24.21,
20.35, 27.7, 29.53, 14.19, 15.64, 32.74, 31.42, 14.01, 12.85,
17.31, 31.67, 23.63, 17.29, 36.71, 18.19, 17.78, 34.87, 36.87,
19.27, 24.97, 41.66, 16.83, 34.79, 14.94, 34.39, 40.66, 31.35,
31.74, 36.19, 18.28, 37.61, 37.19, 29.58, 17.04, 28.84, 16.6,
41.97, 32.36, 27.91, 21.7, 40.45, 35.38, 41.19, 35.68, 19.49,
20.94, 23.99, 14.28, 39.24, 12.19, 18.02, 39.14, 40.61, 33.32,
38.68, 39.18, 31.76, 22.64, 38.18, 36.75, 30.91, 38.82, 30.68,
14.2, 39.34, 18.91, 12.7, 28.2, 37.85, 34.06, 12.88, 40.03,
29.95, 14.61, 17.01, 35.64, 20.49, 39.51, 29.29, 18.84, 36.42,
37.88, 32.65, 19.7, 19.84, 38.75, 21.25, 40.68, 17.89, 26.3,
37.22, 18.03, 17.33, 36.26, 41.98, 19.4, 20.54, 18.6, 26.92,
15.23, 20.22, 15.2, 35.49, 15.14, 14.43, 30.82, 14.79, 17.74,
36.8, 17.09, 18.09, 19.92, 39.64, 23.87, 22.67, 24.66, 24.33,
16.82, 17.91, 21.66, 30.79, 32.91, 25.16, 38.98, 15.49, 21.33,
38.47, 34.46, 24.22, 36.93, 22.25, 15.33, 41.38, 34.49, 23.44,
30.53, 10.62, 23.8, 28.94, 12.49, 22, 24.51, 14.72, 15.53,
23.23, 38.93, 16.06, 19.36, 35.91, 22.2, 15.85, 33.36, 31.75,
19.69, 29.86, 16.3, 19.42, 19.17, 14.41, 13.18, 20.67, 17.02
)), row.names = c(NA, -250L), class = c("tbl_df", "tbl",
"data.frame"))
[1]: https://stackoverflow.com/questions/17765001/using-rollmean-when-there-are-missing-values-na
</details>
# 答案1
**得分**: 2
1) 假设平均值为3天(当前日期和前2天),而不是3行,并且日期已经在组内排序(这是问题中的情况),我们计算要使用的行数(这将是一个向量,因为每个点可能有不同数量的行),然后在`rollapplyr`中使用它。在每一行中,它对在当前行之前或在当前行之内的w天内的所有行进行平均。这在原始数据框上执行平均值,而不添加额外的NA行。您可以在`?rollapply`的示例部分找到更多的示例。
2) 如果您想要在当前行之后包括与当前行日期相同的行,那么使用以下方法。这里,L是`rollapply`使用的偏移向量的列表,`L[[i]]`是第i行要使用的偏移向量。
3) 另一种方法是使用sqldf。这个方法得到与(2)类似的答案。请注意,group是SQL中的保留字,因此我们用[...]转义它。它在Group和日期条件上执行自连接。
<details>
<summary>英文:</summary>
**1)** Assuming a mean of 3 days (current point and prior 2 days) rather than 3 rows and that dates are already sorted within Group (which is the case in the question) we calculate the number of rows to use (this will be a vector since each point can have a different number of rows) and use that in `rollapplyr`. At each row it averages all rows that are prior or at the current row that are within w days prior to the current row. This performs the averaging on the original data frame without adding additional NA rows. You can find additional examples of this in the Example section of `?rollapply`.
library(dplyr)
library(zoo)
w <- 3
data %>%
group_by(Group) %>%
mutate(Npoints = 1:n() - findInterval(Date - w, Date),
Mean3 = rollapplyr(Value, Npoints, mean, partial = TRUE, fill = NA)) %>%
ungroup
giving:
# A tibble: 250 × 5
Date Group Value Npoints Mean3
<date> <fct> <dbl> <int> <dbl>
1 2021-02-18 A 37.4 1 37.4
2 2021-02-19 A 26.1 2 31.8
3 2021-02-19 B 25.5 1 25.5
4 2021-02-22 A 31.6 1 31.6
5 2021-02-22 A 27.0 2 29.3
6 2021-02-22 A 15.3 3 24.6
7 2021-02-22 A 35.9 4 27.5
8 2021-02-22 A 28.6 5 27.7
9 2021-02-22 A 17.1 6 25.9
10 2021-02-22 B 30.4 1 30.4
# … with 240 more rows
**2)** If instead you want to include rows that are ahead of the current row if they equal the date of the current row then use this. Here L is a list of offset vectors for `rollapply` to use such that `L[[i]]` is the vector of offsets to use at the ith row.
data %>%
group_by(Group) %>%
mutate(L = lapply(1:n(),
\(i) which(Date %in% seq(Date[i] - w, Date[i], "day")) - i),
Mean3 = rollapplyr(Value, L, mean, partial = TRUE, fill = NA)) %>%
ungroup %>%
select(-L)
giving:
# A tibble: 250 × 4
Date Group Value Mean3
<date> <fct> <dbl> <dbl>
1 2021-02-18 A 37.4 37.4
2 2021-02-19 A 26.1 31.8
3 2021-02-19 B 25.5 25.5
4 2021-02-22 A 31.6 26.4
5 2021-02-22 A 27.0 26.4
6 2021-02-22 A 15.3 26.4
7 2021-02-22 A 35.9 26.4
8 2021-02-22 A 28.6 26.4
9 2021-02-22 A 17.1 26.4
10 2021-02-22 B 30.4 32.0
# ℹ 240 more rows
# ℹ Use `print(n = ...)` to see more rows
**3)** Another approach is to use sqldf. This one gives a similar answer to (2). Note that group is a reserved word in SQL so we escape it with [...]. It performs a self join on Group and date condition.
library(sqldf)
sqldf("select a.Date, a.[Group], a.Value, avg(b.Value) Mean3
from data a
left join data b on a.[Group] = b.[Group] and b.Date between a.Date - 3 and a.Date
group by a.rowid
order by a.rowid")
</details>
# 答案2
**得分**: 0
**编辑:** 在重新审视该帖子后,我想提出另一种使用我的函数 `time_roll_mean()` 的解决方案。
它考虑了时间间隙、重复值,并通过 `g` 参数接受分组,这是专门设计的,因为它只对所有组执行单一的计算,而不是每个组执行一次计算。
它接受未排序的数据,在存在重复值时,每个最后重复值的均值将在每个重复组中传播。
它还接受 lubridate 的 `period` 和 `duration` 对象。
不足之处在于它只能计算“右对齐”滚动均值。
```r
# 取消下面的注释以安装 timeplyr
# remotes::install_github("NicChr/timeplyr")
library(timeplyr)
library(dplyr)
library(lubridate)
data %>%
mutate(mean = time_roll_mean(Value, window = days(3), time = Date, g = Group,
close_left_boundary = TRUE))
#> # A tibble: 250 x 4
#> Date Group Value mean
#> <date> <fct> <dbl> <dbl>
#> 1 2021-02-18 A 37.4 37.4
#> 2 2021-02-19 A 26.1 31.8
#> 3 2021-02-19 B 25.5 25.5
#> 4 2021-02-22 A 31.6 26.4
#> 5 2021-02-22 A 27.0 26.4
#> 6 2021-02-22 A 15.3 26.4
#> 7 2021-02-22 A 35.9 26.4
#> 8 2021-02-22 A 28.6 26.4
#> 9 2021-02-22 A 17.1 26.4
#> 10 2021-02-22 B 30.4 32.0
#> # 另外 240 行
创建于 2023-07-24,使用 reprex v2.0.2。
英文:
Edit: After revisiting the thread I'd like to propose another solution using my function time_roll_mean()
.
It accounts for time gaps, duplicates, and accepts groups through the g
argument which is specialised as it only performs a single calculation for all groups instead of one calculation per group.
It accepts unsorted data, and where there are duplicates, the mean of each last duplicate is propagated across each duplicate group.
It also accepts lubridate period
and duration
objects.
The downside is that it can only calculate "right-aligned" rolling means.
# Uncomment below to install timeplyr
# remotes::install_github("NicChr/timeplyr")
library(timeplyr)
library(dplyr)
library(lubridate)
data %>%
mutate(mean = time_roll_mean(Value, window = days(3), time = Date, g = Group,
close_left_boundary = TRUE))
#> # A tibble: 250 x 4
#> Date Group Value mean
#> <date> <fct> <dbl> <dbl>
#> 1 2021-02-18 A 37.4 37.4
#> 2 2021-02-19 A 26.1 31.8
#> 3 2021-02-19 B 25.5 25.5
#> 4 2021-02-22 A 31.6 26.4
#> 5 2021-02-22 A 27.0 26.4
#> 6 2021-02-22 A 15.3 26.4
#> 7 2021-02-22 A 35.9 26.4
#> 8 2021-02-22 A 28.6 26.4
#> 9 2021-02-22 A 17.1 26.4
#> 10 2021-02-22 B 30.4 32.0
#> # i 240 more rows
<sup>Created on 2023-07-24 with reprex v2.0.2</sup>
答案3
得分: -1
library(dplyr)
library(zoo)
# 创建一个包含完整日期集的日历数据框
calendar <- data.frame(Date = seq(min(data$Date), max(data$Date), by = "day"))
# 通过 "Date" 和 "Group" 列将数据和日历合并
data_full <- full_join(data, calendar, by = c("Date"))
# 按日期和组对数据进行分组,并计算值的摘要统计信息
grouped <- data_full %>%
group_by(Date, Group) %>%
summarise(mean = mean(Value))
# 按组对生成的摘要统计数据进行分组
grouped_by_group <- grouped %>%
group_by(Group)
# 使用 rollapply() 分别计算每个组的移动平均值
window_length <- 7 # 移动平均窗口的期望天数
grouped_by_group <- grouped_by_group %>%
mutate(rolling = rollapply(mean, window_length, mean, fill = NA, align = "right"))
英文:
library(dplyr)
library(zoo)
# Create a calendar dataframe with the full set of dates
calendar <- data.frame(Date = seq(min(data$Date), max(data$Date), by = "day"))
# join data and calendar by "Date" and "Group" columns
data_full <- full_join(data, calendar, by = c("Date"))
# Group the data by date and group and calculate the summary statistics of the value
grouped <- data_full %>%
group_by(Date, Group) %>%
summarise(mean = mean(Value))
# Group the resulting summary statistics data by group
grouped_by_group <- grouped %>%
group_by(Group)
# Use rollapply() to calculate the moving average for each group separately
window_length <- 7 # the desired number of days for the moving average window
grouped_by_group <- grouped_by_group %>%
mutate(rolling = rollapply(mean, window_length, mean, fill = NA, align = "right"))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论