使用基于data.table的函数完成time.series

huangapple go评论131阅读模式
英文:

Complete time.series using a function based on data.table

问题

我试图学习更多关于data.table以及正确的使用方式。

可复现的示例

这是我的数据:

  1. set.seed(1)
  2. df <- data.frame(group1 = rep(c("a", "b"), each = 9),
  3. group2 = rep(c("A", "B", "C"), each = 3),
  4. date = as.Date("2023-01-01") + c(1,2,5,0,3,4,2,4,6,0,2,3,1,2,3,4,6,7),
  5. x = rnorm(18), y = rnorm(18))

我想要在每个组中添加缺失的日期。每个组都以其自己的最大日期和最小日期开始和结束。每个组由group1group2定义。

任务

使用Tidyverse,我会这样做:

  1. fill <- list(x=0, y=1)
  2. out1 <- df %>%
  3. dplyr::group_by(group1, group2) %>%
  4. tidyr::complete(date = seq.Date(min(date), max(date), "day"), fill = fill) %>%
  5. dplyr::ungroup() %>%
  6. as.data.frame()

使用data.table(我认为)解决方案如下:

  1. library(data.table)
  2. fill <- list(x=0, y=1)
  3. dt <- as.data.table(df)
  4. indx <- dt[, .(date = seq(min(date), max(date), "day")), by = list(group1, group2)]
  5. dt <- dt[indx, on = .(date, group1, group2)]
  6. if (length(fill)) dt[, names(fill) := Map(replace, .SD, lapply(.SD, is.na), fill), .SDcols=names(fill)]
  7. out2 <- as.data.frame(dt)

我得到相同的结果:

  1. identical(out1, out2)
  2. # TRUE

将任务制作成函数

现在,我想创建一个函数,该函数接受数据框、日期列、多个分组列和多个要填充的列作为输入。

使用Tidyverse,我会这样做:

  1. complete_ts_tv <- function(df, date, ..., fill = list()){
  2. df %>%
  3. dplyr::group_by(...) %>%
  4. tidyr::complete({{date}} := seq.Date(min({{date}}), max({{date}}), "day"), fill = fill) %>%
  5. dplyr::ungroup() %>%
  6. as.data.frame()
  7. }
  8. identical(out1, complete_ts_tv(df, date, group1, group2, fill = fill))
  9. # TRUE

如何使用data.table创建类似的函数?

我尝试过的方法

我找到了一个解决方案,但如果日期列实际上被称为“date”,则无法正常工作。

  1. library(data.table)
  2. set.seed(1)
  3. df <- data.frame(group1 = rep(c("a", "b"), each = 9),
  4. group2 = rep(c("A", "B", "C"), each = 3),
  5. date = as.Date("2023-01-01") + c(1,2,5,0,3,4,2,4,6,0,2,3,1,2,3,4,6,7),
  6. x = rnorm(18), y = rnorm(18))
  7. fill <- list(x=0, y=1)
  8. complete_ts_dt <- function(df, date, ..., fill = list()) {
  9. dt <- as.data.table(df)
  10. # 为每个组合创建日期序列
  11. indx <- dt[, .(date = seq(min(get(date)), max(get(date)), "day")), by = c(...)]
  12. setnames(indx, "date", date)
  13. # 在原始数据表上执行联接以填充缺失的日期
  14. dt <- dt[indx, on = c(date, ...)]
  15. # 用填充值替换指定列中的NA值
  16. if (length(fill)) {
  17. dt[, (names(fill)) := Map(replace, .SD, lapply(.SD, is.na), fill), .SDcols = names(fill)]
  18. }
  19. # 将结果转换回数据框
  20. out <- as.data.frame(dt)
  21. return(out)
  22. }
  23. # 无法正常工作!
  24. complete_ts_dt(df,
  25. date = "date", "group1", "group2", fill = list(x = 0, y = 1))
  26. # Error in get(date): invalid first argument
  27. # 可以正常工作!
  28. complete_ts_dt(df = dplyr::rename(df, DATE = date),
  29. date = "DATE", "group1", "group2", fill = list(x = 0, y = 1))

有什么帮助吗?
我还对如何使我的函数更适合正确使用data.table 感兴趣。

更新:最终,我这样做了。看起来足够符合data.table吗?

  1. generate_dates <- function(date){
  2. seq.Date(min(date), max(date), by = "day")
  3. }
  4. complete_ts_dt <- function(df, date, ..., fill = list()) {
  5. dt <- as.data.table(df)
  6. # 为每个组合创建日期序列
  7. indx <- dt[, lapply(.SD, generate_dates), by = c(...), .SDcols = date]
  8. # 在原始数据表上执行联接以填充缺失的日期
  9. dt <- dt[indx, on = c(date, ...)]
  10. # 用填充值替换指定列中的NA值
  11. if (length(fill)) {
  12. dt[, (names(fill)) := Map(replace, .SD, lapply(.SD, is.na), fill), .SDcols = names(fill)]
  13. }
  14. # 将结果转换回数据框
  15. out <- as.data.frame(dt)
  16. return(out)
  17. }

现在它可以正常工作了。

英文:

I'm trying to learn more about data.table and the proper way to use it.

Reproducible example

This is my data:

  1. set.seed(1)
  2. df &lt;- data.frame(group1 = rep(c(&quot;a&quot;, &quot;b&quot;), each = 9),
  3. group2 = rep(c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;), each = 3),
  4. date = as.Date(&quot;2023-01-01&quot;) + c(1,2,5,0,3,4,2,4,6,0,2,3,1,2,3,4,6,7),
  5. x = rnorm(18), y = rnorm(18))

I want to add the missing dates in my dataset in each group.
Each group starts and ends at its own max and min date. Each group is defined by group1 and group2.

Task

With Tidyverse I would do the following:

  1. fill &lt;- list(x=0, y=1)
  2. out1 &lt;- df |&gt;
  3. dplyr::group_by(group1, group2) |&gt;
  4. tidyr::complete(date = seq.Date(min(date), max(date), &quot;day&quot;), fill = fill) |&gt;
  5. dplyr::ungroup() |&gt;
  6. as.data.frame()

With data.table (I think) the solution would be:

  1. library(data.table)
  2. fill &lt;- list(x=0, y=1)
  3. dt &lt;- as.data.table(df)
  4. indx &lt;- dt[, .(date = seq(min(date), max(date), &quot;day&quot;)), by = list(group1, group2)]
  5. dt &lt;- dt[indx, on = .(date, group1, group2)]
  6. if (length(fill)) dt[, names(fill) := Map(replace, .SD, lapply(.SD, is.na), fill), .SDcols=names(fill)]
  7. out2 &lt;- as.data.frame(dt)

I get the same result:

  1. identical(out1, out2)
  2. # TRUE

Make a function out of the task

Now, I want to create a function that takes in input a data.frame, a date column, multiple grouping columns and multiple columns to be filled.

With Tidyverse I would do:

  1. complete_ts_tv &lt;- function(df, date, ..., fill = list()){
  2. df |&gt;
  3. dplyr::group_by(...) |&gt;
  4. tidyr::complete({{date}} := seq.Date(min({{date}}), max({{date}}), &quot;day&quot;), fill = fill) |&gt;
  5. dplyr::ungroup() |&gt;
  6. as.data.frame()
  7. }
  8. identical(out1, complete_ts_tv(df, date, group1, group2, fill = fill))
  9. # TRUE

How do I create a similar function with data.table?

What I tried so far

I found a solution but it doesn't work if the date column is actually called "date".

  1. library(data.table)
  2. set.seed(1)
  3. df &lt;- data.frame(group1 = rep(c(&quot;a&quot;, &quot;b&quot;), each = 9),
  4. group2 = rep(c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;), each = 3),
  5. date = as.Date(&quot;2023-01-01&quot;) + c(1,2,5,0,3,4,2,4,6,0,2,3,1,2,3,4,6,7),
  6. x = rnorm(18), y = rnorm(18))
  7. fill &lt;- list(x=0, y=1)
  8. complete_ts_dt &lt;- function(df, date, ..., fill = list()) {
  9. dt &lt;- as.data.table(df)
  10. # Create a sequence of dates for each group combination
  11. indx &lt;- dt[, .(date = seq(min(get(date)), max(get(date)), &quot;day&quot;)), by = c(...)]
  12. setnames(indx, &quot;date&quot;, date)
  13. # Perform a join on the original data.table to fill in missing dates
  14. dt &lt;- dt[indx, on = c(date, ...)]
  15. # Replace NA values in the specified columns with the fill values
  16. if (length(fill)) {
  17. dt[, (names(fill)) := Map(replace, .SD, lapply(.SD, is.na), fill), .SDcols = names(fill)]
  18. }
  19. # Convert the result back to a data.frame
  20. out &lt;- as.data.frame(dt)
  21. return(out)
  22. }
  23. # doesnt work!
  24. complete_ts_dt(df,
  25. date = &quot;date&quot;, &quot;group1&quot;, &quot;group2&quot;, fill = list(x = 0, y = 1))
  26. #&gt; Error in get(date): invalid first argument
  27. # works!
  28. complete_ts_dt(df = dplyr::rename(df, DATE = date),
  29. date = &quot;DATE&quot;, &quot;group1&quot;, &quot;group2&quot;, fill = list(x = 0, y = 1))
  30. #&gt; group1 group2 DATE x y
  31. #&gt; 1 a A 2023-01-02 -0.62645381 0.82122120
  32. #&gt; 2 a A 2023-01-03 0.18364332 0.59390132
  33. #&gt; 3 a A 2023-01-04 0.00000000 1.00000000
  34. #&gt; 4 a A 2023-01-05 0.00000000 1.00000000
  35. #&gt; 5 a A 2023-01-06 -0.83562861 0.91897737
  36. #&gt; 6 a B 2023-01-01 1.59528080 0.78213630
  37. #&gt; 7 a B 2023-01-02 0.00000000 1.00000000
  38. #&gt; 8 a B 2023-01-03 0.00000000 1.00000000
  39. #&gt; 9 a B 2023-01-04 0.32950777 0.07456498
  40. #&gt; 10 a B 2023-01-05 -0.82046838 -1.98935170
  41. #&gt; 11 a C 2023-01-03 0.48742905 0.61982575
  42. #&gt; 12 a C 2023-01-04 0.00000000 1.00000000
  43. #&gt; 13 a C 2023-01-05 0.73832471 -0.05612874
  44. #&gt; 14 a C 2023-01-06 0.00000000 1.00000000
  45. #&gt; 15 a C 2023-01-07 0.57578135 -0.15579551
  46. #&gt; 16 b A 2023-01-01 -0.30538839 -1.47075238
  47. #&gt; 17 b A 2023-01-02 0.00000000 1.00000000
  48. #&gt; 18 b A 2023-01-03 1.51178117 -0.47815006
  49. #&gt; 19 b A 2023-01-04 0.38984324 0.41794156
  50. #&gt; 20 b B 2023-01-02 -0.62124058 1.35867955
  51. #&gt; 21 b B 2023-01-03 -2.21469989 -0.10278773
  52. #&gt; 22 b B 2023-01-04 1.12493092 0.38767161
  53. #&gt; 23 b C 2023-01-05 -0.04493361 -0.05380504
  54. #&gt; 24 b C 2023-01-06 0.00000000 1.00000000
  55. #&gt; 25 b C 2023-01-07 -0.01619026 -1.37705956
  56. #&gt; 26 b C 2023-01-08 0.94383621 -0.41499456

Any help?
I would be also interested in understanding how I can make my function more appropriate to a correct use of data.table.


Update: eventually I did like this. Does it look data.table-like enough?

  1. generate_dates &lt;- function(date){
  2. seq.Date(min(date), max(date), by = &quot;day&quot;)
  3. }
  4. complete_ts_dt &lt;- function(df, date, ..., fill = list()) {
  5. dt &lt;- as.data.table(df)
  6. # Create a sequence of dates for each group combination
  7. indx &lt;- dt[, lapply(.SD, generate_dates), by = c(...), .SDcols = date]
  8. # Perform a join on the original data.table to fill in missing dates
  9. dt &lt;- dt[indx, on = c(date, ...)]
  10. # Replace NA values in the specified columns with the fill values
  11. if (length(fill)) {
  12. dt[, (names(fill)) := Map(replace, .SD, lapply(.SD, is.na), fill), .SDcols = names(fill)]
  13. }
  14. # Convert the result back to a data.frame
  15. out &lt;- as.data.frame(dt)
  16. return(out)
  17. }

It works with my examples now.

答案1

得分: 1

Here's the translated code you provided:

  1. 如果你正在使用 `data.table`,请使用 `setDT` `df` 转换为 `data.table`,然后保持不变。没有必要将函数输出再转换为 `data.frame`
  2. 连接是一个不错的方法。尽管如此,我不认为需要使用 `lapply` 或一个辅助函数,而且 `NA` 填充可以简化。
  3. complete_ts_dt1 <- function(df, date, ..., fill = list()) {
  4. # 创建一个包含所有日期的 data.table
  5. dt <- setnames(
  6. setDT(df)[
  7. ,.(date = seq(min(.SD[[1]]), max(.SD[[1]]), "day")),
  8. c(...), .SDcols = date # 按照 ...
  9. ], "date", date
  10. )
  11. # 将原始数据连接到包含所有缺失日期的 data.table
  12. df[dt, on = c(..., date)][
  13. is.na(get(names(fill)[[1]])), names(fill) := fill # NA 填充
  14. ]
  15. }
  16. 另一种方法:
  17. complete_ts_dt2 <- function(df, date, ..., fill = list()) {
  18. setorderv( # 排序生成的 data.table
  19. rbindlist( # 将原始数据绑定到包含缺失日期的 data.table 上
  20. list(
  21. setDT(df),
  22. df[ # 创建一个包含缺失日期的 data.table
  23. ,.(
  24. date = as.Date(
  25. sequence(as.integer(diff(.SD[[1]]) - 1L), .SD[[1]][-.N] + 1L)
  26. )
  27. ), c(...), .SDcols = date # 按照 ...
  28. ][,names(fill) := fill]
  29. ), use.names = FALSE
  30. ), c(..., date) # 按照 ... 和日期排序
  31. )
  32. }
  33. data.table 在性能方面表现出色。计时:
  34. microbenchmark::microbenchmark(
  35. complete_ts_dt1 = complete_ts_dt1(df, date = "date", "group1", "group2", fill = list(x = 0, y = 1)),
  36. complete_ts_dt2 = complete_ts_dt2(df, date = "date", "group1", "group2", fill = list(x = 0, y = 1)),
  37. complete_ts_tv = complete_ts_tv(df, date, group1, group2, fill = list(x = 0, y = 1)),
  38. times = 10
  39. )
  40. #> Unit: milliseconds
  41. #> expr min lq mean median uq max neval
  42. #> complete_ts_dt1 2.6638 2.7706 4.18143 2.93130 3.3745 9.7147 10
  43. #> complete_ts_dt2 1.4248 1.5499 2.31117 1.89895 2.0148 6.8276 10
  44. #> complete_ts_tv 20.3337 20.7733 24.21463 24.83460 26.8032 28.2474 10

I've provided the translation of the code you provided, focusing on the code portions and omitting the explanation and additional information. If you have any specific questions or need further assistance with this code, please let me know.

英文:

If you're working with data.table, use setDT to convert df to a data.table by reference and then leave it that way. No reason to convert the function output back to a data.frame.

The join is a good approach. Although, I don't see the need for lapply or a helper function, and the NA fill can be simplified.

  1. complete_ts_dt1 &lt;- function(df, date, ..., fill = list()) {
  2. # create a data.table with all dates filled in
  3. dt &lt;- setnames(
  4. setDT(df)[
  5. ,.(date = seq(min(.SD[[1]]), max(.SD[[1]]), &quot;day&quot;)),
  6. c(...), .SDcols = date # group by ...
  7. ], &quot;date&quot;, date
  8. )
  9. # join the original data to a data.table with all the missing dates filled in
  10. df[dt, on = c(..., date)][
  11. is.na(get(names(fill)[[1]])), names(fill) := fill # NA fill
  12. ]
  13. }

An alternative approach:

  1. complete_ts_dt2 &lt;- function(df, date, ..., fill = list()) {
  2. setorderv( # sort the resulting data.table
  3. rbindlist( # bind the original data to a data.table with the missing dates
  4. list(
  5. setDT(df),
  6. df[ # create a data.table with the missing dates
  7. ,.(
  8. date = as.Date(
  9. sequence(as.integer(diff(.SD[[1]]) - 1L), .SD[[1]][-.N] + 1L)
  10. )
  11. ), c(...), .SDcols = date # group by ...
  12. ][,names(fill) := fill]
  13. ), use.names = FALSE
  14. ), c(..., date) # sort by ... and date
  15. )
  16. }

data.table shines especially when performance is a concern. Timings:

  1. microbenchmark::microbenchmark(
  2. complete_ts_dt1 = complete_ts_dt1(df, date = &quot;date&quot;, &quot;group1&quot;, &quot;group2&quot;, fill = list(x = 0, y = 1)),
  3. complete_ts_dt2 = complete_ts_dt2(df, date = &quot;date&quot;, &quot;group1&quot;, &quot;group2&quot;, fill = list(x = 0, y = 1)),
  4. complete_ts_tv = complete_ts_tv(df, date, group1, group2, fill = list(x = 0, y = 1)),
  5. times = 10
  6. )
  7. #&gt; Unit: milliseconds
  8. #&gt; expr min lq mean median uq max neval
  9. #&gt; complete_ts_dt1 2.6638 2.7706 4.18143 2.93130 3.3745 9.7147 10
  10. #&gt; complete_ts_dt2 1.4248 1.5499 2.31117 1.89895 2.0148 6.8276 10
  11. #&gt; complete_ts_tv 20.3337 20.7733 24.21463 24.83460 26.8032 28.2474 10

Here's a full reprex to show that all the results are the same:

  1. library(tidyverse)
  2. library(data.table)
  3. set.seed(1)
  4. df &lt;- data.frame(group1 = rep(c(&quot;a&quot;, &quot;b&quot;), each = 9),
  5. group2 = rep(c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;), each = 3),
  6. date = as.Date(&quot;2023-01-01&quot;) + c(1,2,5,0,3,4,2,4,6,0,2,3,1,2,3,4,6,7),
  7. x = rnorm(18), y = rnorm(18))
  8. complete_ts_tv &lt;- function(df, date, ..., fill = list()) {
  9. df |&gt;
  10. group_by(...) |&gt;
  11. complete({{date}} := seq.Date(min({{date}}), max({{date}}), &quot;day&quot;), fill = fill) |&gt;
  12. ungroup() |&gt;
  13. as.data.frame()
  14. }
  15. generate_dates &lt;- function(date) seq.Date(min(date), max(date), by = &quot;day&quot;)
  16. complete_ts_dt &lt;- function(df, date, ..., fill = list()) {
  17. dt &lt;- as.data.table(df)
  18. indx &lt;- dt[, lapply(.SD, generate_dates), by = c(...), .SDcols = date]
  19. dt &lt;- dt[indx, on = c(date, ...)]
  20. if (length(fill)) {
  21. dt[, (names(fill)) := Map(replace, .SD, lapply(.SD, is.na), fill), .SDcols = names(fill)]
  22. }
  23. dt
  24. }
  25. complete_ts_dt1 &lt;- function(df, date, ..., fill = list()) {
  26. # create a data.table with all dates filled in
  27. dt &lt;- setnames(
  28. setDT(df)[
  29. ,.(date = seq(min(.SD[[1]]), max(.SD[[1]]), &quot;day&quot;)),
  30. c(...), .SDcols = date # group by ...
  31. ], &quot;date&quot;, date
  32. )
  33. # join the original data to a data.table with all the missing dates filled in
  34. df[dt, on = c(..., date)][
  35. is.na(get(names(fill)[[1]])), names(fill) := fill # NA fill
  36. ]
  37. }
  38. complete_ts_dt2 &lt;- function(df, date, ..., fill = list()) {
  39. setorderv( # sort the resulting data.table
  40. rbindlist( # bind the original data to a data.table with the missing dates
  41. list(
  42. setDT(df),
  43. df[ # create a data.table with the missing dates
  44. ,.(
  45. date = as.Date(
  46. sequence(as.integer(diff(.SD[[1]]) - 1L), .SD[[1]][-.N] + 1L)
  47. )
  48. ), c(...), .SDcols = date # group by ...
  49. ][,names(fill) := fill]
  50. ), use.names = FALSE
  51. ), c(..., date) # sort by ... and date
  52. )
  53. }
  54. complete_ts_dt2(df, &quot;date&quot;, &quot;group1&quot;, &quot;group2&quot;, fill = list(x = 0, y = 1))[]
  55. #&gt; group1 group2 date x y
  56. #&gt; 1: a A 2023-01-02 -0.62645381 0.82122120
  57. #&gt; 2: a A 2023-01-03 0.18364332 0.59390132
  58. #&gt; 3: a A 2023-01-04 0.00000000 1.00000000
  59. #&gt; 4: a A 2023-01-05 0.00000000 1.00000000
  60. #&gt; 5: a A 2023-01-06 -0.83562861 0.91897737
  61. #&gt; 6: a B 2023-01-01 1.59528080 0.78213630
  62. #&gt; 7: a B 2023-01-02 0.00000000 1.00000000
  63. #&gt; 8: a B 2023-01-03 0.00000000 1.00000000
  64. #&gt; 9: a B 2023-01-04 0.32950777 0.07456498
  65. #&gt; 10: a B 2023-01-05 -0.82046838 -1.98935170
  66. #&gt; 11: a C 2023-01-03 0.48742905 0.61982575
  67. #&gt; 12: a C 2023-01-04 0.00000000 1.00000000
  68. #&gt; 13: a C 2023-01-05 0.73832471 -0.05612874
  69. #&gt; 14: a C 2023-01-06 0.00000000 1.00000000
  70. #&gt; 15: a C 2023-01-07 0.57578135 -0.15579551
  71. #&gt; 16: b A 2023-01-01 -0.30538839 -1.47075238
  72. #&gt; 17: b A 2023-01-02 0.00000000 1.00000000
  73. #&gt; 18: b A 2023-01-03 1.51178117 -0.47815006
  74. #&gt; 19: b A 2023-01-04 0.38984324 0.41794156
  75. #&gt; 20: b B 2023-01-02 -0.62124058 1.35867955
  76. #&gt; 21: b B 2023-01-03 -2.21469989 -0.10278773
  77. #&gt; 22: b B 2023-01-04 1.12493092 0.38767161
  78. #&gt; 23: b C 2023-01-05 -0.04493361 -0.05380504
  79. #&gt; 24: b C 2023-01-06 0.00000000 1.00000000
  80. #&gt; 25: b C 2023-01-07 -0.01619026 -1.37705956
  81. #&gt; 26: b C 2023-01-08 0.94383621 -0.41499456
  82. #&gt; group1 group2 date x y
  83. res &lt;- list(
  84. setDT(complete_ts_tv(df, date, group1, group2, fill = list(x = 0, y = 1))),
  85. complete_ts_dt(df, &quot;date&quot;, &quot;group1&quot;, &quot;group2&quot;, fill = list(x = 0, y = 1)),
  86. complete_ts_dt1(df, &quot;date&quot;, &quot;group1&quot;, &quot;group2&quot;, fill = list(x = 0, y = 1)),
  87. complete_ts_dt2(df, &quot;date&quot;, &quot;group1&quot;, &quot;group2&quot;, fill = list(x = 0, y = 1))
  88. )
  89. identical(res[-4], res[-1])
  90. #&gt; [1] TRUE
  91. Created on 2023-08-29 with reprex v2.0.2

huangapple
  • 本文由 发表于 2023年8月5日 02:24:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76838367.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定