有dbplyr方法可以基于日期范围计算滚动总和吗?

huangapple go评论82阅读模式
英文:

Is there a dbplyr method to calculate rolling sum based on a date range

问题

我正在使用dbplyr访问Redshift表中的数据,并使用Tidyverse进行数据整理。我正在尝试计算7天内的滚动累计金额。

不幸的是,我找到的建议在Stack Overflow上的包,如'slider'和'runner',似乎不兼容dbplyr(或我正在使用的Redshift表和SQL)。

我正在尝试实现类似于这个表的结果。在这个示例中,'cum_sum_7Days'列中的值是'amount'列中的值的总和,其中'start_date'适合在'start_date'和'previous_7Day'列之间。

  1. Acct Start_Date Previous_7Day Amount Cum_sum_7Days
  2. YYYY 8/07/2022 7:04 1/07/2022 7:04 500 500
  3. YYYY 8/07/2022 12:49 1/07/2022 12:49 200 700
  4. YYYY 9/07/2022 11:47 2/07/2022 11:47 300 1000
  5. YYYY 9/07/2022 11:52 2/07/2022 11:52 45.6 1045.6
  6. YYYY 12/07/2022 13:03 5/07/2022 13:03 200 1245.6
  7. YYYY 15/07/2022 13:53 8/07/2022 13:53 200 745.6
  8. YYYY 16/07/2022 12:58 9/07/2022 12:58 300 700
  9. YYYY 16/07/2022 13:28 9/07/2022 13:28 500 1200
  10. YYYY 19/07/2022 12:22 12/07/2022 12:22 200 1400
  11. YYYY 23/07/2022 5:52 16/07/2022 5:52 200 1200
  12. YYYY 26/07/2022 13:01 19/07/2022 13:01 100 300
  13. YYYY 29/07/2022 13:50 22/07/2022 13:50 200 500
  14. YYYY 30/07/2022 13:57 23/07/2022 13:57 300 600
  15. YYYY 3/08/2022 6:17 27/07/2022 6:17 200 700
  16. YYYY 5/08/2022 13:30 29/07/2022 13:30 200 900
  17. YYYY 9/08/2022 13:44 2/08/2022 13:44 200 600
  18. YYYY 12/08/2022 12:13 5/08/2022 12:13 200 600

请注意:
*

英文:

I'm using dbplyr to access data in Redshift tables and Tidyverse to do the data wrangling. I'm trying to calculate a rolling sum amount over 7 days.

Unfortunately, none of the packages that I've found suggested on stack overflow, such as 'slider' and 'runner', seem to be compatible with dbplyr (or with the Redshift tables and sql that I'm using).

I'm trying to achieve similar results to this table. In this example, the value in the cum_sum_7Days column is the sum of values in the amount column where the start_date fits between the start_date and previous_7Day columns.

  1. Acct Start_Date Previous_7Day Amount Cum_sum_7Days
  2. YYYY 8/07/2022 7:04 1/07/2022 7:04 500 500
  3. YYYY 8/07/2022 12:49 1/07/2022 12:49 200 700
  4. YYYY 9/07/2022 11:47 2/07/2022 11:47 300 1000
  5. YYYY 9/07/2022 11:52 2/07/2022 11:52 45.6 1045.6
  6. YYYY 12/07/2022 13:03 5/07/2022 13:03 200 1245.6
  7. YYYY 15/07/2022 13:53 8/07/2022 13:53 200 745.6
  8. YYYY 16/07/2022 12:58 9/07/2022 12:58 300 700
  9. YYYY 16/07/2022 13:28 9/07/2022 13:28 500 1200
  10. YYYY 19/07/2022 12:22 12/07/2022 12:22 200 1400
  11. YYYY 23/07/2022 5:52 16/07/2022 5:52 200 1200
  12. YYYY 26/07/2022 13:01 19/07/2022 13:01 100 300
  13. YYYY 29/07/2022 13:50 22/07/2022 13:50 200 500
  14. YYYY 30/07/2022 13:57 23/07/2022 13:57 300 600
  15. YYYY 3/08/2022 6:17 27/07/2022 6:17 200 700
  16. YYYY 5/08/2022 13:30 29/07/2022 13:30 200 900
  17. YYYY 9/08/2022 13:44 2/08/2022 13:44 200 600
  18. YYYY 12/08/2022 12:13 5/08/2022 12:13 200 600

Note that:

  • My dates are not consecutive
  • The date-time fields are required as the 7 days must be accurate to the hour-minute
  • The rolling window size may be changed from 7 days (14 days, 1 year, etc.)

So, any solution need to handle this.

答案1

得分: 0

是的,这在dbplyr中是可能的。但是,方法将不同于纯R的方式。

dbplyr通过将tidyverse命令转化为SQL来工作。如果对于某个函数没有定义翻译,那么它可能无法正常工作。这就是为什么使用来自runner包的sum_run等函数会出现"no applicable method"错误。dbplyr不知道如何将它们翻译成SQL。

我们可以使用的方法遵循与在SQL中解决此问题类似的逻辑:

  1. output = tbl1 %>%
  2. inner_join(tbl1, by = "Acct", suffix = c(".x", ".y")) %>%
  3. filter(
  4. Previous_7Day.x <= Start_Date.y,
  5. Start_Date.y <= Start_Date.x
  6. ) %>%
  7. group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %>%
  8. summarise(Cum_sum_7Days = sum(Amount.y) %>%
  9. select(
  10. Acct = Acct.x,
  11. Start_Date = Start_Date.x,
  12. Previous_7Day = Previous_7Day.x,
  13. Amount = Amount.x,
  14. Cum_sum_7Days
  15. )

这个想法的核心是将表与自身连接。在筛选之后,我们拥有了所有.y日期在.x日期之间的组合。因此,我们可以对所有.y金额求和以产生滚动总和。

英文:

Yes, this should be possible with dbplyr. But the method will look different that via pure R.

dbplyr works by translating tidyverse commands into SQL. If no translation is defined for a function then it is unlikely to work correctly. This is why using functions like sum_run from the runner package give a no applicable method error. dbplyr does not know how to translate them into SQL.

The method we can use follows similar logic to how we would approach this problem in SQL:

  1. output = tbl1 %&gt;%
  2. inner_join(tbl1, by = &quot;Acct&quot;, suffix = c(&quot;.x&quot;, &quot;.y&quot;)) %&gt;%
  3. filter(
  4. Previous_7Day.x &lt;= Start_Date.y,
  5. Start_Date.y &lt;= Start_Date.x
  6. ) %&gt;%
  7. group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %&gt;%
  8. summarise(Cum_sum_7Days = sum(Amount.y) %&gt;%
  9. select(
  10. Acct = Acct.x,
  11. Start_Date = Start_Date.x,
  12. Previous_7Day = Previous_7Day.x,
  13. Amount = Amount.x,
  14. Cum_sum_7Days
  15. )

The core of the idea is to join the table to itself. After filtering we have all combinations where the .y date is between the .x dates. So we can sum all the .y amounts to produce the rolling sum.

答案2

得分: 0

这是代码示例,不需要翻译。

英文:

A variant on the other answer using join_by:

  1. library(dplyr, warn.conflicts = FALSE)
  2. library(DBI)
  3. df &lt;- tribble(
  4. ~Acct, ~Start_Date, ~Amount,
  5. &quot;YYYY&quot;, &quot;2022-07-08 07:04:00 AEST&quot;, 500,
  6. &quot;YYYY&quot;, &quot;2022-07-08 12:49:00 AEST&quot;, 200,
  7. &quot;YYYY&quot;, &quot;2022-07-09 11:47:00 AEST&quot;, 300,
  8. &quot;YYYY&quot;, &quot;2022-07-09 11:52:00 AEST&quot;, 45.6,
  9. &quot;YYYY&quot;, &quot;2022-07-12 13:03:00 AEST&quot;, 200,
  10. &quot;YYYY&quot;, &quot;2022-07-15 13:53:00 AEST&quot;, 200,
  11. &quot;YYYY&quot;, &quot;2022-07-16 12:58:00 AEST&quot;, 300,
  12. &quot;YYYY&quot;, &quot;2022-07-16 13:28:00 AEST&quot;, 500,
  13. &quot;YYYY&quot;, &quot;2022-07-19 12:22:00 AEST&quot;, 200,
  14. &quot;YYYY&quot;, &quot;2022-07-23 05:52:00 AEST&quot;, 200,
  15. &quot;YYYY&quot;, &quot;2022-07-26 13:01:00 AEST&quot;, 100,
  16. &quot;YYYY&quot;, &quot;2022-07-29 13:50:00 AEST&quot;, 200,
  17. &quot;YYYY&quot;, &quot;2022-07-30 13:57:00 AEST&quot;, 300,
  18. &quot;YYYY&quot;, &quot;2022-08-03 06:17:00 AEST&quot;, 200,
  19. &quot;YYYY&quot;, &quot;2022-08-05 13:30:00 AEST&quot;, 200,
  20. &quot;YYYY&quot;, &quot;2022-08-09 13:44:00 AEST&quot;, 200,
  21. &quot;YYYY&quot;, &quot;2022-08-12 12:13:00 AEST&quot;, 200
  22. ) %&gt;%
  23. mutate(
  24. Start_Date = as.POSIXct(Start_Date)
  25. )
  26. pg &lt;- dbConnect(RPostgres::Postgres())
  27. tbl1 &lt;-
  28. df %&gt;%
  29. copy_to(pg, ., overwrite = TRUE)
  30. tbl1 %&gt;%
  31. mutate(Previous_7Day = Start_Date - days(7)) %&gt;%
  32. inner_join(tbl1,
  33. join_by(Acct,
  34. between(y$Start_Date, x$Previous_7Day, x$Start_Date)),
  35. suffix = c(&quot;&quot;, &quot;_prev&quot;)) %&gt;%
  36. group_by(Acct, Start_Date, Amount) %&gt;%
  37. summarize(Cum_sum_7Days = sum(Amount_prev)) %&gt;%
  38. arrange(Start_Date)
  39. #&gt; `summarise()` has grouped output by &quot;Acct&quot; and &quot;Start_Date&quot;. You can override
  40. #&gt; using the `.groups` argument.
  41. #&gt; Warning: Missing values are always removed in SQL aggregation functions.
  42. #&gt; Use `na.rm = TRUE` to silence this warning
  43. #&gt; This warning is displayed once every 8 hours.
  44. #&gt; # Source: SQL [?? x 4]
  45. #&gt; # Database: postgres [iangow@/tmp:5432/iangow]
  46. #&gt; # Groups: Acct, Start_Date
  47. #&gt; # Ordered by: Start_Date
  48. #&gt; Acct Start_Date Amount Cum_sum_7Days
  49. #&gt; &lt;chr&gt; &lt;dttm&gt; &lt;dbl&gt; &lt;dbl&gt;
  50. #&gt; 1 YYYY 2022-07-07 21:04:00 500 500
  51. #&gt; 2 YYYY 2022-07-08 02:49:00 200 700
  52. #&gt; 3 YYYY 2022-07-09 01:47:00 300 1000
  53. #&gt; 4 YYYY 2022-07-09 01:52:00 45.6 1046.
  54. #&gt; 5 YYYY 2022-07-12 03:03:00 200 1246.
  55. #&gt; 6 YYYY 2022-07-15 03:53:00 200 746.
  56. #&gt; 7 YYYY 2022-07-16 02:58:00 300 700
  57. #&gt; 8 YYYY 2022-07-16 03:28:00 500 1200
  58. #&gt; 9 YYYY 2022-07-19 02:22:00 200 1400
  59. #&gt; 10 YYYY 2022-07-22 19:52:00 200 1200
  60. #&gt; # … with more rows

<sup>Created on 2023-04-03 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年2月24日 15:32:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/75553694.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定