英文:
Is there a dbplyr method to calculate rolling sum based on a date range
问题
我正在使用dbplyr访问Redshift表中的数据,并使用Tidyverse进行数据整理。我正在尝试计算7天内的滚动累计金额。
不幸的是,我找到的建议在Stack Overflow上的包,如'slider'和'runner',似乎不兼容dbplyr(或我正在使用的Redshift表和SQL)。
我正在尝试实现类似于这个表的结果。在这个示例中,'cum_sum_7Days'列中的值是'amount'列中的值的总和,其中'start_date'适合在'start_date'和'previous_7Day'列之间。
Acct Start_Date Previous_7Day Amount Cum_sum_7Days
YYYY 8/07/2022 7:04 1/07/2022 7:04 500 500
YYYY 8/07/2022 12:49 1/07/2022 12:49 200 700
YYYY 9/07/2022 11:47 2/07/2022 11:47 300 1000
YYYY 9/07/2022 11:52 2/07/2022 11:52 45.6 1045.6
YYYY 12/07/2022 13:03 5/07/2022 13:03 200 1245.6
YYYY 15/07/2022 13:53 8/07/2022 13:53 200 745.6
YYYY 16/07/2022 12:58 9/07/2022 12:58 300 700
YYYY 16/07/2022 13:28 9/07/2022 13:28 500 1200
YYYY 19/07/2022 12:22 12/07/2022 12:22 200 1400
YYYY 23/07/2022 5:52 16/07/2022 5:52 200 1200
YYYY 26/07/2022 13:01 19/07/2022 13:01 100 300
YYYY 29/07/2022 13:50 22/07/2022 13:50 200 500
YYYY 30/07/2022 13:57 23/07/2022 13:57 300 600
YYYY 3/08/2022 6:17 27/07/2022 6:17 200 700
YYYY 5/08/2022 13:30 29/07/2022 13:30 200 900
YYYY 9/08/2022 13:44 2/08/2022 13:44 200 600
YYYY 12/08/2022 12:13 5/08/2022 12:13 200 600
请注意:
*
英文:
I'm using dbplyr to access data in Redshift tables and Tidyverse to do the data wrangling. I'm trying to calculate a rolling sum amount over 7 days.
Unfortunately, none of the packages that I've found suggested on stack overflow, such as 'slider' and 'runner', seem to be compatible with dbplyr (or with the Redshift tables and sql that I'm using).
I'm trying to achieve similar results to this table. In this example, the value in the cum_sum_7Days
column is the sum of values in the amount
column where the start_date
fits between the start_date
and previous_7Day
columns.
Acct Start_Date Previous_7Day Amount Cum_sum_7Days
YYYY 8/07/2022 7:04 1/07/2022 7:04 500 500
YYYY 8/07/2022 12:49 1/07/2022 12:49 200 700
YYYY 9/07/2022 11:47 2/07/2022 11:47 300 1000
YYYY 9/07/2022 11:52 2/07/2022 11:52 45.6 1045.6
YYYY 12/07/2022 13:03 5/07/2022 13:03 200 1245.6
YYYY 15/07/2022 13:53 8/07/2022 13:53 200 745.6
YYYY 16/07/2022 12:58 9/07/2022 12:58 300 700
YYYY 16/07/2022 13:28 9/07/2022 13:28 500 1200
YYYY 19/07/2022 12:22 12/07/2022 12:22 200 1400
YYYY 23/07/2022 5:52 16/07/2022 5:52 200 1200
YYYY 26/07/2022 13:01 19/07/2022 13:01 100 300
YYYY 29/07/2022 13:50 22/07/2022 13:50 200 500
YYYY 30/07/2022 13:57 23/07/2022 13:57 300 600
YYYY 3/08/2022 6:17 27/07/2022 6:17 200 700
YYYY 5/08/2022 13:30 29/07/2022 13:30 200 900
YYYY 9/08/2022 13:44 2/08/2022 13:44 200 600
YYYY 12/08/2022 12:13 5/08/2022 12:13 200 600
Note that:
- My dates are not consecutive
- The date-time fields are required as the 7 days must be accurate to the hour-minute
- The rolling window size may be changed from 7 days (14 days, 1 year, etc.)
So, any solution need to handle this.
答案1
得分: 0
是的,这在dbplyr中是可能的。但是,方法将不同于纯R的方式。
dbplyr通过将tidyverse命令转化为SQL来工作。如果对于某个函数没有定义翻译,那么它可能无法正常工作。这就是为什么使用来自runner
包的sum_run
等函数会出现"no applicable method"错误。dbplyr不知道如何将它们翻译成SQL。
我们可以使用的方法遵循与在SQL中解决此问题类似的逻辑:
output = tbl1 %>%
inner_join(tbl1, by = "Acct", suffix = c(".x", ".y")) %>%
filter(
Previous_7Day.x <= Start_Date.y,
Start_Date.y <= Start_Date.x
) %>%
group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %>%
summarise(Cum_sum_7Days = sum(Amount.y) %>%
select(
Acct = Acct.x,
Start_Date = Start_Date.x,
Previous_7Day = Previous_7Day.x,
Amount = Amount.x,
Cum_sum_7Days
)
这个想法的核心是将表与自身连接。在筛选之后,我们拥有了所有.y
日期在.x
日期之间的组合。因此,我们可以对所有.y
金额求和以产生滚动总和。
英文:
Yes, this should be possible with dbplyr. But the method will look different that via pure R.
dbplyr works by translating tidyverse commands into SQL. If no translation is defined for a function then it is unlikely to work correctly. This is why using functions like sum_run
from the runner
package give a no applicable method
error. dbplyr does not know how to translate them into SQL.
The method we can use follows similar logic to how we would approach this problem in SQL:
output = tbl1 %>%
inner_join(tbl1, by = "Acct", suffix = c(".x", ".y")) %>%
filter(
Previous_7Day.x <= Start_Date.y,
Start_Date.y <= Start_Date.x
) %>%
group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %>%
summarise(Cum_sum_7Days = sum(Amount.y) %>%
select(
Acct = Acct.x,
Start_Date = Start_Date.x,
Previous_7Day = Previous_7Day.x,
Amount = Amount.x,
Cum_sum_7Days
)
The core of the idea is to join the table to itself. After filtering we have all combinations where the .y
date is between the .x
dates. So we can sum all the .y
amounts to produce the rolling sum.
答案2
得分: 0
这是代码示例,不需要翻译。
英文:
A variant on the other answer using join_by
:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
df <- tribble(
~Acct, ~Start_Date, ~Amount,
"YYYY", "2022-07-08 07:04:00 AEST", 500,
"YYYY", "2022-07-08 12:49:00 AEST", 200,
"YYYY", "2022-07-09 11:47:00 AEST", 300,
"YYYY", "2022-07-09 11:52:00 AEST", 45.6,
"YYYY", "2022-07-12 13:03:00 AEST", 200,
"YYYY", "2022-07-15 13:53:00 AEST", 200,
"YYYY", "2022-07-16 12:58:00 AEST", 300,
"YYYY", "2022-07-16 13:28:00 AEST", 500,
"YYYY", "2022-07-19 12:22:00 AEST", 200,
"YYYY", "2022-07-23 05:52:00 AEST", 200,
"YYYY", "2022-07-26 13:01:00 AEST", 100,
"YYYY", "2022-07-29 13:50:00 AEST", 200,
"YYYY", "2022-07-30 13:57:00 AEST", 300,
"YYYY", "2022-08-03 06:17:00 AEST", 200,
"YYYY", "2022-08-05 13:30:00 AEST", 200,
"YYYY", "2022-08-09 13:44:00 AEST", 200,
"YYYY", "2022-08-12 12:13:00 AEST", 200
) %>%
mutate(
Start_Date = as.POSIXct(Start_Date)
)
pg <- dbConnect(RPostgres::Postgres())
tbl1 <-
df %>%
copy_to(pg, ., overwrite = TRUE)
tbl1 %>%
mutate(Previous_7Day = Start_Date - days(7)) %>%
inner_join(tbl1,
join_by(Acct,
between(y$Start_Date, x$Previous_7Day, x$Start_Date)),
suffix = c("", "_prev")) %>%
group_by(Acct, Start_Date, Amount) %>%
summarize(Cum_sum_7Days = sum(Amount_prev)) %>%
arrange(Start_Date)
#> `summarise()` has grouped output by "Acct" and "Start_Date". You can override
#> using the `.groups` argument.
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # Source: SQL [?? x 4]
#> # Database: postgres [iangow@/tmp:5432/iangow]
#> # Groups: Acct, Start_Date
#> # Ordered by: Start_Date
#> Acct Start_Date Amount Cum_sum_7Days
#> <chr> <dttm> <dbl> <dbl>
#> 1 YYYY 2022-07-07 21:04:00 500 500
#> 2 YYYY 2022-07-08 02:49:00 200 700
#> 3 YYYY 2022-07-09 01:47:00 300 1000
#> 4 YYYY 2022-07-09 01:52:00 45.6 1046.
#> 5 YYYY 2022-07-12 03:03:00 200 1246.
#> 6 YYYY 2022-07-15 03:53:00 200 746.
#> 7 YYYY 2022-07-16 02:58:00 300 700
#> 8 YYYY 2022-07-16 03:28:00 500 1200
#> 9 YYYY 2022-07-19 02:22:00 200 1400
#> 10 YYYY 2022-07-22 19:52:00 200 1200
#> # … with more rows
<sup>Created on 2023-04-03 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论