有dbplyr方法可以基于日期范围计算滚动总和吗?

huangapple go评论67阅读模式
英文:

Is there a dbplyr method to calculate rolling sum based on a date range

问题

我正在使用dbplyr访问Redshift表中的数据,并使用Tidyverse进行数据整理。我正在尝试计算7天内的滚动累计金额。

不幸的是,我找到的建议在Stack Overflow上的包,如'slider'和'runner',似乎不兼容dbplyr(或我正在使用的Redshift表和SQL)。

我正在尝试实现类似于这个表的结果。在这个示例中,'cum_sum_7Days'列中的值是'amount'列中的值的总和,其中'start_date'适合在'start_date'和'previous_7Day'列之间。

Acct          Start_Date           Previous_7Day           Amount       Cum_sum_7Days

YYYY	   8/07/2022 7:04	   1/07/2022 7:04  	            500     	   500
YYYY	   8/07/2022 12:49	   1/07/2022 12:49  	        200     	   700
YYYY	   9/07/2022 11:47 	   2/07/2022 11:47 	            300     	   1000
YYYY	   9/07/2022 11:52 	   2/07/2022 11:52 	            45.6     	   1045.6
YYYY	   12/07/2022 13:03	   5/07/2022 13:03 	            200     	   1245.6
YYYY	   15/07/2022 13:53	   8/07/2022 13:53 	            200     	   745.6
YYYY	   16/07/2022 12:58	   9/07/2022 12:58 	            300     	   700
YYYY	   16/07/2022 13:28	   9/07/2022 13:28 	            500     	   1200
YYYY	   19/07/2022 12:22	   12/07/2022 12:22	            200     	   1400
YYYY	   23/07/2022 5:52 	   16/07/2022 5:52 	            200     	   1200
YYYY	   26/07/2022 13:01	   19/07/2022 13:01	            100     	   300
YYYY	   29/07/2022 13:50	   22/07/2022 13:50	            200     	   500
YYYY	   30/07/2022 13:57	   23/07/2022 13:57	            300     	   600
YYYY	   3/08/2022 6:17  	   27/07/2022 6:17	            200     	   700
YYYY	   5/08/2022 13:30	   29/07/2022 13:30	            200     	   900
YYYY	   9/08/2022 13:44	   2/08/2022 13:44 	            200     	   600
YYYY	   12/08/2022 12:13	   5/08/2022 12:13 	            200     	   600

请注意:
*

英文:

I'm using dbplyr to access data in Redshift tables and Tidyverse to do the data wrangling. I'm trying to calculate a rolling sum amount over 7 days.

Unfortunately, none of the packages that I've found suggested on stack overflow, such as 'slider' and 'runner', seem to be compatible with dbplyr (or with the Redshift tables and sql that I'm using).

I'm trying to achieve similar results to this table. In this example, the value in the cum_sum_7Days column is the sum of values in the amount column where the start_date fits between the start_date and previous_7Day columns.

Acct          Start_Date           Previous_7Day           Amount       Cum_sum_7Days

YYYY	   8/07/2022 7:04	   1/07/2022 7:04  	            500     	   500
YYYY	   8/07/2022 12:49	   1/07/2022 12:49  	        200     	   700
YYYY	   9/07/2022 11:47 	   2/07/2022 11:47 	            300     	   1000
YYYY	   9/07/2022 11:52 	   2/07/2022 11:52 	            45.6     	   1045.6
YYYY	   12/07/2022 13:03	   5/07/2022 13:03 	            200     	   1245.6
YYYY	   15/07/2022 13:53	   8/07/2022 13:53 	            200     	   745.6
YYYY	   16/07/2022 12:58	   9/07/2022 12:58 	            300     	   700
YYYY	   16/07/2022 13:28	   9/07/2022 13:28 	            500     	   1200
YYYY	   19/07/2022 12:22	   12/07/2022 12:22	            200     	   1400
YYYY	   23/07/2022 5:52 	   16/07/2022 5:52 	            200     	   1200
YYYY	   26/07/2022 13:01	   19/07/2022 13:01	            100     	   300
YYYY	   29/07/2022 13:50	   22/07/2022 13:50	            200     	   500
YYYY	   30/07/2022 13:57	   23/07/2022 13:57	            300     	   600
YYYY	   3/08/2022 6:17  	   27/07/2022 6:17	            200     	   700
YYYY	   5/08/2022 13:30	   29/07/2022 13:30	            200     	   900
YYYY	   9/08/2022 13:44	   2/08/2022 13:44 	            200     	   600
YYYY	   12/08/2022 12:13	   5/08/2022 12:13 	            200     	   600

Note that:

  • My dates are not consecutive
  • The date-time fields are required as the 7 days must be accurate to the hour-minute
  • The rolling window size may be changed from 7 days (14 days, 1 year, etc.)

So, any solution need to handle this.

答案1

得分: 0

是的,这在dbplyr中是可能的。但是,方法将不同于纯R的方式。

dbplyr通过将tidyverse命令转化为SQL来工作。如果对于某个函数没有定义翻译,那么它可能无法正常工作。这就是为什么使用来自runner包的sum_run等函数会出现"no applicable method"错误。dbplyr不知道如何将它们翻译成SQL。

我们可以使用的方法遵循与在SQL中解决此问题类似的逻辑:

output = tbl1 %>%
  inner_join(tbl1, by = "Acct", suffix = c(".x", ".y")) %>%
  filter(
    Previous_7Day.x <= Start_Date.y,
    Start_Date.y <= Start_Date.x
  ) %>%
  group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %>%
  summarise(Cum_sum_7Days = sum(Amount.y) %>%
  select(
    Acct = Acct.x,
    Start_Date = Start_Date.x,
    Previous_7Day = Previous_7Day.x,
    Amount = Amount.x,
    Cum_sum_7Days
  )

这个想法的核心是将表与自身连接。在筛选之后,我们拥有了所有.y日期在.x日期之间的组合。因此,我们可以对所有.y金额求和以产生滚动总和。

英文:

Yes, this should be possible with dbplyr. But the method will look different that via pure R.

dbplyr works by translating tidyverse commands into SQL. If no translation is defined for a function then it is unlikely to work correctly. This is why using functions like sum_run from the runner package give a no applicable method error. dbplyr does not know how to translate them into SQL.

The method we can use follows similar logic to how we would approach this problem in SQL:

output = tbl1 %&gt;%
  inner_join(tbl1, by = &quot;Acct&quot;, suffix = c(&quot;.x&quot;, &quot;.y&quot;)) %&gt;%
  filter(
    Previous_7Day.x &lt;= Start_Date.y,
    Start_Date.y &lt;= Start_Date.x
  ) %&gt;%
  group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %&gt;%
  summarise(Cum_sum_7Days = sum(Amount.y) %&gt;%
  select(
    Acct = Acct.x,
    Start_Date = Start_Date.x,
    Previous_7Day = Previous_7Day.x,
    Amount = Amount.x,
    Cum_sum_7Days
  )

The core of the idea is to join the table to itself. After filtering we have all combinations where the .y date is between the .x dates. So we can sum all the .y amounts to produce the rolling sum.

答案2

得分: 0

这是代码示例,不需要翻译。

英文:

A variant on the other answer using join_by:

library(dplyr, warn.conflicts = FALSE)
library(DBI)

df &lt;- tribble(
  ~Acct,  ~Start_Date,                ~Amount,
  &quot;YYYY&quot;, &quot;2022-07-08 07:04:00 AEST&quot;, 500,    
  &quot;YYYY&quot;, &quot;2022-07-08 12:49:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-09 11:47:00 AEST&quot;, 300,    
  &quot;YYYY&quot;, &quot;2022-07-09 11:52:00 AEST&quot;, 45.6,   
  &quot;YYYY&quot;, &quot;2022-07-12 13:03:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-15 13:53:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-16 12:58:00 AEST&quot;, 300,    
  &quot;YYYY&quot;, &quot;2022-07-16 13:28:00 AEST&quot;, 500,    
  &quot;YYYY&quot;, &quot;2022-07-19 12:22:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-23 05:52:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-26 13:01:00 AEST&quot;, 100,    
  &quot;YYYY&quot;, &quot;2022-07-29 13:50:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-30 13:57:00 AEST&quot;, 300,    
  &quot;YYYY&quot;, &quot;2022-08-03 06:17:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-08-05 13:30:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-08-09 13:44:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-08-12 12:13:00 AEST&quot;, 200
) %&gt;%
  mutate(
    Start_Date = as.POSIXct(Start_Date)
  )

pg &lt;- dbConnect(RPostgres::Postgres())

tbl1 &lt;-
  df %&gt;% 
  copy_to(pg, ., overwrite = TRUE)

tbl1 %&gt;%
  mutate(Previous_7Day = Start_Date - days(7)) %&gt;%
  inner_join(tbl1, 
             join_by(Acct, 
                     between(y$Start_Date, x$Previous_7Day, x$Start_Date)),
             suffix = c(&quot;&quot;, &quot;_prev&quot;)) %&gt;%
  group_by(Acct, Start_Date, Amount) %&gt;%
  summarize(Cum_sum_7Days = sum(Amount_prev)) %&gt;%
  arrange(Start_Date)          
#&gt; `summarise()` has grouped output by &quot;Acct&quot; and &quot;Start_Date&quot;. You can override
#&gt; using the `.groups` argument.
#&gt; Warning: Missing values are always removed in SQL aggregation functions.
#&gt; Use `na.rm = TRUE` to silence this warning
#&gt; This warning is displayed once every 8 hours.
#&gt; # Source:     SQL [?? x 4]
#&gt; # Database:   postgres  [iangow@/tmp:5432/iangow]
#&gt; # Groups:     Acct, Start_Date
#&gt; # Ordered by: Start_Date
#&gt;    Acct  Start_Date          Amount Cum_sum_7Days
#&gt;    &lt;chr&gt; &lt;dttm&gt;               &lt;dbl&gt;         &lt;dbl&gt;
#&gt;  1 YYYY  2022-07-07 21:04:00  500            500 
#&gt;  2 YYYY  2022-07-08 02:49:00  200            700 
#&gt;  3 YYYY  2022-07-09 01:47:00  300           1000 
#&gt;  4 YYYY  2022-07-09 01:52:00   45.6         1046.
#&gt;  5 YYYY  2022-07-12 03:03:00  200           1246.
#&gt;  6 YYYY  2022-07-15 03:53:00  200            746.
#&gt;  7 YYYY  2022-07-16 02:58:00  300            700 
#&gt;  8 YYYY  2022-07-16 03:28:00  500           1200 
#&gt;  9 YYYY  2022-07-19 02:22:00  200           1400 
#&gt; 10 YYYY  2022-07-22 19:52:00  200           1200 
#&gt; # … with more rows

<sup>Created on 2023-04-03 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年2月24日 15:32:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/75553694.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定