2023年2月24日 15:32:56go评论96阅读模式

英文:

Is there a dbplyr method to calculate rolling sum based on a date range

问题

我正在使用dbplyr访问Redshift表中的数据，并使用Tidyverse进行数据整理。我正在尝试计算7天内的滚动累计金额。

不幸的是，我找到的建议在Stack Overflow上的包，如'slider'和'runner'，似乎不兼容dbplyr（或我正在使用的Redshift表和SQL）。

我正在尝试实现类似于这个表的结果。在这个示例中，'cum_sum_7Days'列中的值是'amount'列中的值的总和，其中'start_date'适合在'start_date'和'previous_7Day'列之间。

Acct          Start_Date           Previous_7Day           Amount       Cum_sum_7Days
YYYY	   8/07/2022 7:04	   1/07/2022 7:04  	            500     	   500
YYYY	   8/07/2022 12:49	   1/07/2022 12:49  	        200     	   700
YYYY	   9/07/2022 11:47 	   2/07/2022 11:47 	            300     	   1000
YYYY	   9/07/2022 11:52 	   2/07/2022 11:52 	            45.6     	   1045.6
YYYY	   12/07/2022 13:03	   5/07/2022 13:03 	            200     	   1245.6
YYYY	   15/07/2022 13:53	   8/07/2022 13:53 	            200     	   745.6
YYYY	   16/07/2022 12:58	   9/07/2022 12:58 	            300     	   700
YYYY	   16/07/2022 13:28	   9/07/2022 13:28 	            500     	   1200
YYYY	   19/07/2022 12:22	   12/07/2022 12:22	            200     	   1400
YYYY	   23/07/2022 5:52 	   16/07/2022 5:52 	            200     	   1200
YYYY	   26/07/2022 13:01	   19/07/2022 13:01	            100     	   300
YYYY	   29/07/2022 13:50	   22/07/2022 13:50	            200     	   500
YYYY	   30/07/2022 13:57	   23/07/2022 13:57	            300     	   600
YYYY	   3/08/2022 6:17  	   27/07/2022 6:17	            200     	   700
YYYY	   5/08/2022 13:30	   29/07/2022 13:30	            200     	   900
YYYY	   9/08/2022 13:44	   2/08/2022 13:44 	            200     	   600
YYYY	   12/08/2022 12:13	   5/08/2022 12:13 	            200     	   600

请注意：
*

英文:

I'm using dbplyr to access data in Redshift tables and Tidyverse to do the data wrangling. I'm trying to calculate a rolling sum amount over 7 days.

Unfortunately, none of the packages that I've found suggested on stack overflow, such as 'slider' and 'runner', seem to be compatible with dbplyr (or with the Redshift tables and sql that I'm using).

I'm trying to achieve similar results to this table. In this example, the value in the cum_sum_7Days column is the sum of values in the amount column where the start_date fits between the start_date and previous_7Day columns.

Acct          Start_Date           Previous_7Day           Amount       Cum_sum_7Days
YYYY	   8/07/2022 7:04	   1/07/2022 7:04  	            500     	   500
YYYY	   8/07/2022 12:49	   1/07/2022 12:49  	        200     	   700
YYYY	   9/07/2022 11:47 	   2/07/2022 11:47 	            300     	   1000
YYYY	   9/07/2022 11:52 	   2/07/2022 11:52 	            45.6     	   1045.6
YYYY	   12/07/2022 13:03	   5/07/2022 13:03 	            200     	   1245.6
YYYY	   15/07/2022 13:53	   8/07/2022 13:53 	            200     	   745.6
YYYY	   16/07/2022 12:58	   9/07/2022 12:58 	            300     	   700
YYYY	   16/07/2022 13:28	   9/07/2022 13:28 	            500     	   1200
YYYY	   19/07/2022 12:22	   12/07/2022 12:22	            200     	   1400
YYYY	   23/07/2022 5:52 	   16/07/2022 5:52 	            200     	   1200
YYYY	   26/07/2022 13:01	   19/07/2022 13:01	            100     	   300
YYYY	   29/07/2022 13:50	   22/07/2022 13:50	            200     	   500
YYYY	   30/07/2022 13:57	   23/07/2022 13:57	            300     	   600
YYYY	   3/08/2022 6:17  	   27/07/2022 6:17	            200     	   700
YYYY	   5/08/2022 13:30	   29/07/2022 13:30	            200     	   900
YYYY	   9/08/2022 13:44	   2/08/2022 13:44 	            200     	   600
YYYY	   12/08/2022 12:13	   5/08/2022 12:13 	            200     	   600

Note that:

My dates are not consecutive
The date-time fields are required as the 7 days must be accurate to the hour-minute
The rolling window size may be changed from 7 days (14 days, 1 year, etc.)

So, any solution need to handle this.

答案1

得分: 0

是的，这在dbplyr中是可能的。但是，方法将不同于纯R的方式。

dbplyr通过将tidyverse命令转化为SQL来工作。如果对于某个函数没有定义翻译，那么它可能无法正常工作。这就是为什么使用来自runner包的sum_run等函数会出现"no applicable method"错误。dbplyr不知道如何将它们翻译成SQL。

我们可以使用的方法遵循与在SQL中解决此问题类似的逻辑：

output = tbl1 %>%
  inner_join(tbl1, by = "Acct", suffix = c(".x", ".y")) %>%
  filter(
    Previous_7Day.x <= Start_Date.y,
    Start_Date.y <= Start_Date.x
  ) %>%
  group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %>%
  summarise(Cum_sum_7Days = sum(Amount.y) %>%
  select(
    Acct = Acct.x,
    Start_Date = Start_Date.x,
    Previous_7Day = Previous_7Day.x,
    Amount = Amount.x,
    Cum_sum_7Days
  )

这个想法的核心是将表与自身连接。在筛选之后，我们拥有了所有.y日期在.x日期之间的组合。因此，我们可以对所有.y金额求和以产生滚动总和。

英文:

Yes, this should be possible with dbplyr. But the method will look different that via pure R.

dbplyr works by translating tidyverse commands into SQL. If no translation is defined for a function then it is unlikely to work correctly. This is why using functions like sum_run from the runner package give a no applicable method error. dbplyr does not know how to translate them into SQL.

The method we can use follows similar logic to how we would approach this problem in SQL:

output = tbl1 %&gt;%
  inner_join(tbl1, by = &quot;Acct&quot;, suffix = c(&quot;.x&quot;, &quot;.y&quot;)) %&gt;%
  filter(
    Previous_7Day.x &lt;= Start_Date.y,
    Start_Date.y &lt;= Start_Date.x
  ) %&gt;%
  group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %&gt;%
  summarise(Cum_sum_7Days = sum(Amount.y) %&gt;%
  select(
    Acct = Acct.x,
    Start_Date = Start_Date.x,
    Previous_7Day = Previous_7Day.x,
    Amount = Amount.x,
    Cum_sum_7Days
  )

The core of the idea is to join the table to itself. After filtering we have all combinations where the .y date is between the .x dates. So we can sum all the .y amounts to produce the rolling sum.

答案2

得分: 0

这是代码示例，不需要翻译。

英文:

A variant on the other answer using join_by:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
df &lt;- tribble(
  ~Acct,  ~Start_Date,                ~Amount,
  &quot;YYYY&quot;, &quot;2022-07-08 07:04:00 AEST&quot;, 500,    
  &quot;YYYY&quot;, &quot;2022-07-08 12:49:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-09 11:47:00 AEST&quot;, 300,    
  &quot;YYYY&quot;, &quot;2022-07-09 11:52:00 AEST&quot;, 45.6,   
  &quot;YYYY&quot;, &quot;2022-07-12 13:03:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-15 13:53:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-16 12:58:00 AEST&quot;, 300,    
  &quot;YYYY&quot;, &quot;2022-07-16 13:28:00 AEST&quot;, 500,    
  &quot;YYYY&quot;, &quot;2022-07-19 12:22:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-23 05:52:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-26 13:01:00 AEST&quot;, 100,    
  &quot;YYYY&quot;, &quot;2022-07-29 13:50:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-07-30 13:57:00 AEST&quot;, 300,    
  &quot;YYYY&quot;, &quot;2022-08-03 06:17:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-08-05 13:30:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-08-09 13:44:00 AEST&quot;, 200,    
  &quot;YYYY&quot;, &quot;2022-08-12 12:13:00 AEST&quot;, 200
) %&gt;%
  mutate(
    Start_Date = as.POSIXct(Start_Date)
  )
pg &lt;- dbConnect(RPostgres::Postgres())
tbl1 &lt;-
  df %&gt;% 
  copy_to(pg, ., overwrite = TRUE)
tbl1 %&gt;%
  mutate(Previous_7Day = Start_Date - days(7)) %&gt;%
  inner_join(tbl1, 
             join_by(Acct, 
                     between(y$Start_Date, x$Previous_7Day, x$Start_Date)),
             suffix = c(&quot;&quot;, &quot;_prev&quot;)) %&gt;%
  group_by(Acct, Start_Date, Amount) %&gt;%
  summarize(Cum_sum_7Days = sum(Amount_prev)) %&gt;%
  arrange(Start_Date)          
#&gt; `summarise()` has grouped output by &quot;Acct&quot; and &quot;Start_Date&quot;. You can override
#&gt; using the `.groups` argument.
#&gt; Warning: Missing values are always removed in SQL aggregation functions.
#&gt; Use `na.rm = TRUE` to silence this warning
#&gt; This warning is displayed once every 8 hours.
#&gt; # Source:     SQL [?? x 4]
#&gt; # Database:   postgres  [iangow@/tmp:5432/iangow]
#&gt; # Groups:     Acct, Start_Date
#&gt; # Ordered by: Start_Date
#&gt;    Acct  Start_Date          Amount Cum_sum_7Days
#&gt;    &lt;chr&gt; &lt;dttm&gt;               &lt;dbl&gt;         &lt;dbl&gt;
#&gt;  1 YYYY  2022-07-07 21:04:00  500            500 
#&gt;  2 YYYY  2022-07-08 02:49:00  200            700 
#&gt;  3 YYYY  2022-07-09 01:47:00  300           1000 
#&gt;  4 YYYY  2022-07-09 01:52:00   45.6         1046.
#&gt;  5 YYYY  2022-07-12 03:03:00  200           1246.
#&gt;  6 YYYY  2022-07-15 03:53:00  200            746.
#&gt;  7 YYYY  2022-07-16 02:58:00  300            700 
#&gt;  8 YYYY  2022-07-16 03:28:00  500           1200 
#&gt;  9 YYYY  2022-07-19 02:22:00  200           1400 
#&gt; 10 YYYY  2022-07-22 19:52:00  200           1200 
#&gt; # … with more rows

<sup>Created on 2023-04-03 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有dbplyr方法可以基于日期范围计算滚动总和吗？

问题

答案1

答案2

将多列中的值扩展为二进制值

ggplotly使多边形的边界变为透明。有解决方法吗？

如何在R中将经纬度坐标转换为兰伯特等积圆锥投影

选择数据框中的特定数据以替代，使用行和列名称。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。