SQL:如何从时间戳数据生成时间序列,并计算不同事件类型的累积总和?

huangapple go评论61阅读模式
英文:

SQL: How can I generate a time series from timestamp data and calculate cumulative sums across different event types?

问题

Building a timeseries with cumulative sums based on single grouped events

要创建一个基于单个分组事件的累积和时间序列,你可以考虑以下方法:

首先,你可以使用以下SQL语句将源数据转换为时间序列,并计算累积和。这个SQL语句假定你已经有一个名为"events"的表包含了你的源数据。

WITH event_series AS (
  SELECT
    generate_series AS event_timestamp,
    type,
    value,
    SUM(value) OVER (PARTITION BY type ORDER BY generate_series) AS cumulative_sum
  FROM
    generate_series(
      (SELECT MIN(event_timestamp) FROM events),
      (SELECT MAX(event_timestamp) FROM events),
      '1 day'::interval
    ) AS generate_series
  LEFT JOIN
    events USING (event_timestamp, type)
)
SELECT * FROM event_series
ORDER BY event_timestamp, type;

这个SQL语句首先使用generate_series函数创建了一个包含日期范围的时间序列,并连接了"events"表以获取事件数据。然后,它使用SUM窗口函数按事件类型分组计算了累积和。最后,通过对结果进行排序,你将得到所需的时间序列和累积和。

这个SQL查询将会生成你所期望的结果,其中包括事件时间戳、类型、值和累积和。你可以将这个查询嵌套在一个Python脚本中,以便在需要时自动运行。这种方法应该足够处理你的需求,而不需要额外的Python脚本。

英文:

Building a timeseries with cumulative sums based on singe grouped events

To begin I need to highlight I am quite new to SQL and the current state was achieved with the help of stack overflow.

I have a table containing events of different kinds as they occur. I would like to transform this data into a time series filling the time gaps between the events and cumulating the numbers across the different kind of events.

So it is all about creating a timeseries from single events and calculate the running/cumulative sum across groups of events

Here the example of source data:

Source:

event_timestamp    type  value
01.01.2023 10:00   1     10
03.01.2023 10:00   2     10
05.01.2023 10:00   2     10
07.01.2023 10:00   1     10

Desired outcome:

event_timestamp    type value cumulative_sum
01.01.2023 10:00   1    10    10
02.01.2023 10:00   1    0     10
03.01.2023 10:00   1    0     10
03.01.2023 10:00   2    10    10
04.01.2023 10:00   1    0     10
04.01.2023 10:00   2    0     10
05.01.2023 10:00   1    0     10
05.01.2023 10:00   2    10    20
06.01.2023 10:00   1    0     10
06.01.2023 10:00   2    0     20
07.01.2023 10:00   1    10    20
07.01.2023 10:00   2    0     20

I got to a point, where I can create this (limited to a single event type):

time               type value cumulative_sum
01.01.2023 10:00   1    10    10
02.01.2023 10:00   1    0     10
03.01.2023 10:00   1    10    20
04.01.2023 10:00   1    0     20
05.01.2023 10:00   1    0     20
06.01.2023 10:00   1    0     20
07.01.2023 10:00   1    0     20

with the following sql statement (postgresql):

SELECT  
generate_series AS timestamp,
-- hard coded event type below
COALESCE(events.type, 1) AS type,
COALESCE(events.value, 0) AS value,
COALESCE(SUM(td.value) OVER (ORDER BY generate_series), 0) AS cumulative_sum  
FROM  
generate_series('2023-01-01'::timestamp, '2023-01-07'::timestamp, '1 day') AS generate_series  
LEFT JOIN  
-- hard coded event type below
events ON generate_series = events.event_timestamp AND event.type = 1 
ORDER BY  
generate_series;

Now I am questioning if I should try to find the SQL statement doing all in a better way (no idea how) or if I should use a python script to just run this statement per event type and sequentally , insert the data into the table.

Questions summarized:

  • Would it be advisable to do such a calculation with a combination of SQL and e.g. python?
  • Would it be better to split the creation of the timeseries and the cumulative sum?
  • If the pure SQL way is recommended, how can this done considering the groups

答案1

得分: 0

以下是您提供的SQL代码的中文翻译部分:

你已经非常接近了。您只需要为每种类型创建系列。交叉连接可以轻松完成此操作。

看一下这个:

SELECT generate_series AS timestamp
     , etypes.type
     , COALESCE(events.value, 0) AS value
     , COALESCE(SUM(events.value) OVER (PARTITION BY etypes.type ORDER BY generate_series), 0) AS cumulative_sum
  FROM generate_series('2023-01-01'::timestamp, '2023-01-07'::timestamp, '1 day') AS generate_series
  CROSS JOIN (SELECT DISTINCT type FROM events) AS etypes
  LEFT JOIN events ON generate_series = events.event_timestamp AND events.type = etypes.type
 ORDER BY generate_series, etypes.type
;

在给定的范围内没有数据时,结果如下:

时间戳 类型 累积总和
2023-01-01 00:00:00 1 0 0
2023-01-01 00:00:00 3 0 0
2023-01-02 00:00:00 1 0 0
2023-01-02 00:00:00 3 0 0
2023-01-03 00:00:00 1 0 0
2023-01-03 00:00:00 3 0 0
2023-01-04 00:00:00 1 0 0
2023-01-04 00:00:00 3 0 0
2023-01-05 00:00:00 1 0 0
2023-01-05 00:00:00 3 0 0
2023-01-06 00:00:00 1 0 0
2023-01-06 00:00:00 3 0 0
2023-01-07 00:00:00 1 0 0
2023-01-07 00:00:00 3 0 0

测试案例(添加了一些数据)

您还可以根据每种类型的起始和结束创建不同范围的系列。

以下是一个示例:

SELECT generate_series AS timestamp
     , etypes.type
     , COALESCE(events.value, 0) AS value
     , COALESCE(SUM(events.value) OVER (PARTITION BY etypes.type ORDER BY generate_series), 0) AS cumulative_sum
  FROM (SELECT type
             , MIN(event_timestamp) AS mints
             , MAX(event_timestamp) AS maxts
          FROM events
         GROUP BY type
       ) AS etypes
  JOIN LATERAL (SELECT generate_series(mints, maxts, '1 day') AS generate_series) AS gs ON 1 = 1
  LEFT JOIN events ON generate_series = events.event_timestamp AND events.type = etypes.type
 ORDER BY generate_series, etypes.type
;

使用您的数据的结果如下:

时间戳 类型 累积总和
2023-01-01 00:00:00 1 10 10
2023-01-02 00:00:00 1 0 10
2023-01-03 00:00:00 1 0 10
2023-01-03 00:00:00 2 10 10
2023-01-04 00:00:00 1 0 10
2023-01-04 00:00:00 2 0 10
2023-01-05 00:00:00 1 0 10
2023-01-05 00:00:00 2 10 20
2023-01-06 00:00:00 1 0 10
2023-01-07 00:00:00 1 10 20

<details>
<summary>英文:</summary>

You&#39;re very close.  All you need is to create the series for each type.  A cross join can do that easily.

Check this out:

```sql
SELECT generate_series AS timestamp
     , etypes.type
     , COALESCE(events.value, 0) AS value
     , COALESCE(SUM(events.value) OVER (PARTITION BY etypes.type ORDER BY generate_series), 0) AS cumulative_sum
  FROM generate_series(&#39;2023-01-01&#39;::timestamp, &#39;2023-01-07&#39;::timestamp, &#39;1 day&#39;) AS generate_series
  CROSS JOIN (SELECT DISTINCT type FROM events) AS etypes
  LEFT JOIN events ON generate_series = events.event_timestamp AND events.type = etypes.type
 ORDER BY generate_series, etypes.type
;

With no data in the given range, the result is this:

timestamp type value cumulative_sum
2023-01-01 00:00:00 1 0 0
2023-01-01 00:00:00 3 0 0
2023-01-02 00:00:00 1 0 0
2023-01-02 00:00:00 3 0 0
2023-01-03 00:00:00 1 0 0
2023-01-03 00:00:00 3 0 0
2023-01-04 00:00:00 1 0 0
2023-01-04 00:00:00 3 0 0
2023-01-05 00:00:00 1 0 0
2023-01-05 00:00:00 3 0 0
2023-01-06 00:00:00 1 0 0
2023-01-06 00:00:00 3 0 0
2023-01-07 00:00:00 1 0 0
2023-01-07 00:00:00 3 0 0

Test case (with a little data added)

You could also create the series with different range for each type, based on the start and end for that type.

Here's an example of that:

SELECT generate_series AS timestamp
     , etypes.type
     , COALESCE(events.value, 0) AS value
     , COALESCE(SUM(events.value) OVER (PARTITION BY etypes.type ORDER BY generate_series), 0) AS cumulative_sum
  FROM (SELECT type
             , MIN(event_timestamp) AS mints
             , MAX(event_timestamp) AS maxts
          FROM events
         GROUP BY type
       ) AS etypes
  JOIN LATERAL (SELECT generate_series(mints, maxts, &#39;1 day&#39;) AS generate_series) AS gs ON 1 = 1
  LEFT JOIN events ON generate_series = events.event_timestamp AND events.type = etypes.type
 ORDER BY generate_series, etypes.type
;

Result with your data:

timestamp type value cumulative_sum
2023-01-01 00:00:00 1 10 10
2023-01-02 00:00:00 1 0 10
2023-01-03 00:00:00 1 0 10
2023-01-03 00:00:00 2 10 10
2023-01-04 00:00:00 1 0 10
2023-01-04 00:00:00 2 0 10
2023-01-05 00:00:00 1 0 10
2023-01-05 00:00:00 2 10 20
2023-01-06 00:00:00 1 0 10
2023-01-07 00:00:00 1 10 20

huangapple
  • 本文由 发表于 2023年5月20日 21:21:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76295454.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定