英文:
SQL: How can I generate a time series from timestamp data and calculate cumulative sums across different event types?
问题
Building a timeseries with cumulative sums based on single grouped events
要创建一个基于单个分组事件的累积和时间序列,你可以考虑以下方法:
首先,你可以使用以下SQL语句将源数据转换为时间序列,并计算累积和。这个SQL语句假定你已经有一个名为"events"的表包含了你的源数据。
WITH event_series AS (
SELECT
generate_series AS event_timestamp,
type,
value,
SUM(value) OVER (PARTITION BY type ORDER BY generate_series) AS cumulative_sum
FROM
generate_series(
(SELECT MIN(event_timestamp) FROM events),
(SELECT MAX(event_timestamp) FROM events),
'1 day'::interval
) AS generate_series
LEFT JOIN
events USING (event_timestamp, type)
)
SELECT * FROM event_series
ORDER BY event_timestamp, type;
这个SQL语句首先使用generate_series
函数创建了一个包含日期范围的时间序列,并连接了"events"表以获取事件数据。然后,它使用SUM
窗口函数按事件类型分组计算了累积和。最后,通过对结果进行排序,你将得到所需的时间序列和累积和。
这个SQL查询将会生成你所期望的结果,其中包括事件时间戳、类型、值和累积和。你可以将这个查询嵌套在一个Python脚本中,以便在需要时自动运行。这种方法应该足够处理你的需求,而不需要额外的Python脚本。
英文:
Building a timeseries with cumulative sums based on singe grouped events
To begin I need to highlight I am quite new to SQL and the current state was achieved with the help of stack overflow.
I have a table containing events of different kinds as they occur. I would like to transform this data into a time series filling the time gaps between the events and cumulating the numbers across the different kind of events.
So it is all about creating a timeseries from single events and calculate the running/cumulative sum across groups of events
Here the example of source data:
Source:
event_timestamp type value
01.01.2023 10:00 1 10
03.01.2023 10:00 2 10
05.01.2023 10:00 2 10
07.01.2023 10:00 1 10
Desired outcome:
event_timestamp type value cumulative_sum
01.01.2023 10:00 1 10 10
02.01.2023 10:00 1 0 10
03.01.2023 10:00 1 0 10
03.01.2023 10:00 2 10 10
04.01.2023 10:00 1 0 10
04.01.2023 10:00 2 0 10
05.01.2023 10:00 1 0 10
05.01.2023 10:00 2 10 20
06.01.2023 10:00 1 0 10
06.01.2023 10:00 2 0 20
07.01.2023 10:00 1 10 20
07.01.2023 10:00 2 0 20
I got to a point, where I can create this (limited to a single event type):
time type value cumulative_sum
01.01.2023 10:00 1 10 10
02.01.2023 10:00 1 0 10
03.01.2023 10:00 1 10 20
04.01.2023 10:00 1 0 20
05.01.2023 10:00 1 0 20
06.01.2023 10:00 1 0 20
07.01.2023 10:00 1 0 20
with the following sql statement (postgresql):
SELECT
generate_series AS timestamp,
-- hard coded event type below
COALESCE(events.type, 1) AS type,
COALESCE(events.value, 0) AS value,
COALESCE(SUM(td.value) OVER (ORDER BY generate_series), 0) AS cumulative_sum
FROM
generate_series('2023-01-01'::timestamp, '2023-01-07'::timestamp, '1 day') AS generate_series
LEFT JOIN
-- hard coded event type below
events ON generate_series = events.event_timestamp AND event.type = 1
ORDER BY
generate_series;
Now I am questioning if I should try to find the SQL statement doing all in a better way (no idea how) or if I should use a python script to just run this statement per event type and sequentally , insert the data into the table.
Questions summarized:
- Would it be advisable to do such a calculation with a combination of SQL and e.g. python?
- Would it be better to split the creation of the timeseries and the cumulative sum?
- If the pure SQL way is recommended, how can this done considering the groups
答案1
得分: 0
以下是您提供的SQL代码的中文翻译部分:
你已经非常接近了。您只需要为每种类型创建系列。交叉连接可以轻松完成此操作。
看一下这个:
SELECT generate_series AS timestamp
, etypes.type
, COALESCE(events.value, 0) AS value
, COALESCE(SUM(events.value) OVER (PARTITION BY etypes.type ORDER BY generate_series), 0) AS cumulative_sum
FROM generate_series('2023-01-01'::timestamp, '2023-01-07'::timestamp, '1 day') AS generate_series
CROSS JOIN (SELECT DISTINCT type FROM events) AS etypes
LEFT JOIN events ON generate_series = events.event_timestamp AND events.type = etypes.type
ORDER BY generate_series, etypes.type
;
在给定的范围内没有数据时,结果如下:
时间戳 | 类型 | 值 | 累积总和 |
---|---|---|---|
2023-01-01 00:00:00 | 1 | 0 | 0 |
2023-01-01 00:00:00 | 3 | 0 | 0 |
2023-01-02 00:00:00 | 1 | 0 | 0 |
2023-01-02 00:00:00 | 3 | 0 | 0 |
2023-01-03 00:00:00 | 1 | 0 | 0 |
2023-01-03 00:00:00 | 3 | 0 | 0 |
2023-01-04 00:00:00 | 1 | 0 | 0 |
2023-01-04 00:00:00 | 3 | 0 | 0 |
2023-01-05 00:00:00 | 1 | 0 | 0 |
2023-01-05 00:00:00 | 3 | 0 | 0 |
2023-01-06 00:00:00 | 1 | 0 | 0 |
2023-01-06 00:00:00 | 3 | 0 | 0 |
2023-01-07 00:00:00 | 1 | 0 | 0 |
2023-01-07 00:00:00 | 3 | 0 | 0 |
您还可以根据每种类型的起始和结束创建不同范围的系列。
以下是一个示例:
SELECT generate_series AS timestamp
, etypes.type
, COALESCE(events.value, 0) AS value
, COALESCE(SUM(events.value) OVER (PARTITION BY etypes.type ORDER BY generate_series), 0) AS cumulative_sum
FROM (SELECT type
, MIN(event_timestamp) AS mints
, MAX(event_timestamp) AS maxts
FROM events
GROUP BY type
) AS etypes
JOIN LATERAL (SELECT generate_series(mints, maxts, '1 day') AS generate_series) AS gs ON 1 = 1
LEFT JOIN events ON generate_series = events.event_timestamp AND events.type = etypes.type
ORDER BY generate_series, etypes.type
;
使用您的数据的结果如下:
时间戳 | 类型 | 值 | 累积总和 |
---|---|---|---|
2023-01-01 00:00:00 | 1 | 10 | 10 |
2023-01-02 00:00:00 | 1 | 0 | 10 |
2023-01-03 00:00:00 | 1 | 0 | 10 |
2023-01-03 00:00:00 | 2 | 10 | 10 |
2023-01-04 00:00:00 | 1 | 0 | 10 |
2023-01-04 00:00:00 | 2 | 0 | 10 |
2023-01-05 00:00:00 | 1 | 0 | 10 |
2023-01-05 00:00:00 | 2 | 10 | 20 |
2023-01-06 00:00:00 | 1 | 0 | 10 |
2023-01-07 00:00:00 | 1 | 10 | 20 |
<details>
<summary>英文:</summary>
You're very close. All you need is to create the series for each type. A cross join can do that easily.
Check this out:
```sql
SELECT generate_series AS timestamp
, etypes.type
, COALESCE(events.value, 0) AS value
, COALESCE(SUM(events.value) OVER (PARTITION BY etypes.type ORDER BY generate_series), 0) AS cumulative_sum
FROM generate_series('2023-01-01'::timestamp, '2023-01-07'::timestamp, '1 day') AS generate_series
CROSS JOIN (SELECT DISTINCT type FROM events) AS etypes
LEFT JOIN events ON generate_series = events.event_timestamp AND events.type = etypes.type
ORDER BY generate_series, etypes.type
;
With no data in the given range, the result is this:
timestamp | type | value | cumulative_sum |
---|---|---|---|
2023-01-01 00:00:00 | 1 | 0 | 0 |
2023-01-01 00:00:00 | 3 | 0 | 0 |
2023-01-02 00:00:00 | 1 | 0 | 0 |
2023-01-02 00:00:00 | 3 | 0 | 0 |
2023-01-03 00:00:00 | 1 | 0 | 0 |
2023-01-03 00:00:00 | 3 | 0 | 0 |
2023-01-04 00:00:00 | 1 | 0 | 0 |
2023-01-04 00:00:00 | 3 | 0 | 0 |
2023-01-05 00:00:00 | 1 | 0 | 0 |
2023-01-05 00:00:00 | 3 | 0 | 0 |
2023-01-06 00:00:00 | 1 | 0 | 0 |
2023-01-06 00:00:00 | 3 | 0 | 0 |
2023-01-07 00:00:00 | 1 | 0 | 0 |
2023-01-07 00:00:00 | 3 | 0 | 0 |
Test case (with a little data added)
You could also create the series with different range for each type, based on the start and end for that type.
Here's an example of that:
SELECT generate_series AS timestamp
, etypes.type
, COALESCE(events.value, 0) AS value
, COALESCE(SUM(events.value) OVER (PARTITION BY etypes.type ORDER BY generate_series), 0) AS cumulative_sum
FROM (SELECT type
, MIN(event_timestamp) AS mints
, MAX(event_timestamp) AS maxts
FROM events
GROUP BY type
) AS etypes
JOIN LATERAL (SELECT generate_series(mints, maxts, '1 day') AS generate_series) AS gs ON 1 = 1
LEFT JOIN events ON generate_series = events.event_timestamp AND events.type = etypes.type
ORDER BY generate_series, etypes.type
;
Result with your data:
timestamp | type | value | cumulative_sum |
---|---|---|---|
2023-01-01 00:00:00 | 1 | 10 | 10 |
2023-01-02 00:00:00 | 1 | 0 | 10 |
2023-01-03 00:00:00 | 1 | 0 | 10 |
2023-01-03 00:00:00 | 2 | 10 | 10 |
2023-01-04 00:00:00 | 1 | 0 | 10 |
2023-01-04 00:00:00 | 2 | 0 | 10 |
2023-01-05 00:00:00 | 1 | 0 | 10 |
2023-01-05 00:00:00 | 2 | 10 | 20 |
2023-01-06 00:00:00 | 1 | 0 | 10 |
2023-01-07 00:00:00 | 1 | 10 | 20 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论