How to aggregate time series data on a postgres table for abitrary time range such that the returned result has no more than 100 rows

huangapple go评论62阅读模式
英文:

How to aggregate time series data on a postgres table for abitrary time range such that the returned result has no more than 100 rows

问题

表格 statuses 有3个列

  • id(uuid,主键)
  • time(带有时区的时间戳,非空)
  • duration(整数,非空)

它包含的记录分布不均匀;在一天中可能只有一条记录,或者在其他某一天可能有数千条记录。

我需要对任意时间范围进行持续时间的聚合(平均值),以便返回的结果不超过100条记录。

如果按天或按小时分组,那就很简单了。但是让我有点困扰的是任意时间范围以及找到使记录不超过限制的最佳方式。

示例:

实际数据:

time duration
2023-06-30 03:49:43+00 96
2023-06-30 03:51:25+00 132
2023-06-30 03:51:55+00 124
2023-06-30 03:52:25+00 102
2023-06-30 03:52:55+00 119
2023-06-30 03:53:25+00 136
2023-06-30 03:53:55+00 92
2023-06-30 03:54:25+00 86
2023-06-30 03:54:55+00 99
2023-06-30 03:55:25+00 105
2023-06-30 03:55:55+00 94
2023-06-30 03:56:25+00 76
2023-06-30 03:56:55+00 77
2023-06-30 03:57:25+00 66
2023-06-30 03:57:55+00 80
2023-06-30 03:59:13+00 98
2023-06-30 03:59:43+00 201
2023-06-30 04:00:13+00 652
2023-06-30 04:00:43+00 154
2023-06-30 04:01:13+00 272

假设我想要最多输出10条记录(而不是最初问题中提到的100条),我会期望以下输出:

time duration
2023-06-30 03:49:43+00 114
2023-06-30 03:51:55+00 113
2023-06-30 03:52:55+00 127.5
2023-06-30 03:53:55+00 89.0
...6 条记录...
英文:

The table statuses has 3 columns

  • id (uuid, PK)
  • time (timestamp with time zone, NOT NULL)
  • duration (integer, NOT NULL)

The records it contains are not evenly distributed; there could be a single record in a day or thousands in some other day.

I need to aggregate the duration (average) for any arbitrary time range such that the returned result has no more than 100 records.

Had it been grouping by day or by hour, it would have been pretty straightforward. But what's throwing me off a bit is the arbitrary time range and finding that sweet spot so the records are within the limit.

Example:

Actual data:

time duration
2023-06-30 03:49:43+00 96
2023-06-30 03:51:25+00 132
2023-06-30 03:51:55+00 124
2023-06-30 03:52:25+00 102
2023-06-30 03:52:55+00 119
2023-06-30 03:53:25+00 136
2023-06-30 03:53:55+00 92
2023-06-30 03:54:25+00 86
2023-06-30 03:54:55+00 99
2023-06-30 03:55:25+00 105
2023-06-30 03:55:55+00 94
2023-06-30 03:56:25+00 76
2023-06-30 03:56:55+00 77
2023-06-30 03:57:25+00 66
2023-06-30 03:57:55+00 80
2023-06-30 03:59:13+00 98
2023-06-30 03:59:43+00 201
2023-06-30 04:00:13+00 652
2023-06-30 04:00:43+00 154
2023-06-30 04:01:13+00 272

Assuming, I want at most 10 output (instead of 100 as originally mentioned in the question), I would expect the following output

time duration
2023-06-30 03:49:43+00 114
2023-06-30 03:51:55+00 113
2023-06-30 03:52:55+00 127.5
2023-06-30 03:53:55+00 89.0
...6 more rows

答案1

得分: 1

这是您想要的内容(显然,您会用日期范围的限制来替代'x'和'y'):

with data_set as (select duration from table where time between 'x' and 'y' LIMIT 100)
select avg(duration)
from data_set
;

如果这不是您想要的,请您更新问题并提供一些示例数据以及您想要实现的结果。

解决方案 2

以下代码将时间戳范围分为10个相等的桶(使用WIDTH_BUCKET函数),然后对落入每个桶的记录的所有持续时间求平均值,并使用每个桶中的最小时间戳作为(半随机的)时间来分配平均值。

要实际获得10条记录,需要在最小和最大时间戳之间均匀分布数据。如果不是这样,您可以使用WIDTH_BUCKET的三参数版本创建自己的任意大小的桶,其中第三个参数是每个桶的下限数组。

我将代码拆分为多个公共表达式(CTEs),以便更清晰地了解它正在执行的操作 - 您可以将其中一些组合起来以缩短代码。

with data1 as (
  select 
    *, 
    cast(to_char((event_time):: TIMESTAMP, 'yyyymmddhhmiss') as BigInt) event_time_int 
  from 
    event_data
), 
data2 as (
  select 
    *, 
    (select min(event_time_int) from data1) min_ts, 
    (select max(event_time_int) from data1) max_ts 
  from data1
), 
data3 as (
  select 
    *, width_bucket(event_time_int, min_ts, max_ts, 10) bucket 
  from data2
) 
select 
  min(event_time), 
  avg(event_duration) 
from 
  data3 
group by 
  bucket;
英文:

is this what you want (obviously you'd substitute 'x' and 'y' for the limits of your date range):

with data_set as (select duration from table where time between 'x' and  'y' LIMIT 100)
select avg(duration)
from data_set
;

If this isn't what you want, can you update your question with some sample data and the result you want to achieve?

Solution 2

The following code divides the range of timestamps into 10 equal buckets (using the WIDTH_BUCKET function) and then averages all the durations for records that fall into each bucket - and uses the minimum timestamp in each bucket as a (semi-arbitrary) time to assign to the average.

To actually get 10 records there needs to be data for each "bucket" - so data needs to be relatively evenly distributed between the min and max timestamp. If it isn't, you can create your own arbitrary-sized buckets using the 3 parameter version of WIDTH_BUCKET where the 3rd parameter is an array of the lower bounds for each bucket.

I've split the code into multiple CTEs to make it clearer what it is doing - you could probably combine some of them to shorten the code.

with data1 as (
  select 
    *, 
    cast(to_char((event_time):: TIMESTAMP, 'yyyymmddhhmiss') as BigInt) event_time_int 
  from 
    event_data
), 
data2 as (
  select 
    *, 
    (select min(event_time_int) from data1) min_ts, 
    (select max(event_time_int) from data1) max_ts 
  from data1
), 
data3 as (
  select 
    *, width_bucket(event_time_int, min_ts, max_ts, 10) bucket 
  from data2
) 
select 
  min(event_time), 
  avg(event_duration) 
from 
  data3 
group by 
  bucket;

答案2

得分: 0

使用窗口函数与 lead

create table job_table(line_id integer generated always as identity, ts timestamptz, duration integer);

insert into job_table(ts, duration) values
('2023-06-30 03:49:43+00', 96),
('2023-06-30 03:51:25+00', 132),
('2023-06-30 03:51:55+00', 124),
('2023-06-30 03:52:25+00', 102),
('2023-06-30 03:52:55+00', 119),
('2023-06-30 03:53:25+00', 136),
('2023-06-30 03:53:55+00', 92 ),
('2023-06-30 03:54:25+00', 86 ),
('2023-06-30 03:54:55+00', 99 ),
('2023-06-30 03:55:25+00', 105),
('2023-06-30 03:55:55+00', 94 ),
('2023-06-30 03:56:25+00', 76 ),
('2023-06-30 03:56:55+00', 77 ),
('2023-06-30 03:57:25+00', 66 ),
('2023-06-30 03:57:55+00', 80 ),
('2023-06-30 03:59:13+00', 98 ),
('2023-06-30 03:59:43+00', 201),
('2023-06-30 04:00:13+00', 652),
('2023-06-30 04:00:43+00', 154),
('2023-06-30 04:01:13+00', 272); 

select 
    ts, round((duration + lead(duration) over(order by ts))/2.0, 2) 
from 
    job_table 
where 
   ts between '2023-06-30 03:49:43+00' and '2023-06-30 04:01:13+00' 
limit 10;

结果如下:

           ts            | duration 
-------------------------+----------
 06/29/2023 20:49:43 PDT |   114.00
 06/29/2023 20:51:25 PDT |   128.00
 06/29/2023 20:51:55 PDT |   113.00
 06/29/2023 20:52:25 PDT |   110.50
 06/29/2023 20:52:55 PDT |   127.50
 06/29/2023 20:53:25 PDT |   114.00
 06/29/2023 20:53:55 PDT |    89.00
 06/29/2023 20:54:25 PDT |    92.50
 06/29/2023 20:54:55 PDT |   102.00
 06/29/2023 20:55:25 PDT |    99.50
英文:

Using window functions with lead.

create table job_table(line_id integer generated always as identity, ts timestamptz, duration integer);

insert into job_table(ts, duration) values
(' 2023-06-30 03:49:43+00 ', 96),
(' 2023-06-30 03:51:25+00 ', 132),
(' 2023-06-30 03:51:55+00 ', 124),
(' 2023-06-30 03:52:25+00 ', 102),
(' 2023-06-30 03:52:55+00 ', 119),
(' 2023-06-30 03:53:25+00 ', 136),
(' 2023-06-30 03:53:55+00 ', 92 ),
(' 2023-06-30 03:54:25+00 ', 86 ),
(' 2023-06-30 03:54:55+00 ', 99 ),
(' 2023-06-30 03:55:25+00 ', 105),
(' 2023-06-30 03:55:55+00 ', 94 ),
(' 2023-06-30 03:56:25+00 ', 76 ),
(' 2023-06-30 03:56:55+00 ', 77 ),
(' 2023-06-30 03:57:25+00 ', 66 ),
(' 2023-06-30 03:57:55+00 ', 80 ),
(' 2023-06-30 03:59:13+00 ', 98 ),
(' 2023-06-30 03:59:43+00 ', 201),
(' 2023-06-30 04:00:13+00 ', 652),
(' 2023-06-30 04:00:43+00 ', 154),
(' 2023-06-30 04:01:13+00 ', 272); 

select 
    ts, round((duration + lead(duration) over(order by ts))/2.0, 2) 
from 
    job_table 
where 
   ts between '2023-06-30 03:49:43+00' and ' 2023-06-30 04:01:13+00' 
limit 10;

           ts            | duration 
-------------------------+----------
 06/29/2023 20:49:43 PDT |   114.00
 06/29/2023 20:51:25 PDT |   128.00
 06/29/2023 20:51:55 PDT |   113.00
 06/29/2023 20:52:25 PDT |   110.50
 06/29/2023 20:52:55 PDT |   127.50
 06/29/2023 20:53:25 PDT |   114.00
 06/29/2023 20:53:55 PDT |    89.00
 06/29/2023 20:54:25 PDT |    92.50
 06/29/2023 20:54:55 PDT |   102.00
 06/29/2023 20:55:25 PDT |    99.50




答案3

得分: 0

以下是翻译好的部分:

我有这个时间序列作为输入:3小时的数据,不均匀分布:

数据如下:
所有我所做的就是获得一个行号列,告诉我这行是按时间戳排序的第一行,第二行,第三行等等。第二步是将这个行号整除以10 - 在我的示例中(在你的示例中是100),以获得一个分组标准。就像这样:

然后,我可以在外部查询中按`grp`列分组;当我在内部查询中添加一个`LEAD()`函数调用时,我可以选择从组的最小时间戳中减去组的最大时间戳,或从组的最大`next_ts`中减去最小时间戳:

希望这对你有所帮助!

英文:

I have this time series as input: 3 hour's worth of data, unevenly distributed:

SELECT
  TRUNC(ts,'hh') AS thehour
, count(*) as rc
FROM tseries
GROUP BY 1
ORDER BY 1;
-- out        thehour       | rc 
-- out ---------------------+----
-- out  2023-07-01 00:00:00 | 44
-- out  2023-07-01 01:00:00 | 35
-- out  2023-07-01 02:00:00 | 21

The data look like this: ...

         ts          |  nam   
---------------------+--------
 2023-07-01 00:00:00 | row 00
 2023-07-01 00:01:00 | row 01
 2023-07-01 00:03:00 | row 02
 2023-07-01 00:04:00 | row 03
[. . .]
 2023-07-01 02:42:00 | row 96
 2023-07-01 02:45:00 | row 97
 2023-07-01 02:47:00 | row 98
 2023-07-01 02:56:00 | row 99
(100 rows)

All I do is to get a row number column, which tells me whether the row is the first, second, third, etc row if we order by the timestamp. And a second step is to integer-divide this row number by 10 - in my example (and by 100 in your example) to get a grouping criterion. Like this:

  SELECT
    ROW_NUMBER() OVER(ORDER BY ts) AS rn
  , ROW_NUMBER() OVER(ORDER BY ts) / 10 AS grp
  , *
  FROM tseries
 rn  | grp |         ts          |  nam   
-----+-----+---------------------+--------
   1 |   0 | 2023-07-01 00:00:00 | row 00
   2 |   0 | 2023-07-01 00:01:00 | row 01
   3 |   0 | 2023-07-01 00:03:00 | row 02
   4 |   0 | 2023-07-01 00:04:00 | row 03
[. . .]
   9 |   0 | 2023-07-01 00:08:00 | row 08
  10 |   1 | 2023-07-01 00:10:00 | row 09
  11 |   1 | 2023-07-01 00:10:00 | row 10
[. . .]
  96 |   9 | 2023-07-01 02:39:00 | row 95
  97 |   9 | 2023-07-01 02:42:00 | row 96
  98 |   9 | 2023-07-01 02:45:00 | row 97
  99 |   9 | 2023-07-01 02:47:00 | row 98
 100 |  10 | 2023-07-01 02:56:00 | row 99

Then, I can group by the grp column in an outside query; and when I add a LEAD() function call to the inner query, I have the choice to subtract the minimal timestamp of the group from the maximal timestamp of the group, or from the maximal next_ts of the group:

WITH
w_rn AS (
  SELECT
    ROW_NUMBER() OVER w      AS rn
  , ROW_NUMBER() OVER w / 10 AS grp
  , LEAD(ts)     OVER w      AS next_ts
  , *
  FROM tseries
  WINDOW w AS (ORDER BY ts)
)
SELECT
  grp
, MIN(ts)                AS start_ts
, MAX(ts)                AS end_ts
, MAX(next_ts)           AS next_ts
, MAX(ts) - MIN(ts)      AS duration_inside
, MAX(next_ts) - MIN(ts) AS duration_outside
FROM w_rn
GROUP BY 1
ORDER BY 1;
 grp |      start_ts       |       end_ts        |       next_ts       | duration_inside | duration_outside 
-----+---------------------+---------------------+---------------------+-----------------+------------------
   0 | 2023-07-01 00:00:00 | 2023-07-01 00:08:00 | 2023-07-01 00:10:00 | 00:08:00        | 00:10:00
   1 | 2023-07-01 00:10:00 | 2023-07-01 00:27:00 | 2023-07-01 00:31:00 | 00:17:00        | 00:21:00
   2 | 2023-07-01 00:31:00 | 2023-07-01 00:43:00 | 2023-07-01 00:44:00 | 00:12:00        | 00:13:00
   3 | 2023-07-01 00:44:00 | 2023-07-01 00:52:00 | 2023-07-01 00:55:00 | 00:08:00        | 00:11:00
   4 | 2023-07-01 00:55:00 | 2023-07-01 01:10:00 | 2023-07-01 01:10:00 | 00:15:00        | 00:15:00
   5 | 2023-07-01 01:10:00 | 2023-07-01 01:35:00 | 2023-07-01 01:40:00 | 00:25:00        | 00:30:00
   6 | 2023-07-01 01:40:00 | 2023-07-01 01:48:00 | 2023-07-01 01:50:00 | 00:08:00        | 00:10:00
   7 | 2023-07-01 01:50:00 | 2023-07-01 01:59:00 | 2023-07-01 02:02:00 | 00:09:00        | 00:12:00
   8 | 2023-07-01 02:02:00 | 2023-07-01 02:21:00 | 2023-07-01 02:23:00 | 00:19:00        | 00:21:00
   9 | 2023-07-01 02:23:00 | 2023-07-01 02:47:00 | 2023-07-01 02:56:00 | 00:24:00        | 00:33:00
  10 | 2023-07-01 02:56:00 | 2023-07-01 02:56:00 | (null)              | 00:00:00        | (null)
(11 rows)

Please do scroll the last report to the right to see how it calculates.

答案4

得分: 0

以下是已翻译的内容:

以下查询演示了如何使用 ntile 来将具有不规则间隔的时间序列数据聚合到桶中:

WITH statuses(start_time, duration) AS (
  VALUES ('2023-06-30 03:49:43+00'::timestamptz, 96),
         ('2023-06-30 03:51:25+00'::timestamptz, 132),
         ('2023-06-30 03:51:55+00'::timestamptz, 124),
         ('2023-06-30 03:52:25+00'::timestamptz, 102),
         ('2023-06-30 03:52:55+00'::timestamptz, 119),
         ('2023-06-30 03:53:25+00'::timestamptz, 136),
         ('2023-06-30 03:53:55+00'::timestamptz, 92),
         ('2023-06-30 03:54:25+00'::timestamptz, 86),
         ('2023-06-30 03:54:55+00'::timestamptz, 99),
         ('2023-06-30 03:55:25+00'::timestamptz, 105),
         ('2023-06-30 03:55:55+00'::timestamptz, 94),
         ('2023-06-30 03:56:25+00'::timestamptz, 76),
         ('2023-06-30 03:56:55+00'::timestamptz, 77),
         ('2023-06-30 03:57:25+00'::timestamptz, 66),
         ('2023-06-30 03:57:55+00'::timestamptz, 80),
         ('2023-06-30 03:59:13+00'::timestamptz, 98),
         ('2023-06-30 03:59:43+00'::timestamptz, 201),
         ('2023-06-30 04:00:13+00'::timestamptz, 652),
         ('2023-06-30 04:00:43+00'::timestamptz, 154),
         ('2023-06-30 04:01:13+00'::timestamptz, 272)),
buckets AS (
  SELECT statuses.*, ntile(10) OVER (ORDER BY statuses.start_time) AS bucket
    FROM statuses)
SELECT buckets.bucket, min(buckets.start_time) AS start_time, avg(buckets.duration) AS mean_duration
  FROM buckets
  GROUP BY buckets.bucket
  ORDER BY buckets.bucket;

ntile 的参数确定了最大桶的数量。桶的数量永远不会超过数据点的数量。运行上述查询将产生以下输出:

 bucket |       start_time       |    mean_duration     
--------+------------------------+----------------------
      1 | 2023-06-29 23:49:43-04 | 114.0000000000000000
      2 | 2023-06-29 23:51:55-04 | 113.0000000000000000
      3 | 2023-06-29 23:52:55-04 | 127.5000000000000000
      4 | 2023-06-29 23:53:55-04 |  89.0000000000000000
      5 | 2023-06-29 23:54:55-04 | 102.0000000000000000
      6 | 2023-06-29 23:55:55-04 |  85.0000000000000000
      7 | 2023-06-29 23:56:55-04 |  71.5000000000000000
      8 | 2023-06-29 23:57:55-04 |  89.0000000000000000
      9 | 2023-06-29 23:59:43-04 | 426.5000000000000000
     10 | 2023-06-30 00:00:43-04 | 213.0000000000000000
(10 rows)

buckets 子查询中添加时间范围可以有效地限制包含在桶中的数据点。

英文:

The following query demonstrates using ntile to aggregate time series data with irregular intervals into buckets:

WITH statuses(start_time, duration) AS (
  VALUES ('2023-06-30 03:49:43+00'::timestamptz, 96),
         ('2023-06-30 03:51:25+00'::timestamptz, 132),
         ('2023-06-30 03:51:55+00'::timestamptz, 124),
         ('2023-06-30 03:52:25+00'::timestamptz, 102),
         ('2023-06-30 03:52:55+00'::timestamptz, 119),
         ('2023-06-30 03:53:25+00'::timestamptz, 136),
         ('2023-06-30 03:53:55+00'::timestamptz, 92),
         ('2023-06-30 03:54:25+00'::timestamptz, 86),
         ('2023-06-30 03:54:55+00'::timestamptz, 99),
         ('2023-06-30 03:55:25+00'::timestamptz, 105),
         ('2023-06-30 03:55:55+00'::timestamptz, 94),
         ('2023-06-30 03:56:25+00'::timestamptz, 76),
         ('2023-06-30 03:56:55+00'::timestamptz, 77),
         ('2023-06-30 03:57:25+00'::timestamptz, 66),
         ('2023-06-30 03:57:55+00'::timestamptz, 80),
         ('2023-06-30 03:59:13+00'::timestamptz, 98),
         ('2023-06-30 03:59:43+00'::timestamptz, 201),
         ('2023-06-30 04:00:13+00'::timestamptz, 652),
         ('2023-06-30 04:00:43+00'::timestamptz, 154),
         ('2023-06-30 04:01:13+00'::timestamptz, 272)),
buckets AS (
  SELECT statuses.*, ntile(10) OVER (ORDER BY statuses.start_time) AS bucket
    FROM statuses)
SELECT buckets.bucket, min(buckets.start_time) AS start_time, avg(buckets.duration) AS mean_duration
  FROM buckets
  GROUP BY buckets.bucket
  ORDER BY buckets.bucket;

The argument to ntile determines the maximum number of buckets. There will never be more buckets than data points. Running the preceding query results in the following output:

 bucket |       start_time       |    mean_duration     
--------+------------------------+----------------------
1 | 2023-06-29 23:49:43-04 | 114.0000000000000000
2 | 2023-06-29 23:51:55-04 | 113.0000000000000000
3 | 2023-06-29 23:52:55-04 | 127.5000000000000000
4 | 2023-06-29 23:53:55-04 |  89.0000000000000000
5 | 2023-06-29 23:54:55-04 | 102.0000000000000000
6 | 2023-06-29 23:55:55-04 |  85.0000000000000000
7 | 2023-06-29 23:56:55-04 |  71.5000000000000000
8 | 2023-06-29 23:57:55-04 |  89.0000000000000000
9 | 2023-06-29 23:59:43-04 | 426.5000000000000000
10 | 2023-06-30 00:00:43-04 | 213.0000000000000000
(10 rows)

Adding a time range to the buckets subquery efficiently limits the data points included in the buckets.

huangapple
  • 本文由 发表于 2023年7月4日 21:30:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76613160.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定