2023年7月4日 21:30:25go评论66阅读模式

英文:

How to aggregate time series data on a postgres table for abitrary time range such that the returned result has no more than 100 rows

问题

表格 statuses 有3个列

id（uuid，主键）
time（带有时区的时间戳，非空）
duration（整数，非空）

它包含的记录分布不均匀；在一天中可能只有一条记录，或者在其他某一天可能有数千条记录。

我需要对任意时间范围进行持续时间的聚合（平均值），以便返回的结果不超过100条记录。

如果按天或按小时分组，那就很简单了。但是让我有点困扰的是任意时间范围以及找到使记录不超过限制的最佳方式。

示例：

实际数据：

time	duration
2023-06-30 03:49:43+00	96
2023-06-30 03:51:25+00	132
2023-06-30 03:51:55+00	124
2023-06-30 03:52:25+00	102
2023-06-30 03:52:55+00	119
2023-06-30 03:53:25+00	136
2023-06-30 03:53:55+00	92
2023-06-30 03:54:25+00	86
2023-06-30 03:54:55+00	99
2023-06-30 03:55:25+00	105
2023-06-30 03:55:55+00	94
2023-06-30 03:56:25+00	76
2023-06-30 03:56:55+00	77
2023-06-30 03:57:25+00	66
2023-06-30 03:57:55+00	80
2023-06-30 03:59:13+00	98
2023-06-30 03:59:43+00	201
2023-06-30 04:00:13+00	652
2023-06-30 04:00:43+00	154
2023-06-30 04:01:13+00	272

假设我想要最多输出10条记录（而不是最初问题中提到的100条），我会期望以下输出：

time	duration
2023-06-30 03:49:43+00	114
2023-06-30 03:51:55+00	113
2023-06-30 03:52:55+00	127.5
2023-06-30 03:53:55+00	89.0
...6 条记录...

英文:

The table statuses has 3 columns

id (uuid, PK)
time (timestamp with time zone, NOT NULL)
duration (integer, NOT NULL)

The records it contains are not evenly distributed; there could be a single record in a day or thousands in some other day.

I need to aggregate the duration (average) for any arbitrary time range such that the returned result has no more than 100 records.

Had it been grouping by day or by hour, it would have been pretty straightforward. But what's throwing me off a bit is the arbitrary time range and finding that sweet spot so the records are within the limit.

Example:

Actual data:

time	duration
2023-06-30 03:49:43+00	96
2023-06-30 03:51:25+00	132
2023-06-30 03:51:55+00	124
2023-06-30 03:52:25+00	102
2023-06-30 03:52:55+00	119
2023-06-30 03:53:25+00	136
2023-06-30 03:53:55+00	92
2023-06-30 03:54:25+00	86
2023-06-30 03:54:55+00	99
2023-06-30 03:55:25+00	105
2023-06-30 03:55:55+00	94
2023-06-30 03:56:25+00	76
2023-06-30 03:56:55+00	77
2023-06-30 03:57:25+00	66
2023-06-30 03:57:55+00	80
2023-06-30 03:59:13+00	98
2023-06-30 03:59:43+00	201
2023-06-30 04:00:13+00	652
2023-06-30 04:00:43+00	154
2023-06-30 04:01:13+00	272

Assuming, I want at most 10 output (instead of 100 as originally mentioned in the question), I would expect the following output

time	duration
2023-06-30 03:49:43+00	114
2023-06-30 03:51:55+00	113
2023-06-30 03:52:55+00	127.5
2023-06-30 03:53:55+00	89.0
...6 more rows

答案1

得分: 1

这是您想要的内容（显然，您会用日期范围的限制来替代'x'和'y'）：

with data_set as (select duration from table where time between 'x' and 'y' LIMIT 100)
select avg(duration)
from data_set
;

如果这不是您想要的，请您更新问题并提供一些示例数据以及您想要实现的结果。

解决方案 2

以下代码将时间戳范围分为10个相等的桶（使用WIDTH_BUCKET函数），然后对落入每个桶的记录的所有持续时间求平均值，并使用每个桶中的最小时间戳作为（半随机的）时间来分配平均值。

要实际获得10条记录，需要在最小和最大时间戳之间均匀分布数据。如果不是这样，您可以使用WIDTH_BUCKET的三参数版本创建自己的任意大小的桶，其中第三个参数是每个桶的下限数组。

我将代码拆分为多个公共表达式（CTEs），以便更清晰地了解它正在执行的操作 - 您可以将其中一些组合起来以缩短代码。

with data1 as (
  select 
    *, 
    cast(to_char((event_time):: TIMESTAMP, 'yyyymmddhhmiss') as BigInt) event_time_int 
  from 
    event_data
), 
data2 as (
  select 
    *, 
    (select min(event_time_int) from data1) min_ts, 
    (select max(event_time_int) from data1) max_ts 
  from data1
), 
data3 as (
  select 
    *, width_bucket(event_time_int, min_ts, max_ts, 10) bucket 
  from data2
) 
select 
  min(event_time), 
  avg(event_duration) 
from 
  data3 
group by 
  bucket;

英文:

is this what you want (obviously you'd substitute 'x' and 'y' for the limits of your date range):

with data_set as (select duration from table where time between &#39;x&#39; and  &#39;y&#39; LIMIT 100)
select avg(duration)
from data_set
;

If this isn't what you want, can you update your question with some sample data and the result you want to achieve?

Solution 2

The following code divides the range of timestamps into 10 equal buckets (using the WIDTH_BUCKET function) and then averages all the durations for records that fall into each bucket - and uses the minimum timestamp in each bucket as a (semi-arbitrary) time to assign to the average.

To actually get 10 records there needs to be data for each "bucket" - so data needs to be relatively evenly distributed between the min and max timestamp. If it isn't, you can create your own arbitrary-sized buckets using the 3 parameter version of WIDTH_BUCKET where the 3rd parameter is an array of the lower bounds for each bucket.

I've split the code into multiple CTEs to make it clearer what it is doing - you could probably combine some of them to shorten the code.

with data1 as (
  select 
    *, 
    cast(to_char((event_time):: TIMESTAMP, &#39;yyyymmddhhmiss&#39;) as BigInt) event_time_int 
  from 
    event_data
), 
data2 as (
  select 
    *, 
    (select min(event_time_int) from data1) min_ts, 
    (select max(event_time_int) from data1) max_ts 
  from data1
), 
data3 as (
  select 
    *, width_bucket(event_time_int, min_ts, max_ts, 10) bucket 
  from data2
) 
select 
  min(event_time), 
  avg(event_duration) 
from 
  data3 
group by 
  bucket;

答案2

得分: 0

使用窗口函数与 lead。

create table job_table(line_id integer generated always as identity, ts timestamptz, duration integer);

insert into job_table(ts, duration) values
('2023-06-30 03:49:43+00', 96),
('2023-06-30 03:51:25+00', 132),
('2023-06-30 03:51:55+00', 124),
('2023-06-30 03:52:25+00', 102),
('2023-06-30 03:52:55+00', 119),
('2023-06-30 03:53:25+00', 136),
('2023-06-30 03:53:55+00', 92 ),
('2023-06-30 03:54:25+00', 86 ),
('2023-06-30 03:54:55+00', 99 ),
('2023-06-30 03:55:25+00', 105),
('2023-06-30 03:55:55+00', 94 ),
('2023-06-30 03:56:25+00', 76 ),
('2023-06-30 03:56:55+00', 77 ),
('2023-06-30 03:57:25+00', 66 ),
('2023-06-30 03:57:55+00', 80 ),
('2023-06-30 03:59:13+00', 98 ),
('2023-06-30 03:59:43+00', 201),
('2023-06-30 04:00:13+00', 652),
('2023-06-30 04:00:43+00', 154),
('2023-06-30 04:01:13+00', 272); 

select 
    ts, round((duration + lead(duration) over(order by ts))/2.0, 2) 
from 
    job_table 
where 
   ts between '2023-06-30 03:49:43+00' and '2023-06-30 04:01:13+00' 
limit 10;

结果如下：

           ts            | duration 
-------------------------+----------
 06/29/2023 20:49:43 PDT |   114.00
 06/29/2023 20:51:25 PDT |   128.00
 06/29/2023 20:51:55 PDT |   113.00
 06/29/2023 20:52:25 PDT |   110.50
 06/29/2023 20:52:55 PDT |   127.50
 06/29/2023 20:53:25 PDT |   114.00
 06/29/2023 20:53:55 PDT |    89.00
 06/29/2023 20:54:25 PDT |    92.50
 06/29/2023 20:54:55 PDT |   102.00
 06/29/2023 20:55:25 PDT |    99.50

英文:

Using window functions with lead.

create table job_table(line_id integer generated always as identity, ts timestamptz, duration integer);

insert into job_table(ts, duration) values
(&#39; 2023-06-30 03:49:43+00 &#39;, 96),
(&#39; 2023-06-30 03:51:25+00 &#39;, 132),
(&#39; 2023-06-30 03:51:55+00 &#39;, 124),
(&#39; 2023-06-30 03:52:25+00 &#39;, 102),
(&#39; 2023-06-30 03:52:55+00 &#39;, 119),
(&#39; 2023-06-30 03:53:25+00 &#39;, 136),
(&#39; 2023-06-30 03:53:55+00 &#39;, 92 ),
(&#39; 2023-06-30 03:54:25+00 &#39;, 86 ),
(&#39; 2023-06-30 03:54:55+00 &#39;, 99 ),
(&#39; 2023-06-30 03:55:25+00 &#39;, 105),
(&#39; 2023-06-30 03:55:55+00 &#39;, 94 ),
(&#39; 2023-06-30 03:56:25+00 &#39;, 76 ),
(&#39; 2023-06-30 03:56:55+00 &#39;, 77 ),
(&#39; 2023-06-30 03:57:25+00 &#39;, 66 ),
(&#39; 2023-06-30 03:57:55+00 &#39;, 80 ),
(&#39; 2023-06-30 03:59:13+00 &#39;, 98 ),
(&#39; 2023-06-30 03:59:43+00 &#39;, 201),
(&#39; 2023-06-30 04:00:13+00 &#39;, 652),
(&#39; 2023-06-30 04:00:43+00 &#39;, 154),
(&#39; 2023-06-30 04:01:13+00 &#39;, 272); 

select 
    ts, round((duration + lead(duration) over(order by ts))/2.0, 2) 
from 
    job_table 
where 
   ts between &#39;2023-06-30 03:49:43+00&#39; and &#39; 2023-06-30 04:01:13+00&#39; 
limit 10;

           ts            | duration 
-------------------------+----------
 06/29/2023 20:49:43 PDT |   114.00
 06/29/2023 20:51:25 PDT |   128.00
 06/29/2023 20:51:55 PDT |   113.00
 06/29/2023 20:52:25 PDT |   110.50
 06/29/2023 20:52:55 PDT |   127.50
 06/29/2023 20:53:25 PDT |   114.00
 06/29/2023 20:53:55 PDT |    89.00
 06/29/2023 20:54:25 PDT |    92.50
 06/29/2023 20:54:55 PDT |   102.00
 06/29/2023 20:55:25 PDT |    99.50

答案3

得分: 0

以下是翻译好的部分：

我有这个时间序列作为输入：3小时的数据，不均匀分布：

数据如下：
所有我所做的就是获得一个行号列，告诉我这行是按时间戳排序的第一行，第二行，第三行等等。第二步是将这个行号整除以10 - 在我的示例中（在你的示例中是100），以获得一个分组标准。就像这样：

然后，我可以在外部查询中按`grp`列分组；当我在内部查询中添加一个`LEAD()`函数调用时，我可以选择从组的最小时间戳中减去组的最大时间戳，或从组的最大`next_ts`中减去最小时间戳：

希望这对你有所帮助！

英文:

I have this time series as input: 3 hour's worth of data, unevenly distributed:

SELECT
  TRUNC(ts,&#39;hh&#39;) AS thehour
, count(*) as rc
FROM tseries
GROUP BY 1
ORDER BY 1;
-- out        thehour       | rc 
-- out ---------------------+----
-- out  2023-07-01 00:00:00 | 44
-- out  2023-07-01 01:00:00 | 35
-- out  2023-07-01 02:00:00 | 21

The data look like this: ...

         ts          |  nam   
---------------------+--------
 2023-07-01 00:00:00 | row 00
 2023-07-01 00:01:00 | row 01
 2023-07-01 00:03:00 | row 02
 2023-07-01 00:04:00 | row 03
[. . .]
 2023-07-01 02:42:00 | row 96
 2023-07-01 02:45:00 | row 97
 2023-07-01 02:47:00 | row 98
 2023-07-01 02:56:00 | row 99
(100 rows)

All I do is to get a row number column, which tells me whether the row is the first, second, third, etc row if we order by the timestamp. And a second step is to integer-divide this row number by 10 - in my example (and by 100 in your example) to get a grouping criterion. Like this:

  SELECT
    ROW_NUMBER() OVER(ORDER BY ts) AS rn
  , ROW_NUMBER() OVER(ORDER BY ts) / 10 AS grp
  , *
  FROM tseries
 rn  | grp |         ts          |  nam   
-----+-----+---------------------+--------
   1 |   0 | 2023-07-01 00:00:00 | row 00
   2 |   0 | 2023-07-01 00:01:00 | row 01
   3 |   0 | 2023-07-01 00:03:00 | row 02
   4 |   0 | 2023-07-01 00:04:00 | row 03
[. . .]
   9 |   0 | 2023-07-01 00:08:00 | row 08
  10 |   1 | 2023-07-01 00:10:00 | row 09
  11 |   1 | 2023-07-01 00:10:00 | row 10
[. . .]
  96 |   9 | 2023-07-01 02:39:00 | row 95
  97 |   9 | 2023-07-01 02:42:00 | row 96
  98 |   9 | 2023-07-01 02:45:00 | row 97
  99 |   9 | 2023-07-01 02:47:00 | row 98
 100 |  10 | 2023-07-01 02:56:00 | row 99

Then, I can group by the grp column in an outside query; and when I add a LEAD() function call to the inner query, I have the choice to subtract the minimal timestamp of the group from the maximal timestamp of the group, or from the maximal next_ts of the group:

WITH
w_rn AS (
  SELECT
    ROW_NUMBER() OVER w      AS rn
  , ROW_NUMBER() OVER w / 10 AS grp
  , LEAD(ts)     OVER w      AS next_ts
  , *
  FROM tseries
  WINDOW w AS (ORDER BY ts)
)
SELECT
  grp
, MIN(ts)                AS start_ts
, MAX(ts)                AS end_ts
, MAX(next_ts)           AS next_ts
, MAX(ts) - MIN(ts)      AS duration_inside
, MAX(next_ts) - MIN(ts) AS duration_outside
FROM w_rn
GROUP BY 1
ORDER BY 1;
 grp |      start_ts       |       end_ts        |       next_ts       | duration_inside | duration_outside 
-----+---------------------+---------------------+---------------------+-----------------+------------------
   0 | 2023-07-01 00:00:00 | 2023-07-01 00:08:00 | 2023-07-01 00:10:00 | 00:08:00        | 00:10:00
   1 | 2023-07-01 00:10:00 | 2023-07-01 00:27:00 | 2023-07-01 00:31:00 | 00:17:00        | 00:21:00
   2 | 2023-07-01 00:31:00 | 2023-07-01 00:43:00 | 2023-07-01 00:44:00 | 00:12:00        | 00:13:00
   3 | 2023-07-01 00:44:00 | 2023-07-01 00:52:00 | 2023-07-01 00:55:00 | 00:08:00        | 00:11:00
   4 | 2023-07-01 00:55:00 | 2023-07-01 01:10:00 | 2023-07-01 01:10:00 | 00:15:00        | 00:15:00
   5 | 2023-07-01 01:10:00 | 2023-07-01 01:35:00 | 2023-07-01 01:40:00 | 00:25:00        | 00:30:00
   6 | 2023-07-01 01:40:00 | 2023-07-01 01:48:00 | 2023-07-01 01:50:00 | 00:08:00        | 00:10:00
   7 | 2023-07-01 01:50:00 | 2023-07-01 01:59:00 | 2023-07-01 02:02:00 | 00:09:00        | 00:12:00
   8 | 2023-07-01 02:02:00 | 2023-07-01 02:21:00 | 2023-07-01 02:23:00 | 00:19:00        | 00:21:00
   9 | 2023-07-01 02:23:00 | 2023-07-01 02:47:00 | 2023-07-01 02:56:00 | 00:24:00        | 00:33:00
  10 | 2023-07-01 02:56:00 | 2023-07-01 02:56:00 | (null)              | 00:00:00        | (null)
(11 rows)

Please do scroll the last report to the right to see how it calculates.

答案4

得分: 0

以下是已翻译的内容：

以下查询演示了如何使用 ntile 来将具有不规则间隔的时间序列数据聚合到桶中：

WITH statuses(start_time, duration) AS (
  VALUES ('2023-06-30 03:49:43+00'::timestamptz, 96),
         ('2023-06-30 03:51:25+00'::timestamptz, 132),
         ('2023-06-30 03:51:55+00'::timestamptz, 124),
         ('2023-06-30 03:52:25+00'::timestamptz, 102),
         ('2023-06-30 03:52:55+00'::timestamptz, 119),
         ('2023-06-30 03:53:25+00'::timestamptz, 136),
         ('2023-06-30 03:53:55+00'::timestamptz, 92),
         ('2023-06-30 03:54:25+00'::timestamptz, 86),
         ('2023-06-30 03:54:55+00'::timestamptz, 99),
         ('2023-06-30 03:55:25+00'::timestamptz, 105),
         ('2023-06-30 03:55:55+00'::timestamptz, 94),
         ('2023-06-30 03:56:25+00'::timestamptz, 76),
         ('2023-06-30 03:56:55+00'::timestamptz, 77),
         ('2023-06-30 03:57:25+00'::timestamptz, 66),
         ('2023-06-30 03:57:55+00'::timestamptz, 80),
         ('2023-06-30 03:59:13+00'::timestamptz, 98),
         ('2023-06-30 03:59:43+00'::timestamptz, 201),
         ('2023-06-30 04:00:13+00'::timestamptz, 652),
         ('2023-06-30 04:00:43+00'::timestamptz, 154),
         ('2023-06-30 04:01:13+00'::timestamptz, 272)),
buckets AS (
  SELECT statuses.*, ntile(10) OVER (ORDER BY statuses.start_time) AS bucket
    FROM statuses)
SELECT buckets.bucket, min(buckets.start_time) AS start_time, avg(buckets.duration) AS mean_duration
  FROM buckets
  GROUP BY buckets.bucket
  ORDER BY buckets.bucket;

ntile 的参数确定了最大桶的数量。桶的数量永远不会超过数据点的数量。运行上述查询将产生以下输出：

 bucket |       start_time       |    mean_duration     
--------+------------------------+----------------------
      1 | 2023-06-29 23:49:43-04 | 114.0000000000000000
      2 | 2023-06-29 23:51:55-04 | 113.0000000000000000
      3 | 2023-06-29 23:52:55-04 | 127.5000000000000000
      4 | 2023-06-29 23:53:55-04 |  89.0000000000000000
      5 | 2023-06-29 23:54:55-04 | 102.0000000000000000
      6 | 2023-06-29 23:55:55-04 |  85.0000000000000000
      7 | 2023-06-29 23:56:55-04 |  71.5000000000000000
      8 | 2023-06-29 23:57:55-04 |  89.0000000000000000
      9 | 2023-06-29 23:59:43-04 | 426.5000000000000000
     10 | 2023-06-30 00:00:43-04 | 213.0000000000000000
(10 rows)

在 buckets 子查询中添加时间范围可以有效地限制包含在桶中的数据点。

英文:

The following query demonstrates using ntile to aggregate time series data with irregular intervals into buckets:

WITH statuses(start_time, duration) AS (
  VALUES (&#39;2023-06-30 03:49:43+00&#39;::timestamptz, 96),
         (&#39;2023-06-30 03:51:25+00&#39;::timestamptz, 132),
         (&#39;2023-06-30 03:51:55+00&#39;::timestamptz, 124),
         (&#39;2023-06-30 03:52:25+00&#39;::timestamptz, 102),
         (&#39;2023-06-30 03:52:55+00&#39;::timestamptz, 119),
         (&#39;2023-06-30 03:53:25+00&#39;::timestamptz, 136),
         (&#39;2023-06-30 03:53:55+00&#39;::timestamptz, 92),
         (&#39;2023-06-30 03:54:25+00&#39;::timestamptz, 86),
         (&#39;2023-06-30 03:54:55+00&#39;::timestamptz, 99),
         (&#39;2023-06-30 03:55:25+00&#39;::timestamptz, 105),
         (&#39;2023-06-30 03:55:55+00&#39;::timestamptz, 94),
         (&#39;2023-06-30 03:56:25+00&#39;::timestamptz, 76),
         (&#39;2023-06-30 03:56:55+00&#39;::timestamptz, 77),
         (&#39;2023-06-30 03:57:25+00&#39;::timestamptz, 66),
         (&#39;2023-06-30 03:57:55+00&#39;::timestamptz, 80),
         (&#39;2023-06-30 03:59:13+00&#39;::timestamptz, 98),
         (&#39;2023-06-30 03:59:43+00&#39;::timestamptz, 201),
         (&#39;2023-06-30 04:00:13+00&#39;::timestamptz, 652),
         (&#39;2023-06-30 04:00:43+00&#39;::timestamptz, 154),
         (&#39;2023-06-30 04:01:13+00&#39;::timestamptz, 272)),
buckets AS (
  SELECT statuses.*, ntile(10) OVER (ORDER BY statuses.start_time) AS bucket
    FROM statuses)
SELECT buckets.bucket, min(buckets.start_time) AS start_time, avg(buckets.duration) AS mean_duration
  FROM buckets
  GROUP BY buckets.bucket
  ORDER BY buckets.bucket;

The argument to ntile determines the maximum number of buckets. There will never be more buckets than data points. Running the preceding query results in the following output:

 bucket |       start_time       |    mean_duration     
--------+------------------------+----------------------
1 | 2023-06-29 23:49:43-04 | 114.0000000000000000
2 | 2023-06-29 23:51:55-04 | 113.0000000000000000
3 | 2023-06-29 23:52:55-04 | 127.5000000000000000
4 | 2023-06-29 23:53:55-04 |  89.0000000000000000
5 | 2023-06-29 23:54:55-04 | 102.0000000000000000
6 | 2023-06-29 23:55:55-04 |  85.0000000000000000
7 | 2023-06-29 23:56:55-04 |  71.5000000000000000
8 | 2023-06-29 23:57:55-04 |  89.0000000000000000
9 | 2023-06-29 23:59:43-04 | 426.5000000000000000
10 | 2023-06-30 00:00:43-04 | 213.0000000000000000
(10 rows)

Adding a time range to the buckets subquery efficiently limits the data points included in the buckets.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to aggregate time series data on a postgres table for abitrary time range such that the returned result has no more than 100 rows

问题

答案1

答案2

答案3

答案4

多列IN/ANY postgres查询

How can I build varidics of type interface in Go?

如何在SQL中创建客户拥有的产品组合。

使用Go语言中的go-pg库正确访问PostgreSQL数据库的方法是什么？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论