How to partition a table based on a pattern(in this case an increasing sub sequence) that is found in a column in presto sql

huangapple go评论72阅读模式
英文:

How to partition a table based on a pattern(in this case an increasing sub sequence) that is found in a column in presto sql

问题

我遇到了以下问题。我有一个表T,在其中有一个名为offset的列,它将始终是一个以零开头的递增整数子序列集合。

 T

**行号, offset, col_1, col_2, col_3 ....**
     1       0
     2      10
     3      22
     4       32
     5       0
     6       0
     7       12
     8       32 
     9       44

我想根据offset列对表进行分区,使属于每个严格递增子序列的行放入一个分区。在上面的示例中,行[1,2,3,4],[0],[0,12,32,44]是我正在寻找的分区。我该如何继续,或者在SQL中是否可能实现这个?我尝试使用lead和lag分区函数,想法是创建一个辅助列,存储分区号,然后根据这个列对结果表进行分组。但我认为这样的努力可能需要循环功能,而我不认为SQL中存在这样的功能。

英文:

I am encountering the following problem. I have a table T, in which there is column named offset which will always be a collection of increasing sub sequences of integers beginning with zero.

Table T:

**row_num, offset ,col_1, col_2, col_3 ....**
     1       0
     2      10
     3      22
     4       32
     5       0
     6       0
     7       12
     8       32 
     9       44
             

I want to partition the table based on offset column such that the rows belonging to each strictly increasing subsequence are put into one partition. In the above example rows [1,2,3,4] , [0], [0,12,32,44] are the partitions that i am looking for. How do i proceed or is this even possible in sql?

I tried using lead and lag partition functions, the idea was to come up with an auxiliary column which stores partition numbers and then group by the resulting table based on this column. But i believe such an effort would require the loop functionality which i don't think exists in sql.

答案1

得分: 4

您可以使用 MATCH_RECOGNIZE 来识别满足某种模式的行。在您的情况下,您可以定义一个递增值 offset 的模式:

WITH t(row_num, offset) AS (
    VALUES
             (1, 0),
             (2, 10),
             (3, 22),
             (4, 32),
             (5, 0),
             (6, 0),
             (7, 12),
             (8, 32),
             (9, 44)
)
SELECT row_num, offset, partition FROM t
    MATCH_RECOGNIZE (
        ORDER BY row_num
        MEASURES MATCH_NUMBER() AS partition
        ALL ROWS PER MATCH
        AFTER MATCH SKIP PAST LAST ROW
        PATTERN (START UP*)
        DEFINE UP AS offset > PREV(offset)
    )

上面查询中的 MATCH_RECOGNIZE 子句的含义是:

  • 查找符合模式 START UP* 的一系列行(按 row_num 排序)。START 元素应该出现一次,其后可以出现零次或多次 UP 元素。STARTUP 是用来表示序列中的行的任意名称。START 被隐式地定义,它匹配任何行,而 UP 被定义为其 offset 列的值大于序列中前一行的 offset 列的值(DEFINE UP AS offset > PREV(offset))。
  • 计算一个名为 partition 的输出列,其值是与每个匹配项关联的合成数字(MEASURES MATCH_NUMBER() AS partition)。
  • 输出所有带有 partition 列注释的输入行(ALL ROWS PER MATCH)。

该查询会产生以下结果:

 row_num | offset | partition
---------+--------+-----------
       1 |      0 |         1
       2 |     10 |         1
       3 |     22 |         1
       4 |     32 |         1
       5 |      0 |         2
       6 |      0 |         3
       7 |     12 |         3
       8 |     32 |         3
       9 |     44 |         3
(9 rows)

详细信息,请参阅:

请注意,此功能仅在 Trino 和基于 Trino 的 Athena 版本 3 中可用。

英文:

You can use MATCH_RECOGNIZE to identify rows that satisfy a pattern. In your case, you can define a pattern of increasing values of the column offset:

WITH t(row_num, offset) AS (
    VALUES
             (1, 0),
             (2, 10),
             (3, 22),
             (4, 32),
             (5, 0),
             (6, 0),
             (7, 12),
             (8, 32),
             (9, 44)
)
SELECT row_num, offset, partition FROM t
    MATCH_RECOGNIZE (
        ORDER BY row_num
        MEASURES MATCH_NUMBER() AS partition
        ALL ROWS PER MATCH
        AFTER MATCH SKIP PAST LAST ROW
        PATTERN (START UP*)
        DEFINE UP AS offset > PREV(offset)
    )

What the MATCH_RECOGNIZE clause says in the query above is:

  • Find a sequence of rows (ordered by row_num) that matches the pattern START UP*. The START element should appear once and the UP element should appear zero or more times following it. START and UP are arbitrary names to represent rows in the sequence. START is defined implicitly and it matches any row, while UP is defined as a row whose offset column has a value larger than the offset column of the previous row in the sequence (DEFINE UP AS offset > PREV(offset))
  • Compute an output column named partition whose value is a synthetic number associated with each match. (MEASURES MATCH_NUMBER() AS partition)
  • Output all the input rows with the partition column annotation. (ALL ROWS PER MATCH)

The query produces:

 row_num | offset | partition
---------+--------+-----------
       1 |      0 |         1
       2 |     10 |         1
       3 |     22 |         1
       4 |     32 |         1
       5 |      0 |         2
       6 |      0 |         3
       7 |     12 |         3
       8 |     32 |         3
       9 |     44 |         3
(9 rows)

For more details, please see:

Note that this functionality is only available in Trino and Athena version 3, which is based on Trino.

答案2

得分: 1

你可以使用一个运行总和函数,在row_num递增时,每当发现偏移值为零时,总和加1:

select row_num, offset,
  sum(case when offset = 0 then 1 else 0 end) over (order by row_num) prt
from table_name

此查询的输出结果为:

row_num   offset   prt
1         0        1
2         10       1
3         22       1
4         32       1
5         0        2
6         0        3
7         12       3
8         32       3
9         44       3
英文:

You could use a running sum function that increased by 1 whenever a zero offset value is found over the increasing of row_num:

select row_num, offset,
  sum(case when offset = 0 then 1 else 0 end) over (order by row_num) prt
from table_name

The output of this query:

row_num	  offset	prt
1	      0	        1
2	      10	    1
3	      22	    1
4	      32	    1
5	      0	        2
6	      0	        3
7	      12    	3
8	      32	    3
9	      44	    3

huangapple
  • 本文由 发表于 2023年2月8日 13:18:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/75381617.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定