两个相似的Presto SQL查询之间的差异

huangapple go评论83阅读模式
英文:

Discrepancy between two similar presto SQL queries

问题

以下是您要翻译的内容:

Query 1:

SELECT visitor_id, array_agg(timestamp) as time, array_agg(offset) as offset_list 
from
(SELECT * FROM 
(
SELECT visitor_id, timestamp,
cast(json_extract(uri_args, '$.offset') AS int) as offset


FROM table_t 
where year = 2023 and month = 1 and day = 27 and 

request_uri = '/home_page')
order by visitor_id, timestamp)
group by visitor_id 
order by cardinality(offset_list) desc 

Query 2:

SELECT visitor_id ,array_agg(offset) as offset_list 
from
(SELECT * FROM 
(
SELECT visitor_id, timestamp,
cast(json_extract(uri_args, '$.offset') AS int) as offset


FROM table_t 
where year = 2023 and month = 1 and day = 27 and 

request_uri = '/home_page')
order by visitor_id, timestamp)
group by visitor_id 
order by cardinality(offset_list) desc

Here uri_args is simply a json file which under the key 'offset' contains the value of the offset for the particular API response. This is from response log table of a server.

Although the two queries are similar and according to me ought to return the same values in the offset_list column i find the following discrepancy :

I will consider a particular visitor_id to convey it clearly, for a visitor_id ='12345' query i returns the following row in the offset_list col

[0, 0, 0, 10, 0, 10, 20, 32, 42, 0, 0, 20, 53, 77, 57, 0, 10, 20, 31, 10, 41, 0, 10, 41, 54, 77, 0, 10, 31, 41, 54, 10, 31, 54, 57, 77, 10, 20, 32, 0, 10, 21, 33, 44, 72, 52, 0, 10, 20, 31, 41]

and for query 2 the output is as follows :

[20, 32, 42, 0, 0, 20, 53, 77, 57, 0, 10, 20, 31, 10, 41, 0, 10, 41, 54, 77, 0, 10, 31, 41, 54, 10, 31, 54, 77, 57, 10, 20, 32, 0, 10, 21, 33, 44, 72, 52, 0, 10, 20, 31, 41, 0, 0, 0, 10, 0, 10]

I can observe the the two are circular permutations of each other but fail to see why this is happening. Please help me understand what the difference is the inner working of each query. The first reply suits the intent of my quest which is to capture the visitors journey on the homepage.

英文:

I have the following SQL queries that should written similar results wrt offset column, but dont.

Query 1:

SELECT visitor_id, array_agg(timestamp) as time, array_agg(offset) as offset_list 
from
(SELECT * FROM 
(
SELECT visitor_id, timestamp,
cast(json_extract(uri_args, '$.offset') AS int) as offset


FROM table_t 
where year = 2023 and month = 1 and day = 27 and 

request_uri = '/home_page')
order by visitor_id, timestamp)
group by visitor_id 
order by cardinality(offset_list) desc 

Query 2:

SELECT visitor_id ,array_agg(offset) as offset_list 
from
(SELECT * FROM 
(
SELECT visitor_id, timestamp,
cast(json_extract(uri_args, '$.offset') AS int) as offset


FROM table_t 
where year = 2023 and month = 1 and day = 27 and 

request_uri = '/home_page')
order by visitor_id, timestamp)
group by visitor_id 
order by cardinality(offset_list) desc

Here uri_args is simply a json file which under the key 'offset' contains the value of the offset for the particular API response. This is from response log table of a server.

Although the two queries are similar and according to me ought to return the same values in the offset_list column i find the following discrepancy :

I will consider a particular visitor_id to convey it clearly, for a visitor_id ='12345' query i returns the following row in the offset_list col

[0, 0, 0, 10, 0, 10, 20, 32, 42, 0, 0, 20, 53, 77, 57, 0, 10, 20, 31, 10, 41, 0, 10, 41, 54, 77, 0, 10, 31, 41, 54, 10, 31, 54, 57, 77, 10, 20, 32, 0, 10, 21, 33, 44, 72, 52, 0, 10, 20, 31, 41]

and for query 2 the output is as follows :

[20, 32, 42, 0, 0, 20, 53, 77, 57, 0, 10, 20, 31, 10, 41, 0, 10, 41, 54, 77, 0, 10, 31, 41, 54, 10, 31, 54, 77, 57, 10, 20, 32, 0, 10, 21, 33, 44, 72, 52, 0, 10, 20, 31, 41, 0, 0, 0, 10, 0, 10]

I can observe the the two are circular permutations of each other but fail to see why this is happening. Please help me understand what the difference is the inner working of each query. The first reply suits the intent of my quest which is to capture the visitors journey on the homepage.

答案1

得分: 2

如果需要确定性排序的数组元素,请在聚合函数中指定 order by 子句,如文档中所述。

一些聚合函数,如 array_agg(),根据输入值的顺序产生不同的结果。可以通过在聚合函数内部编写 ORDER BY 子句来指定排序。

SELECT visitor_id, array_agg(offset ORDER BY timestamp) AS offset_list 
-- ...
英文:

If you need your array elements deterministically ordered specify order by clause for aggregate function as mentioned in the docs

> Some aggregate functions such as array_agg() produce different results depending on the order of input values. This ordering can be specified by writing an ORDER BY Clause within the aggregate function

SELECT visitor_id ,array_agg(offset order by timestamp) as offset_list 
-- ...

huangapple
  • 本文由 发表于 2023年2月7日 03:56:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75365979.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定