基于 Athena 中的数值进行数据分组

huangapple go评论70阅读模式
英文:

Grouping data based on a value in Athena

问题

我正在尝试根据登录页面名称创建新的ID。每当遇到登录页面名称时,它将被视为一个会话,直到找到下一个登录页面名称为止。新的ID将基于该ID的会话计数生成。

示例数据集:

id page_name visited_time
ABC-123 logon 2023-02-23 04:04:40.000
ABC-123 smscode 2023-02-23 04:20:40.000
ABC-123 acct balance 2023-02-23 04:21:40.000
ABC-123 logon 2023-02-23 04:54:40.000
ABC-123 transfer 2023-02-23 04:54:40.000
CDE-123 logon 2023-02-23 04:58:40.000

使用的SQL:

with session_cnt
as
(
select id, page_name, visited_time, 
row_number() over (partition by id, page_name order by visited_time asc) as session_count, 
visited_time
from table
)
select id, page_name, visited_time, session_count, array_join(array_agg(session_count), ''-'', ''null)) as new_id
from session_cnt

上述SQL似乎根据其他页面名称返回新的ID。我希望根据ID和page_name=logon来分组会话计数。

期望的结果:

id page_name visited_time session_count new_id
ABC-123 logon 2023-02-23 04:04:40.000 1 ABC-123-1
ABC-123 smscode 2023-02-23 04:20:40.000 1 ABC-123-1
ABC-123 acct balance 2023-02-23 04:21:40.000 1 ABC-123-1
ABC-123 logon 2023-02-23 04:54:40.000 2 ABC-123-2
ABC-123 transfer 2023-02-23 04:54:40.000 2 ABC-123-2
CDE-123 logon 2023-02-23 04:58:40.000 1 CDE-123-1

请告诉我如何在Athena中实现这个目标。

英文:

I am trying to create new id based on logon page name. Whenever logon page_name is encountered, it will be considered as one session until it finds next logon page name. The new id will be generated based on session count for that id.

Sample data set:

id page_name visited_time
ABC-123 logon 2023-02-23 04:04:40.000
ABC-123 smscode 2023-02-23 04:20:40.000
ABC-123 acct balance 2023-02-23 04:21:40.000
ABC-123 logon 2023-02-23 04:54:40.000
ABC-123 transfer 2023-02-23 04:54:40.000
CDE-123 logon 2023-02-23 04:58:40.000

SQL used:

with session_cnt
as
(
select id, page_name, visited_time, 
row_number() over (partition by id, page_name order by visited_time asc) as session_count, 
visited_time
from table
)
select id, page_name, visited_time, session_count, array_join(array_agg(session_count), '-', 'null)) as new_id
from session_cnt

Above SQL seems to return new id based on other page names too. I would like to group the session count based on id and page_name=logon.

Expected result:

id page_name visited_time session_count new_id
ABC-123 logon 2023-02-23 04:04:40.000 1 ABC-123-1
ABC-123 smscode 2023-02-23 04:20:40.000 1 ABC-123-1
ABC-123 acct balance 2023-02-23 04:21:40.000 1 ABC-123-1
ABC-123 logon 2023-02-23 04:54:40.000 2 ABC-123-2
ABC-123 transfer 2023-02-23 04:54:40.000 2 ABC-123-2
CDE-123 logon 2023-02-23 04:58:40.000 1 CDE-123-1

Please advise how I can achieve this in Athena

答案1

得分: 1

使用累积求和(使用窗口函数)在page_name等于'logon'时:

-- 示例数据
with dataset (id, page_name, visited_time) as (
    values ('ABC-123', 'logon', '2023-02-23 04:04:40.000'),
    ('ABC-123', 'smscode', '2023-02-23 04:20:40.000'),
    ('ABC-123', 'acct balance', '2023-02-23 04:21:40.000'),
    ('ABC-123', 'logon', '2023-02-23 04:54:40.000'),
    ('ABC-123', 'transfer', '2023-02-23 04:54:40.000'),
    ('CDE-123', 'logon', '2023-02-23 04:58:40.000')
)

-- 查询
select *,
  sum(if(page_name = 'logon', 1)) over(partition by id order by visited_time) 
      as session_id 
from dataset;

输出:

id page_name visited_time session_id
CDE-123 logon 2023-02-23 04:58:40.000 1
ABC-123 logon 2023-02-23 04:04:40.000 1
ABC-123 smscode 2023-02-23 04:20:40.000 1
ABC-123 acct balance 2023-02-23 04:21:40.000 1
ABC-123 logon 2023-02-23 04:54:40.000 2
ABC-123 transfer 2023-02-23 04:54:40.000 2
英文:

Use cumulative sum (using window functions) over page_name equal to 'logon':

-- sample data
with dataset (id, page_name, visited_time) as (
    values ('ABC-123',	'logon',	'2023-02-23 04:04:40.000'),
    ('ABC-123',	'smscode',	'2023-02-23 04:20:40.000'),
    ('ABC-123',	'acct balance',	'2023-02-23 04:21:40.000'),
    ('ABC-123',	'logon',	'2023-02-23 04:54:40.000'),
    ('ABC-123',	'transfer',	'2023-02-23 04:54:40.000'),
    ('CDE-123',	'logon',	'2023-02-23 04:58:40.000')
)

-- query
select *,
  sum(if(page_name = 'logon', 1)) over(partition by id order by visited_time) 
      as session_id 
from dataset;

Output:

id page_name visited_time session_id
CDE-123 logon 2023-02-23 04:58:40.000 1
ABC-123 logon 2023-02-23 04:04:40.000 1
ABC-123 smscode 2023-02-23 04:20:40.000 1
ABC-123 acct balance 2023-02-23 04:21:40.000 1
ABC-123 logon 2023-02-23 04:54:40.000 2
ABC-123 transfer 2023-02-23 04:54:40.000 2

huangapple
  • 本文由 发表于 2023年2月24日 03:15:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75549370.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定