英文:
Grouping data based on a value in Athena
问题
我正在尝试根据登录页面名称创建新的ID。每当遇到登录页面名称时,它将被视为一个会话,直到找到下一个登录页面名称为止。新的ID将基于该ID的会话计数生成。
示例数据集:
id | page_name | visited_time |
---|---|---|
ABC-123 | logon | 2023-02-23 04:04:40.000 |
ABC-123 | smscode | 2023-02-23 04:20:40.000 |
ABC-123 | acct balance | 2023-02-23 04:21:40.000 |
ABC-123 | logon | 2023-02-23 04:54:40.000 |
ABC-123 | transfer | 2023-02-23 04:54:40.000 |
CDE-123 | logon | 2023-02-23 04:58:40.000 |
使用的SQL:
with session_cnt
as
(
select id, page_name, visited_time,
row_number() over (partition by id, page_name order by visited_time asc) as session_count,
visited_time
from table
)
select id, page_name, visited_time, session_count, array_join(array_agg(session_count), ''-'', ''null)) as new_id
from session_cnt
上述SQL似乎根据其他页面名称返回新的ID。我希望根据ID和page_name=logon来分组会话计数。
期望的结果:
id | page_name | visited_time | session_count | new_id |
---|---|---|---|---|
ABC-123 | logon | 2023-02-23 04:04:40.000 | 1 | ABC-123-1 |
ABC-123 | smscode | 2023-02-23 04:20:40.000 | 1 | ABC-123-1 |
ABC-123 | acct balance | 2023-02-23 04:21:40.000 | 1 | ABC-123-1 |
ABC-123 | logon | 2023-02-23 04:54:40.000 | 2 | ABC-123-2 |
ABC-123 | transfer | 2023-02-23 04:54:40.000 | 2 | ABC-123-2 |
CDE-123 | logon | 2023-02-23 04:58:40.000 | 1 | CDE-123-1 |
请告诉我如何在Athena中实现这个目标。
英文:
I am trying to create new id based on logon page name. Whenever logon page_name is encountered, it will be considered as one session until it finds next logon page name. The new id will be generated based on session count for that id.
Sample data set:
id | page_name | visited_time |
---|---|---|
ABC-123 | logon | 2023-02-23 04:04:40.000 |
ABC-123 | smscode | 2023-02-23 04:20:40.000 |
ABC-123 | acct balance | 2023-02-23 04:21:40.000 |
ABC-123 | logon | 2023-02-23 04:54:40.000 |
ABC-123 | transfer | 2023-02-23 04:54:40.000 |
CDE-123 | logon | 2023-02-23 04:58:40.000 |
SQL used:
with session_cnt
as
(
select id, page_name, visited_time,
row_number() over (partition by id, page_name order by visited_time asc) as session_count,
visited_time
from table
)
select id, page_name, visited_time, session_count, array_join(array_agg(session_count), '-', 'null)) as new_id
from session_cnt
Above SQL seems to return new id based on other page names too. I would like to group the session count based on id and page_name=logon.
Expected result:
id | page_name | visited_time | session_count | new_id |
---|---|---|---|---|
ABC-123 | logon | 2023-02-23 04:04:40.000 | 1 | ABC-123-1 |
ABC-123 | smscode | 2023-02-23 04:20:40.000 | 1 | ABC-123-1 |
ABC-123 | acct balance | 2023-02-23 04:21:40.000 | 1 | ABC-123-1 |
ABC-123 | logon | 2023-02-23 04:54:40.000 | 2 | ABC-123-2 |
ABC-123 | transfer | 2023-02-23 04:54:40.000 | 2 | ABC-123-2 |
CDE-123 | logon | 2023-02-23 04:58:40.000 | 1 | CDE-123-1 |
Please advise how I can achieve this in Athena
答案1
得分: 1
使用累积求和(使用窗口函数)在page_name
等于'logon'
时:
-- 示例数据
with dataset (id, page_name, visited_time) as (
values ('ABC-123', 'logon', '2023-02-23 04:04:40.000'),
('ABC-123', 'smscode', '2023-02-23 04:20:40.000'),
('ABC-123', 'acct balance', '2023-02-23 04:21:40.000'),
('ABC-123', 'logon', '2023-02-23 04:54:40.000'),
('ABC-123', 'transfer', '2023-02-23 04:54:40.000'),
('CDE-123', 'logon', '2023-02-23 04:58:40.000')
)
-- 查询
select *,
sum(if(page_name = 'logon', 1)) over(partition by id order by visited_time)
as session_id
from dataset;
输出:
id | page_name | visited_time | session_id |
---|---|---|---|
CDE-123 | logon | 2023-02-23 04:58:40.000 | 1 |
ABC-123 | logon | 2023-02-23 04:04:40.000 | 1 |
ABC-123 | smscode | 2023-02-23 04:20:40.000 | 1 |
ABC-123 | acct balance | 2023-02-23 04:21:40.000 | 1 |
ABC-123 | logon | 2023-02-23 04:54:40.000 | 2 |
ABC-123 | transfer | 2023-02-23 04:54:40.000 | 2 |
英文:
Use cumulative sum (using window functions) over page_name
equal to 'logon'
:
-- sample data
with dataset (id, page_name, visited_time) as (
values ('ABC-123', 'logon', '2023-02-23 04:04:40.000'),
('ABC-123', 'smscode', '2023-02-23 04:20:40.000'),
('ABC-123', 'acct balance', '2023-02-23 04:21:40.000'),
('ABC-123', 'logon', '2023-02-23 04:54:40.000'),
('ABC-123', 'transfer', '2023-02-23 04:54:40.000'),
('CDE-123', 'logon', '2023-02-23 04:58:40.000')
)
-- query
select *,
sum(if(page_name = 'logon', 1)) over(partition by id order by visited_time)
as session_id
from dataset;
Output:
id | page_name | visited_time | session_id |
---|---|---|---|
CDE-123 | logon | 2023-02-23 04:58:40.000 | 1 |
ABC-123 | logon | 2023-02-23 04:04:40.000 | 1 |
ABC-123 | smscode | 2023-02-23 04:20:40.000 | 1 |
ABC-123 | acct balance | 2023-02-23 04:21:40.000 | 1 |
ABC-123 | logon | 2023-02-23 04:54:40.000 | 2 |
ABC-123 | transfer | 2023-02-23 04:54:40.000 | 2 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论