在PostgreSQL中计算平均会话时长。

huangapple go评论63阅读模式
英文:

Calculating the average session length in PostgreSQL

问题

I am your Chinese translator, providing the translated content as per your request:

我需要一些帮助。我在PostgreSQL中有一个表,记录了用户会话的日志。
该表有3个字段:id、action和action_date,其中action_date为时间戳格式。
action字段有两个可能的值:“page_load”或“page_exit”。
action_date字段包含每个会话的开始时间和结束时间。
我需要计算访问网站的每个用户的平均会话时长。
问题在于,对于一些会话(可能是由于数据记录错误),action字段对于同一用户(id)可能包含不同数量的“page_load”或“page_exit”值。此外,同一用户在同一天内可能有多个会话。
还有一些情况下,两个连续的“page_load”动作之间有很短的时间差,然后是一个“page_exit”动作,表示第二个“page_load”会话的结束:

8658	page_load	2023-02-01 21:58:44.000
8658	page_load	2023-02-02 08:31:32.000
8658	page_exit	2017-01-28 08:42:18.000

由于每个用户的“page_load”或“page_exit”动作数量不同,我在进行计算时遇到了困难。
我尝试了根据会话类型将它们分成不同的数据集,然后执行连接,但由于具有不同数量“page_load”和“page_exit”动作的会话,这是不正确的。
我还尝试使用了lead()函数,但由于“page_load”或“page_exit”动作可以连续发生,所以得到了错误的计算结果。
我试图过滤掉“page_load”或“page_exit”动作数量不匹配的行,但最终数据发生了偏移,计算结果也是错误的。
有没有办法消除这些多余的行,只与匹配的配对进行计算呢?我已经没有主意了,希望社区能够提供可能的帮助。
以下是我试图消除“page_load”和“page_exit”不匹配的行的其中一种方法,以便获得两个我以后可以用来计算平均长度的单独列。但是,这个查询是不正确的,因为我得到了“page_load”和“page_exit”之间的显著差异,而这在原始数据中并不存在。

WITH ses_events AS (   
    SELECT id,
           action,
           action_date,
           ROW_NUMBER() OVER (PARTITION BY id, action ORDER BY action_date) AS event_seq   
      FROM users_act ), 
arr AS ( 
   SELECT id,   
          array_agg(CASE WHEN action = 'page_load' THEN action_date END) AS ses_start,   
          array_agg(CASE WHEN action = 'page_exit' THEN action_date END) AS ses_end 
     FROM ses_events  
 GROUP BY id, event_seq 
   HAVING COUNT(*) FILTER (WHERE action IN ('page_load', 'page_exit')) % 2 = 0 ), 
dr_arr AS ( 
    SELECT user_id,   
           unnest(array_remove(ses_start, NULL))::timestamp AS ses_start,   
           unnest(array_remove(ses_end, NULL))::timestamp AS ses_end FROM arr) 
SELECT * FROM dr_arr ORDER BY ses_start, ses_end;

另一种尝试使用偏移量进行计算,但由于缺少“page_load”或“page_exit”的行,它也返回错误的数据:

WITH ld AS (   SELECT id,
         action,
         action_date,
         LEAD(action_date) over(PARTITION BY id ORDER BY action_date) AS end_s
    FROM users_act ), l AS ( SELECT id,
       (end_s - action_date) AS length_ses   FROM ld  WHERE event = 'page_load' ) SELECT id,
       avg(length_ses) AS avg_us_ses   FROM l  GROUP BY id;

我尝试了几种不同的方法,但都没有得到正确的结果。我相信唯一能帮助解决这个问题的方法是将行旋转,并创建两个单独的字段,start_ses(时间戳)和end_ses(时间戳)。然而,我至今的尝试并没有产生正确的结果。我希望有人能提供指导或指引我找到这个问题的正确解决方案。

英文:

I need some help. I have a table in PostgreSQL with logs of user sessions.
The table has 3 fields: id, action, and action_date in the timestamp format.
The action field has 2 possible values: "page_load" or "page_exit".
The action_date field contains the start time and end time of each session, respectively.
I need to calculate the average session length for each user who visited the website.
The problem is that for some sessions (probably due to errors in data recording), the action field contains a different number of "page_load" or "page_exit" values for the same user (id). Additionally, there can be multiple sessions for the same user within one day.
There are also cases where two consecutive "page_load" actions occur with a small time difference, followed by a "page_exit" action, indicating the end of the second "page_load" session:

8658	page_load	2023-02-01 21:58:44.000
8658	page_load	2023-02-02 08:31:32.000
8658	page_exit	2017-01-28 08:42:18.000

Due to the varying number of "page_load" or "page_exit" actions for each user, I'm having trouble performing the calculation.
I've tried creating separate datasets by dividing them based on the session type and then performing a join, but it's incorrect due to sessions with different counts of "page_load" and "page_exit" actions.
I also attempted to use the lead() function, but I get incorrect calculations due to cases where "page_load" or "page_exit" actions can occur consecutively.
I've tried filtering out the rows where the count of "page_load" or "page_exit" actions doesn't match, but I end up with a shift in the data and incorrect calculations as well.
Is there any way to eliminate these extra rows and perform calculations only with matching pairs? I've run out of ideas, and I'm hoping for possible assistance from the community.
Here is one of the approaches I'm trying to eliminate rows where "page_load" and "page_exit" don't match, in order to obtain two separate columns that I can later use to calculate the average length. However, this query is incorrect because I'm getting a significant difference between "page_load" and "page_exit" that doesn't exist in the original data.

WITH ses_events AS (   
    SELECT id,
           action,
           action_date,
           ROW_NUMBER() OVER (PARTITION BY id, action ORDER BY action_date) AS event_seq   
      FROM users_act ), 
arr AS ( 
   SELECT id,   
          array_agg(CASE WHEN action = 'page_load' THEN action_date END) AS ses_start,   
          array_agg(CASE WHEN action = 'page_exit' THEN action_date END) AS ses_end 
     FROM ses_events  
 GROUP BY id, event_seq 
   HAVING COUNT(*) FILTER (WHERE action IN ('page_load', 'page_exit')) % 2 = 0 ), 
dr_arr AS ( 
    SELECT user_id,   
           unnest(array_remove(ses_start, NULL))::timestamp AS ses_start,   
           unnest(array_remove(ses_end, NULL))::timestamp AS ses_end FROM arr) 
SELECT * FROM dr_arr ORDER BY ses_start, ses_end;

Another attempt to calculate using offsets, but it also returns incorrect data due to rows that lack either "page_load" or "page_exit":

WITH ld AS (   SELECT id,
         action,
         action_date,
         LEAD(action_date) over(PARTITION BY id ORDER BY action_date) AS end_s
    FROM users_act ), l AS ( SELECT id,
       (end_s - action_date) AS length_ses   FROM ld  WHERE event = 'page_load' ) SELECT id,
       avg(length_ses) AS avg_us_ses   FROM l  GROUP BY id;

I have tried several different approaches, but none of them have yielded the correct result. I believe that the only thing that can help solve this problem is to pivot the rows and create two separate fields, start_ses (timestamp) and end_ses (timestamp). However, my attempts to write such a solution have not produced the correct result so far. I hope someone can provide guidance or point me towards the right solution for this problem.

答案1

得分: 0

选择你要如何处理“离群值”行并没有明确说明。一个简单的方法是使用`lead()``lag()`来识别“合法”行,即页面加载紧接着页面退出。然后,我们会忽略所有其他行,并计算这些成对行的平均值。


    选择 id, avg(action_date - lag_action_date) avg_session_duration
     (
        选择 e.*,
            lag(action)      (id分区内按action_date顺序) 作为 lag_action,
            lag(action_date) (id分区内按action_date顺序) 作为 lag_action_date
         ses_events e
    ) e
    其中 action = 'page_logout'  lag_action = 'page_login'
     id 分组
英文:

You do not exactly tell how you want to handle "outlier" rows. A simple approach is to use lead() or lag() to identify "legitimate" rows, ie a page load immediately followed by a page exit. We would then ignore all other rows, and compute the average of those pairs.

select id, avg(action_date - lag_action_date) avg_session_duration
from (
    select e.*,
        lag(action)      over(partition by id order by action_date) lag_action,
        lag(action_date) over(partition by id order by action_date) lag_action_date
    from ses_events e
) e
where action = 'page_logout' and lag_action = 'page_login'
group by id

答案2

得分: 0

以下是翻译的内容:

通常情况下,您希望在这种情况下引入一个“清理视图”,其中包含详细(非聚合)数据,但提供一些额外信息以帮助对数据进行分类。

一个示例,将OK中的“终止”记录分类为前一记录是开启记录,并计算会话持续时间。

with session_view as (
select
  id, "action", action_date,
  lag(action_date) over (partition by id order by action_date) action_date_lag,
  lag("action") over (partition by id order by action_date) action_lag
from dt)
select 
  id, "action", action_date,
  case 
    when "action"  = 'page_load' then null /* 会话开始 */
    when "action" = 'page_exit' and action_lag = 'page_load' then 'OK' else '非法会话' end as session_status,
  case when "action" = 'page_exit' then action_date - action_date_lag end as session_duration
from session_view
order by id, action_date

您只会考虑具有“OK”状态的会话进行平均计算。

此外,您可以检查持续时间并丢弃异常值,使用某些过长或过短会话的阈值。

或者使用中位数,它对异常值的依赖性较小。

您可以使用EXTRACT(epoch来计算持续时间,以秒或其他单位为单位。

英文:

You typically want to instoduce a cleaning view in such case containing still the detailed (not aggregated) data, but providing some additional information helping to classify the data.

An example that classifies the terminating records in OK if a previous record is the opening record and calculates the session duration.

with session_view as (
select
  id, "action", action_date,
  lag(action_date) over (partition by id order by action_date) action_date_lag,
  lag("action") over (partition by id order by action_date) action_lag
from dt)
select 
  id, "action", action_date,
  case 
    when "action"  = 'page_load' then null /* session start */
    when "action" = 'page_exit' and action_lag = 'page_load' then 'OK' else 'illegal session' end as session_status,
  case when "action" = 'page_exit' then action_date - action_date_lag end as session_duration
from session_view
order by id, action_date

id  |action   |action_date        |session_status |session_duration|
----+---------+-------------------+---------------+----------------+
8658|page_exit|2017-01-28 08:42:18|illegal session|                |
8658|page_load|2023-02-01 21:58:44|               |                |
8658|page_exit|2023-02-01 22:58:44|OK             |        01:00:00|
8658|page_load|2023-02-02 08:31:32|               |                |
8658|page_exit|2023-02-02 08:41:32|OK             |        00:10:00|

You will consider only the sessions with the status OK in the average calculation.

Additionaly you may check the duration and discard outliers using soem threshold for too long or too short sessions.

Alternatively use median which is less error dependent on the outliers.

You may want to use EXTRACT(epoch to calculate the duration in seconds or other units.

huangapple
  • 本文由 发表于 2023年5月18日 03:28:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76275596.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定