2023年5月18日 03:28:01go评论88阅读模式

英文:

Calculating the average session length in PostgreSQL

问题

I am your Chinese translator, providing the translated content as per your request:

我需要一些帮助。我在PostgreSQL中有一个表，记录了用户会话的日志。
该表有3个字段：id、action和action_date，其中action_date为时间戳格式。
action字段有两个可能的值：“page_load”或“page_exit”。
action_date字段包含每个会话的开始时间和结束时间。
我需要计算访问网站的每个用户的平均会话时长。
问题在于，对于一些会话（可能是由于数据记录错误），action字段对于同一用户（id）可能包含不同数量的“page_load”或“page_exit”值。此外，同一用户在同一天内可能有多个会话。
还有一些情况下，两个连续的“page_load”动作之间有很短的时间差，然后是一个“page_exit”动作，表示第二个“page_load”会话的结束：

8658	page_load	2023-02-01 21:58:44.000
8658	page_load	2023-02-02 08:31:32.000
8658	page_exit	2017-01-28 08:42:18.000

由于每个用户的“page_load”或“page_exit”动作数量不同，我在进行计算时遇到了困难。
我尝试了根据会话类型将它们分成不同的数据集，然后执行连接，但由于具有不同数量“page_load”和“page_exit”动作的会话，这是不正确的。
我还尝试使用了lead()函数，但由于“page_load”或“page_exit”动作可以连续发生，所以得到了错误的计算结果。
我试图过滤掉“page_load”或“page_exit”动作数量不匹配的行，但最终数据发生了偏移，计算结果也是错误的。
有没有办法消除这些多余的行，只与匹配的配对进行计算呢？我已经没有主意了，希望社区能够提供可能的帮助。
以下是我试图消除“page_load”和“page_exit”不匹配的行的其中一种方法，以便获得两个我以后可以用来计算平均长度的单独列。但是，这个查询是不正确的，因为我得到了“page_load”和“page_exit”之间的显著差异，而这在原始数据中并不存在。

WITH ses_events AS (   
    SELECT id,
           action,
           action_date,
           ROW_NUMBER() OVER (PARTITION BY id, action ORDER BY action_date) AS event_seq   
      FROM users_act ), 
arr AS ( 
   SELECT id,   
          array_agg(CASE WHEN action = 'page_load' THEN action_date END) AS ses_start,   
          array_agg(CASE WHEN action = 'page_exit' THEN action_date END) AS ses_end 
     FROM ses_events  
 GROUP BY id, event_seq 
   HAVING COUNT(*) FILTER (WHERE action IN ('page_load', 'page_exit')) % 2 = 0 ), 
dr_arr AS ( 
    SELECT user_id,   
           unnest(array_remove(ses_start, NULL))::timestamp AS ses_start,   
           unnest(array_remove(ses_end, NULL))::timestamp AS ses_end FROM arr) 
SELECT * FROM dr_arr ORDER BY ses_start, ses_end;

另一种尝试使用偏移量进行计算，但由于缺少“page_load”或“page_exit”的行，它也返回错误的数据：

WITH ld AS (   SELECT id,
         action,
         action_date,
         LEAD(action_date) over(PARTITION BY id ORDER BY action_date) AS end_s
    FROM users_act ), l AS ( SELECT id,
       (end_s - action_date) AS length_ses   FROM ld  WHERE event = 'page_load' ) SELECT id,
       avg(length_ses) AS avg_us_ses   FROM l  GROUP BY id;

我尝试了几种不同的方法，但都没有得到正确的结果。我相信唯一能帮助解决这个问题的方法是将行旋转，并创建两个单独的字段，start_ses（时间戳）和end_ses（时间戳）。然而，我至今的尝试并没有产生正确的结果。我希望有人能提供指导或指引我找到这个问题的正确解决方案。

英文:

I need some help. I have a table in PostgreSQL with logs of user sessions.
The table has 3 fields: id, action, and action_date in the timestamp format.
The action field has 2 possible values: "page_load" or "page_exit".
The action_date field contains the start time and end time of each session, respectively.
I need to calculate the average session length for each user who visited the website.
The problem is that for some sessions (probably due to errors in data recording), the action field contains a different number of "page_load" or "page_exit" values for the same user (id). Additionally, there can be multiple sessions for the same user within one day.
There are also cases where two consecutive "page_load" actions occur with a small time difference, followed by a "page_exit" action, indicating the end of the second "page_load" session:

8658	page_load	2023-02-01 21:58:44.000
8658	page_load	2023-02-02 08:31:32.000
8658	page_exit	2017-01-28 08:42:18.000

Due to the varying number of "page_load" or "page_exit" actions for each user, I'm having trouble performing the calculation.
I've tried creating separate datasets by dividing them based on the session type and then performing a join, but it's incorrect due to sessions with different counts of "page_load" and "page_exit" actions.
I also attempted to use the lead() function, but I get incorrect calculations due to cases where "page_load" or "page_exit" actions can occur consecutively.
I've tried filtering out the rows where the count of "page_load" or "page_exit" actions doesn't match, but I end up with a shift in the data and incorrect calculations as well.
Is there any way to eliminate these extra rows and perform calculations only with matching pairs? I've run out of ideas, and I'm hoping for possible assistance from the community.
Here is one of the approaches I'm trying to eliminate rows where "page_load" and "page_exit" don't match, in order to obtain two separate columns that I can later use to calculate the average length. However, this query is incorrect because I'm getting a significant difference between "page_load" and "page_exit" that doesn't exist in the original data.

WITH ses_events AS (   
    SELECT id,
           action,
           action_date,
           ROW_NUMBER() OVER (PARTITION BY id, action ORDER BY action_date) AS event_seq   
      FROM users_act ), 
arr AS ( 
   SELECT id,   
          array_agg(CASE WHEN action = &#39;page_load&#39; THEN action_date END) AS ses_start,   
          array_agg(CASE WHEN action = &#39;page_exit&#39; THEN action_date END) AS ses_end 
     FROM ses_events  
 GROUP BY id, event_seq 
   HAVING COUNT(*) FILTER (WHERE action IN (&#39;page_load&#39;, &#39;page_exit&#39;)) % 2 = 0 ), 
dr_arr AS ( 
    SELECT user_id,   
           unnest(array_remove(ses_start, NULL))::timestamp AS ses_start,   
           unnest(array_remove(ses_end, NULL))::timestamp AS ses_end FROM arr) 
SELECT * FROM dr_arr ORDER BY ses_start, ses_end;

Another attempt to calculate using offsets, but it also returns incorrect data due to rows that lack either "page_load" or "page_exit":

WITH ld AS (   SELECT id,
         action,
         action_date,
         LEAD(action_date) over(PARTITION BY id ORDER BY action_date) AS end_s
    FROM users_act ), l AS ( SELECT id,
       (end_s - action_date) AS length_ses   FROM ld  WHERE event = &#39;page_load&#39; ) SELECT id,
       avg(length_ses) AS avg_us_ses   FROM l  GROUP BY id;

I have tried several different approaches, but none of them have yielded the correct result. I believe that the only thing that can help solve this problem is to pivot the rows and create two separate fields, start_ses (timestamp) and end_ses (timestamp). However, my attempts to write such a solution have not produced the correct result so far. I hope someone can provide guidance or point me towards the right solution for this problem.

答案1

得分: 0

选择你要如何处理“离群值”行并没有明确说明。一个简单的方法是使用`lead()`或`lag()`来识别“合法”行，即页面加载紧接着页面退出。然后，我们会忽略所有其他行，并计算这些成对行的平均值。
    选择 id, avg(action_date - lag_action_date) avg_session_duration
    从 (
        选择 e.*,
            lag(action)      在(id分区内按action_date顺序) 作为 lag_action,
            lag(action_date) 在(id分区内按action_date顺序) 作为 lag_action_date
        从 ses_events e
    ) e
    其中 action = 'page_logout' 且 lag_action = 'page_login'
    按 id 分组

英文:

You do not exactly tell how you want to handle "outlier" rows. A simple approach is to use lead() or lag() to identify "legitimate" rows, ie a page load immediately followed by a page exit. We would then ignore all other rows, and compute the average of those pairs.

select id, avg(action_date - lag_action_date) avg_session_duration
from (
    select e.*,
        lag(action)      over(partition by id order by action_date) lag_action,
        lag(action_date) over(partition by id order by action_date) lag_action_date
    from ses_events e
) e
where action = &#39;page_logout&#39; and lag_action = &#39;page_login&#39;
group by id

答案2

得分: 0

以下是翻译的内容：

通常情况下，您希望在这种情况下引入一个“清理视图”，其中包含详细（非聚合）数据，但提供一些额外信息以帮助对数据进行分类。

一个示例，将OK中的“终止”记录分类为前一记录是开启记录，并计算会话持续时间。

with session_view as (
select
  id, "action", action_date,
  lag(action_date) over (partition by id order by action_date) action_date_lag,
  lag("action") over (partition by id order by action_date) action_lag
from dt)
select 
  id, "action", action_date,
  case 
    when "action"  = 'page_load' then null /* 会话开始 */
    when "action" = 'page_exit' and action_lag = 'page_load' then 'OK' else '非法会话' end as session_status,
  case when "action" = 'page_exit' then action_date - action_date_lag end as session_duration
from session_view
order by id, action_date

您只会考虑具有“OK”状态的会话进行平均计算。

此外，您可以检查持续时间并丢弃异常值，使用某些过长或过短会话的阈值。

或者使用中位数，它对异常值的依赖性较小。

您可以使用EXTRACT(epoch来计算持续时间，以秒或其他单位为单位。

英文:

You typically want to instoduce a cleaning view in such case containing still the detailed (not aggregated) data, but providing some additional information helping to classify the data.

An example that classifies the terminating records in OK if a previous record is the opening record and calculates the session duration.

with session_view as (
select
  id, &quot;action&quot;, action_date,
  lag(action_date) over (partition by id order by action_date) action_date_lag,
  lag(&quot;action&quot;) over (partition by id order by action_date) action_lag
from dt)
select 
  id, &quot;action&quot;, action_date,
  case 
    when &quot;action&quot;  = &#39;page_load&#39; then null /* session start */
    when &quot;action&quot; = &#39;page_exit&#39; and action_lag = &#39;page_load&#39; then &#39;OK&#39; else &#39;illegal session&#39; end as session_status,
  case when &quot;action&quot; = &#39;page_exit&#39; then action_date - action_date_lag end as session_duration
from session_view
order by id, action_date
id  |action   |action_date        |session_status |session_duration|
----+---------+-------------------+---------------+----------------+
8658|page_exit|2017-01-28 08:42:18|illegal session|                |
8658|page_load|2023-02-01 21:58:44|               |                |
8658|page_exit|2023-02-01 22:58:44|OK             |        01:00:00|
8658|page_load|2023-02-02 08:31:32|               |                |
8658|page_exit|2023-02-02 08:41:32|OK             |        00:10:00|

You will consider only the sessions with the status OK in the average calculation.

Additionaly you may check the duration and discard outliers using soem threshold for too long or too short sessions.

Alternatively use median which is less error dependent on the outliers.

You may want to use EXTRACT(epoch to calculate the duration in seconds or other units.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在PostgreSQL中计算平均会话时长。

问题

答案1

答案2

我是新手学习PHP，正在尝试让提交按钮读取一个SQL表。

打包成JAR时出现问题（postgresql）

将字符串属性映射到JSONB

类未找到异常 Postgresql。已在 pom 中导入 maven 依赖。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。