英文:
Running sum of unique users in redshift
问题
我有一个包含用户每天访问的表格,如下所示 -
| 日期 | 用户ID |
|:-------- |:-------- |
| 01/31/23 | a |
| 01/31/23 | a |
| 01/31/23 | b |
| 01/30/23 | c |
| 01/30/23 | a |
| 01/29/23 | c |
| 01/28/23 | d |
| 01/28/23 | e |
| 01/01/23 | a |
| 12/31/22 | c |
我想要得到过去30天内独立用户ID的累积总数。以下是预期的输出 -
| 日期 | 独立用户数 |
|:-------- |:-------- |
| 01/31/23 | 5 |
| 01/30/23 | 4 |
.
.
.
以下是我尝试的查询 -
SELECT 日期
, SUM(COUNT(DISTINCT 用户ID)) over (order by 日期 rows between 30 preceding and current row) AS 独立用户
FROM 我的表
GROUP BY 日期
ORDER BY 日期 DESC
我遇到的问题是,这个查询没有正确计算独立的用户ID - 例如,我得到的结果中,01/31/23的结果是9,而不是5,因为它每次都计算用户ID 'a'。
谢谢,感谢您的帮助!
英文:
I have a table with as follows with user visits by day -
| date | user_id |
|:-------- |:-------- |
| 01/31/23 | a |
| 01/31/23 | a |
| 01/31/23 | b |
| 01/30/23 | c |
| 01/30/23 | a |
| 01/29/23 | c |
| 01/28/23 | d |
| 01/28/23 | e |
| 01/01/23 | a |
| 12/31/22 | c |
I am looking to get a running total of unique user_id for the last 30 days . Here is the expected output -
| date | distinct_users|
|:-------- |:-------- |
| 01/31/23 | 5 |
| 01/30/23 | 4 |
.
.
.
Here is the query I tried -
SELECT date
, SUM(COUNT(DISTINCT user_id)) over (order by date rows between 30 preceding and current row) AS unique_users
FROM mytable
GROUP BY date
ORDER BY date DESC
The problem I am running into is that this query not counting the unique user_id - for instance the result I am getting for 01/31/23 is 9 instead of 5 as it is counting user_id 'a' every time it occurs.
Thank you, appreciate your help!
答案1
得分: 0
不是性能最佳的方法,但你可以使用相关子查询来查找过去30天窗口内用户的不同计数:
SELECT
date,
(SELECT COUNT(DISTINCT t2.user_id)
FROM mytable t2
WHERE t2.date BETWEEN t1.date - INTERVAL '30 day' AND t1.date) AS distinct_users
FROM mytable t1
ORDER BY date;
英文:
Not the most performant approach, but you could use a correlated subquery to find the distinct count of users over a window of the past 30 days:
<!-- language: sql -->
SELECT
date,
(SELECT COUNT(DISTINCT t2.user_id)
FROM mytable t2
WHERE t2.date BETWEEN t1.date - INTERVAL '30 day' AND t1.date) AS distinct_users
FROM mytable t1
ORDER BY date;
答案2
得分: 0
以下是翻译好的部分:
首先,窗口函数在分组和聚合之后运行。因此,COUNT(DISTINCT user_id)会给出每个日期的用户ID计数,然后窗口函数运行。此外,设置窗口函数的方式是在过去的30行而不是30天内工作,因此您需要填充缺失的日期以使用它们。
至于如何做到这一点 - 我只能想到“扩展数据,使每个日期和ID都有一行”的方法。这将需要一个公共表达式(CTE),用于生成过去2年的日期以及30天,以便回顾窗口可以在最早的日期上运行。然后,对于每个用户ID和日期,在过去30天内的窗口中查看是否有此用户ID的示例,如果窗口内没有使用该用户ID,则将其值设置为NULL。然后,按日期分组计算用户ID的计数(非NULL),以获取该日期的唯一用户ID数量。
这意味着扩展数据量相当大,但我认为没有其他方法可以获取过去30天内真正的唯一用户ID。如果需要,我可以帮助编写代码,但代码将类似于以下内容:
使用递归CTE生成所需的日期,
使用CTE将这些日期与过去2年中用户表中的所有用户ID的不同集合交叉连接,
使用CTE将日期/用户ID数据集与过去2年和30天的实际数据表连接,并使用窗口回溯计算非NULL用户ID,按日期和用户ID分区,按日期排序,并使用DECODE()或CASE语句将任何零计数设置为NULL,
按日期分组计算用户ID的计数。
英文:
There are a few things going on here. First window functions run after group by and aggregation. So COUNT(DISTINCT user_id) gives the count of user_ids for each date then the window function runs. Also, window function set up like this work over the past 30 rows, not 30 days so you will need to fill in missing dates to use them.
As to how to do this - I can only think of the "expand to the data so each date and id has a row" method. This will require a CTE to generate the last 2 years of dates plus 30 days so that the look-back window works for the first dates. Then window over the past 30 days for each user_id and date to see which rows have an example of this user_id within the past 30 days, setting the value to NULL if no uses of the user_id are present within the window. Then Count the user_ids counts (non NULL) grouping by just date to get the number of unique user_ids for that date.
This means expanding the data significantly but I see no other way to get truly unique user_ids over the past 30 days. I can help code this up if you need but will look something like:
WITH RECURSIVE CTE to generate the needed dates,
CTE to cross join these dates with a distinct set of all the user_ids in user for the past 2 years,
CTE to join the date/user_id data set with the table of real data for past 2 years and 30 days and window back counting non-NULL user_ids, partition by date and user_id, order by date, and setting any zero counts to NULL with a DECODE() or CASE statement,
SELECT, grouping by just date count the user_ids by date;
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论