SQL按ID分组运行总和,并受条件限制(使用窗口函数)

huangapple go评论65阅读模式
英文:

SQL running total group by id and limited by a condition (over window)

问题

我在一个SQL表中有以下数据。在[SQL fiddle][1]中可以进行操作。

|             id | pointsEarned |                createdAt |
|----------------|--------------|--------------------------|
| 234-00000206-0 |          500 | 2023-05-03T09:05:05.034Z |
| 234-00000206-0 |         1000 | 2023-05-12T09:05:05.034Z |
| 234-00000206-0 |          800 | 2023-05-15T09:05:05.034Z |
| 234-00000206-0 |          300 | 2023-05-21T09:05:05.034Z |
| 234-00000206-0 |         1100 | 2023-05-28T09:05:05.034Z |

数据的创建方式如下。

    CREATE TABLE activities (
      id varchar(14),
      pointsEarned int,
      createdAt varchar(24)
      );
      
    INSERT INTO activities (id, pointsEarned, createdAt)
    VALUES ('234-00000206-0', 500, '2023-05-03T09:05:05.034Z');
    
    INSERT INTO activities (id, pointsEarned, createdAt)
    VALUES ('234-00000206-0', 1000, '2023-05-12T09:05:05.034Z');
    
    INSERT INTO activities (id, pointsEarned, createdAt)
    VALUES ('234-00000206-0', 800, '2023-05-15T09:05:05.034Z');
    
    INSERT INTO activities (id, pointsEarned, createdAt)
    VALUES ('234-00000206-0', 300, '2023-05-21T09:05:05.034Z');
    
    INSERT INTO activities (id, pointsEarned, createdAt)
    VALUES ('234-00000206-0', 1100, '2023-05-28T09:05:05.034Z');

从这些数据中,我要找出那些获得了超过1,000分的ID,以及发生这种情况的日期。在样本数据中,这应该是在第二行的5月12日。

我尝试了这个SQL语句,但结果当然是5月28日。

    SELECT 
      id, 
      SUM(pointsEarned) as points, 
      MAX(createdAt) as lastActivity
    FROM 
      activities
    GROUP BY
      id
    HAVING 
      points > 1000;

然后我考虑使用窗口函数,并尝试了这个,但出现了一个错误,说语句是错误的。您有没有想法如何修复错误(请参见上面的SQL fiddle链接),以及如何将查询限制为1,000分?

    SELECT 
      id,
      SUM(pointsEarned) OVER(ORDER BY createdAt) points
    FROM activities;

PS:SQL fiddle使用的是MySQL。最后,我需要用于Google BigQuery的语句。

  [1]: http://sqlfiddle.com/#!9/28cc4f/13
英文:

I'm having the following data in an SQL table. See SQL fiddle for playing around.

id pointsEarned createdAt
234-00000206-0 500 2023-05-03T09:05:05.034Z
234-00000206-0 1000 2023-05-12T09:05:05.034Z
234-00000206-0 800 2023-05-15T09:05:05.034Z
234-00000206-0 300 2023-05-21T09:05:05.034Z
234-00000206-0 1100 2023-05-28T09:05:05.034Z

The data is created as follows.

CREATE TABLE activities (
  id varchar(14),
  pointsEarned int,
  createdAt varchar(24)
  );
  
INSERT INTO activities (id, pointsEarned, createdAt)
VALUES ('234-00000206-0', 500, '2023-05-03T09:05:05.034Z');

INSERT INTO activities (id, pointsEarned, createdAt)
VALUES ('234-00000206-0', 1000, '2023-05-12T09:05:05.034Z');

INSERT INTO activities (id, pointsEarned, createdAt)
VALUES ('234-00000206-0', 800, '2023-05-15T09:05:05.034Z');

INSERT INTO activities (id, pointsEarned, createdAt)
VALUES ('234-00000206-0', 300, '2023-05-21T09:05:05.034Z');

INSERT INTO activities (id, pointsEarned, createdAt)
VALUES ('234-00000206-0', 1100, '2023-05-28T09:05:05.034Z');

From that data I'm looking for those IDs that have passed 1.000 points and at what date that has happened. In the sample data this would have been May 12th in the second line already.

I tried this SQL statement, but here the result is of course May, 28th.

SELECT 
  id, 
  SUM(pointsEarned) as points, 
  MAX(createdAt) as lastActivity
FROM 
  activities
GROUP BY
  id
HAVING 
  points > 1000;

Then I was thinking about using window functions and tried this one, but getting an error that the statement is wrong. Any idea how to fix the error (see SQL fiddle link above) and how to limit that query to 1.000 points?

SELECT 
  id,
  SUM(pointsEarned) OVER(ORDER BY createdAt) points
FROM activities;

PS: the SQL fiddle uses MySQL. Finally, I need the statement for Google BigQuery though.

答案1

得分: 3

Your code doesn't work in MySQL 5.6 because in that version of MySQL window functions were not yet a feature available. Furthermore, you shouldn't test on MySQL if you need a Google Bigquery solution, because they're two deeply different products with different tools.

Given these premises, in Google Bigquery you have availability of the QUALIFY clause, that allows you to filter over the result of window functions. You can use qualify twice to check:

  • when the running sum of points is higher than 1000
  • when the date corresponds to the first date of your filtered running sum

by partitioning on "id" (assuming you can have multiple ids), and ordering on your date.

WITH cte AS (
    SELECT id, 
           SUM(pointsEarned) OVER(PARTITION BY id ORDER BY createdAt) AS runningPoints, 
           createdAt
    FROM activities
    QUALIFY runningPoints > 1000
)
SELECT id, runningPoints, createdAt 
FROM cte 
QUALIFY createdAt = MIN(createdAt) OVER(PARTITION BY id ORDER BY runningPoints)

This answer was tested on Bigquery and gave the following output:

id runningPoints createdAt
234-00000206-0 1500 2023-05-12T09:05:05.034Z
英文:

Your code doesn't work in MySQL 5.6 because in that version of MySQL window functions were not yet a feature available. Furthermore, you shouldn't test on MySQL if you need a Google Bigquery solution, because they're two deeply different products with different tools.

Given these premises, in Google Bigquery you have availability of the QUALIFY clause, that allows you to filter over the result of window functions. You can use qualify twice to check:

  • when the running sum of points is higher than 1000
  • when the date corresponds to the first date of your filtered running sum

by partitioning on "id" (assuming you can have multiple ids), and ordering on your date.

WITH cte AS (
    SELECT id, 
           SUM(pointsEarned) OVER(PARTITION BY id ORDER BY createdAt) AS runningPoints, 
           createdAt
    FROM activities
    QUALIFY runningPoints > 1000
)
SELECT id, runningPoints, createdAt 
FROM cte 
QUALIFY createdAt = MIN(createdAt) OVER(PARTITION BY id ORDER BY runningPoints)

This answer was tested on Bigquery and gave the following output:

id runningPoints createdAt
234-00000206-0 1500 2023-05-12T09:05:05.034Z

答案2

得分: 2

以下是翻译好的内容:

这可以通过首先使用窗口函数 sum()order by 子句来获取累积总和,然后使用 row number() 为每一行分配一个按生成的累积总和排序的ID来实现,然后仅选择 rn = 1 的行:

with cte as (
  SELECT *, SUM(pointsEarned) over (partition by id order by createdAt) points
  FROM activities 
),
cte2 as (
  select *, row_number() over (partition by id order by points) as rn
  from cte
  where points > 1000
)
select id, createdAt
from cte2
where rn = 1

要获取特定 ID 的数据:

with cte as (
  SELECT *, SUM(pointsEarned) over (order by createdAt) points
  FROM activities 
  where id = '234-00000206-0'
),
cte2 as (
  select *, row_number() over (order by points) as rn
  from cte
  where points > 1000
)
select id, createdAt
from cte2
where rn = 1

演示链接在此

英文:

This can be accomplished by first using the window function sum() with an order by clause to obtain the running total, followed by row number() to assign an id to each row ordered by the generated running total, then select only the row with rn = 1 :

with cte as (
  SELECT *, SUM(pointsEarned) over (partition by id order by createdAt) points
  FROM activities 
),
cte2 as (
  select *, row_number() over (partition by id order by points) as rn
  from cte
  where points > 1000
)
select id, createdAt
from cte2
where rn = 1

To get data of a specific id then :

with cte as (
  SELECT *, SUM(pointsEarned) over (order by createdAt) points
  FROM activities 
  where id = '234-00000206-0'
),
cte2 as (
  select *, row_number() over (order by points) as rn
  from cte
  where points > 1000
)
select id, createdAt
from cte2
where rn = 1

Demo here

答案3

得分: 2

在BigQuery中,您可以通过一次查询和一个窗口函数调用来实现这一点:

select a.*, 
    sum(pointsEarned) over(partition by id order by createdAt) as runningPoints
from activities a
qualify runningPoints >= 1000 and runningPoints - pointsEarned < 1000

qualify 子句标识了累积总和达到阈值的行。

注意:只要您的pointsEarned没有负值,就可以正常工作,就像您的数据中所示(否则查询会为每次再次达到阈值的情况返回一行)。

英文:

You can express this in a single pass, with a single window function call. In BigQuery:

select a.*, 
    sum(pointsEarned) over(partition by id order by createdAt) as runningPoints, 
from activities a
qualify runningPoints &gt;= 1000 and runningPoints - pointsEarned &lt; 1000

The qualify clause identifies the row where the running sum reaches the threshold.

Note: this works as long as you don't have negative pointsEarned, as shown in your data (otherwise the query would bring one row for each occurence where the threshold is reached again).

答案4

得分: 1

你自己已经回答了这个问题,你的SQL是有效的(也许你在你的示例中使用了旧版本的MYSQL)

这是你的查询:

SELECT 
  id,
  SUM(pointsEarned) OVER(ORDER BY createdAt) points
FROM activities;

结果如下:

结果(MySQL v8.0)

id pointsEarned createdAt points
234-00000206-0 500 2023-05-03T09:05:05.034Z 500
234-00000206-0 1000 2023-05-12T09:05:05.034Z 1500
234-00000206-0 800 2023-05-15T09:05:05.034Z 2300
234-00000206-0 300 2023-05-21T09:05:05.034Z 2600
234-00000206-0 1100 2023-05-28T09:05:05.034Z 3700

现在你只需要选择总分大于1000的第一行,如下所示:

SELECT sum_t.id, sum_t.createdAt
FROM (
	SELECT *, SUM(pointsEarned) OVER(ORDER BY createdAt) points
  	FROM activities
) as sum_t
WHERE sum_t.points > 1000
LIMIT 1

这将导致以下结果:

结果(MySQL v8.0)

id createdAt
234-00000206-0 2023-05-12T09:05:05.034Z

示例链接

英文:

You kind of answered this yourself, your SQL was valid (maybe you used an old version of MYSQL in your fiddle)

SELECT 
  id,
  SUM(pointsEarned) OVER(ORDER BY createdAt) points
FROM activities;

which yields

Result (MySQL v8.0)

id pointsEarned createdAt points
234-00000206-0 500 2023-05-03T09:05:05.034Z 500
234-00000206-0 1000 2023-05-12T09:05:05.034Z 1500
234-00000206-0 800 2023-05-15T09:05:05.034Z 2300
234-00000206-0 300 2023-05-21T09:05:05.034Z 2600
234-00000206-0 1100 2023-05-28T09:05:05.034Z 3700

now you only need to select the first row having > 1000 total points like this

SELECT sum_t.id, sum_t.createdAt
FROM (
	SELECT *, SUM(pointsEarned) OVER(ORDER BY createdAt) points
  	FROM activities
) as sum_t
WHERE sum_t.points &gt; 1000
LIMIT 1

which leads to the following result

Result (MySQL v8.0)

id createdAt
234-00000206-0 2023-05-12T09:05:05.034Z

Demo

huangapple
  • 本文由 发表于 2023年6月5日 18:06:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76405341.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定