如何正确获取分区的第二行

huangapple go评论67阅读模式
英文:

How to correctly get the second row of a partition by

问题

第一种方法中,问题可能出现在仅按照 'tid' 和 'created_date' 列对数据进行排序时。这可能导致在 'tid' 相同的情况下,无法准确地确定哪个记录是最新的。对于 'tid' 相同的记录,它们的 'created_date' 可能会出现混淆。

第二种方法也存在相同的问题,因为它仍然只根据 'tid' 和 'created_date' 进行排序,并且分配了相同的行号。

要解决这个问题,你可以使用窗口函数来找到最新的记录。下面是一个修改后的查询:

WITH RankedData AS (
  SELECT
    created_date,
    tid,
    fmid,
    ROW_NUMBER() OVER (PARTITION BY tid ORDER BY created_date DESC) AS row_num
  FROM
    table
)

SELECT
  created_date,
  tid,
  fmid
FROM
  RankedData
WHERE
  row_num = 1;

这个查询将根据 'tid' 分区,并按 'created_date' 降序排序,然后为每个分区中的记录分配行号。最后,它仅选择行号为 1 的记录,这将是每个 'tid' 分区中的最新记录。

英文:

First of all, I'm using BigQuery. In my table I have some duplicated rows due to a change in one of the columns (called fmid). When this happens, I have to get the latest record, which is defined by the time (called created_date). The only thing these duplicated records have in common is the id (called tid).

I have tried two different options to set which row is the first and which one is the latest, so then I could remove what I want.

First one:

SELECT *, row_number() OVER( PARTITION BY tid) as number

FROM (SELECT created_date,tid, fmid

FROM table 

ORDER BY tid,created_date)

ORDER BY tid, created_date

Results:
Sometimes it's correct, but the order of the id 296644550 is not right.

1	2023-02-03 22:11:07 UTC	233330088	4	1
2	2023-02-04 10:07:06 UTC	233330088	18	2
3	2023-02-11 11:06:42 UTC	284447799	4	1
4	2023-02-11 11:22:13 UTC	284447799	18	2
5	2023-02-10 10:39:23 UTC	296644550	4	2
6	2023-02-10 19:32:08 UTC	296644550	18	1

Second one:

SELECT *, row_number() OVER( PARTITION BY tid,created_date) as number

FROM (SELECT created_date,tid, fmid

FROM table 

ORDER BY tid,created_date)

ORDER BY tid, created_date

Results:
The column "number" gets the same result.

Row	created_date	tid	fmid	number
1	2023-02-03 22:11:07 UTC	233330088	4	1
2	2023-02-04 10:07:06 UTC	233330088	18	1
3	2023-02-11 11:06:42 UTC	284447799	4	1
4	2023-02-11 11:22:13 UTC	284447799	18	1
5	2023-02-10 10:39:23 UTC	296644550	4	1
6	2023-02-10 19:32:08 UTC	296644550	18	1

Any idea of what's wrong?

答案1

得分: 0

排序必须在row_number的窗口内进行,以便获得正确的编号

SELECT created_date, tid, fmid
FROM (
    SELECT created_date, tid, fmid,
           row_number() OVER (PARTITION BY tid ORDER BY created_date) as number
    FROM table
)
WHERE number = 2
ORDER BY tid

或者作为CTE:

WITH CTE AS (
    SELECT created_date, tid, fmid,
           row_number() OVER (PARTITION BY tid ORDER BY created_date) as number
    FROM table
)

SELECT created_date, tid, fmid
FROM CTE
WHERE number = 2
ORDER BY tid
英文:

the sorting has to happen in the window of row_number, so that you get the correct number

SELECT created_date,tid, fmid
FROM (SELECT created_date,tid, fmid
      , row_number() OVER( PARTITION BY tid ORDER BY created_date) as number
       FROM table 
)
WHERE number = 2
ORDER BY tid

or as CTE

WITH CTE AS (
SELECT created_date,tid, fmid
      , row_number() OVER( PARTITION BY tid ORDER BY created_date) as number
       FROM table )

SELECT created_date,tid, fmid
FROM CTE
WHERE number = 2
ORDER BY tid

huangapple
  • 本文由 发表于 2023年6月26日 16:17:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76554799.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定