英文:
Delete from SQL table if there are Duplicates AND if duplicates are older than 30 days
问题
我正在尝试编写一个SQL查询来访问Microsoft SQL表。我希望实现的目标是,只有当重复行的时间超过30天时,才删除重复行。这是一个示例表格:
INSERT INTO [dbo].[test]
(id1, id2, firstName, lastName, dayTime)
VALUES
(12, 13, 'Syed','Abbas','05-02-2023'),
(12, 13, 'Syed','Abbas','07-02-2023'),
(12, 14, 'Adam', 'Johnson', '07-02-2023'),
(10, 9, 'Monique', 'Brown', '03-03-2023')
以下是我为查询编写的内容:
DELETE T
FROM
(
SELECT *
, DupRank = ROW_NUMBER() OVER (
PARTITION BY id1, id2
ORDER BY (SELECT NULL)
)
FROM [dbo].[test]
) AS T
WHERE DupRank > 1 and dayTime < DATEADD(day, -30, GETDATE())
我试图实现的结果是只删除第一行(12, 13, Syed, Abbas, 05-02-2023),其余的值将保留。但是,当我运行这个查询时,它不会删除任何内容——没有错误,只是0行受到影响。
我已经尝试了查询的各个部分,它们都正常工作(例如,当我只删除重复行时,它会删除第二行,当我只删除30天前的行时,它会删除第一行和第四行)。我不确定是否错误地使用了“and”子句?
英文:
I am trying to write an SQL query to access a Microsoft SQL table. What I am hoping to accomplish is that I can find all rows that have duplicates and delete duplicates only if they are older than 30 days. Here is an example table:
INSERT INTO [dbo].[test]
(id1, id2, firstName, lastName, dayTime)
VALUES
(12, 13, 'Syed','Abbas','05-02-2023'),
(12, 13, 'Syed','Abbas','07-02-2023'),
(12, 14, 'Adam', 'Johnson', '07-02-2023'),
(10, 9, 'Monique', 'Brown', '03-03-2023')
And this is what I have written for my query:
DELETE T
FROM
(
SELECT *
, DupRank = ROW_NUMBER() OVER (
PARTITION BY id1, id2
ORDER BY (SELECT NULL)
)
FROM [dbo].[test]
) AS T
WHERE DupRank > 1 and dayTime < DATEADD(day, -30, GETDATE())
The outcome I am trying to get is that only row 1 (12, 13, Syed, Abbas, 05-02-2023) will be deleted and the rest of the values will stay. However, when I run this query, it does not delete anything-- no errors, just 0 rows affected.
I have tried the separate parts of the query and they work fine (ie, when I just delete duplicates, it removes row 2, and when I just delete for older than 30 days, it removes rows 1 and 4). I am not sure if I am using the "and" clause incorrectly?
答案1
得分: 1
我猜测(虽然没有看到查询计划很难确定)非确定性的 ORDER BY
导致了问题。
当你写 ORDER BY (SELECT NULL)
时,这意味着服务器可以以任何顺序计算行号。所以可能较旧的行被标记为1,而较新的行被标记为2。然后,当你筛选 DupRank > 1 and dayTime < DATEADD(day, -30, GETDATE())
时,你会筛选掉两行。
所以只需使用确定性的编号。在这里合理的做法是从新到旧编号,以便始终保留最新的行和任何在30天内的其他行。
DELETE T
FROM
(
SELECT *,
DupRank = ROW_NUMBER() OVER (
PARTITION BY T.id1, T.id2
ORDER BY T.dayTime DESC)
FROM dbo.test T
) AS T
WHERE T.DupRank > 1
AND T.dayTime < DATEADD(day, -30, GETDATE());
英文:
I'm guessing (although it's hard to say without seeing the query plan) that the non-deterministic ORDER BY
is causing problems.
When you write ORDER BY (SELECT NULL)
that means that the server is free to calculate the row-number in any order. So it could be that the older row is being numbered 1 and the newer row 2. Then when you filter to DupRank > 1 and dayTime < DATEADD(day, -30, GETDATE())
you are filtering out both rows.
So just use a deterministic numbering. The logical thing to do here would be to number from newest to oldest, so that you always keep the newest row and any others which are less than 30 days old.
DELETE T
FROM
(
SELECT *,
DupRank = ROW_NUMBER() OVER (
PARTITION BY T.id1, T.id2
ORDER BY T.dayTime DESC)
FROM dbo.test T
) AS T
WHERE T.DupRank > 1
AND T.dayTime < DATEADD(day, -30, GETDATE());
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论