更高效的SQL删除方法,而不是使用NOT EXISTS?

huangapple go评论65阅读模式
英文:

More performant SQL Delete than using NOT EXISTS?

问题

以下是翻译好的部分:

我有一个查询,从表中删除大量数据,如下所示。 它使用一个基于尝试不破坏事务日志的 while 循环,但 Customers 表中有约 2 亿条记录,它正在删除约 200 万条记录。 我想知道是否替换 NOT EXISTS 会有所帮助。

WHILE (1=1)
BEGIN
  DELETE TOP(10000) FROM Customers
  WHERE NOT EXISTS (SELECT * FROM CustomerInvoices WHERE CustomerInvoices.CustomerId = 
  Customers.CustomerId)
  IF (@@ROWCOUNT = 0)
  BREAK
END
英文:

I have a query that is deleting a lot of data from a table as follows. It uses a while loop based on trying to not destroy the transaction log but the Customers table has about 200 million records in it and it is deleting approx. 2 million. I was wondering if replacing the NOT EXISTS would help at all.

WHILE (1=1)
BEGIN
  DELETE TOP(10000) FROM Customers
  WHERE NOT EXISTS (SELECT * FROM CustomerInvoices WHERE CustomerInvoices.CustomerId = 
  Customers.CustomerId)
  IF (@@ROWCOUNT = 0)
  BREAK
END

答案1

得分: 1

你的问题是,在查找匹配“NOT EXISTS”之前,需要检查Customers中的行数,每批次都会增加。

匹配行的比例会稳步下降,直到最后一批,您需要扫描剩下的1.98亿行才能找到最后的1万行。

您总共有200批次。平均每个批次都会读取1亿行Customers中的数据(最早的批次较少,后面的批次较多),总共从该表中读取了200亿行,而在CustomerInvoices表中也有类似数量的行。

如果执行计划是串行扫描,那么很可能每个批次都会遍历所有已经处理过的行(在每个以前的批次中都已经确定不符合条件),然后最终才会处理感兴趣的行。

您可以创建一个带有连续整数列的临时表...

DECLARE @LastRow INT

CREATE TABLE #DeleteCandidates(Id int PRIMARY KEY, CustomerId INT);

INSERT #DeleteCandidates
SELECT ROW_NUMBER()
OVER (
ORDER BY (SELECT 0)) AS Id,
Customers.CustomerId
FROM Customers
WHERE NOT EXISTS (SELECT *
FROM CustomerInvoices
WHERE CustomerInvoices.CustomerId = Customers.CustomerId)

SET @LastRow = @@ROWCOUNT

然后编写一些代码以处理该临时表中的“<batch_size>”范围的“Id”。

例如,如下所示...

DECLARE @BatchSize INT = 10000
DECLARE @MinId INT = 1

WHILE @MinId <= @LastRow
BEGIN

  DELETE FROM Customers
  WHERE  Customers.CustomerId IN (SELECT dc.CustomerId
                                  FROM   #DeleteCandidates dc
                                  WHERE  dc.Id &gt;= @MinId
                                         AND dc.Id &lt; @MinId + @BatchSize)
         AND NOT EXISTS (SELECT *
                         FROM   CustomerInvoices/*WITH (HOLDLOCK )*/
                         WHERE  CustomerInvoices.CustomerId = Customers.CustomerId)

  SET @MinId = @MinId + @BatchSize

END

在实际的DELETE操作中,仍然需要使用“NOT EXISTS”,以防自识别为删除候选项的标识已不再符合条件。

您还可以考虑使用“HOLDLOCK”提示来处理在DELETE查询本身运行时可能发生的真正并发插入情况。

英文:

Your problem is that the number of rows in Customers it needs to check before finding 10000 matching the NOT EXISTS grows every batch.

The ratio of matching rows will steadily drop until by the final batch you are scanning the whole 198 million remaining rows to find the last 10,000.

You are doing 200 batches. On average each batch reads 100 million in Customers rows (the earliest batches much less and the later ones more more) - this totals to 20 billion rows read over all just from that table and a similar amount in CustomerInvoices.

If the execution plan is a serial scan then likely every batch this will go over all the ones already processed in every previous batch and found to be not eligible before finally getting to the ones of interest.

You can create a temp table with a sequential integer column...

DECLARE @LastRow INT

CREATE TABLE #DeleteCandidates(Id int PRIMARY KEY, CustomerId INT);

INSERT #DeleteCandidates
SELECT ROW_NUMBER()
         OVER (
           ORDER BY (SELECT 0)) AS Id,
       Customers.CustomerId
FROM   Customers
WHERE  NOT EXISTS (SELECT *
                   FROM   CustomerInvoices
                   WHERE  CustomerInvoices.CustomerId = Customers.CustomerId)

SET @LastRow = @@ROWCOUNT 

Then write some code to process that temp table in &lt;batch_size&gt; chunks of Id ranges.

e.g. as below

DECLARE @BatchSize INT = 10000
DECLARE @MinId INT = 1

WHILE @MinId &lt;= @LastRow
  BEGIN

      DELETE FROM Customers
      WHERE  Customers.CustomerId IN (SELECT dc.CustomerId
                                      FROM   #DeleteCandidates dc
                                      WHERE  dc.Id &gt;= @MinId
                                             AND dc.Id &lt; @MinId + @BatchSize)
             AND NOT EXISTS (SELECT *
                             FROM   CustomerInvoices/*WITH (HOLDLOCK )*/
                             WHERE  CustomerInvoices.CustomerId = Customers.CustomerId)

      SET @MinId = @MinId + @BatchSize
  END 

You still need a NOT EXISTS on the actual DELETE in case there were inserts since the identification that means a delete candidate is no longer eligible.

You might also consider the HOLDLOCK hint to deal with the possibility of truly concurrent inserts whilst the DELETE query itself is running.

huangapple
  • 本文由 发表于 2023年6月15日 03:46:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76477067.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定