2023年6月8日 17:38:48go评论70阅读模式

英文:

Need help understanding where MySQL deadlocks

问题

我们的数据库中一直存在一个与特定死锁有关的问题。冲突的查询几乎总是以下两个：

UPDATE db.tbl 
    SET pid = 68111
WHERE pid IS NULL 
AND processed IS NULL 
AND `time` <= now() 
LIMIT 10

和

UPDATE `db`.`tbl` 
    SET `processed` = -1,
        `updated_on` = NOW() 
WHERE (`tbl`.`id` = 108588129)

pid、id 和 processed 的值可能会有所不同，但查询是相同的。

我不明白的是，为什么会发生死锁，因为第二个查询通过其ID更新特定记录。只有一个锁需要获取，但要创建死锁至少需要2个锁。而且也没有会锁定其他行的事务 - 每个查询都是独立的。

为什么会发生这种情况，我该如何避免它？

show engine status innodb; 的匿名/缩短输出如下：

------------------------
LATEST DETECTED DEADLOCK
------------------------
2023-06-08 08:16:27 0x7f70f3ef1700
*** (1) TRANSACTION:
TRANSACTION 2857804352, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 2480 lock struct(s), heap size 286928, 9581 row lock(s)
MySQL thread id 25966272, OS thread handle 140144661681920, query id 847014117 x.x.x.x db_user updating
UPDATE db.tbl SET pid = &#39;&#39;68111&#39;&#39; WHERE pid IS NULL AND processed IS NULL AND `time` &lt;= now() LIMIT 10

*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 2714 page no 86552 n bits 1552 index processed of table `db`.`tbl` trx id 2857804352 lock_mode X
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: SQL NULL;
 1: len 4; hex 8678cf9b; asc  x  ;;

Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: SQL NULL;
 1: len 4; hex 8678cfad; asc  x  ;;

<Snip a lot of Record Locks just like the one above>

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2714 page no 113955 n bits 160 index PRIMARY of table `db`.`tbl` trx id 2857804352 lock_mode X locks rec but not gap waiting
Record lock, heap no 92 PHYSICAL RECORD: n_fields 23; compact format; info bits 0
 0: len 4; hex 8678ec61; asc  x a;;
 1: len 6; hex 0000aa56a25d; asc    V ];;
 2: len 7; hex 0100002bc01686; asc    +   ;;
...
*** (2) TRANSACTION:
TRANSACTION 2857804381, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 25966282, OS thread handle 140122483259136, query id 847014368 x.x.x.x other_db_user updating
UPDATE `db`.`tbl` SET `processed` = &#39;&#39;-1&#39;&#39;, `updated_on` = NOW() WHERE (`tbl`.`id` = 108588129)

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 2714 page no 113955 n bits 160 index PRIMARY of table `db`.`tbl` trx id 2857804381 lock_mode X locks rec but not gap
Record lock, heap no 92 PHYSICAL RECORD: n_fields 23; compact format; info bits 0
 0: len 4; hex 8678ec61; asc  x a;;
 1: len 6; hex 0000aa56a25d; asc    V ];;
 2: len 7; hex 0100002bc01686; asc    +   ;;
...
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2714 page no 86552 n bits 1552 index processed of table `db`.`tbl` trx id 2857804381 lock_mode X locks rec but not gap waiting
Record lock, heap no 1462 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: SQL NULL;
 1: len 4; hex 8678ec61; asc  x a;;
...
*** WE ROLL BACK TRANSACTION (2)

此外，一个简化的表定义如下：

CREATE TABLE `tbl` (
  `id` int NOT NULL AUTO_INCREMENT,
  `pid` int DEFAULT NULL,
  `processed` tinyint DEFAULT NULL,
  `time` datetime DEFAULT NULL,
  `created_on` datetime DEFAULT NULL,
  `updated_on` datetime DEFAULT NULL,

  `other` varchar(10) NOT NULL,
  
  PRIMARY KEY (`id`),
  KEY `ix_other` (`other`),
  KEY `ix_processed` (`processed`),
  KEY `ix_time` (`time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

还要注意，表中有许多行（目前超过5000行，但在一天内的时间不断变化），其中processed=null, pid=null但time在未来。

补充： 根据评论提供的更多上下文：

目前在这个表中有一点多500万行。这是一种队列，我们每个月大约会得到140万行。最早的行是在二月份。当time在未来时，几乎总是在同一天。

pid和processed与队列相关。当插入一行（作业）时，pid和processed都是null。time是作业应该执行的时间（有些作业需要在指定的时间之后执行，其他作业需要尽快执行）。

当其中一个处理器准备执行一些作业时，它首先运行第一个查询，并通过将其PID（Linux进程ID）设置为这些行来“保留”一批10

英文:

We have a long standing problem with a certain deadlock in our DB. The conflicting queries are almost always these two:

UPDATE db.tbl 
    SET pid = 68111
WHERE pid IS NULL 
AND processed IS NULL 
AND `time` &lt;= now() 
LIMIT 10

and

UPDATE `db`.`tbl` 
    SET `processed` = -1,
        `updated_on` = NOW() 
WHERE (`tbl`.`id` = 108588129)

The pid, id and processed values vary, but the queries are the same.

What I don't understand is what exactly could deadlock there, since the second query updates a specific record by its ID. There's just one lock to be had, but you need at least 2 to create a deadlock. There are also no transactions that would lock other rows - each of these queries is standalone.

Why is this happening and how could I avoid it?

Anonymized/shortened output of show engine status innodb;:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2023-06-08 08:16:27 0x7f70f3ef1700
*** (1) TRANSACTION:
TRANSACTION 2857804352, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 2480 lock struct(s), heap size 286928, 9581 row lock(s)
MySQL thread id 25966272, OS thread handle 140144661681920, query id 847014117 x.x.x.x db_user updating
UPDATE db.tbl SET pid = &#39;&#39;68111&#39;&#39; WHERE pid IS NULL AND processed IS NULL AND `time` &lt;= now() LIMIT 10

*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 2714 page no 86552 n bits 1552 index processed of table `db`.`tbl` trx id 2857804352 lock_mode X
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: SQL NULL;
 1: len 4; hex 8678cf9b; asc  x  ;;

Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: SQL NULL;
 1: len 4; hex 8678cfad; asc  x  ;;

&lt;Snip a lot of Record Locks just like the one above&gt;

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2714 page no 113955 n bits 160 index PRIMARY of table `db`.`tbl` trx id 2857804352 lock_mode X locks rec but not gap waiting
Record lock, heap no 92 PHYSICAL RECORD: n_fields 23; compact format; info bits 0
 0: len 4; hex 8678ec61; asc  x a;;
 1: len 6; hex 0000aa56a25d; asc    V ];;
 2: len 7; hex 0100002bc01686; asc    +   ;;
 3: len 4; hex 803ff583; asc  ?  ;;
 4: len 4; hex 73746f70; asc stop;;
 5: len 4; hex 80000002; asc     ;;
 6: len 1; hex 83; asc  ;;
 7: len 4; hex 80005737; asc   W7;;
 8: len 6; hex 4b4a38363431; asc KJ8641;;
 9: len 8; hex 80000008a515e59b; asc         ;;
 10: len 4; hex 800001be; asc     ;;
 11: len 5; hex 99b050b41a; asc   P  ;;
 12: len 4; hex 84ad8892; asc     ;;
 13: len 3; hex 736d73; asc sms;;
 14: SQL NULL;
 15: SQL NULL;
 16: len 1; hex 7f; asc  ;;
 17: SQL NULL;
 18: SQL NULL;
 19: SQL NULL;
 20: len 4; hex 53746f70; asc Stop;;
 21: len 5; hex 99b050b41b; asc   P  ;;
 22: len 5; hex 99b050b41b; asc   P  ;;


*** (2) TRANSACTION:
TRANSACTION 2857804381, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 25966282, OS thread handle 140122483259136, query id 847014368 x.x.x.x other_db_user updating
UPDATE `db`.`tbl` SET `processed` = &#39;&#39;-1&#39;&#39;, `updated_on` = NOW() WHERE (`tbl`.`id` = 108588129)

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 2714 page no 113955 n bits 160 index PRIMARY of table `db`.`tbl` trx id 2857804381 lock_mode X locks rec but not gap
Record lock, heap no 92 PHYSICAL RECORD: n_fields 23; compact format; info bits 0
 0: len 4; hex 8678ec61; asc  x a;;
 1: len 6; hex 0000aa56a25d; asc    V ];;
 2: len 7; hex 0100002bc01686; asc    +   ;;
 3: len 4; hex 803ff583; asc  ?  ;;
 4: len 4; hex 73746f70; asc stop;;
 5: len 4; hex 80000002; asc     ;;
 6: len 1; hex 83; asc  ;;
 7: len 4; hex 80005737; asc   W7;;
 8: len 6; hex 4b4a38363431; asc KJ8641;;
 9: len 8; hex 80000008a515e59b; asc         ;;
 10: len 4; hex 800001be; asc     ;;
 11: len 5; hex 99b050b41a; asc   P  ;;
 12: len 4; hex 84ad8892; asc     ;;
 13: len 3; hex 736d73; asc sms;;
 14: SQL NULL;
 15: SQL NULL;
 16: len 1; hex 7f; asc  ;;
 17: SQL NULL;
 18: SQL NULL;
 19: SQL NULL;
 20: len 4; hex 53746f70; asc Stop;;
 21: len 5; hex 99b050b41b; asc   P  ;;
 22: len 5; hex 99b050b41b; asc   P  ;;


*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2714 page no 86552 n bits 1552 index processed of table `db`.`tbl` trx id 2857804381 lock_mode X locks rec but not gap waiting
Record lock, heap no 1462 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: SQL NULL;
 1: len 4; hex 8678ec61; asc  x a;;

*** WE ROLL BACK TRANSACTION (2)

Also, a simplified table definition:

CREATE TABLE `tbl` (
  `id` int NOT NULL AUTO_INCREMENT,
  `pid` int DEFAULT NULL,
  `processed` tinyint DEFAULT NULL,
  `time` datetime DEFAULT NULL,
  `created_on` datetime DEFAULT NULL,
  `updated_on` datetime DEFAULT NULL,

  `other` varchar(10) NOT NULL,
  
  PRIMARY KEY (`id`),
  KEY `ix_other` (`other`),
  KEY `ix_processed` (`processed`),
  KEY `ix_time` (`time`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

I should also note that there are many rows (currently over 5000, but the number constantly varies during the day) with processed=null, pid=null but time in the future.

Added: A bit more context, as per comments:

There are currently a little over 5 million rows in this table. It's a sort of a queue and we get roughly 1.4M rows per month. The earliest rows are in February. When time is in the future, it's almost always in the same day.

pid and processed are related to the queue. When a row (job) is inserted, both pid and processed are null. time is when the job should be performed (some jobs need to be performed only AFTER a specified moment in time; others need to be performed ASAP).

When one of the processors is ready to execute some jobs, it starts by running the first query and "reserving" a batch of 10 jobs for itself by settings its PID (Linux process ID) to those rows. It then selects those rows and works through them one-by-one, setting the processed flag when it's done (the second query).

However sometimes those two queries clash and we get a deadlock.

答案1

得分: 1

我对锁定和死锁的理解非常有限，因此这个解释过于简化，因为我不确定“在主要的UPDATE查询中请求哪个锁”（初始的次要索引扫描，主键，然后由更新受到影响的其他次要索引？）：
会话1发送初始的UPDATE查询，选择ix_processed索引作为最具选择性的索引，并开始锁定索引的行，直到收集足够的行来满足查询。
会话2发送UPDATE主键查询，立即在聚簇索引中的行上获得x锁。
会话2试图在ix_processed上应用对processed列的更新，这需要ix_processed上的x锁，但它已经被会话1锁定，因此会话2等待。
会话1尝试在1.中收集的行上获得聚簇索引的x锁，但正在被会话2（2.）锁定的行，因此会话1等待。
死锁。

当前的UPDATE需要的锁比你想象的要多得多。

在使用可重复读隔离级别（select @@session.tx_isolation）时，这是默认设置，对于UPDATE查询所检查的每一行都会放置排它锁，而不仅仅是那些被更新的行。

你可以使用以下方式进行测试：

START TRANSACTION;

UPDATE tbl 
    SET pid = 68111
WHERE pid IS NULL 
AND processed IS NULL 
AND `time` <= now() 
LIMIT 10;

SELECT dl.object_name, dl.index_name, dl.lock_type, dl.lock_mode, dl.lock_status, dl.lock_data, tbl.*
FROM performance_schema.data_locks dl
JOIN performance_schema.threads t ON dl.THREAD_ID = t.THREAD_ID
JOIN tbl ON tbl.id = dl.lock_data
WHERE dl.object_name = 'tbl'
AND dl.index_name = 'PRIMARY'
AND t.PROCESSLIST_ID = CONNECTION_ID();

ROLLBACK;

在(processed, pid, time)上添加一个复合索引将减少UPDATE语句所需的锁数量，仅为10。

ALTER TABLE tbl ADD INDEX ix_processed_pid_time (processed, pid, time);

如果您不想添加索引，那么您可以尝试在UPDATE中为processed列设置一个不同的值（也许是-2），这将立即排除它不会被后续的更新影响。

如果您使用的是MySQL 8，另一种方法可能是使用SELECT ... FOR UPDATE SKIP LOCKED：

START TRANSACTION;

SELECT id FROM tbl
WHERE pid IS NULL AND processed IS NULL AND `time` <= NOW()
LIMIT 10
FOR UPDATE SKIP LOCKED;

UPDATE tbl
    SET pid = 68111,
        processed = -2
WHERE id IN ( ... );

COMMIT;

英文:

I have a very limited understanding of locking and deadlocks, so this is over-simplified, as I am not sure "which lock is requested when" (initial secondary index scan, PRIMARY and then other secondary indices impacted by update?) by the main UPDATE query:

Session 1 sends the initial UPDATE query, chooses the ix_processed index as the most selective, and starts locking the index rows, until it has gathered enough rows to satisfy the query.
Session 2 sends the UPDATE on PK query which immediately gets the x-lock on row in the clustered index.
Session 2 tries to apply the update to processed column which requires x-lock on ix_processed but it is already locked by Session 1, so it waits.
Session 1 tries to get the x-lock on the clustered index for rows gathered in 1., but the row being updated is locked by Session 2 (2.), so it waits.
💣 deadlock 💣

Your current UPDATE requires a lot more locks than you might think.

When using the repeatable-read isolation level (select @@session.tx_isolation), which is the default, the exclusive lock will be placed on every row examined for the UPDATE, not just those that are updated.

You can test this with something like:

START TRANSACTION;

UPDATE tbl 
    SET pid = 68111
WHERE pid IS NULL 
AND processed IS NULL 
AND `time` &lt;= now() 
LIMIT 10;

SELECT dl.object_name, dl.index_name, dl.lock_type, dl.lock_mode, dl.lock_status, dl.lock_data, tbl.*
FROM performance_schema.data_locks dl
JOIN performance_schema.threads t ON dl.THREAD_ID = t.THREAD_ID
JOIN tbl ON tbl.id = dl.lock_data
WHERE dl.object_name = &#39;tbl&#39;
AND dl.index_name = &#39;PRIMARY&#39;
AND t.PROCESSLIST_ID = CONNECTION_ID();

ROLLBACK;

Adding a composite index on (processed, pid, time) will reduce the number of locks required for the UPDATE statement to just 10.

ALTER TABLE tbl ADD INDEX ix_processed_pid_time (processed, pid, time);

If you do not want to add the index, then you could try setting a distinct value for the processed column (-2 maybe) in the UPDATE, as this would mean it is immediately excluded from subsequent updates.

If you are using MySQL 8, another approach might be to use SELECT ... FOR UPDATE SKIP LOCKED:

START TRANSACTION;

SELECT id FROM tbl
WHERE pid IS NULL AND processed IS NULL AND `time` &lt;= NOW()
LIMIT 10
FOR UPDATE SKIP LOCKED;

UPDATE tbl
    SET pid = 68111,
        processed = -2
WHERE id IN ( ... );

COMMIT;

答案2

得分: 1

"…理解…"

如果没有合适的索引（请参阅user1191247的回答），将锁定太多行，导致死锁。

如果你使用的是MySQL的最新版本，EXPLAIN UPDATE ... 将提供关于使用的索引更多信息。

（我怀疑使用子查询是否有帮助。）

英文:

"... Understanding ..."

Without a suitable index (see user1191247 answer), too many rows will be locked, leading to a deadlock.

If you have a recent version of MySQL, EXPLAIN UPDATE ... will provide more information about what index is being used.

(I doubt if using a subquery will help.)

答案3

得分: 0

我建议将您的更新查询更改为以下形式：

UPDATE db.tbl 
SET pid = 68111
WHERE id IN (
    SELECT t2.id
    FROM db.tbl t2
    WHERE t2.time <= NOW()
    AND t2.pid IS NULL
    AND t2.processed IS NULL
    LIMIT 10
)

为了优化选择查询，我建议创建一个包含所需部分的单一索引，而无需访问原始数据页面。因此，不仅在TIME列上创建索引，还应该在以下列上创建索引：

(time, pid, processed, id)

这样整个WHERE条件都能得到验证，同时也包括了用于返回外部UPDATE的ID IN条件。

英文:

I would suggest updating your update

UPDATE db.tbl 
    SET pid = 68111
WHERE pid IS NULL 
AND processed IS NULL 
AND `time` &lt;= now() 
LIMIT 10

To an update where ID IN (select). At least my thinking about it. The update is locking while it is trying to gather and figure out what it SHOULD update. By pre-querying what you want as a select, and updating based on those IDs, the Select does not lock, finishes getting the few records it needs, then applies the update to them by their key ID.

So the time index is used first for the select (should be fast), then the update is done based on the PKID returned from the select. Something like

UPDATE db.tbl 
   SET pid = 68111
   WHERE id in ( select t2.id
                    from db.tbl t2
                    where t2.time &lt;= now()
                      and t2.pid is null
                      and t2.processed is null 
                    limit 10 )

Now to optimize the Select query, I would have a single index to include the parts it needs without having to go to the raw data pages. So instead of your index on just the TIME column, I would have an index on

(time, pid, processed, id )

So the entire where is qualified, but also has the ID for returning the outer update WHERE ID IN condition.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

需要帮助理解MySQL死锁发生的位置。

问题

答案1

答案2

答案3

Golang HelloActor出现致命错误：所有goroutine都处于休眠状态-死锁。

如何使用Hibernate和Ajax请求读取表格。

Deadlock with Django / MYSQL and filter on select_for_update.

有没有办法在MySQL中连接子查询时停止反复编写相同的子查询？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论