英文:
Need help understanding where MySQL deadlocks
问题
我们的数据库中一直存在一个与特定死锁有关的问题。冲突的查询几乎总是以下两个:
UPDATE db.tbl
SET pid = 68111
WHERE pid IS NULL
AND processed IS NULL
AND `time` <= now()
LIMIT 10
和
UPDATE `db`.`tbl`
SET `processed` = -1,
`updated_on` = NOW()
WHERE (`tbl`.`id` = 108588129)
pid
、id
和 processed
的值可能会有所不同,但查询是相同的。
我不明白的是,为什么会发生死锁,因为第二个查询通过其ID更新特定记录。只有一个锁需要获取,但要创建死锁至少需要2个锁。而且也没有会锁定其他行的事务 - 每个查询都是独立的。
为什么会发生这种情况,我该如何避免它?
show engine status innodb;
的匿名/缩短输出如下:
------------------------
LATEST DETECTED DEADLOCK
------------------------
2023-06-08 08:16:27 0x7f70f3ef1700
*** (1) TRANSACTION:
TRANSACTION 2857804352, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 2480 lock struct(s), heap size 286928, 9581 row lock(s)
MySQL thread id 25966272, OS thread handle 140144661681920, query id 847014117 x.x.x.x db_user updating
UPDATE db.tbl SET pid = ''68111'' WHERE pid IS NULL AND processed IS NULL AND `time` <= now() LIMIT 10
*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 2714 page no 86552 n bits 1552 index processed of table `db`.`tbl` trx id 2857804352 lock_mode X
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: SQL NULL;
1: len 4; hex 8678cf9b; asc x ;;
Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: SQL NULL;
1: len 4; hex 8678cfad; asc x ;;
<Snip a lot of Record Locks just like the one above>
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2714 page no 113955 n bits 160 index PRIMARY of table `db`.`tbl` trx id 2857804352 lock_mode X locks rec but not gap waiting
Record lock, heap no 92 PHYSICAL RECORD: n_fields 23; compact format; info bits 0
0: len 4; hex 8678ec61; asc x a;;
1: len 6; hex 0000aa56a25d; asc V ];;
2: len 7; hex 0100002bc01686; asc + ;;
...
*** (2) TRANSACTION:
TRANSACTION 2857804381, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 25966282, OS thread handle 140122483259136, query id 847014368 x.x.x.x other_db_user updating
UPDATE `db`.`tbl` SET `processed` = ''-1'', `updated_on` = NOW() WHERE (`tbl`.`id` = 108588129)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 2714 page no 113955 n bits 160 index PRIMARY of table `db`.`tbl` trx id 2857804381 lock_mode X locks rec but not gap
Record lock, heap no 92 PHYSICAL RECORD: n_fields 23; compact format; info bits 0
0: len 4; hex 8678ec61; asc x a;;
1: len 6; hex 0000aa56a25d; asc V ];;
2: len 7; hex 0100002bc01686; asc + ;;
...
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2714 page no 86552 n bits 1552 index processed of table `db`.`tbl` trx id 2857804381 lock_mode X locks rec but not gap waiting
Record lock, heap no 1462 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: SQL NULL;
1: len 4; hex 8678ec61; asc x a;;
...
*** WE ROLL BACK TRANSACTION (2)
此外,一个简化的表定义如下:
CREATE TABLE `tbl` (
`id` int NOT NULL AUTO_INCREMENT,
`pid` int DEFAULT NULL,
`processed` tinyint DEFAULT NULL,
`time` datetime DEFAULT NULL,
`created_on` datetime DEFAULT NULL,
`updated_on` datetime DEFAULT NULL,
`other` varchar(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `ix_other` (`other`),
KEY `ix_processed` (`processed`),
KEY `ix_time` (`time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
还要注意,表中有许多行(目前超过5000行,但在一天内的时间不断变化),其中processed=null, pid=null
但time
在未来。
补充: 根据评论提供的更多上下文:
目前在这个表中有一点多500万行。这是一种队列,我们每个月大约会得到140万行。最早的行是在二月份。当time
在未来时,几乎总是在同一天。
pid
和processed
与队列相关。当插入一行(作业)时,pid
和processed
都是null
。time
是作业应该执行的时间(有些作业需要在指定的时间之后执行,其他作业需要尽快执行)。
当其中一个处理器准备执行一些作业时,它首先运行第一个查询,并通过将其PID(Linux进程ID)设置为这些行来“保留”一批10
英文:
We have a long standing problem with a certain deadlock in our DB. The conflicting queries are almost always these two:
UPDATE db.tbl
SET pid = 68111
WHERE pid IS NULL
AND processed IS NULL
AND `time` <= now()
LIMIT 10
and
UPDATE `db`.`tbl`
SET `processed` = -1,
`updated_on` = NOW()
WHERE (`tbl`.`id` = 108588129)
The pid
, id
and processed
values vary, but the queries are the same.
What I don't understand is what exactly could deadlock there, since the second query updates a specific record by its ID. There's just one lock to be had, but you need at least 2 to create a deadlock. There are also no transactions that would lock other rows - each of these queries is standalone.
Why is this happening and how could I avoid it?
Anonymized/shortened output of show engine status innodb;
:
------------------------
LATEST DETECTED DEADLOCK
------------------------
2023-06-08 08:16:27 0x7f70f3ef1700
*** (1) TRANSACTION:
TRANSACTION 2857804352, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 2480 lock struct(s), heap size 286928, 9581 row lock(s)
MySQL thread id 25966272, OS thread handle 140144661681920, query id 847014117 x.x.x.x db_user updating
UPDATE db.tbl SET pid = ''68111'' WHERE pid IS NULL AND processed IS NULL AND `time` <= now() LIMIT 10
*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 2714 page no 86552 n bits 1552 index processed of table `db`.`tbl` trx id 2857804352 lock_mode X
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: SQL NULL;
1: len 4; hex 8678cf9b; asc x ;;
Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: SQL NULL;
1: len 4; hex 8678cfad; asc x ;;
<Snip a lot of Record Locks just like the one above>
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2714 page no 113955 n bits 160 index PRIMARY of table `db`.`tbl` trx id 2857804352 lock_mode X locks rec but not gap waiting
Record lock, heap no 92 PHYSICAL RECORD: n_fields 23; compact format; info bits 0
0: len 4; hex 8678ec61; asc x a;;
1: len 6; hex 0000aa56a25d; asc V ];;
2: len 7; hex 0100002bc01686; asc + ;;
3: len 4; hex 803ff583; asc ? ;;
4: len 4; hex 73746f70; asc stop;;
5: len 4; hex 80000002; asc ;;
6: len 1; hex 83; asc ;;
7: len 4; hex 80005737; asc W7;;
8: len 6; hex 4b4a38363431; asc KJ8641;;
9: len 8; hex 80000008a515e59b; asc ;;
10: len 4; hex 800001be; asc ;;
11: len 5; hex 99b050b41a; asc P ;;
12: len 4; hex 84ad8892; asc ;;
13: len 3; hex 736d73; asc sms;;
14: SQL NULL;
15: SQL NULL;
16: len 1; hex 7f; asc ;;
17: SQL NULL;
18: SQL NULL;
19: SQL NULL;
20: len 4; hex 53746f70; asc Stop;;
21: len 5; hex 99b050b41b; asc P ;;
22: len 5; hex 99b050b41b; asc P ;;
*** (2) TRANSACTION:
TRANSACTION 2857804381, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 25966282, OS thread handle 140122483259136, query id 847014368 x.x.x.x other_db_user updating
UPDATE `db`.`tbl` SET `processed` = ''-1'', `updated_on` = NOW() WHERE (`tbl`.`id` = 108588129)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 2714 page no 113955 n bits 160 index PRIMARY of table `db`.`tbl` trx id 2857804381 lock_mode X locks rec but not gap
Record lock, heap no 92 PHYSICAL RECORD: n_fields 23; compact format; info bits 0
0: len 4; hex 8678ec61; asc x a;;
1: len 6; hex 0000aa56a25d; asc V ];;
2: len 7; hex 0100002bc01686; asc + ;;
3: len 4; hex 803ff583; asc ? ;;
4: len 4; hex 73746f70; asc stop;;
5: len 4; hex 80000002; asc ;;
6: len 1; hex 83; asc ;;
7: len 4; hex 80005737; asc W7;;
8: len 6; hex 4b4a38363431; asc KJ8641;;
9: len 8; hex 80000008a515e59b; asc ;;
10: len 4; hex 800001be; asc ;;
11: len 5; hex 99b050b41a; asc P ;;
12: len 4; hex 84ad8892; asc ;;
13: len 3; hex 736d73; asc sms;;
14: SQL NULL;
15: SQL NULL;
16: len 1; hex 7f; asc ;;
17: SQL NULL;
18: SQL NULL;
19: SQL NULL;
20: len 4; hex 53746f70; asc Stop;;
21: len 5; hex 99b050b41b; asc P ;;
22: len 5; hex 99b050b41b; asc P ;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2714 page no 86552 n bits 1552 index processed of table `db`.`tbl` trx id 2857804381 lock_mode X locks rec but not gap waiting
Record lock, heap no 1462 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: SQL NULL;
1: len 4; hex 8678ec61; asc x a;;
*** WE ROLL BACK TRANSACTION (2)
Also, a simplified table definition:
CREATE TABLE `tbl` (
`id` int NOT NULL AUTO_INCREMENT,
`pid` int DEFAULT NULL,
`processed` tinyint DEFAULT NULL,
`time` datetime DEFAULT NULL,
`created_on` datetime DEFAULT NULL,
`updated_on` datetime DEFAULT NULL,
`other` varchar(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `ix_other` (`other`),
KEY `ix_processed` (`processed`),
KEY `ix_time` (`time`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I should also note that there are many rows (currently over 5000, but the number constantly varies during the day) with processed=null, pid=null
but time
in the future.
Added: A bit more context, as per comments:
There are currently a little over 5 million rows in this table. It's a sort of a queue and we get roughly 1.4M rows per month. The earliest rows are in February. When time
is in the future, it's almost always in the same day.
pid
and processed
are related to the queue. When a row (job) is inserted, both pid
and processed
are null
. time
is when the job should be performed (some jobs need to be performed only AFTER a specified moment in time; others need to be performed ASAP).
When one of the processors is ready to execute some jobs, it starts by running the first query and "reserving" a batch of 10 jobs for itself by settings its PID (Linux process ID) to those rows. It then selects those rows and works through them one-by-one, setting the processed
flag when it's done (the second query).
However sometimes those two queries clash and we get a deadlock.
答案1
得分: 1
-
我对锁定和死锁的理解非常有限,因此这个解释过于简化,因为我不确定“在主要的
UPDATE
查询中请求哪个锁”(初始的次要索引扫描,主键,然后由更新受到影响的其他次要索引?): -
会话1发送初始的
UPDATE
查询,选择ix_processed
索引作为最具选择性的索引,并开始锁定索引的行,直到收集足够的行来满足查询。 -
会话2发送
UPDATE主键
查询,立即在聚簇索引中的行上获得x锁。 -
会话2试图在
ix_processed
上应用对processed
列的更新,这需要ix_processed
上的x锁,但它已经被会话1锁定,因此会话2等待。 -
会话1尝试在
1.
中收集的行上获得聚簇索引的x锁,但正在被会话2(2.
)锁定的行,因此会话1等待。 -
死锁。
当前的UPDATE
需要的锁比你想象的要多得多。
在使用可重复读隔离级别(select @@session.tx_isolation
)时,这是默认设置,对于UPDATE
查询所检查的每一行都会放置排它锁,而不仅仅是那些被更新的行。
你可以使用以下方式进行测试:
START TRANSACTION;
UPDATE tbl
SET pid = 68111
WHERE pid IS NULL
AND processed IS NULL
AND `time` <= now()
LIMIT 10;
SELECT dl.object_name, dl.index_name, dl.lock_type, dl.lock_mode, dl.lock_status, dl.lock_data, tbl.*
FROM performance_schema.data_locks dl
JOIN performance_schema.threads t ON dl.THREAD_ID = t.THREAD_ID
JOIN tbl ON tbl.id = dl.lock_data
WHERE dl.object_name = 'tbl'
AND dl.index_name = 'PRIMARY'
AND t.PROCESSLIST_ID = CONNECTION_ID();
ROLLBACK;
在(processed, pid, time)
上添加一个复合索引将减少UPDATE
语句所需的锁数量,仅为10。
ALTER TABLE tbl ADD INDEX ix_processed_pid_time (processed, pid, time);
如果您不想添加索引,那么您可以尝试在UPDATE
中为processed
列设置一个不同的值(也许是-2),这将立即排除它不会被后续的更新影响。
如果您使用的是MySQL 8,另一种方法可能是使用SELECT ... FOR UPDATE SKIP LOCKED
:
START TRANSACTION;
SELECT id FROM tbl
WHERE pid IS NULL AND processed IS NULL AND `time` <= NOW()
LIMIT 10
FOR UPDATE SKIP LOCKED;
UPDATE tbl
SET pid = 68111,
processed = -2
WHERE id IN ( ... );
COMMIT;
英文:
I have a very limited understanding of locking and deadlocks, so this is over-simplified, as I am not sure "which lock is requested when" (initial secondary index scan, PRIMARY and then other secondary indices impacted by update?) by the main UPDATE
query:
-
Session 1 sends the initial
UPDATE
query, chooses theix_processed
index as the most selective, and starts locking the index rows, until it has gathered enough rows to satisfy the query. -
Session 2 sends the
UPDATE on PK
query which immediately gets the x-lock on row in the clustered index. -
Session 2 tries to apply the update to
processed
column which requires x-lock onix_processed
but it is already locked by Session 1, so it waits. -
Session 1 tries to get the x-lock on the clustered index for rows gathered in
1.
, but the row being updated is locked by Session 2 (2.
), so it waits. -
💣 deadlock 💣
Your current UPDATE
requires a lot more locks than you might think.
When using the repeatable-read isolation level (select @@session.tx_isolation
), which is the default, the exclusive lock will be placed on every row examined for the UPDATE, not just those that are updated.
You can test this with something like:
START TRANSACTION;
UPDATE tbl
SET pid = 68111
WHERE pid IS NULL
AND processed IS NULL
AND `time` <= now()
LIMIT 10;
SELECT dl.object_name, dl.index_name, dl.lock_type, dl.lock_mode, dl.lock_status, dl.lock_data, tbl.*
FROM performance_schema.data_locks dl
JOIN performance_schema.threads t ON dl.THREAD_ID = t.THREAD_ID
JOIN tbl ON tbl.id = dl.lock_data
WHERE dl.object_name = 'tbl'
AND dl.index_name = 'PRIMARY'
AND t.PROCESSLIST_ID = CONNECTION_ID();
ROLLBACK;
Adding a composite index on (processed, pid, time)
will reduce the number of locks required for the UPDATE
statement to just 10.
ALTER TABLE tbl ADD INDEX ix_processed_pid_time (processed, pid, time);
If you do not want to add the index, then you could try setting a distinct value for the processed
column (-2 maybe) in the UPDATE
, as this would mean it is immediately excluded from subsequent updates.
If you are using MySQL 8, another approach might be to use SELECT ... FOR UPDATE SKIP LOCKED
:
START TRANSACTION;
SELECT id FROM tbl
WHERE pid IS NULL AND processed IS NULL AND `time` <= NOW()
LIMIT 10
FOR UPDATE SKIP LOCKED;
UPDATE tbl
SET pid = 68111,
processed = -2
WHERE id IN ( ... );
COMMIT;
答案2
得分: 1
"…理解…"
如果没有合适的索引(请参阅user1191247的回答),将锁定太多行,导致死锁。
如果你使用的是MySQL的最新版本,EXPLAIN UPDATE ...
将提供关于使用的索引更多信息。
(我怀疑使用子查询是否有帮助。)
英文:
"... Understanding ..."
Without a suitable index (see user1191247 answer), too many rows will be locked, leading to a deadlock.
If you have a recent version of MySQL, EXPLAIN UPDATE ...
will provide more information about what index is being used.
(I doubt if using a subquery will help.)
答案3
得分: 0
我建议将您的更新查询更改为以下形式:
UPDATE db.tbl
SET pid = 68111
WHERE id IN (
SELECT t2.id
FROM db.tbl t2
WHERE t2.time <= NOW()
AND t2.pid IS NULL
AND t2.processed IS NULL
LIMIT 10
)
为了优化选择查询,我建议创建一个包含所需部分的单一索引,而无需访问原始数据页面。因此,不仅在TIME列上创建索引,还应该在以下列上创建索引:
(time, pid, processed, id)
这样整个WHERE条件都能得到验证,同时也包括了用于返回外部UPDATE的ID IN条件。
英文:
I would suggest updating your update
UPDATE db.tbl
SET pid = 68111
WHERE pid IS NULL
AND processed IS NULL
AND `time` <= now()
LIMIT 10
To an update where ID IN (select). At least my thinking about it. The update is locking while it is trying to gather and figure out what it SHOULD update. By pre-querying what you want as a select, and updating based on those IDs, the Select does not lock, finishes getting the few records it needs, then applies the update to them by their key ID.
So the time index is used first for the select (should be fast), then the update is done based on the PKID returned from the select. Something like
UPDATE db.tbl
SET pid = 68111
WHERE id in ( select t2.id
from db.tbl t2
where t2.time <= now()
and t2.pid is null
and t2.processed is null
limit 10 )
Now to optimize the Select query, I would have a single index to include the parts it needs without having to go to the raw data pages. So instead of your index on just the TIME column, I would have an index on
(time, pid, processed, id )
So the entire where is qualified, but also has the ID for returning the outer update WHERE ID IN condition.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论