英文:
Subquery with GROUP BY and MAX won't use index
问题
在PostgreSQL 15.3(对应Django模型)中,您有一个名为"public.myapp1_task"的表格,以及一些相关的代码。您的目标是为每个"myapp2_item_id"找到具有最高sequence的行。您已经添加了与"sequence"列相关的最后两个索引。您正在使用Django ORM尝试过滤查询集。
以下是您的SQL查询:
SELECT "myapp1_task"."id"
FROM "myapp1_task"
LEFT OUTER JOIN "myapp2_item"
ON ("myapp1_task"."myapp2_item_id" = "myapp2_item"."id")
LEFT OUTER JOIN "myapp2_user" ON ("myapp2_item"."user_id" = "myapp2_user"."id")
LEFT OUTER JOIN "myapp2_category"
ON ("myapp2_item"."myapp2_category_id" = "myapp2_category"."id")
LEFT OUTER JOIN "myapp2_user" T5 ON ("myapp1_task"."user_id" = T5."id")
WHERE "myapp1_task"."sequence" = (SELECT "subquery"."max_seq"
FROM (
SELECT MAX(U0."sequence") AS "max_seq", U0."sequence"
FROM "myapp1_task" U0
WHERE (U0."myapp2_item_id" = ("myapp1_task"."myapp2_item_id"))
GROUP BY U0."sequence"
ORDER BY U0."sequence" DESC
LIMIT 1) subquery)
您遇到的问题是这个查询在较大的表格上运行非常慢,且子查询上存在"seq scan"。下面是一些优化建议:
-
索引优化:
- 确保"myapp1_task"表上的所有列都有适当的索引,特别是"myapp2_item_id"和"sequence"列。
- 确保在"myapp1_task"表上为"myapp2_item_id"和"sequence"列的组合添加索引,以加快子查询的速度。
-
查询优化:
- 尝试使用"EXISTS"子查询替代"="子查询,因为"EXISTS"通常更有效率。
- 可以尝试将"ORDER BY"子句从子查询移动到主查询,这样数据库可以更好地优化查询计划。
-
分析执行计划:
- 使用
EXPLAIN ANALYZE
命令来分析查询的执行计划,以了解哪个部分的性能较差。这可以帮助您确定问题的根本原因,并指导您采取相应的优化措施。
- 使用
-
硬件和配置:
- 确保数据库服务器的硬件和配置足够强大,以处理大型表格上的复杂查询。可能需要考虑升级硬件或优化数据库配置参数。
请注意,查询性能的优化是一个复杂的过程,需要根据具体情况进行调整和测试。在进行任何更改之前,请务必备份数据库,并在生产环境之外进行测试。如果问题仍然存在,请考虑与数据库管理员或性能专家合作以获取更详细的支持。
英文:
I have this table in PostgreSQL 15.3 (corresponding to a Django model):
Table "public.myapp1_task"
Column | Type | Collation | Nullable | Default
-------------------------+--------------------------+-----------+----------+-------------------------------------------------
id | bigint | | not null | nextval('myapp1_task_id_seq'::regclass)
created_at | timestamp with time zone | | not null |
updated_at | timestamp with time zone | | not null |
kind | character varying(12) | | not null |
status | character varying(12) | | not null |
environment | character varying(7) | | not null |
data | jsonb | | not null |
result | jsonb | | not null |
sent_at | timestamp with time zone | | |
response_at | timestamp with time zone | | |
priority | smallint | | not null |
sequence | integer | | |
result_attachment | character varying(100) | | not null |
taxes | jsonb | | not null |
myapp2_item_id | bigint | | |
source | character varying(8) | | not null |
user_id | bigint | | |
custom_actions | jsonb | | not null |
Indexes:
"myapp1_task_pkey" PRIMARY KEY, btree (id)
"myapp1_task_user_id_76a104e9" btree (user_id)
"myapp1_task_myapp2_item_idd_441d91cb" btree (myapp2_item_id)
"sequence_idx" btree (sequence DESC NULLS LAST)
"sequence_mc_idx" btree (sequence, myapp2_item_id DESC NULLS LAST)
Goals: for each myapp2_item_id
, find the row with the highest sequence.
I added the last two indexes related to the sequence
column.
Using Django ORM, I'm trying to filter a queryset, here's the code:
queryset = Task.objects.all()
sequences = queryset.filter(item=OuterRef("item")).exclude(sequence__isnull=True).order_by("-sequence").distinct().values("sequence")
max_sequences = sequences.annotate(max_seq=Max("sequence")).values("max_seq")[:1]
filtered_queryset = queryset.filter(sequence=Subquery(max_sequences))
print(filtered_queryset.query)
which translates that into this SQL statement. Note the subquery with group by
and max
aggregates:
SELECT "myapp1_task"."id"
FROM "myapp1_task"
LEFT OUTER JOIN "myapp2_item"
ON ("myapp1_task"."myapp2_item_id" = "myapp2_item"."id")
LEFT OUTER JOIN "myapp2_user" ON ("myapp2_item"."user_id" = "myapp2_user"."id")
LEFT OUTER JOIN "myapp2_category"
ON ("myapp2_item"."myapp2_category_id" = "myapp2_category"."id")
LEFT OUTER JOIN "myapp2_user" T5 ON ("myapp1_task"."user_id" = T5."id")
WHERE "myapp1_task"."sequence" = (SELECT "subquery"."max_seq"
FROM (
SELECT MAX(U0."sequence") AS "max_seq", U0."sequence"
FROM "myapp1_task" U0
WHERE (U0."myapp2_item_id" =
("myapp1_task"."myapp2_item_id"))
GROUP BY U0."sequence"
ORDER BY U0."sequence" DESC
LIMIT 1) subquery)
Sadly, it's very slow on a fairly large table (>1M rows). Inspecting the explain
result, I got this -> seq scan
on the subquery, so none of the new indexes are used:
Seq Scan on myapp1_task (cost=0.00..5525.25 rows=3 width=8)
Filter: (sequence = (SubPlan 1))
SubPlan 1
-> Subquery Scan on subquery (cost=8.30..8.33 rows=1 width=4)
-> Limit (cost=8.30..8.32 rows=1 width=8)
-> GroupAggregate (cost=8.30..8.32 rows=1 width=8)
Group Key: u0.sequence
-> Sort (cost=8.30..8.31 rows=1 width=4)
Sort Key: u0.sequence DESC
-> Index Scan using myapp1_task_myapp2_item_idd_441d91cb on myapp1_task u0 (cost=0.28..8.29 rows=1 width=4)
Index Cond: (myapp2_item_id = myapp1_task.myapp2_item_id)
Not sure what I'm doing wrong. How can this be improved?
答案1
得分: 2
你或者你的ORM(或两者兼而有之)已将SQL语句扭曲并混淆,以至于任何RDBMS都难以从中提炼出有效的查询计划。在去除了许多不必要的内容后,(等效的)语句如下:
SELECT t.id
FROM myapp1_task t
LEFT JOIN myapp2_item i ON t.myapp2_item_id = i.id
LEFT JOIN myapp2_user iu ON i.user_id = iu.id
LEFT JOIN myapp2_category c ON i.myapp2_category_id = c.id
LEFT JOIN myapp2_user tu ON t.user_id = tu.id
WHERE t.sequence = (
SELECT t1.sequence
FROM myapp1_task t1
WHERE t1.myapp2_item_id = t.myapp2_item_id
ORDER BY t1.sequence DESC
LIMIT 1
);
(在原始版本中,GROUP BY
和 MAX
是多余的噪音。)
WHERE
子句中的相关子查询会以非常昂贵的方式对 myapp1_task
进行过滤,其中 sequence
在相同的 myapp2_item_id
下以降序排序。由于您的特殊查询和表定义,将会消除任何 myapp2_item_id
或 sequence
为 null
的行,或者存在任何其他具有相同 myapp2_item_id
且 sequence IS NULL
的行。
所有的 LEFT JOIN
行只是噪音,因为 SELECT
列表只返回 myapp1_task.id
。这些连接的唯一可能影响是在左侧具有重复值的情况下会使行数增加,但这似乎是一个不太可能的尝试。
解决方案
您稍后补充说:
> 目标: 对于每个 myapp2_item_id
,找到具有最高序列的行。
这仍然没有澄清如何处理 null
值。也没有处理重复值的方法。也没有明确返回什么。所有这些都很重要。
假设:
- 组合
(myapp2_item_id, sequence)
实际上是UNIQUE
的。 - 您要获取最高的非空序列。
- 您想返回整行(
SELECT *
)- 这通常是一种浪费。
那么查询就可以简化为:
SELECT DISTINCT ON (myapp2_item_id) *
FROM myapp1_task
ORDER BY myapp2_item_id, sequence DESC NULLS LAST;
参见:
关于 DESC NULLS LAST
:
- https://stackoverflow.com/questions/9510509/sort-by-column-asc-but-null-values-first/9511492#9511492
索引
对于这个查询来说,最好的索引是一个多列索引,具有匹配的排序顺序,首先是 myapp2_item_id
:(myapp2_item_id, sequence DESC NULLS LAST)
。
对于每个 myapp2_item_id
值只有很少的行时,索引可能不会有太大帮助 - 特别是对于 SELECT *
。顺序扫描可能一样快或者更快。随着每组中行数的增加,索引变得更加有用。对于大量行,特殊的查询技巧更为有效。参见:
- https://stackoverflow.com/questions/66893968/select-distinct-is-slower-than-expected-on-my-table-in-postgresql/66894500#66894500
- https://stackoverflow.com/questions/25536422/optimize-group-by-query-to-retrieve-latest-record-per-user/25536748#25536748
- https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/34715134#34715134
预计 PostgreSQL 16 将在2023年末发布,在这个领域会包括许多性能优化。
英文:
Either you or your ORM (or both) have twisted and obfuscated the SQL statement to a degree that any RDBMS would have a hard time to distill an efficient query plan from it. After removing a lot of cruft, the (equivalent) statement reads:
SELECT t.id
FROM myapp1_task t
LEFT JOIN myapp2_item i ON t.myapp2_item_id = i.id
LEFT JOIN myapp2_user iu ON i.user_id = iu.id
LEFT JOIN myapp2_category c ON i.myapp2_category_id = c.id
LEFT JOIN myapp2_user tu ON t.user_id = tu.id
WHERE t.sequence = (
SELECT t1.sequence
FROM myapp1_task t1
WHERE t1.myapp2_item_id = t.myapp2_item_id
ORDER BY t1.sequence DESC
LIMIT 1
);
(GROUP BY
and MAX
were useless noise in the original.)
The correlated subquery in the WHERE
clause filters rows from myapp1_task
where sequence
sorts first in descending sort order among rows with the same myapp2_item_id
- in a very expensive way. Due to your peculiar query and table definition, any rows are eliminated where myapp2_item_id
or sequence
is null
, or any other row with the same myapp2_item_id
and sequence IS NULL
exists.
All LEFT JOIN
rows are just noise since the SELECT
list only returns myapp1_task.id
anyway. The only possible effect of those joins would be to multiply rows if the left side has duplicates, which seems an unlikely endeavor.
Solution
You later added:
> Goals: for each myapp2_item_id
, find the row with the highest sequence.
Still does not clarify how to deal with null
values. Nor how to deal with duplicates. Nor what to return exactly. All of which matters.
Assuming:
- The combination
(myapp2_item_id, sequence)
is actuallyUNIQUE
. - You want the highest not-null sequence.
- You want to return whole rows (
SELECT *
) - which is often wasteful nonsense.
Then the query boils down to just (!):
SELECT DISTINCT ON (myapp2_item_id) *
FROM myapp1_task
ORDER BY myapp2_item_id, sequence DESC NULLS LAST;
See:
About DESC NULLS LAST
:
- https://stackoverflow.com/questions/9510509/sort-by-column-asc-but-null-values-first/9511492#9511492
Index
The best index for this query is a multicolumn index with matching sort order leading myapp2_item_id
: (myapp2_item_id, sequence DESC NULLS LAST)
.
For only few rows per value in myapp2_item_id
, the index will not help (much) - especially for SELECT *
. A sequential scan can be as fast or faster. The index becomes more useful with a growing number of rows per group. For large numbers, special query techniques are superior. See:
- https://stackoverflow.com/questions/66893968/select-distinct-is-slower-than-expected-on-my-table-in-postgresql/66894500#66894500
- https://stackoverflow.com/questions/25536422/optimize-group-by-query-to-retrieve-latest-record-per-user/25536748#25536748
- https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/34715134#34715134
Postgres 16 will ship with a number of performance optimizations in this area, due late 2023.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论