2023年7月17日 23:49:03go评论104阅读模式

英文:

Subquery with GROUP BY and MAX won't use index

问题

在PostgreSQL 15.3（对应Django模型）中，您有一个名为"public.myapp1_task"的表格，以及一些相关的代码。您的目标是为每个"myapp2_item_id"找到具有最高sequence的行。您已经添加了与"sequence"列相关的最后两个索引。您正在使用Django ORM尝试过滤查询集。

以下是您的SQL查询：

SELECT "myapp1_task"."id"
FROM "myapp1_task"
LEFT OUTER JOIN "myapp2_item"
ON ("myapp1_task"."myapp2_item_id" = "myapp2_item"."id")
LEFT OUTER JOIN "myapp2_user" ON ("myapp2_item"."user_id" = "myapp2_user"."id")
LEFT OUTER JOIN "myapp2_category"
ON ("myapp2_item"."myapp2_category_id" = "myapp2_category"."id")
LEFT OUTER JOIN "myapp2_user" T5 ON ("myapp1_task"."user_id" = T5."id")
WHERE "myapp1_task"."sequence" = (SELECT "subquery"."max_seq"
FROM (
SELECT MAX(U0."sequence") AS "max_seq", U0."sequence"
FROM "myapp1_task" U0
WHERE (U0."myapp2_item_id" = ("myapp1_task"."myapp2_item_id"))
GROUP BY U0."sequence"
ORDER BY U0."sequence" DESC
LIMIT 1) subquery)

您遇到的问题是这个查询在较大的表格上运行非常慢，且子查询上存在"seq scan"。下面是一些优化建议：

索引优化：
- 确保"myapp1_task"表上的所有列都有适当的索引，特别是"myapp2_item_id"和"sequence"列。
- 确保在"myapp1_task"表上为"myapp2_item_id"和"sequence"列的组合添加索引，以加快子查询的速度。
查询优化：
- 尝试使用"EXISTS"子查询替代"="子查询，因为"EXISTS"通常更有效率。
- 可以尝试将"ORDER BY"子句从子查询移动到主查询，这样数据库可以更好地优化查询计划。
分析执行计划：
- 使用EXPLAIN ANALYZE命令来分析查询的执行计划，以了解哪个部分的性能较差。这可以帮助您确定问题的根本原因，并指导您采取相应的优化措施。
硬件和配置：
- 确保数据库服务器的硬件和配置足够强大，以处理大型表格上的复杂查询。可能需要考虑升级硬件或优化数据库配置参数。

请注意，查询性能的优化是一个复杂的过程，需要根据具体情况进行调整和测试。在进行任何更改之前，请务必备份数据库，并在生产环境之外进行测试。如果问题仍然存在，请考虑与数据库管理员或性能专家合作以获取更详细的支持。

英文:

I have this table in PostgreSQL 15.3 (corresponding to a Django model):

                                             Table &quot;public.myapp1_task&quot;
         Column          |           Type           | Collation | Nullable |                     Default
-------------------------+--------------------------+-----------+----------+-------------------------------------------------
 id                      | bigint                   |           | not null | nextval(&#39;myapp1_task_id_seq&#39;::regclass)
 created_at              | timestamp with time zone |           | not null |
 updated_at              | timestamp with time zone |           | not null |
 kind                    | character varying(12)    |           | not null |
 status                  | character varying(12)    |           | not null |
 environment             | character varying(7)     |           | not null |
 data                    | jsonb                    |           | not null |
 result                  | jsonb                    |           | not null |
 sent_at                 | timestamp with time zone |           |          |
 response_at             | timestamp with time zone |           |          |
 priority                | smallint                 |           | not null |
 sequence                | integer                  |           |          |
 result_attachment       | character varying(100)   |           | not null |
 taxes                   | jsonb                    |           | not null |
 myapp2_item_id          | bigint                   |           |          |
 source                  | character varying(8)     |           | not null |
 user_id                 | bigint                   |           |          |
 custom_actions          | jsonb                    |           | not null |
 
Indexes:
    &quot;myapp1_task_pkey&quot; PRIMARY KEY, btree (id)
    &quot;myapp1_task_user_id_76a104e9&quot; btree (user_id)
    &quot;myapp1_task_myapp2_item_idd_441d91cb&quot; btree (myapp2_item_id)
    &quot;sequence_idx&quot; btree (sequence DESC NULLS LAST)
    &quot;sequence_mc_idx&quot; btree (sequence, myapp2_item_id DESC NULLS LAST)

Goals: for each myapp2_item_id, find the row with the highest sequence.

I added the last two indexes related to the sequence column.

Using Django ORM, I'm trying to filter a queryset, here's the code:

queryset = Task.objects.all()
sequences = queryset.filter(item=OuterRef(&quot;item&quot;)).exclude(sequence__isnull=True).order_by(&quot;-sequence&quot;).distinct().values(&quot;sequence&quot;)
max_sequences = sequences.annotate(max_seq=Max(&quot;sequence&quot;)).values(&quot;max_seq&quot;)[:1]
filtered_queryset = queryset.filter(sequence=Subquery(max_sequences))
print(filtered_queryset.query)

which translates that into this SQL statement. Note the subquery with group by and max aggregates:

SELECT &quot;myapp1_task&quot;.&quot;id&quot;
FROM &quot;myapp1_task&quot;
         LEFT OUTER JOIN &quot;myapp2_item&quot;
                         ON (&quot;myapp1_task&quot;.&quot;myapp2_item_id&quot; = &quot;myapp2_item&quot;.&quot;id&quot;)
         LEFT OUTER JOIN &quot;myapp2_user&quot; ON (&quot;myapp2_item&quot;.&quot;user_id&quot; = &quot;myapp2_user&quot;.&quot;id&quot;)
         LEFT OUTER JOIN &quot;myapp2_category&quot;
                         ON (&quot;myapp2_item&quot;.&quot;myapp2_category_id&quot; = &quot;myapp2_category&quot;.&quot;id&quot;)
         LEFT OUTER JOIN &quot;myapp2_user&quot; T5 ON (&quot;myapp1_task&quot;.&quot;user_id&quot; = T5.&quot;id&quot;)
WHERE &quot;myapp1_task&quot;.&quot;sequence&quot; = (SELECT &quot;subquery&quot;.&quot;max_seq&quot;
                                          FROM (
                                          SELECT MAX(U0.&quot;sequence&quot;) AS &quot;max_seq&quot;, U0.&quot;sequence&quot;
                                                FROM &quot;myapp1_task&quot; U0
                                                WHERE (U0.&quot;myapp2_item_id&quot; =
                                                       (&quot;myapp1_task&quot;.&quot;myapp2_item_id&quot;))
                                                GROUP BY U0.&quot;sequence&quot;
                                                ORDER BY U0.&quot;sequence&quot; DESC
                                                LIMIT 1) subquery)

Sadly, it's very slow on a fairly large table (>1M rows). Inspecting the explain result, I got this -> seq scan on the subquery, so none of the new indexes are used:

Seq Scan on myapp1_task  (cost=0.00..5525.25 rows=3 width=8)
  Filter: (sequence = (SubPlan 1))
  SubPlan 1
    -&gt;  Subquery Scan on subquery  (cost=8.30..8.33 rows=1 width=4)
          -&gt;  Limit  (cost=8.30..8.32 rows=1 width=8)
                -&gt;  GroupAggregate  (cost=8.30..8.32 rows=1 width=8)
                      Group Key: u0.sequence
                      -&gt;  Sort  (cost=8.30..8.31 rows=1 width=4)
                            Sort Key: u0.sequence DESC
                            -&gt;  Index Scan using myapp1_task_myapp2_item_idd_441d91cb on myapp1_task u0  (cost=0.28..8.29 rows=1 width=4)
                                  Index Cond: (myapp2_item_id = myapp1_task.myapp2_item_id)

Not sure what I'm doing wrong. How can this be improved?

答案1

得分: 2

你或者你的ORM（或两者兼而有之）已将SQL语句扭曲并混淆，以至于任何RDBMS都难以从中提炼出有效的查询计划。在去除了许多不必要的内容后，（等效的）语句如下：

SELECT t.id
FROM   myapp1_task t
LEFT   JOIN myapp2_item i     ON t.myapp2_item_id = i.id
LEFT   JOIN myapp2_user iu    ON i.user_id = iu.id
LEFT   JOIN myapp2_category c ON i.myapp2_category_id = c.id
LEFT   JOIN myapp2_user tu    ON t.user_id = tu.id
WHERE  t.sequence = (
   SELECT t1.sequence
   FROM   myapp1_task t1
   WHERE  t1.myapp2_item_id = t.myapp2_item_id
   ORDER  BY t1.sequence DESC
   LIMIT  1
   );

（在原始版本中，GROUP BY 和 MAX 是多余的噪音。）

WHERE 子句中的相关子查询会以非常昂贵的方式对 myapp1_task 进行过滤，其中 sequence 在相同的 myapp2_item_id 下以降序排序。由于您的特殊查询和表定义，将会消除任何 myapp2_item_id 或 sequence 为 null 的行，或者存在任何其他具有相同 myapp2_item_id 且 sequence IS NULL 的行。

所有的 LEFT JOIN 行只是噪音，因为 SELECT 列表只返回 myapp1_task.id。这些连接的唯一可能影响是在左侧具有重复值的情况下会使行数增加，但这似乎是一个不太可能的尝试。

解决方案

您稍后补充说：

> 目标: 对于每个 myapp2_item_id，找到具有最高序列的行。

这仍然没有澄清如何处理 null 值。也没有处理重复值的方法。也没有明确返回什么。所有这些都很重要。

假设:

组合 (myapp2_item_id, sequence) 实际上是 UNIQUE 的。
您要获取最高的非空序列。
您想返回整行（SELECT *）- 这通常是一种浪费。

那么查询就可以简化为：

SELECT DISTINCT ON (myapp2_item_id) *
FROM   myapp1_task
ORDER  BY myapp2_item_id, sequence DESC NULLS LAST;

参见：

https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564

关于 DESC NULLS LAST：

https://stackoverflow.com/questions/9510509/sort-by-column-asc-but-null-values-first/9511492#9511492

索引

对于这个查询来说，最好的索引是一个多列索引，具有匹配的排序顺序，首先是 myapp2_item_id：(myapp2_item_id, sequence DESC NULLS LAST)。

对于每个 myapp2_item_id 值只有很少的行时，索引可能不会有太大帮助 - 特别是对于 SELECT *。顺序扫描可能一样快或者更快。随着每组中行数的增加，索引变得更加有用。对于大量行，特殊的查询技巧更为有效。参见：

预计 PostgreSQL 16 将在2023年末发布，在这个领域会包括许多性能优化。

英文:

Either you or your ORM (or both) have twisted and obfuscated the SQL statement to a degree that any RDBMS would have a hard time to distill an efficient query plan from it. After removing a lot of cruft, the (equivalent) statement reads:

SELECT t.id
FROM   myapp1_task t
LEFT   JOIN myapp2_item i     ON t.myapp2_item_id = i.id
LEFT   JOIN myapp2_user iu    ON i.user_id = iu.id
LEFT   JOIN myapp2_category c ON i.myapp2_category_id = c.id
LEFT   JOIN myapp2_user tu    ON t.user_id = tu.id
WHERE  t.sequence = (
   SELECT t1.sequence
   FROM   myapp1_task t1
   WHERE  t1.myapp2_item_id = t.myapp2_item_id
   ORDER  BY t1.sequence DESC
   LIMIT  1
   );

(GROUP BY and MAX were useless noise in the original.)

The correlated subquery in the WHERE clause filters rows from myapp1_task where sequence sorts first in descending sort order among rows with the same myapp2_item_id - in a very expensive way. Due to your peculiar query and table definition, any rows are eliminated where myapp2_item_id or sequence is null, or any other row with the same myapp2_item_id and sequence IS NULL exists.

All LEFT JOIN rows are just noise since the SELECT list only returns myapp1_task.id anyway. The only possible effect of those joins would be to multiply rows if the left side has duplicates, which seems an unlikely endeavor.

Solution

You later added:

> Goals: for each myapp2_item_id, find the row with the highest sequence.

Still does not clarify how to deal with null values. Nor how to deal with duplicates. Nor what to return exactly. All of which matters.

Assuming:

The combination (myapp2_item_id, sequence) is actually UNIQUE.
You want the highest not-null sequence.
You want to return whole rows (SELECT *) - which is often wasteful nonsense.

Then the query boils down to just (!):

SELECT DISTINCT ON (myapp2_item_id) *
FROM   myapp1_task
ORDER  BY myapp2_item_id, sequence DESC NULLS LAST;

See:

https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564

About DESC NULLS LAST:

https://stackoverflow.com/questions/9510509/sort-by-column-asc-but-null-values-first/9511492#9511492

Index

The best index for this query is a multicolumn index with matching sort order leading myapp2_item_id: (myapp2_item_id, sequence DESC NULLS LAST).

For only few rows per value in myapp2_item_id, the index will not help (much) - especially for SELECT *. A sequential scan can be as fast or faster. The index becomes more useful with a growing number of rows per group. For large numbers, special query techniques are superior. See:

Postgres 16 will ship with a number of performance optimizations in this area, due late 2023.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

子查询与GROUP BY和MAX不会使用索引。

问题

答案1

解决方案

索引

Solution

Index

在Go语言中对SQL查询进行参数化处理。

gorm / go如何正确地连接两个表格？

SQL：计算学生所参加的不同学校的数量

在CASE语句中在THEN之后添加一个OR的方法是什么？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。