可以使用PostgreSQL递归减少左连接中的行数吗?

huangapple go评论72阅读模式
英文:

Is it possible to reduce the number of rows in a left join using postgre recursion?

问题

I understand that you want a translation of the code portion you provided. Here's the translated code:

我有一个数据库查询,当我添加最后一个频道条件(**channel.code**)时,查询需要超过两分钟才能完成。主要的概念问题是,我只需要每页获取**10个订单**,而我在子查询中过滤订单表,然后仍然需要进行一些左连接操作来达到最终条件,此时我正在将500万个匹配订单与包含数百万记录的所有表连接。

我想知道是否可能在Postgres中以某种方式创建一个循环,例如,它可能会获取100个订单(根据表的条件进行初步筛选),然后仅在最后连接它们并满足条件,然后如果选定的100个订单中有合适的订单,则将它们添加到最终结果中。这可以实现吗?

所有索引已经存在。这是我想要优化的查询示例:

SELECT ord.billing_number,
ord.amount_total_sum,
ord.fare_sum,
ord.service_fee_sum,
ord.service_time_limit,
ord.tax_sum,
ord.currency,
ord.pos,
ord.updated,
ord.created,
ord.user_id,
ord.hidden,
ord.id,
ord.marker,
status.code as status,
status.title as status_title,
partner.code as code,
partner.fee as fee,
partner.fee_refund as fee_refund,
partner.config as config,
partner.project,
COALESCE(contract_parent_partner.code, parent_partner.code) as parent_code,
u.email as user_email
FROM (SELECT * FROM tol.order WHERE billing_number::char <> '4') ord
LEFT JOIN tol.service_cart service_cart ON ord.id = service_cart.order_id
INNER JOIN tol.ticket_avia_v2 ticket
ON service_cart.ticket_uid = ticket.id AND service_cart.service_table_name = 'vw_ticket_avia'
LEFT JOIN tol.user u ON u.id = ord.user_id
LEFT JOIN tol.partner partner ON ord.partner_id = partner.id
LEFT JOIN tol.status status ON ord.status_id = status.id
LEFT JOIN tol.order_channel order_channel ON ord.id = order_channel.order_id
LEFT JOIN tol.partner_contract partner_contract ON ord.partner_contract_id = partner_contract.id
LEFT JOIN tol.partner parent_partner ON parent_partner.id = partner.parent_id
LEFT JOIN tol.channel channel ON order_channel.channel_id = channel.id
LEFT JOIN tol.partner contract_parent_partner
ON contract_parent_partner.id = partner_contract.parent_partner_id
WHERE (EXISTS (SELECT 1
FROM tol.flight
WHERE tol.flight.ticket_uid = ticket.id
AND tol.flight.carrier_code = 'carrier_code'
ORDER BY tol.flight.id))
AND (channel.code = 'channel_code')
ORDER BY ord.id DESC
LIMIT 10;


尝试在连接中使用子查询,但由于我使用了左连接,它与当前查询产生不同的结果。

这是**EXPLAIN(ANALYZE,BUFFERS,SETTINGS)**的结果。

Limit (cost=3226993.73..3226993.74 rows=1 width=515) (actual time=1051.938..1055.143 rows=0 loops=1)
...
(省略部分内容)
...
Execution Time: 1078.301 ms


请注意,这是您提供的代码的翻译版本。如果需要进一步的解释或帮助,请告诉我。

<details>
<summary>英文:</summary>

I have a database query that takes over two minutes to complete when I add the last channel condition (**channel.code**) the main conceptual problem is that I only need to get **10 orders per page**, while I am filtering the orders table in a subquery, but then I still need to do a few left joins to arrive at the final condition, at which point I&#39;m joining 5 million matching orders with all tables that contain some several million records.

I want to know if it is possible to somehow make a cycle using Postgres, in which, for example, it would take, for example, 100 orders (preliminarily filtered by the conditions of the table) and only then connect them with the rest of the tables, and fulfill the conditions at the end, then if among the selected 100 orders were suitable, then he added them to the final result. This can be done?

All indexes are already there. Here is an example query that I want to optimize:

SELECT ord.billing_number,
ord.amount_total_sum,
ord.fare_sum,
ord.service_fee_sum,
ord.service_time_limit,
ord.tax_sum,
ord.currency,
ord.pos,
ord.updated,
ord.created,
ord.user_id,
ord.hidden,
ord.id,
ord.marker,
status.code as status,
status.title as status_title,
partner.code as code,
partner.fee as fee,
partner.fee_refund as fee_refund,
partner.config as config,
partner.project,
COALESCE(contract_parent_partner.code, parent_partner.code) as parent_code,
u.email as user_email
FROM (SELECT * FROM tol.order WHERE billing_number::char <> '4') ord
LEFT JOIN tol.service_cart service_cart ON ord.id = service_cart.order_id
INNER JOIN tol.ticket_avia_v2 ticket
ON service_cart.ticket_uid = ticket.id AND service_cart.service_table_name = 'vw_ticket_avia'
LEFT JOIN tol.user u ON u.id = ord.user_id
LEFT JOIN tol.partner partner ON ord.partner_id = partner.id
LEFT JOIN tol.status status ON ord.status_id = status.id
LEFT JOIN tol.order_channel order_channel ON ord.id = order_channel.order_id
LEFT JOIN tol.partner_contract partner_contract ON ord.partner_contract_id = partner_contract.id
LEFT JOIN tol.partner parent_partner ON parent_partner.id = partner.parent_id
LEFT JOIN tol.channel channel ON order_channel.channel_id = channel.id
LEFT JOIN tol.partner contract_parent_partner
ON contract_parent_partner.id = partner_contract.parent_partner_id
WHERE (EXISTS (SELECT 1
FROM tol.flight
WHERE tol.flight.ticket_uid = ticket.id
AND tol.flight.carrier_code = 'carrier_code'
ORDER BY tol.flight.id))
AND (channel.code = 'channel_code')
ORDER BY ord.id DESC
LIMIT 10;


Tried to use subqueries in joins, but since I use a left join, it gives a different result than the current query.
Here is **EXPLAIN (ANALYZE, BUFFERS, SETTINGS)**
Limit  (cost=3226993.73..3226993.74 rows=1 width=515) (actual time=1051.938..1055.143 rows=0 loops=1)
Buffers: shared hit=65 read=2
-&gt;  Sort  (cost=3226993.73..3226993.74 rows=1 width=515) (actual time=32.786..35.991 rows=0 loops=1)
&quot;        Sort Key: &quot;&quot;order&quot;&quot;.id DESC&quot;
Sort Method: quicksort  Memory: 25kB
Buffers: shared hit=65 read=2
-&gt;  Nested Loop Left Join  (cost=2038882.89..3226993.72 rows=1 width=515) (actual time=32.724..35.926 rows=0 loops=1)
Join Filter: (contract_parent_partner.id = partner_contract.parent_partner_id)
Buffers: shared hit=62 read=2
-&gt;  Nested Loop Left Join  (cost=2038882.89..3225588.32 rows=1 width=492) (actual time=32.723..35.924 rows=0 loops=1)
Join Filter: (parent_partner.id = partner.parent_id)
Buffers: shared hit=62 read=2
-&gt;  Gather  (cost=2038882.89..3224182.92 rows=1 width=491) (actual time=32.722..35.922 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=62 read=2
-&gt;  Nested Loop Semi Join  (cost=2037882.89..3223182.82 rows=1 width=491) (actual time=0.723..0.728 rows=0 loops=3)
Buffers: shared hit=62 read=2
-&gt;  Hash Join  (cost=2037812.20..3215337.24 rows=105 width=523) (actual time=0.723..0.726 rows=0 loops=3)
Hash Cond: (order_channel.channel_id = channel.id)
Buffers: shared hit=62 read=2
-&gt;  Hash Left Join  (cost=2037803.88..3206741.25 rows=3271051 width=527) (never executed)
&quot;                                            Hash Cond: (&quot;&quot;order&quot;&quot;.partner_contract_id = partner_contract.id)&quot;
-&gt;  Hash Left Join  (cost=2037462.25..3197812.02 rows=3271051 width=527) (never executed)
&quot;                                                  Hash Cond: (&quot;&quot;order&quot;&quot;.status_id = status.id)&quot;
-&gt;  Parallel Hash Left Join  (cost=2037460.83..3187308.80 rows=3271051 width=475) (never executed)
&quot;                                                        Hash Cond: (&quot;&quot;order&quot;&quot;.partner_id = partner.id)&quot;
-&gt;  Parallel Hash Left Join  (cost=1941872.95..3083133.07 rows=3271051 width=166) (never executed)
&quot;                                                              Hash Cond: (&quot;&quot;order&quot;&quot;.user_id = u.id)&quot;
-&gt;  Parallel Hash Join  (cost=1646731.87..2615342.48 rows=3271051 width=152) (never executed)
&quot;                                                                    Hash Cond: (service_cart.order_id = &quot;&quot;order&quot;&quot;.id)&quot;
-&gt;  Parallel Hash Join  (cost=582600.12..1337818.87 rows=5341282 width=44) (never executed)
Hash Cond: (service_cart.order_id = order_channel.order_id)
-&gt;  Parallel Hash Join  (cost=402211.21..1040928.08 rows=5341282 width=36) (never executed)
Hash Cond: (ticket.id = service_cart.ticket_uid)
-&gt;  Parallel Index Only Scan using ticket_avia_v2_id on ticket_avia_v2 ticket  (cost=0.56..521205.61 rows=6446412 width=16) (never executed)
Heap Fetches: 0
-&gt;  Parallel Hash  (cost=284286.30..284286.30 rows=6423068 width=20) (never executed)
-&gt;  Parallel Seq Scan on service_cart  (cost=0.00..284286.30 rows=6423068 width=20) (never executed)
Filter: (service_table_name = &#39;vw_ticket_avia&#39;::text)
-&gt;  Parallel Hash  (cost=100494.85..100494.85 rows=4869685 width=8) (never executed)
-&gt;  Parallel Seq Scan on order_channel  (cost=0.00..100494.85 rows=4869685 width=8) (never executed)
-&gt;  Parallel Hash  (cost=883642.55..883642.55 rows=6000656 width=116) (never executed)
&quot;                                                                          -&gt;  Parallel Seq Scan on &quot;&quot;order&quot;&quot;  (cost=0.00..883642.55 rows=6000656 width=116) (never executed)&quot;
Filter: ((billing_number)::character(1) &lt;&gt; &#39;4&#39;::bpchar)
-&gt;  Parallel Hash  (cost=221481.48..221481.48 rows=4012048 width=18) (never executed)
&quot;                                                                    -&gt;  Parallel Seq Scan on &quot;&quot;user&quot;&quot; u  (cost=0.00..221481.48 rows=4012048 width=18) (never executed)&quot;
-&gt;  Parallel Hash  (cost=95389.05..95389.05 rows=15906 width=321) (never executed)
-&gt;  Parallel Index Scan using partner_parent_index on partner  (cost=0.29..95389.05 rows=15906 width=321) (never executed)
-&gt;  Hash  (cost=1.19..1.19 rows=19 width=60) (never executed)
-&gt;  Seq Scan on status  (cost=0.00..1.19 rows=19 width=60) (never executed)
-&gt;  Hash  (cost=243.50..243.50 rows=7850 width=8) (never executed)
-&gt;  Seq Scan on partner_contract  (cost=0.00..243.50 rows=7850 width=8) (never executed)
-&gt;  Hash  (cost=8.30..8.30 rows=1 width=4) (actual time=0.484..0.484 rows=0 loops=3)
Buckets: 1024  Batches: 1  Memory Usage: 8kB
Buffers: shared hit=6 read=2
-&gt;  Index Scan using uniq_6cae9c8d77153098 on channel  (cost=0.29..8.30 rows=1 width=4) (actual time=0.483..0.483 rows=0 loops=3)
Index Cond: (code = &#39;channel_code&#39;::text)
Buffers: shared hit=6 read=2
-&gt;  Bitmap Heap Scan on flight  (cost=70.69..74.71 rows=1 width=16) (never executed)
Recheck Cond: ((ticket_uid = service_cart.ticket_uid) AND (carrier_code = &#39;carrier_code&#39;::text))
-&gt;  BitmapAnd  (cost=70.69..70.69 rows=1 width=0) (never executed)
-&gt;  Bitmap Index Scan on flight_ticket_idx  (cost=0.00..4.35 rows=36 width=0) (never executed)
Index Cond: (ticket_uid = service_cart.ticket_uid)
-&gt;  Bitmap Index Scan on idx_flight_carrier_code  (cost=0.00..65.28 rows=3295 width=0) (never executed)
Index Cond: (carrier_code = &#39;carrier_code&#39;::text)
-&gt;  Seq Scan on partner parent_partner  (cost=0.00..1067.40 rows=27040 width=9) (never executed)
-&gt;  Seq Scan on partner contract_parent_partner  (cost=0.00..1067.40 rows=27040 width=9) (never executed)
Planning Time: 48.168 ms
JIT:
Functions: 228
&quot;  Options: Inlining true, Optimization true, Expressions true, Deforming true&quot;
&quot;  Timing: Generation 23.199 ms, Inlining 51.477 ms, Optimization 580.442 ms, Emission 386.265 ms, Total 1041.384 ms&quot;
Execution Time: 1078.301 ms
</details>
# 答案1
**得分**: 1
以下是您要翻译的内容:
### 您所要求的
... 当然是可能的。构建一个函数来循环处理已排序行的补丁,或者数值范围以强制执行某种“查询计划”,Postgres在这方面存在盲点。一个复杂的示例:
- [空间索引是否能帮助“范围 - 按顺序 - 限制”查询][1]
往往更好的解决方案是获取服务器配置、数据模型和列统计信息。如果查询驱动查询的特定索引是最快的方法,Postgres *应该*自行到达该查询计划。
... 但对于给定情况没有任何进展。您添加的查询计划显示`rows=0`。未找到任何符合条件的行。对于带有`ORDER BY`和小的`LIMIT`的查询,如果查询计划一直按照排序顺序遍历直到找到足够的行,通常是最糟糕的情况。您强迫Postgres在100行的小补丁中执行此操作只会使情况变得更糟。
### 服务器配置
您的查询计划不显示任何非标准设置(当使用`SETTINGS`选项调用`EXPLAIN`时会显示)。如果您的输入是可靠的,这意味着您正在运行带有默认设置的Postgres,这仅适用于非常基本的需求。从服务器配置开始。以下是一些入门指南:
- [https://wiki.postgresql.org/wiki/Slow_Query_Questions](https://wiki.postgresql.org/wiki/Slow_Query_Questions)
### 一般改进
您正在连接11个表格(不清楚您是否实际上需要全部参与)。这超出了[`join_collapse_limit`][2]的默认设置。因此,按照编写的连接顺序变得重要。请参见:
- [https://stackoverflow.com/questions/36343860/why-does-a-slight-change-in-the-search-term-slow-down-the-query-so-much/36347677#36347677](https://stackoverflow.com/questions/36343860/why-does-a-slight-change-in-the-search-term-slow-down-the-query-so-much/36347677#36347677)
将实际过滤行的连接移到前面,并将将行数乘以的连接移到最后。这可能会大大提高性能:
```pgsql
SELECT ord.billing_number
--  , ...(许多字段)
FROM   tol.order              ord
JOIN   tol.order_channel               ON order_channel.order_id = ord.id
JOIN   tol.channel                     ON channel.id = order_channel.channel_id  -- 移到顶部 !!
JOIN   tol.service_cart                ON service_cart.order_id = ord.id
JOIN   tol.ticket_avia_v2     ticket   ON ticket.id = service_cart.ticket_uid
AND service_cart.service_table_name = 'vw_ticket_avia'
LEFT   JOIN tol.user          u        ON u.id = ord.user_id
LEFT   JOIN tol.status                 ON status.id = ord.status_id
LEFT   JOIN tol.partner                ON partner.id = ord.partner_id
LEFT   JOIN tol.partner       parent_partner ON parent_partner.id = partner.parent_id
LEFT   JOIN tol.partner_contract       ON partner_contract.id = ord.partner_contract_id
LEFT   JOIN tol.partner       contract_parent_partner ON contract_parent_partner.id = partner_contract.parent_partner_id
WHERE  ord.billing_number ^@ '4'  -- “以...开始”操作符 - 更有效率
AND    EXISTS (
SELECT FROM tol.flight f
WHERE  f.ticket_uid = ticket.id
AND    f.carrier_code = 'carrier_code'
)
AND    channel.code = 'channel_code'
ORDER  BY ord.id DESC
LIMIT  10;

此外,某些LEFT JOIN的实例实际上是[INNER] JOIN。请参见:

通常情况下这并不重要,Postgres无论如何都会做正确的事情。但对于手动优化连接顺序可能很重要。请参见:

假设order.billing_number是字符串类型,请将billing_number::char <> '4') ord替换为ord.billing_number ^@ '4',通常更便宜,并允许使用带有COLLATE "C"的普通索引,例如:

CREATE INDEX text_b ON tol.order (billing_number COLLATE "C");

请参见:

如果它是不同的数据类型,请以不同的方式进行优化...

EXISTS子查询表达式中使用ORDER BY没有意义。我删除了它。

(至少)与contract_parent_partner的连接似乎会导致行数增加。是这样吗?在这种情况下,您是否希望返回符合条件的订单的多行?

如果所有这些仍然无法获得可接受的性能,那么某种递归或循环可能会有所帮助(大大填补您的小的LIMIT 10 - 用于实际找到行的查询)。或者您可以优化一些索引并更新/改进统计信息以获得不同的查询计划。这超出了在SO上的一个问题的范围。

英文:

What you ask for

... is possible, of course. Build a function to loop over patches of sorted rows, or ranges of values to force a certain "query plan", where Postgres has a blind spot. One sophisticated example:

More often than not, the better solution is to get server configuration, data model and column statistics straight. If one particular index driving the query is the fastest approach, Postgres should arrive at that query plan all by itself.

... but that's leading nowhere for the given case. Your added query plan reveals rows=0. No qualifying rows are found. That's often a worst case scenario for a query with ORDER BY and a small LIMIT - if the query plan ends up walking along the sort order until enough rows are found. Your idea of forcing Postgres to do just that in small patches of 100 rows would only make it worse.

Server configuration

Your query plan does not show any non-standard settings (which it would when EXPLAIN is called with the SETTINGS option. If your input is reliable, that means you are running Postgres with default settings, which is only good for very basic needs. Start with server configuration. Here are some starters:

General improvements

You are joining 11 tables (and it's unclear if you actually need to involve all of them). That's beyond the default setting for join_collapse_limit. So the order of joins as written becomes relevant. See:

Move joins that actually filter rows to the front, and move joins that multiply rows to the end. That might already improve performance a lot:

SELECT ord.billing_number
--  , ... (a lot of fields)
FROM   tol.order              ord
JOIN   tol.order_channel               ON order_channel.order_id = ord.id
JOIN   tol.channel                     ON channel.id = order_channel.channel_id  -- move to the top !!
JOIN   tol.service_cart                ON service_cart.order_id = ord.id
JOIN   tol.ticket_avia_v2     ticket   ON ticket.id = service_cart.ticket_uid
AND service_cart.service_table_name = &#39;vw_ticket_avia&#39;
LEFT   JOIN tol.user          u        ON u.id = ord.user_id
LEFT   JOIN tol.status                 ON status.id = ord.status_id
LEFT   JOIN tol.partner                ON partner.id = ord.partner_id
LEFT   JOIN tol.partner       parent_partner ON parent_partner.id = partner.parent_id
LEFT   JOIN tol.partner_contract       ON partner_contract.id = ord.partner_contract_id
LEFT   JOIN tol.partner       contract_parent_partner ON contract_parent_partner.id = partner_contract.parent_partner_id
WHERE  ord.billing_number ^@ &#39;4&#39;  -- &quot;starts with&quot; operator - more efficient
AND    EXISTS (
SELECT FROM tol.flight f
WHERE  f.ticket_uid = ticket.id
AND    f.carrier_code = &#39;carrier_code&#39;
)
AND    channel.code = &#39;channel_code&#39;
ORDER  BY ord.id DESC
LIMIT  10;

Also, some instances of LEFT JOIN are effectively [INNER] JOIN. See:

That does not normally matter much, Postgres does the right thing either way. But it might matter for manually optimizing the sequence of joins. See:

Assuming order.billing_number is a string type, replace billing_number::char &lt;&gt; &#39;4&#39;) ord with ord.billing_number ^@ &#39;4&#39;, that's generally much cheaper, and allows using a plain index with COLLATE &quot;C&quot; Like:

CREATE INDEX text_b ON tol.order (billing_number COLLATE &quot;C&quot;);

See:

If it's a different data type, optimize in a different way ...

ORDER BY makes no sense in an EXISTS subquery expression. I dropped that.

(At least) the joins to contract_parent_partner looks like it might multiply rows. Is that so? And do you want to return multiple rows per qualifying orders in that case?

If all of this still does not get you acceptable performance then, yes, some kind of recursion or loop might help (a lot) to fill your tiny LIMIT 10 - for queries that actually do find rows. Or you can optimize some indexes and update / improve statistics to get a different query plan. That's beyond the scope of a question here on SO.

huangapple
  • 本文由 发表于 2023年5月7日 12:32:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76192205.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定