2023年6月2日 03:54:14go评论89阅读模式

英文:

Optimizing joins in a postgres/postgis query

问题

我正在尝试优化我的Postgres查询。我在这里的一些连接方面遇到了问题。我的主要问题涉及到过滤条件h.type='inNetwork'和我的几何搜索ST_Intersects(ST_MakeValid(ser.boundaries)::geography, ST_MakeValid(ST_SetSRID(ST_GeomFromGeoJson(<INSERT_GEOMETRY_JSON>))。关于这个特定的过滤条件似乎会使搜索时间增加约10倍。其他过滤条件似乎对搜索速度没有太大影响。另外需要注意的是，还有一些有条件地使用的其他过滤条件，这就是为什么这里的一些连接表似乎无关紧要。

涉及的查询：

SELECT DISTINCT r.id, r.profitability
FROM rolloff_pricing as r
LEFT JOIN service_areas ser on r.service_area_id = ser.id
LEFT JOIN sizes as s on r.size_id = s.id
LEFT JOIN sizes as sa on r.sell_as = sa.id
LEFT JOIN waste_types w on w.id = r.waste_type_id
LEFT JOIN regions reg on reg.id = ser.region_id
LEFT JOIN haulers h on h.id = reg.hauler_id
LEFT JOIN current_availability ca on ca.region_id = reg.id
LEFT JOIN regions_availability ra on ra.region_id = reg.id
LEFT JOIN current_availability_new_deliveries cand on ca.id = cand.current_availability_id and r.size_id = cand.size_id
LEFT JOIN exceptions ex on ex.region_id = reg.id
WHERE ser.active is true
and ST_Intersects(ST_MakeValid(ser.boundaries)::geography, ST_MakeValid(ST_SetSRID(ST_GeomFromGeoJson('{
    "type": "POINT",
    "coordinates": [
        "-95.3595563",
        "29.7634871"
    ]
}'),4326))::geography)
and h.active is true
and ra.delivery_type='newDeliveries'
and h.type='inNetwork'
GROUP BY r.id ORDER BY profitability desc OFFSET 0 ROWS FETCH NEXT 8 ROWS ONLY

这里发生的情况，我理解为我的过滤条件带来了额外的行（h表的绝大多数行都满足条件h.type='inNetwork'），这导致了我的几何查询针对比预期更大的数据集运行。

我尝试过的一些方法是将几何查询放入子查询中（因为几何查询本身实际上运行得相当快），以获取一组可以在where in子句中使用的r.id。不过这似乎效果不佳。以下是我的修改后的查询，但也太慢了：

SELECT DISTINCT r.id, r.profitability
FROM rolloff_pricing as r
LEFT JOIN service_areas ser on r.service_area_id = ser.id
LEFT JOIN sizes as s on r.size_id = s.id
LEFT JOIN sizes as sa on r.sell_as = sa.id
RIGHT JOIN waste_types w on w.id = r.waste_type_id
RIGHT JOIN regions reg on reg.id = ser.region_id
RIGHT JOIN haulers h on h.id = ser.hauler_id
RIGHT JOIN current_availability ca on ca.region_id = reg.id
RIGHT JOIN regions_availability ra on ra.region_id = reg.id
LEFT JOIN current_availability_new_deliveries cand on ca.id = cand.current_availability_id and r.size_id = cand.size_id
RIGHT JOIN exceptions ex on ex.region_id = reg.id
WHERE r.id in (
    select r2.id
    from rolloff_pricing as r2
    LEFT JOIN service_areas ser2 on r2.service_area_id = ser2.id
    WHERE
    ST_Intersects(ST_MakeValid(ser2.boundaries)::geography, ST_MakeValid(ST_SetSRID(ST_GeomFromGeoJson('{
        "type": "POINT",
        "coordinates": [
            "-95.3595563",
            "29.7634871"
        ]
    }'),4326))::geography)
    and ser2.active is true
)
and h.active is true
and ra.delivery_type='newDeliveries'
and h.type='inNetwork'
GROUP BY r.id ORDER BY profitability desc OFFSET 0 ROWS FETCH NEXT 8 ROWS ONLY

这个子查询本身非常快速，但如果将子查询的结果替换到主查询中，整体性能依然很慢。因此，我不确定如何进一步解决这个问题。我的下一个猜测是运行一个完全独立的查询来获取r.id，然后将它们传递给“主”查询。

可能相关的信息：这个查询是在基于Eloquent的API中生成和执行的。

我应该如何继续改进查询性能？

英文:

I'm trying to optimize my Postgres query. I'm running into problems with some of the joins here. My main issue is around the filter h.type='inNetwork' and my geometry search ST_Intersects(ST_MakeValid(ser.boundaries)::geography, ST_MakeValid(ST_SetSRID(ST_GeomFromGeoJson(<INSERT_GEOMETRY_JSON>)). Something about that specific filter increase the search time by ~10x. The other filters don't seem to have much of an effect on the search speed. As a side note, there are additional filters that are used conditionally and that's why some of the join tables here seem irrelevant.

Query in question:

SELECT DISTINCT r.id, r.profitability
FROM rolloff_pricing as r
LEFT JOIN service_areas ser on r.service_area_id = ser.id
LEFT JOIN sizes as s on r.size_id = s.id
LEFT JOIN sizes as sa on r.sell_as = sa.id
LEFT JOIN waste_types w on w.id = r.waste_type_id
LEFT JOIN regions reg on reg.id = ser.region_id
LEFT JOIN haulers h on h.id = reg.hauler_id
LEFT JOIN current_availability ca on ca.region_id = reg.id
LEFT JOIN regions_availability ra on ra.region_id = reg.id
LEFT JOIN current_availability_new_deliveries cand on ca.id = cand.current_availability_id and r.size_id = cand.size_id
LEFT JOIN exceptions ex on ex.region_id = reg.id
WHERE ser.active is true
and ST_Intersects(ST_MakeValid(ser.boundaries)::geography, ST_MakeValid(ST_SetSRID(ST_GeomFromGeoJson(&#39;{
    &quot;type&quot;: &quot;POINT&quot;,
    &quot;coordinates&quot;: [
        &quot;-95.3595563&quot;,
        &quot;29.7634871&quot;
    ]
}&#39;),4326))::geography)
and h.active is true
and ra.delivery_type=&#39;newDeliveries&#39;
and h.type=&#39;inNetwork&#39;
GROUP BY r.id ORDER BY profitability desc OFFSET 0 ROWS FETCH NEXT 8 ROWS ONLY

Here is the EXPLAIN (ANALYZE, BUFFERS):

Limit  (cost=246.23..246.29 rows=8 width=21) (actual time=3711.860..3711.866 rows=8 loops=1)
  Buffers: shared hit=15048
  -&gt;  Unique  (cost=246.23..246.29 rows=8 width=21) (actual time=3711.859..3711.860 rows=8 loops=1)
        Buffers: shared hit=15048
        -&gt;  Sort  (cost=246.23..246.25 rows=8 width=21) (actual time=3711.858..3711.858 rows=8 loops=1)
&quot;              Sort Key: r.profitability DESC, r.id&quot;
              Sort Method: quicksort  Memory: 28kB
              Buffers: shared hit=15048
              -&gt;  Group  (cost=246.07..246.11 rows=8 width=21) (actual time=3711.820..3711.841 rows=48 loops=1)
                    Group Key: r.id
                    Buffers: shared hit=15048
                    -&gt;  Sort  (cost=246.07..246.09 rows=8 width=21) (actual time=3711.817..3711.823 rows=216 loops=1)
                          Sort Key: r.id
                          Sort Method: quicksort  Memory: 41kB
                          Buffers: shared hit=15048
                          -&gt;  Hash Left Join  (cost=154.30..245.95 rows=8 width=21) (actual time=3711.508..3711.745 rows=216 loops=1)
                                Hash Cond: ((reg.id)::text = (ex.region_id)::text)
                                Buffers: shared hit=15048
                                -&gt;  Hash Join  (cost=150.45..242.05 rows=8 width=37) (actual time=3711.490..3711.705 rows=144 loops=1)
                                      Hash Cond: ((ra.region_id)::text = (reg.id)::text)
                                      Buffers: shared hit=15045
                                      -&gt;  Seq Scan on regions_availability ra  (cost=0.00..89.11 rows=643 width=16) (actual time=0.006..0.186 rows=643 loops=1)
                                            Filter: ((delivery_type)::text = &#39;newDeliveries&#39;::text)
                                            Rows Removed by Filter: 1286
                                            Buffers: shared hit=65
                                      -&gt;  Hash  (cost=150.34..150.34 rows=9 width=53) (actual time=3711.461..3711.461 rows=144 loops=1)
                                            Buckets: 1024  Batches: 1  Memory Usage: 21kB
                                            Buffers: shared hit=14980
                                            -&gt;  Hash Right Join  (cost=73.72..150.34 rows=9 width=53) (actual time=3711.218..3711.442 rows=144 loops=1)
                                                  Hash Cond: ((ca.region_id)::text = (reg.id)::text)
                                                  Buffers: shared hit=14980
                                                  -&gt;  Seq Scan on current_availability ca  (cost=0.00..69.02 rows=2002 width=32) (actual time=0.009..0.124 rows=2002 loops=1)
                                                        Buffers: shared hit=49
                                                  -&gt;  Hash  (cost=73.68..73.68 rows=3 width=69) (actual time=3711.173..3711.173 rows=48 loops=1)
                                                        Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                                        Buffers: shared hit=14931
                                                        -&gt;  Nested Loop  (cost=0.84..73.68 rows=3 width=69) (actual time=2262.438..3711.145 rows=48 loops=1)
                                                              Buffers: shared hit=14931
                                                              -&gt;  Nested Loop  (cost=0.55..44.90 rows=1 width=48) (actual time=2262.424..3710.955 rows=7 loops=1)
                                                                    Buffers: shared hit=14877
                                                                    -&gt;  Nested Loop  (cost=0.28..38.20 rows=3 width=16) (actual time=0.012..4.723 rows=609 loops=1)
                                                                          Buffers: shared hit=1418
                                                                          -&gt;  Seq Scan on haulers h  (cost=0.00..21.60 rows=2 width=16) (actual time=0.003..0.698 rows=439 loops=1)
                                                                                Filter: ((active IS TRUE) AND ((type)::text = &#39;inNetwork&#39;::text))
                                                                                Rows Removed by Filter: 89
                                                                                Buffers: shared hit=15
                                                                          -&gt;  Index Scan using regions_hauler_id_idx on regions reg  (cost=0.28..8.29 rows=1 width=32) (actual time=0.006..0.007 rows=1 loops=439)
                                                                                Index Cond: ((hauler_id)::text = (h.id)::text)
                                                                                Buffers: shared hit=1403
                                                                    -&gt;  Index Scan using service_areas_region_id_idx on service_areas ser  (cost=0.28..2.22 rows=1 width=32) (actual time=6.035..6.085 rows=0 loops=609)
                                                                          Index Cond: ((region_id)::text = (reg.id)::text)
&quot;                                                                          Filter: ((active IS TRUE) AND ((st_makevalid(boundaries))::geography &amp;&amp; &#39;0101000020E610000087646DF802D757C0FA6AFDE373C33D40&#39;::geography) AND (_st_distance((st_makevalid(boundaries))::geography, &#39;0101000020E610000087646DF802D757C0FA6AFDE373C33D40&#39;::geography, &#39;0&#39;::double precision, false) &lt; &#39;1.00000000000000008e-05&#39;::double precision))&quot;
                                                                          Rows Removed by Filter: 3
                                                                          Buffers: shared hit=13459
                                                              -&gt;  Index Scan using rolloff_pricing_service_area_id_idx on rolloff_pricing r  (cost=0.29..28.70 rows=8 width=83) (actual time=0.013..0.019 rows=7 loops=7)
                                                                    Index Cond: ((service_area_id)::text = (ser.id)::text)
                                                                    Buffers: shared hit=54
                                -&gt;  Hash  (cost=3.38..3.38 rows=38 width=48) (actual time=0.012..0.012 rows=39 loops=1)
                                      Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                      Buffers: shared hit=3
                                      -&gt;  Seq Scan on exceptions ex  (cost=0.00..3.38 rows=38 width=48) (actual time=0.003..0.007 rows=39 loops=1)
                                            Buffers: shared hit=3
Planning Time: 1.031 ms
Execution Time: 3711.956 ms

My understanding of what's happening here is that my filter is bringing in additional rows (the large majority of the h table is true for the condition h.type='inNetwork', which is making my geometry query run for a much larger set of rows than intended.

Something that I've tried is putting the geometry query into a subquery (because the geometry query actually runs pretty quickly itself) to get a set of r.id's that I could use in an where in clause. This seems to not work as well though. Here is my modified query that also is too slow:

SELECT DISTINCT r.id, r.profitability
FROM rolloff_pricing as r
LEFT JOIN service_areas ser on r.service_area_id = ser.id
LEFT JOIN sizes as s on r.size_id = s.id
LEFT JOIN sizes as sa on r.sell_as = sa.id
RIGHT JOIN waste_types w on w.id = r.waste_type_id
RIGHT JOIN regions reg on reg.id = ser.region_id
RIGHT JOIN haulers h on h.id = ser.hauler_id
RIGHT JOIN current_availability ca on ca.region_id = reg.id
RIGHT JOIN regions_availability ra on ra.region_id = reg.id
LEFT JOIN current_availability_new_deliveries cand on ca.id = cand.current_availability_id and r.size_id = cand.size_id
RIGHT JOIN exceptions ex on ex.region_id = reg.id
WHERE r.id in (
    select r2.id
    from rolloff_pricing as r2
    LEFT JOIN service_areas ser2 on r2.service_area_id = ser2.id
    WHERE
    ST_Intersects(ST_MakeValid(ser2.boundaries)::geography, ST_MakeValid(ST_SetSRID(ST_GeomFromGeoJson(&#39;{
        &quot;type&quot;: &quot;POINT&quot;,
        &quot;coordinates&quot;: [
            &quot;-95.3595563&quot;,
            &quot;29.7634871&quot;
        ]
    }&#39;),4326))::geography)
    and ser2.active is true
)
and h.active is true
and ra.delivery_type=&#39;newDeliveries&#39;
and h.type=&#39;inNetwork&#39;
GROUP BY r.id ORDER BY profitability desc OFFSET 0 ROWS FETCH NEXT 8 ROWS ONLY

It's interesting to me because the subquery here resolves by itself very quickly. And if I sub out the subquery for the returned results, the whole thing resolves very quickly as well. So I'm not sure how to approach this exactly. My next guess is to just run a completely seperate query for the r.ids and then pass them through to the "main" query.

Maybe relevant info: this query is being generated and executed in an eloquent-based api

How can I go about approaching improving the speed here?

答案1

得分: 2

"the large majority of the h table is true for the condition h.type='inNetwork', which is making my geometry query run for a much larger set of rows than intended."
大部分情况下，表h满足条件h.type='inNetwork'，这导致我的几何查询运行的行数比预期要多得多。

"I don't understand. If most of the table meets the condition h.active is true and h.type='inNetwork', then most of the table gets processed. What else could have possibly been intended? The estimate there is pretty horrible (estimated rows 2, actual 439) but that must be just because your stats are horribly out of date. There is really no good reason for the estimate to be off by so much. You should run VACUUM ANALYZE on all tables involved in this query, after making sure there are no transactions being held open. If it doesn't fix the query directly, it will at least produce plans which are easier to understand."
我不明白。如果大多数表满足条件h.active为true且h.type='inNetwork'，那么大多数表将被处理。还可能有什么其他意图呢？估计值相当糟糕（估计行数为2，实际行数为439），但这只是因为您的统计信息已经非常过时。实际上，估算如此不准确真的没有什么充分的理由。您应该在确保没有未完成的事务后，对涉及此查询的所有表运行VACUUM ANALYZE。即使它不能直接修复查询，至少也会生成更容易理解的计划。

"-> Index Scan using service_areas_region_id_idx on service_areas ser (cost=0.28..2.22 rows=1 width=32) (actual time=6.035..6.085 rows=0 loops=609)
Index Cond: ((region_id)::text = (reg.id)::text)
Filter: ((active IS TRUE) AND ((st_makevalid(boundaries))::geography && '0101000020E610000087646DF802D757C0FA6AFDE373C33D40'::geography) AND (_st_distance((st_makevalid(boundaries))::geography, '0101000020E610000087646DF802D757C0FA6AFDE373C33D40'::geography, '0'::double precision, false) < '1.00000000000000008e-05'::double precision))"
在这个地方，几乎占用了所有时间。很难理解它实际在做什么。为什么要执行这个扫描609次需要13459个缓冲区命中？每次循环需要22个缓冲区，但降序索引应该只需要3到4个缓冲区访问。可能是因为过滤条件导致了许多行被过滤掉，但在这种情况下，通过过滤的行数应该远远超过3。也许索引中充斥着死条目，然后被过滤掉但没有在'Rows Removed by Filter'中列入统计。如果没有未完成的事务，运行VACUUM可以解决这个问题。或者可能几何列非常大，因此需要占用多个页面。

"并且要明确，我并不是说大量的缓冲区访问是导致速度变慢的原因，毕竟它们都是'缓冲区命中'，所以应该很快。但这是一个奇怪的现象，值得调查。"
And to be clear here, I'm not saying the large number of buffer accesses are causing the slowness, they are all "buffer hits" after all and so should be fast. But it is an oddity that should be investigated.
要明确的是，我并不是说大量的缓冲区访问是导致速度变慢的原因，毕竟它们都是“缓冲区命中”，所以应该很快。但这是一个奇怪的现象，值得调查。

英文:

> the large majority of the h table is true for the condition h.type='inNetwork', which is making my geometry query run for a much larger set of rows than intended.

I don't understand. If most of the table meets the condition h.active is true and h.type='inNetwork', then most of the table gets processed. What else could have possibly been intended? The estimate there is pretty horrible (estimated rows 2, actual 439) but that must be just because your stats are horribly out of date. There is really no good reason for the estimate to be off by so much. You should run VACUUM ANALYZE on all tables involved in this query, after making sure there are no transactions being held open. If it doesn't fix the query directly, it will at least produce plans which are easier to understand.

-&gt;  Index Scan using service_areas_region_id_idx on service_areas ser  (cost=0.28..2.22 rows=1 width=32) (actual time=6.035..6.085 rows=0 loops=609)
    Index Cond: ((region_id)::text = (reg.id)::text)
    Filter: ((active IS TRUE) AND ((st_makevalid(boundaries))::geography &amp;&amp; &#39;0101000020E610000087646DF802D757C0FA6AFDE373C33D40&#39;::geography) AND (_st_distance((st_makevalid(boundaries))::geography, &#39;0101000020E610000087646DF802D757C0FA6AFDE373C33D40&#39;::geography, &#39;0&#39;::double precision, false) &lt; &#39;1.00000000000000008e-05&#39;::double precision))&quot;
    Rows Removed by Filter: 3
    Buffers: shared hit=13459

This is the one place which takes pretty much all of the time. And it is hard to understand what it is actually doing. Why on earth would it take 13459 buffers hits to run this scan 609 times? that is 22 buffers for each loop, but descending an index should only take 3 to 4 buffer accesses. You could be filtering out a lot of rows due to the filter condition, but in that case Rows Removed by Filter would need to be a lot more than 3. Maybe the index is stuffed full of dead entries, which then get filtered out but don't get tabulated in 'Rows Removed by Filter'. (A VACUUM when there are no transactions being held open would fix that). Or maybe the geometry column is very large and so gets TOASTed over many pages.

And to be clear here, I'm not saying the large number of buffer accesses are causing the slowness, they are all "buffer hits" afterall and so should be fast. But it is an oddity that should be investigated.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在PostgreSQL/PostGIS查询中优化连接操作。

问题

答案1

无法断言时间戳。

在Postgres中设置全局列长度限制。

SQL – 查找循环和持续时间

如何通过Prisma ORM基于现有数据在PostgreSQL中生成种子文件？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。