为什么在这种情况下 PostgreSQL 不使用索引?

huangapple go评论61阅读模式
英文:

Why does PostgreSQL not use the index in this scenario?

问题

PostgreSQL在普通查询中为什么不使用索引,有没有办法重新构造查询以使用索引?我已经尝试了许多组合,但迄今为止没有成功。

PostgreSQL在正常查询中没有使用索引的原因是可能的优化器选择。在您的正常查询中,有一个条件 postcode='M19 1TF',这是一个等值条件,但它出现在连接多个表的查询中。优化器可能认为使用序列扫描更有效率,因为它可以一次性扫描整个表,而不必查找多个索引。

要尝试强制PostgreSQL使用索引,您可以使用以下方法:

SELECT area_id, COUNT(*) 
FROM uprn u 
JOIN geometry_subdivided g ON st_covers(g.geom, st_transform(u.location, 27700))
WHERE u.postcode = 'M19 1TF'
GROUP BY area_id;

这将使用INNER JOIN来连接表格,并且将postcode条件放在WHERE子句中,这有助于优化器更好地选择使用索引。不过,最终PostgreSQL的执行计划取决于数据的分布和索引的性能,所以在某些情况下可能仍然选择使用序列扫描。

如果查询性能对您非常重要,您还可以考虑对表格和索引进行进一步的性能优化,例如使用索引覆盖,调整查询计划,或者考虑使用分区表格等策略。

英文:

I have a couple of tables in PostgreSQL 14.6, with indexes as stated here:

# \d uprn
                        Table "public.uprn"
  Column   |         Type         | Collation | Nullable | Default
-----------+----------------------+-----------+----------+---------
 uprn      | text                 |           |          |
 postcode  | text                 |           |          |
 location  | geometry(Point,4326) |           |          |
Indexes:
    "uprn_location" gist (location)
    "uprn_postcode" btree (postcode)
    "uprn_uprn" btree (uprn)

# \d geometry_subdivided
      Table "public.geometry_subdivided"
 Column  |   Type   | Collation | Nullable | Default
---------+----------+-----------+----------+---------
 area_id | integer  |           |          |
 geom    | geometry |           |          |
Indexes:
    "subdivided_area_id" btree (area_id)
    "subdivided_geom_id" gist (geom)

And this many rows (VACUUM ANALYZE has been done post import):

# select count(*) from uprn;
  count
----------
 32872945
(1 row)

# select count(*) from geometry_subdivided;
 count
--------
 938500
(1 row)

23 of the uprn rows have a postcode of "M19 1TF". If I run a query joining these two tables on that postcode, looking for areas covering the points, it takes a number of seconds, using a sequence scan:

# explain analyze select area_id,count(*) from uprn u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) and postcode='M19 1TF' group by area_id;
                                                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=6287745.55..6292743.36 rows=16428 width=12) (actual time=6273.505..6290.003 rows=16 loops=1)
   Group Key: g.area_id
   ->  Gather Merge  (cost=6287745.55..6292414.80 rows=32856 width=12) (actual time=6273.497..6289.990 rows=48 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial GroupAggregate  (cost=6286745.52..6287622.38 rows=16428 width=12) (actual time=6254.442..6254.452 rows=16 loops=3)
               Group Key: g.area_id
               ->  Sort  (cost=6286745.52..6286983.05 rows=95010 width=4) (actual time=6254.431..6254.434 rows=69 loops=3)
                     Sort Key: g.area_id
                     Sort Method: quicksort  Memory: 25kB
                     Worker 0:  Sort Method: quicksort  Memory: 33kB
                     Worker 1:  Sort Method: quicksort  Memory: 25kB
                     ->  Nested Loop  (cost=25.29..6278890.20 rows=95010 width=4) (actual time=4756.836..6254.376 rows=69 loops=3)
                           ->  Parallel Seq Scan on uprn u  (cost=0.00..1264850.55 rows=101 width=32) (actual time=4725.962..6221.730 rows=8 loops=3)
                                 Filter: (postcode = 'M19 1TF'::text)
                                 Rows Removed by Filter: 10957641
                           ->  Index Scan using subdivided_geom_id on geometry_subdivided g  (cost=25.29..49643.02 rows=94 width=2040) (actual time=0.102..0.253 rows=9 loops=23)
                                 Index Cond: (geom ~ st_transform(u.location, 27700))
                                 Filter: st_covers(geom, st_transform(u.location, 27700))
                                 Rows Removed by Filter: 7
 Planning Time: 0.359 ms
 Execution Time: 6290.100 ms
(22 rows)

But if I discourage PostgreSQL from using sequence scans, the same query takes milliseconds:

# set enable_seqscan to off;
SET
# explain analyze select area_id,count(*) from uprn u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) and postcode='M19 1TF' group by area_id;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=12076657.05..12078536.73 rows=17175 width=12) (actual time=7.710..7.766 rows=16 loops=1)
   Group Key: g.area_id
   ->  Sort  (cost=12076657.05..12077226.36 rows=227724 width=4) (actual time=7.681..7.714 rows=207 loops=1)
         Sort Key: g.area_id
         Sort Method: quicksort  Memory: 34kB
         ->  Nested Loop  (cost=31.60..12053278.12 rows=227724 width=4) (actual time=0.203..7.603 rows=207 loops=1)
               ->  Bitmap Heap Scan on uprn u  (cost=6.31..966.53 rows=242 width=32) (actual time=0.035..0.073 rows=23 loops=1)
                     Recheck Cond: (postcode = 'M19 1TF'::text)
                     Heap Blocks: exact=10
                     ->  Bitmap Index Scan on uprn_postcode  (cost=0.00..6.25 rows=242 width=0) (actual time=0.024..0.025 rows=23 loops=1)
                           Index Cond: (postcode = 'M19 1TF'::text)
               ->  Index Scan using subdivided_geom_id on geometry_subdivided g  (cost=25.29..49802.00 rows=94 width=2038) (actual time=0.116..0.322 rows=9 loops=23)
                     Index Cond: (geom ~ st_transform(u.location, 27700))
                     Filter: st_covers(geom, st_transform(u.location, 27700))
                     Rows Removed by Filter: 7
 Planning Time: 0.259 ms
 Execution Time: 7.851 ms
(17 rows)

Why is PostgreSQL not using the index in the normal query, and is there a way I can reword the query to do so? I've tried a number of combinations but without luck so far

答案1

得分: 4

At a shallow level, 它认为几何索引扫描是计划中成本最高的部分,而使用序列扫描是在并行中执行该索引扫描的关键。因此,认为序列扫描的更高成本是值得的,以实现并行化。因此,如果您通常不从并行化中获得很大好处,这个问题的一个简单解决方案是将max_parallel_workers_per_gather设置为0。

On a deeper level, 为什么它认为序列扫描是并行计划的关键呢?'Parallel Bitmap Heap Scan' 在这个版本中是存在的,为什么不使用它呢?我认为这是因为来自该表的预期行数足够低(242),以至于无法轻松地将位图划分为并行化部分,而序列扫描则更容易划分。(我没有查看源代码来验证这个理论)。如果预期行数更高,它将使用'Parallel Bitmap Heap Scan',如果行数更低,它将不认为在首次进行并行化很重要。

所以另一个可能的解决方案是修复估算问题,即它认为邮政编码将找到242行,但实际上只找到了23行。您可以通过增加default_statistics_target,或者仅针对该列增加统计参数,并重新执行ANALYZE来实现这一点。但是我不确定将其减少到23行是否足以放弃并行计划。

问题的另一半是几何索引的每次迭代认为它将找到94行,但实际上只找到了9行。但是对于这种情况,您无法简单地采取措施,因为这种类型的几何索引扫描不查看数据分布,它总是假设它将返回表的1/10000。

至于您尝试的子查询,规划器"看穿"了这种表达方式,因此得出了与否则相同的计划。通常,这种看穿会导致更好的计划,但在这里,它抑制了您试图强制获得更好计划的尝试。为了避免这种情况,您可以使用材料化的CTE(公共表达式),这会抑制这种规划器看穿。

使用材料化(materialized)关键词,创建CTE(公共表达式),如下所示:

with u as materialized (select * from uprn where postcode='M19 1TF')
select area_id,count(*) from u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) group by area_id;

不管价值多少,看起来PostGIS的下一个版本,3.4.0,将改变在这种情况下估算gist索引扫描成本的方式,可能会修复您的问题(提交31bcb7d414c73df8dbc2975c6dd4a269b190c874)。

英文:

On a shallow level, it is easy to see what is going on. It thinks that the geometry index scans are by far the most costly part of the plan, and it thinks that using the seq scan is the key to doing that index scan in parallel. So the higher cost of the seq scan is believed to be worth it to get the parallelization. So an easy solution to this problem, if you generally don't get much benefit from parallelization, is to set max_parallel_workers_per_gather = 0.

On a deeper level, it is harder to see. Why does it think the seq scan is key to the parallel plan? 'Parallel Bitmap Heap Scan' does exist in this version, so why not use that? I think it is because the expected number of rows from that table is low enough (242) that the bitmap can't be easily divided up for parallelization, while a seq scan is easier to divide. (I have not looked into the source code to verify this theory). If the expected number of rows were higher, it would use the 'Parallel Bitmap Heap Scan', if it were much lower, it wouldn't think it were important to parallelize this in the first place.

So another possible solution is to fix the estimation problem where it thinks the postcode will find 242 rows but only does find 23. You should be able to do this by increasing default_statistics_target, or increasing the statistics parameter just for that column, and redoing the ANALYZE. But I'm not sure just getting it down to 23 rows will be enough to forgo the parallel plan.

The other half of the row estimation problem is where each iteration of the geometry index thinks it will find 94 rows, but really only finds 9. But there is nothing simple you can do about this, as this type of scan on a geometry index doesn't look at the data distributions, it just always assumes it will return 1/10000 of the table.

As for your attempt with the subselect, the planner "sees through" this formulation and so comes up with the same plan as it otherwise would. Usually this see-through leads to better plans, but here it inhibits your attempt to force a better plan. To avoid this, you could use a materialized CTE which inhibits such planner see-through.

with u as materialized (select * from uprn where postcode='M19 1TF')
select area_id,count(*) from u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) group by area_id;

For whatever it is worth it looks like the next release of postgis, 3.4.0, will change the way the gist index scan cost is estimated in this case in a way that might fix your problem (commit 31bcb7d414c73df8dbc2975c6dd4a269b190c874)

答案2

得分: 0

以下是您提供的代码部分的中文翻译:

在我尝试的可能性之一是将uprn表查找移动到表定义中(因为单独的select * from uprn where postcode='M19 1TF'使用了索引),但这没有任何区别:

然而,如果我在内部查询中添加ORDER BY,它就会使用索引!

这回答了我的性能问题,这是主要关注的问题,尽管我不明白为什么PostgreSQL首先不使用索引。

英文:

One of the possibilities I had tried was moving the uprn table lookup into the table definition (as select * from uprn where postcode='M19 1TF' on its own used the index), but that made no difference:

# explain analyze select area_id,count(*) from (select * from uprn where postcode='M19 1TF') u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) group by area_id;
                                                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=6303711.22..6308902.95 rows=17175 width=12) (actual time=7348.913..7353.635 rows=16 loops=1)
   Group Key: g.area_id
   ->  Gather Merge  (cost=6303711.22..6308559.45 rows=34350 width=12) (actual time=7348.908..7353.626 rows=34 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial GroupAggregate  (cost=6302711.20..6303594.59 rows=17175 width=12) (actual time=7329.626..7329.635 rows=11 loops=3)
               Group Key: g.area_id
               ->  Sort  (cost=6302711.20..6302948.41 rows=94885 width=4) (actual time=7329.615..7329.619 rows=69 loops=3)
                     Sort Key: g.area_id
                     Sort Method: quicksort  Memory: 25kB
                     Worker 0:  Sort Method: quicksort  Memory: 25kB
                     Worker 1:  Sort Method: quicksort  Memory: 33kB
                     ->  Nested Loop  (cost=25.29..6294867.11 rows=94885 width=4) (actual time=5698.604..7329.564 rows=69 loops=3)
                           ->  Parallel Seq Scan on uprn  (cost=0.00..1264770.12 rows=101 width=32) (actual time=5669.212..7298.298 rows=8 loops=3)
                                 Filter: (postcode = 'M19 1TF'::text)
                                 Rows Removed by Filter: 10957641
                           ->  Index Scan using subdivided_geom_id on geometry_subdivided g  (cost=25.29..49802.00 rows=94 width=2038) (actual time=0.109..0.270 rows=9 loops=23)
                                 Index Cond: (geom ~ st_transform(uprn.location, 27700))
                                 Filter: st_covers(geom, st_transform(uprn.location, 27700))
                                 Rows Removed by Filter: 7
 Planning Time: 0.460 ms
 Execution Time: 7353.724 ms

However, if I add an ORDER BY to that inner query, it then does use the index!

# explain analyze select area_id,count(*) from (select * from uprn where postcode='M19 1TF' order by postcode) u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) group by area_id;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=12052047.47..12053927.15 rows=17175 width=12) (actual time=7.878..7.921 rows=16 loops=1)
   Group Key: g.area_id
   ->  Sort  (cost=12052047.47..12052616.78 rows=227724 width=4) (actual time=7.861..7.881 rows=207 loops=1)
         Sort Key: g.area_id
         Sort Method: quicksort  Memory: 34kB
         ->  Nested Loop  (cost=31.60..12028668.54 rows=227724 width=4) (actual time=0.328..7.778 rows=207 loops=1)
               ->  Bitmap Heap Scan on uprn  (cost=6.31..966.53 rows=242 width=104) (actual time=0.064..0.102 rows=23 loops=1)
                     Recheck Cond: (postcode = 'M19 1TF'::text)
                     Heap Blocks: exact=10
                     ->  Bitmap Index Scan on uprn_postcode  (cost=0.00..6.25 rows=242 width=0) (actual time=0.045..0.046 rows=23 loops=1)
                           Index Cond: (postcode = 'M19 1TF'::text)
               ->  Index Scan using subdivided_geom_id on geometry_subdivided g  (cost=25.29..49700.30 rows=94 width=2038) (actual time=0.118..0.327 rows=9 loops=23)
                     Index Cond: (geom ~ st_transform(uprn.location, 27700))
                     Filter: st_covers(geom, st_transform(uprn.location, 27700))
                     Rows Removed by Filter: 7
 Planning Time: 0.390 ms
 Execution Time: 8.027 ms
(17 rows)

This answers my question in terms of making it performant, which is the main concern, though I don't understand why PostgreSQL does not use the index in the first place.

huangapple
  • 本文由 发表于 2023年5月20日 23:38:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76296024.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定