2023年5月22日 15:09:10go评论75阅读模式

英文:

Why does PostgreSQL use sequential scan instead of index scan?

问题

根据你提供的信息，PG选择使用顺序扫描（Seq Scan）而不是索引扫描（Index Scan）可能是因为在某些情况下，顺序扫描比索引扫描更有效率。可能的原因之一是，当需要检索的行数相对较大时，顺序扫描可能比索引扫描更快速。

另一个可能的原因是，优化器可能认为从索引中检索数据会导致额外的磁盘 I/O 操作，因此选择了顺序扫描。

需要注意的是，优化器的决策受到许多因素的影响，包括表的大小、索引的选择性、查询的复杂性等等。因此，不同情况下的最佳执行计划可能会有所不同。

如果你对优化器的决策感到困惑，可以考虑使用 ANALYZE 命令来更新统计信息，以便优化器可以更好地了解表的特性，从而做出更准确的执行计划选择。

英文:

I'm learning PostgreSQL and I'm trying to understand the details of how to choose how to scan this database.

I have postgtresql 14.2 and run the following code:

DROP TABLE IF EXISTS t_test;
CREATE TABLE t_test (id serial, name text);
INSERT INTO t_test (name) SELECT &#39;hans&#39; FROM generate_series(1, 2000000);
INSERT INTO t_test (name) SELECT &#39;paul&#39; FROM generate_series(1, 2000000);
CREATE INDEX idx_id ON t_test (id);
EXPLAIN ANALYZE SELECT * from t_test where id &gt; 2756021 LIMIT 2;

And I see:

                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.11 rows=2 width=36) (actual time=152.718..152.728 rows=2 loops=1)
   -&gt;  Seq Scan on t_test  (cost=0.00..71622.00 rows=1333333 width=36) (actual time=152.713..152.716 rows=2 loops=1)
         Filter: (id &gt; 2756021)
         Rows Removed by Filter: 2756021
 Planning Time: 0.218 ms
 Execution Time: 152.756 ms
(6 rows)

According to this Query Plan, in case of seq scan PG needs to read 2 756 021 rows until PG reaches the first desired row. As I understand, In case of index scan, PG needs to find only 1 row from index (ID = 2756021) and recieve the next index value (it should be light operation because index is sorted list) and load rows related with these indexes twice. But PG Planner choices Seq Scan.

I turned off seq scan and obtained Execution time is 0.116 ms instead of 152.756 ms in the first case:

DROP TABLE IF EXISTS t_test;
CREATE TABLE t_test (id serial, name text);
INSERT INTO t_test (name) SELECT &#39;hans&#39; FROM generate_series(1, 2000000);
INSERT INTO t_test (name) SELECT &#39;paul&#39; FROM generate_series(1, 2000000);
CREATE INDEX idx_id ON t_test (id);
SET enable_seqscan = FALSE;
EXPLAIN ANALYZE SELECT * from t_test where id &gt; 2756021 LIMIT 2;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..0.62 rows=2 width=36) (actual time=0.060..0.072 rows=2 loops=1)
   -&gt;  Index Scan using idx_id on t_test  (cost=0.43..124449.76 rows=1333333 width=36) (actual time=0.055..0.059 rows=2 loops=1)
         Index Cond: (id &gt; 2756021)
 Planning Time: 0.342 ms
 Execution Time: 0.116 ms
(5 rows)

But to be honest, the following queries wit seq scan have better results that index scan (but I don't understand why):

SET enable_seqscan = TRUE ;
EXPLAIN ANALYZE SELECT * from t_test where id &gt; 2756021 LIMIT 2;
postgres=# EXPLAIN ANALYZE SELECT * from t_test where id &gt; 2756021 LIMIT 2;
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.12 rows=2 width=9) (actual time=0.042..0.054 rows=2 loops=1)
   -&gt;  Seq Scan on t_test  (cost=0.00..71622.00 rows=1242848 width=9) (actual time=0.038..0.041 rows=2 loops=1)
         Filter: (id &gt; 2756021)
         Rows Removed by Filter: 261
 Planning Time: 0.076 ms
 Execution Time: 0.084 ms
(6 rows)
postgres=# EXPLAIN ANALYZE SELECT * from t_test where id &gt; 1758121 LIMIT 2;
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.06 rows=2 width=9) (actual time=0.018..0.032 rows=2 loops=1)
   -&gt;  Seq Scan on t_test  (cost=0.00..71622.00 rows=2233890 width=9) (actual time=0.014..0.018 rows=2 loops=1)
         Filter: (id &gt; 1758121)
 Planning Time: 0.092 ms
 Execution Time: 0.071 ms
(5 rows)

Why does PG use Seq Scan instead of Index Scan?

答案1

得分: 1

以下是您要翻译的内容：

第一个执行计划的明显问题出现在第二行

-> t_test 上的 Seq Scan (成本=0.00..71622.00 行=1333333 宽度=36) (实际时间=152.713..152.716 行=2 循环=1)

PostgreSQL 优化器估计谓词 id > 2756021 将返回 130 万行 (行=1333333)，因此不考虑 LIMIT 2 在估计中，选择了 Seq Scan。

如果您在谓词中提供了上限，则计划将使用索引。

示例

EXPLAIN ANALYZE SELECT * from t_test where id > 2756021 and id <= 2756021+2 LIMIT 2;

Limit (成本=0.43..8.47 行=2 宽度=9) (实际时间=0.041..0.043 行=2 循环=1)
-> t_test 上的索引扫描 idx_id (成本=0.43..8.47 行=2 宽度=9) (实际时间=0.040..0.041 行=2 循环=1)
索引条件: ((id > 2756021) AND (id <= 2756023))
计划时间: 0.130 毫秒
执行时间: 0.066 毫秒

英文:

The apparent problem with the first explain plan is in the second line

 -&gt;  Seq Scan on t_test  (cost=0.00..71622.00 rows=1333333 width=36) (actual time=152.713..152.716 rows=2 loops=1)

PostgreSQL optimizer estimates that the predicate id > 2756021 will return 1.3M rows (rows=1333333), so the LIMIT 2 is not considered in the estimation and Seq Scan is choosen.

If you help PostgreSQL with an upper limit, the plan is fine using an index.

Example

EXPLAIN ANALYZE SELECT * from t_test where id &gt; 2756021 and id &lt;= 2756021+2 LIMIT 2;
Limit  (cost=0.43..8.47 rows=2 width=9) (actual time=0.041..0.043 rows=2 loops=1)
  -&gt;  Index Scan using idx_id on t_test  (cost=0.43..8.47 rows=2 width=9) (actual time=0.040..0.041 rows=2 loops=1)
        Index Cond: ((id &gt; 2756021) AND (id &lt;= 2756023))
Planning Time: 0.130 ms
Execution Time: 0.066 ms

答案2

得分: 0

我相信规划器不使用索引扫描，因为您正在从表中请求大量记录（id > 2756201表示“大约是您表的1/3”）。规划器决定索引扫描（然后通过索引访问1/3的行）比读取整个表要慢。
要确认，请尝试在where子句中使用更大的id值：
id > 3256201，id > 3756201等...

英文:

I believe planner doesn't using index scan because you are requesting significant amount of records from the table (id > 2756201 means "about 1/3 of your table"). Planner decides that index scan (and access to 1/3 of rows by index after that) will be slower than reading all the table.
To confirm, try to use greater values for id in where clause:
id > 3256201, id > 3756201 etc...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PostgreSQL为什么使用顺序扫描而不是索引扫描？

问题

答案1

答案2

错误：在 SSL 请求之后收到了未加密的数据。

在事务中，有什么是不能做的（然后可以安全地回滚）吗？

Golang postgresql query with transactions and squirrel

有没有一个支持PostgreSQL多个模式的Golang ORM库？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。