英文:
Using CTE in where statement is very slow
问题
以下是您的SQL查询的翻译:
在 PostgreSQL v13.10 中,我有以下运行中的 SQL:
WITH stuckable_statuses AS (
SELECT status_id FROM status_descriptions
WHERE (tags @> ARRAY['stuckable']::varchar[])
)
SELECT jobs.* FROM jobs
WHERE jobs.status = ANY(select status_id from stuckable_statuses)
并且在将 ANY(select status_id from stuckable_statuses)
替换为 id 数组(例如 (1,2,3)
)后,查询速度非常快。
这是该查询的解释分析:
Gather (cost=1005.64..5579003.45 rows=1563473 width=2518) (actual time=45.495..40138.515 rows=303 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Semi Join (cost=5.64..5421656.15 rows=651447 width=2518) (actual time=44.533..40126.793 rows=101 loops=3)
Hash Cond: (jobs.status = status_descriptions.status_id)
-> Parallel Seq Scan on jobs (cost=0.00..5378777.15 rows=13571815 width=2518) (actual time=0.892..38662.091 rows=10537079 loops=3)
-> Hash (cost=5.56..5.56 rows=6 width=4) (actual time=0.377..0.378 rows=11 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on status_descriptions (cost=0.00..5.56 rows=6 width=4) (actual time=0.310..0.370 rows=11 loops=3)
Filter: (tags @> '{stuckable}'::character varying[])
Rows Removed by Filter: 146
Planning Time: 0.711 ms
Execution Time: 40138.654 ms
以下是表定义(从 Rails 的 schema.rb
中获取):
create_table "jobs", id: :serial, force: :cascade do |t|
t.string "filename"
t.string "sandbox"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.integer "status", default: 0, null: false
t.integer "provider_id"
t.integer "lang_id"
t.integer "profile_id"
t.datetime "extra_date"
t.datetime "main_date"
t.datetime "performer_id"
t.index ["provider_id", "status", "extra_date"], name: "jobs_on_media_provider_id__status__extra_date"
t.index ["provider_id", "status", "main_date"], name: "jobs_on_media_provider_id_and_status_and_due_date"
t.index ["profile_id", "status", "extra_date"], name: "index_jobs_on_profile_id__status__extra_date"
t.index ["profile_id", "status", "main_date"], name: "index_transcription_jobs_on_profile_id_and_status_and_due_date"
t.index ["status", "sandbox", "lang_id", "extra_date"], name: "index_jobs_on_status__sandbox__lang_id__extra_date"
t.index ["status", "sandbox", "lang_id", "main_date"], name: "index_jobs_on_status_and_sandbox_and_lang_id_and_due_date"
t.index ["performer_id", "status", "extra_date"], name: "index_jobs_on_performer_id__status__extra_date"
t.index ["performer_id", "status", "main_date"], name: "index_jobs_on_performer_id_and_status_and_due_date"
end
create_table "status_descriptions", id: :serial, force: :cascade do |t|
t.integer "status_id"
t.string "title"
t.string "tags", array: true
t.index ["status_id"], name: "index_status_descriptions_on_status_id"
end
与具有数组的相同 SQL 进行比较,我可以看到它不使用 jobs.status 的索引,可能是因为 jobs 表非常庞大(约 1500 万行),而 status_descriptions 表只有约 200 行。
如果可能的话,您能否帮助我优化此 SQL。谢谢!
更新:
以下是带有硬编码数组的查询:
SELECT jobs.* FROM transcription_jobs
WHERE jobs.status IN (2, 3, 4, 291, 290, 46, 142, 260, 6, 7, 270)
这是它的解释分析:
Index Scan using index_jobs_on_status__sandbox__lang_id__current_stage_due_date on jobs (cost=0.56..98661.05 rows=26541 width=2518) (actual time=0.032..63.266 rows=483 loops=1)
Index Cond: (status = ANY ('{2,3,4,291,290,46,142,260,6,7,270}'::integer[]))
Planning Time: 0.356 ms
Execution Time: 63.337 ms
英文:
I have a following SQL running in PostgreSQL v13.10:
WITH stuckable_statuses AS (
SELECT status_id FROM status_descriptions
WHERE (tags @> ARRAY['stuckable']::varchar[])
)
SELECT jobs.* FROM jobs
WHERE jobs.status = ANY(select status_id from stuckable_statuses)
And it is running really slow while replacing ANY(select status_id from stuckable_statuses)
with array of ids ex. (1,2,3)
is running really fast.
Here is the explain analyze for this query:
Gather (cost=1005.64..5579003.45 rows=1563473 width=2518) (actual time=45.495..40138.515 rows=303 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Semi Join (cost=5.64..5421656.15 rows=651447 width=2518) (actual time=44.533..40126.793 rows=101 loops=3)
Hash Cond: (jobs.status = status_descriptions.status_id)
-> Parallel Seq Scan on jobs (cost=0.00..5378777.15 rows=13571815 width=2518) (actual time=0.892..38662.091 rows=10537079 loops=3)
-> Hash (cost=5.56..5.56 rows=6 width=4) (actual time=0.377..0.378 rows=11 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on status_descriptions (cost=0.00..5.56 rows=6 width=4) (actual time=0.310..0.370 rows=11 loops=3)
Filter: (tags @> '{stuckable}'::character varying[])
Rows Removed by Filter: 146
Planning Time: 0.711 ms
Execution Time: 40138.654 ms
Here is table definitions (taken form rails' schema.rb
):
create_table "jobs", id: :serial, force: :cascade do |t|
t.string "filename"
t.string "sandbox"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.integer "status", default: 0, null: false
t.integer "provider_id"
t.integer "lang_id"
t.integer "profile_id"
t.datetime "extra_date"
t.datetime "main_date"
t.datetime "performer_id"
t.index ["provider_id", "status", "extra_date"], name: "jobs_on_media_provider_id__status__extra_date"
t.index ["provider_id", "status", "main_date"], name: "jobs_on_media_provider_id_and_status_and_due_date"
t.index ["profile_id", "status", "extra_date"], name: "index_jobs_on_profile_id__status__extra_date"
t.index ["profile_id", "status", "main_date"], name: "index_transcription_jobs_on_profile_id_and_status_and_due_date"
t.index ["status", "sandbox", "lang_id", "extra_date"], name: "index_jobs_on_status__sandbox__lang_id__extra_date"
t.index ["status", "sandbox", "lang_id", "main_date"], name: "index_jobs_on_status_and_sandbox_and_lang_id_and_due_date"
t.index ["performer_id", "status", "extra_date"], name: "index_jobs_on_performer_id__status__extra_date"
t.index ["performer_id", "status", "main_date"], name: "index_jobs_on_performer_id_and_status_and_due_date"
end
create_table "status_descriptions", id: :serial, force: :cascade do |t|
t.integer "status_id"
t.string "title"
t.string "tags", array: true
t.index ["status_id"], name: "index_status_descriptions_on_status_id"
end
Comparing to same SQL with array I can see that it is not using index by jobs.status maybe because jobs table is really huge (~15kk rows) while status_descriptions is ~200 rows.
Could you please help me to optimize this SQL if it is possible.
Thank you!
UPDATE:
Here is the query with hardcoded array:
SELECT jobs.* FROM transcription_jobs
WHERE jobs.status IN (2, 3, 4, 291, 290, 46, 142, 260, 6, 7, 270)
And here its explain analyze:
Index Scan using index_jobs_on_status__sandbox__lang_id__current_stage_due_date on jobs (cost=0.56..98661.05 rows=26541 width=2518) (actual time=0.032..63.266 rows=483 loops=1)
Index Cond: (status = ANY ('{2,3,4,291,290,46,142,260,6,7,270}'::integer[]))
Planning Time: 0.356 ms
Execution Time: 63.337 ms
答案1
得分: 1
PostgreSQL 在执行 subquery
或 CTE
中的 ANY()
时不使用索引。相反,您可以使用 inner join
:
WITH stuckable_statuses AS (
SELECT status_id FROM status_descriptions
WHERE (tags @> ARRAY['stuckable']::varchar[])
)
SELECT jobs.* FROM jobs
INNER JOIN stuckable_statuses ON jobs.status = stuckable_statuses.status_id;
确保您在 status_descriptions.status_id
和 jobs.status
上创建了索引。索引可以极大地提高连接操作的性能。
英文:
Postgresql doesn't use indexes when executing subquery
or CTE
with ANY()
.
Instead, you can use inner join
:
WITH stuckable_statuses AS (
SELECT status_id FROM status_descriptions
WHERE (tags @> ARRAY['stuckable']::varchar[])
)
SELECT jobs.* FROM jobs
INNER JOIN stuckable_statuses ON jobs.status = stuckable_statuses.status_id;
make sure you have index on status_descriptions.status_id
and jobs.status
. Indexes can greatly improve the performance for join operation.
答案2
得分: 1
主要问题是它认为会找到1563473行,但实际上只找到了303行。如果实际上找到了1563473行,那么基于序列扫描的哈希连接可能真的比基于ANY的索引扫描更快。
不幸的是,根据您当前的数据模型和现有的PostgreSQL版本,您可能无法解决这个问题。看起来在作业表中,可卡住的状态要比不可卡住的状态要稀少得多,但规划器无法知道这一点。
要强制使用更快的计划,您可以在运行此查询之前临时关闭enable_hashjoin或enable_seqscan。这绝对是一个不太美观的解决方案,但应该是可靠的。如果您关闭并行查询(设置max_parallel_workers_per_gather=0),这可能足以切换到更快的计划。如果并行查询对您来说没有太多好处,那么这将是一个不太美观但也不太可靠的解决方案。或者,您可以尝试添加计划提示。
最可靠的解决方案可能只是将其作为两个查询运行,在一个查询中获取status_id的数组/列表,然后将该数组/列表插入第二个查询中。这样,规划器实际上可以看到将要使用的值,并可以相应地进行规划。(我注意到您的硬编码计划仍然被严重错误估计,但不像其他计划那样错误,也不足以驱动计划选择。这可能是一个单独的问题,与您目前面临的问题无关。)
对于您的数据模型,我直觉认为给定状态的可卡性不太可能经常变化,如果有的话。如果您可以直接将该值记录为作业表中的新列,并在状态本身更改时进行更改,而不需要对其他表进行间接操作,那么这几乎可以立即解决这个问题。
英文:
The main problem is that it thinks it will find 1563473 rows but actually finds 303. If there actually were 1563473 rows found, the hash join over the seq scan probably really would be faster than the ANY-driven index scan.
Unfortunately there probably isn't anything you can do about that with your current data model and the existing versions of PostgreSQL. It looks like the stuckable statuses are substantially rarer in the jobs table than the non-stuckable ones, but the planner has no way to know that.
To force the faster plan, you could temporarily turn off either enable_hashjoin or enable_seqscan just before running this query. That is definitely an ugly solution, but it should be a reliable one. If you turn off parallel query (set max_parallel_workers_per_gather=0), that might be enough to switch the plan to the faster one. If you don't get much benefit from parallel query anyway, then this would be a less ugly solution, but also less reliable. Or, you could try to add plan hints.
The most robust solution is probably just to run it as two queries, fetching the array/list of status_id in one, and then stuffing that array/list into the 2nd one. That way the planner can actually see the values which are going to be used, and can plan accordingly. (I notice that your hard-coded plan is still substantially misestimated, but not by nearly as much as the other one also not by enough to drive the plan choice. This is likely a separate issue, unrelated to the one you currently face.)
For your data model, I'd intuit that the stickability of a given status is not something that would change very often, if ever. If you could record that value directly as a new column in the jobs table and change it whenever the status itself changes, rather than needing to do indirection to a separate table, that would pretty much instantly solve this problem.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论