连接的列也是两侧分区键,仅修剪左侧分区。

huangapple go评论50阅读模式
英文:

Joined column that is also a partition key on both sides only prunes partitions on the left side

问题

I'll provide translations for the code portions you've mentioned without addressing your specific questions:

(请查看 https://dbfiddle.uk/00lhcAA6 上的这些示例。我在版本131415上得到了相同的结果。)

考虑两个一起分区并共享相同概念分区键(`created_at`)的表。

CREATE TABLE part_main (
   main_id serial,
   create_time timestamptz,
   main_val int,
   primary key (main_id, create_time)
 )
 PARTITION BY RANGE (create_time);

CREATE TABLE part_other (
   other_id serial,
   create_time timestamptz,
   main_id int,
   other_val text,
   primary key (other_id, create_time),
   foreign key (main_id, create_time) references part_main(main_id, create_time)
 )
 PARTITION BY RANGE (create_time);

CREATE TABLE part_main_y2023m02 PARTITION OF part_main
    FOR VALUES FROM ('2023-02-01') TO ('2023-03-01');
CREATE TABLE part_main_y2023m03 PARTITION OF part_main
    FOR VALUES FROM ('2023-03-01') TO ('2023-04-01');
CREATE TABLE part_main_y2023m04 PARTITION OF part_main
    FOR VALUES FROM ('2023-04-01') TO ('2023-05-01');

CREATE TABLE part_other_y2023m02 PARTITION OF part_other
    FOR VALUES FROM ('2023-02-01') TO ('2023-03-01');
CREATE TABLE part_other_y2023m03 PARTITION OF part_other
    FOR VALUES FROM ('2023-03-01') TO ('2023-04-01');
CREATE TABLE part_other_y2023m04 PARTITION OF part_other
    FOR VALUES FROM ('2023-04-01') TO ('2023-05-01');

And some test data:

插入到 part_maincreate_time, main_val)值为('2023-04-02'10;
 part_main 选择 main_idcreate_time'foo' 选择 main_idcreate_time'foo' 的其他值插入到 part_other

When joining these tables by (main_id, create_time) and filtering on create_time, partitions are only pruned from the left table:

解释分析 
选择 * from part_main join part_other using (main_id, create_time)
其中 create_time >= '2023-04-01';

Filtering on the left and right side created_at columns individually produces the expected plan:

解释分析 
选择 * from part_main join part_other using (main_id, create_time)
其中 part_main.create_time >= '2023-04-01' and part_other.create_time >= '2023-04-01';

I hope this helps! If you have further questions, please feel free to ask.

英文:

(See these examples at https://dbfiddle.uk/00lhcAA6. I got the same results on versions 13, 14, and 15.)

Take two tables that are partitioned together and share the same conceptual partition key (created_at).

CREATE TABLE part_main (
   main_id serial,
   create_time timestamptz,
   main_val int,
   primary key (main_id, create_time)
 )
 PARTITION BY RANGE (create_time);

CREATE TABLE part_other (
   other_id serial,
   create_time timestamptz,
   main_id int,
   other_val text,
   primary key (other_id, create_time),
   foreign key (main_id, create_time) references part_main(main_id, create_time)
 )
 PARTITION BY RANGE (create_time);

CREATE TABLE part_main_y2023m02 PARTITION OF part_main
    FOR VALUES FROM ('2023-02-01') TO ('2023-03-01');
CREATE TABLE part_main_y2023m03 PARTITION OF part_main
    FOR VALUES FROM ('2023-03-01') TO ('2023-04-01');
CREATE TABLE part_main_y2023m04 PARTITION OF part_main
    FOR VALUES FROM ('2023-04-01') TO ('2023-05-01');

CREATE TABLE part_other_y2023m02 PARTITION OF part_other
    FOR VALUES FROM ('2023-02-01') TO ('2023-03-01');
CREATE TABLE part_other_y2023m03 PARTITION OF part_other
    FOR VALUES FROM ('2023-03-01') TO ('2023-04-01');
CREATE TABLE part_other_y2023m04 PARTITION OF part_other
    FOR VALUES FROM ('2023-04-01') TO ('2023-05-01');

And some test data

insert into part_main (create_time, main_val) values ('2023-04-02', 10);
insert into part_other (main_id, create_time, other_val) select main_id, create_time, 'foo' from part_main;

When joining these tables by (main_id, create_time) and filtering on create_time, partitions are only pruned from the left table. In the plan below, see that only one part_main partition is scanned vs. all three from part_other.

explain analyze 
select * from part_main join part_other using (main_id, create_time)
where create_time >= '2023-04-01';
Merge Join  (cost=326.83..356.03 rows=50 width=52) (actual time=0.132..0.135 rows=1 loops=1)
  Merge Cond: ((part_main.main_id = part_other.main_id) AND (part_main.create_time = part_other.create_time))
  ->  Sort  (cost=61.72..63.26 rows=617 width=16) (actual time=0.083..0.083 rows=1 loops=1)
        Sort Key: part_main.main_id, part_main.create_time
        Sort Method: quicksort  Memory: 25kB
        ->  Seq Scan on part_main_y2023m04 part_main  (cost=0.00..33.12 rows=617 width=16) (actual time=0.012..0.013 rows=1 loops=1)
              Filter: (create_time >= '2023-04-01 00:00:00+01'::timestamp with time zone)
  ->  Sort  (cost=265.11..273.13 rows=3210 width=48) (actual time=0.044..0.045 rows=1 loops=1)
        Sort Key: part_other.main_id, part_other.create_time
        Sort Method: quicksort  Memory: 25kB
        ->  Append  (cost=0.00..78.15 rows=3210 width=48) (actual time=0.019..0.021 rows=1 loops=1)
              ->  Seq Scan on part_other_y2023m02 part_other_1  (cost=0.00..20.70 rows=1070 width=48) (actual time=0.004..0.004 rows=0 loops=1)
              ->  Seq Scan on part_other_y2023m03 part_other_2  (cost=0.00..20.70 rows=1070 width=48) (actual time=0.003..0.003 rows=0 loops=1)
              ->  Seq Scan on part_other_y2023m04 part_other_3  (cost=0.00..20.70 rows=1070 width=48) (actual time=0.011..0.011 rows=1 loops=1)

Filtering on the left and right side created_at columns individually produces the expected plan.

explain analyze 
select * from part_main join part_other using (main_id, create_time)
where part_main.create_time >= '2023-04-01' and part_other.create_time >= '2023-04-01';
Merge Join  (cost=100.23..107.60 rows=6 width=52) (actual time=0.086..0.088 rows=1 loops=1)
  Merge Cond: ((part_main.main_id = part_other.main_id) AND (part_main.create_time = part_other.create_time))
  ->  Sort  (cost=61.72..63.26 rows=617 width=16) (actual time=0.026..0.026 rows=1 loops=1)
        Sort Key: part_main.main_id, part_main.create_time
        Sort Method: quicksort  Memory: 25kB
        ->  Seq Scan on part_main_y2023m04 part_main  (cost=0.00..33.12 rows=617 width=16) (actual time=0.014..0.015 rows=1 loops=1)
              Filter: (create_time >= '2023-04-01 00:00:00+01'::timestamp with time zone)
  ->  Sort  (cost=38.51..39.40 rows=357 width=48) (actual time=0.056..0.056 rows=1 loops=1)
        Sort Key: part_other.main_id, part_other.create_time
        Sort Method: quicksort  Memory: 25kB
        ->  Seq Scan on part_other_y2023m04 part_other  (cost=0.00..23.38 rows=357 width=48) (actual time=0.011..0.011 rows=1 loops=1)
              Filter: (create_time >= '2023-04-01 00:00:00+01'::timestamp with time zone)

So it looks like I have a workaround with the second query style, but I've got a couple of questions.

  1. Is there some semantic difference between the first and the second styles that I'm not aware of? I.e., is there ever a time when where create_time >= '2023-04-01' and where part_main.create_time >= '2023-04-01' and part_other.create_time >= '2023-04-01' could produce different results?
  2. If not, is there anything I can try to get the planner to produce the desired plan with the more succinct style?

答案1

得分: 1

1)
有没有区别。 using (main_id, create_time) 连接谓词是 on part_main.main_id = part_other.main_id and part_main.create_time = part_other.create_time 的缩写。由于这是一个 inner 连接,根据逻辑规则,意味着 part_main.create_time >= '2023-03-01' 等同于 part_other.create_time >= '2023-03-01',因此添加它不会改变结果。

2)
但是没有必要添加第二个条件。正如文档所说,enable_partitionwise_join 的默认值是 false,这可能是您的情况。在设置为 SET enable_partitionwise_join = true; 后,您将看到执行计划只选择相关的分区。

请注意,如果您需要进一步的翻译或解释,请告诉我。
1: https://www.postgresql.org/docs/current/runtime-config-query.html

英文:

1)
There is no difference. The join predicate using (main_id, create_time) is a shorthand for

on part_main.main_id = part_other.main_id and  
part_main.create_time = part_other.create_time

As this is an inner join from part_main.create_time = part_other.create_timeusing the rules of logic implies that

part_main.create_time >= '2023-03-01' is equivalent to part_other.create_time >= '2023-03-01'

So adding it it will not change the result.

2)
But there is no need to add the second condition.

As the documentation says the default for the enable_partitionwise_join is false which is probably your case.

After setting to SET enable_partitionwise_join = true; you will see the execution plan selecting only the relavent partitions

explain analyze 
select * from part_main join part_other using (main_id, create_time)
where create_time >= '2023-04-01';
Merge Join  (cost=136.26..149.08 rows=17 width=52) (actual time=0.032..0.033 rows=1 loops=1)
Merge Cond: ((part_main.main_id = part_other.main_id) AND (part_main.create_time = part_other.create_time))
->  Sort  (cost=61.72..63.26 rows=617 width=16) (actual time=0.012..0.013 rows=1 loops=1)
Sort Key: part_main.main_id, part_main.create_time
Sort Method: quicksort  Memory: 25kB
->  Seq Scan on part_main_y2023m04 part_main  (cost=0.00..33.12 rows=617 width=16) (actual time=0.007..0.008 rows=1 loops=1)
Filter: (create_time >= '2023-04-01 00:00:00+02'::timestamp with time zone)
->  Sort  (cost=74.54..77.21 rows=1070 width=48) (actual time=0.015..0.016 rows=1 loops=1)
Sort Key: part_other.main_id, part_other.create_time
Sort Method: quicksort  Memory: 25kB
->  Seq Scan on part_other_y2023m04 part_other  (cost=0.00..20.70 rows=1070 width=48) (actual time=0.007..0.008 rows=1 loops=1)
Planning Time: 0.216 ms
Execution Time: 0.052 ms 

So there is no need to extend the predicate and additionaly you get with the partition-wise join a better performance (as only the corresponding partitions are joined one to one; instead of pruning partitions, appending them and joining them).

Finaly let me make a remark about your primary key (other_id, create_time) -apparently only the first column is a PK, the second column is added to enable the partitioning. This opens the question if the PK is meaningfull at all.

答案2

得分: 0

两个查询之间在逻辑上没有区别,但有人可能会主张存在语义上的区别。第一个查询是通过主_id和create_time连接两个表,并过滤其中一个表的create_time列(因为对create_time的引用没有特定指明,PostgreSQL可以选择任一表)。PostgreSQL对于主_id和create_time匹配的一般情况执行连接,但不利用哪个第二个表的分区是可行的这一传递逻辑限制。

第二个查询是通过主_id和create_time连接两个表,同时明确过滤了两者的create_time。最终结果相同,但达到结果的指令不同。

没有实际的办法来强制规划器选择特定的计划。作为一个一般原则,不应该试图强制规划器的选择。目标应该是设计一个能够促使规划器找到高效(尽管不一定是最优)计划的环境。

英文:

There's no logical difference between the two queries, but one could argue that there is a semantic difference. The first query says to join the two tables by main_id and create_time and filter on one of the table's create_time column (because the reference to create_time is unqualified, PostgreSQL can choose either table). PostgreSQL performs the join for the general case of main_id and create_time matching but doesn't take advantage of the transitive logical restrictions on which of the second table's partitions are viable.

The second query says to join the two tables by main_id and create_time, but also explicity filters both by create_time. The end result is the same, but the instructions about how to get there are different.

There isn't any practical way to coerce the planner to select a specific plan. As a general princple, one shouldn't try to force the planner's choices. The goal should be to design an environment that facilitates the planner's ability to find an efficient, although not necessarily optimal, plan.

huangapple
  • 本文由 发表于 2023年4月4日 05:17:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75923839.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定