在PostgreSQL中连接大型表并使用分区的最佳实践是什么?

huangapple go评论63阅读模式
英文:

Best practice for joining huge tables using partitions (postgres)?

问题

我正在对两个庞大的客户表进行身份解析。我想要将它们连接到客户的姓氏,以及名字直到昵称的变化,我已经准备好了:

```sql
SELECT * FROM t1
INNER JOIN t2
ON t1.lname = t2.lname
AND t1.fname ~ t2.fname_regex

这些表非常庞大,所以我想按字母顺序拆分它们。但即使在将t1和t2缩小到以字母“a”开头的姓氏后,t1有671,000行,t2有203,000行 - 上面的查询在很多小时内无法完成。

基本上,我想做的是将表分成大约67,000 / 20,000的块,然后将块与块连接。这可以做到,因为我们是按字母顺序处理姓名。

例如,具有姓氏范围在(Aaa-Abg)的t1块只需要与具有相同字母名称范围的t2块进行匹配。

我觉得这可能是一个相当常见的情况,所以我想知道是否有关于这种分区/连接的任何最佳实践。

是否有一种方法可以告诉Postgres我想将我的表按照某列分成这样的字母块,以便它可以自动准备分区和索引,并逐块执行连接?

附言:我使用的是Postgres 9.4.26,无法做任何更改

编辑:

由于问题似乎主要与正则表达式有关,请让我解释一下我在这里尝试实现的内容。我想按名字匹配,但是可以匹配昵称,以便“Robert”与“Bob”等匹配。因此,对于每个名字,我构建了一个名字正则表达式字符串,看起来像 "(^robert$|^bob$|^rob$)" ,使用了一个外部的昵称数据集。没想到这会对性能造成如此严重的影响。如果无法挽救这种方法,那么实施这样的逻辑的推荐方法是什么?

编辑2:

我在这里提出了一个新问题:
https://stackoverflow.com/questions/75545437/joining-two-tables-by-column-with-several-possible-values-improving-performance


<details>
<summary>英文:</summary>

I&#39;m doing identity resolution on two huge customer tables. I want to join them on the customer&#39;s last name, as well as first name up to nickname variations, which I have already prepared:

SELECT * FROM t1
INNER JOIN t2
ON t1.lname = t2.lname
AND t1.fname ~ t2.fname_regex


The tables are very large, so I figured I would split them up alphabetically. But even after narrowing down t1 and t2 to last names starting with the letter &quot;a&quot;, I have 671k rows in t1, and 203k in t2 - the above query fails to complete in many hours.

Basically, what I would like to do is partition the tables into chunks of, say, 67k / 20k (a query with such limits runs in reasonable time) and join chunk to chunk. This can be done because we&#39;re working with names alphabetically. 

For example, the t1 chunk with last names in, say, (Aaa-Abg) only needs to be checked against the t2 chunk with the same alphabetical name range.

**It seems to me like this might be a pretty common scenario, so I&#39;m wondering if there are any best practices regarding such a partition/join.** 

Is there any way to indicate to postgres that I want to partition my tables into such alphabetical chunks on some column, so that it can prepare the partitions and indices in an automated way, and perform the join chunk by chunk?

*P.S. I&#39;m on Postgres 9.4.26 and can&#39;t do anything about it*

EDIT: 

Since the problem seems to be mainly with the regex, let me explain what I tried to achieve here. I want to match by first name, but up to nicknames, so that &quot;Robert&quot; matches with &quot;Bob&quot; etc. So for each fname I built a fname_regex string which looks like `&quot;(^robert$|^bob$|^rob$)&quot;` using an external nicknames dataset. Didn&#39;t expect it to axe performance this badly. If this approach can&#39;t be  salvaged, then what would be the recommended way of implementing such logic? 

EDIT 2:

I made a new question here: 
https://stackoverflow.com/questions/75545437/joining-two-tables-by-column-with-several-possible-values-improving-performance

</details>


# 答案1
**得分**: 1

如果正则表达式是您的瓶颈,您唯一可以期望的是并行查询(在支持的PostgreSQL版本中可用),这将允许您的查询在多个核心上并行运行。

<details>
<summary>英文:</summary>

If the regular expression is your bottleneck, all you can hope for is parallel query (available with supported versions of PostgreSQL), which will allow you to have the query run on several cores in parallel.

</details>



huangapple
  • 本文由 发表于 2023年2月23日 19:21:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/75544140.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定