问题

我正在对两个庞大的客户表进行身份解析。我想要将它们连接到客户的姓氏，以及名字直到昵称的变化，我已经准备好了：

```sql
SELECT * FROM t1
INNER JOIN t2
ON t1.lname = t2.lname
AND t1.fname ~ t2.fname_regex

这些表非常庞大，所以我想按字母顺序拆分它们。但即使在将t1和t2缩小到以字母“a”开头的姓氏后，t1有671,000行，t2有203,000行 - 上面的查询在很多小时内无法完成。

基本上，我想做的是将表分成大约67,000 / 20,000的块，然后将块与块连接。这可以做到，因为我们是按字母顺序处理姓名。

例如，具有姓氏范围在(Aaa-Abg)的t1块只需要与具有相同字母名称范围的t2块进行匹配。

我觉得这可能是一个相当常见的情况，所以我想知道是否有关于这种分区/连接的任何最佳实践。

是否有一种方法可以告诉Postgres我想将我的表按照某列分成这样的字母块，以便它可以自动准备分区和索引，并逐块执行连接？

附言：我使用的是Postgres 9.4.26，无法做任何更改

编辑：

由于问题似乎主要与正则表达式有关，请让我解释一下我在这里尝试实现的内容。我想按名字匹配，但是可以匹配昵称，以便“Robert”与“Bob”等匹配。因此，对于每个名字，我构建了一个名字正则表达式字符串，看起来像 "(^robert$|^bob$|^rob$)" ，使用了一个外部的昵称数据集。没想到这会对性能造成如此严重的影响。如果无法挽救这种方法，那么实施这样的逻辑的推荐方法是什么？

编辑2：

我在这里提出了一个新问题：
https://stackoverflow.com/questions/75545437/joining-two-tables-by-column-with-several-possible-values-improving-performance


<details>
<summary>英文:</summary>

I&#39;m doing identity resolution on two huge customer tables. I want to join them on the customer&#39;s last name, as well as first name up to nickname variations, which I have already prepared:

SELECT * FROM t1
INNER JOIN t2
ON t1.lname = t2.lname
AND t1.fname ~ t2.fname_regex


The tables are very large, so I figured I would split them up alphabetically. But even after narrowing down t1 and t2 to last names starting with the letter &quot;a&quot;, I have 671k rows in t1, and 203k in t2 - the above query fails to complete in many hours.

Basically, what I would like to do is partition the tables into chunks of, say, 67k / 20k (a query with such limits runs in reasonable time) and join chunk to chunk. This can be done because we&#39;re working with names alphabetically. 

For example, the t1 chunk with last names in, say, (Aaa-Abg) only needs to be checked against the t2 chunk with the same alphabetical name range.

**It seems to me like this might be a pretty common scenario, so I&#39;m wondering if there are any best practices regarding such a partition/join.** 

Is there any way to indicate to postgres that I want to partition my tables into such alphabetical chunks on some column, so that it can prepare the partitions and indices in an automated way, and perform the join chunk by chunk?

*P.S. I&#39;m on Postgres 9.4.26 and can&#39;t do anything about it*

EDIT: 

Since the problem seems to be mainly with the regex, let me explain what I tried to achieve here. I want to match by first name, but up to nicknames, so that &quot;Robert&quot; matches with &quot;Bob&quot; etc. So for each fname I built a fname_regex string which looks like `&quot;(^robert$|^bob$|^rob$)&quot;` using an external nicknames dataset. Didn&#39;t expect it to axe performance this badly. If this approach can&#39;t be  salvaged, then what would be the recommended way of implementing such logic? 

EDIT 2:

I made a new question here: 
https://stackoverflow.com/questions/75545437/joining-two-tables-by-column-with-several-possible-values-improving-performance

</details>


# 答案1
**得分**: 1

如果正则表达式是您的瓶颈，您唯一可以期望的是并行查询（在支持的PostgreSQL版本中可用），这将允许您的查询在多个核心上并行运行。

<details>
<summary>英文:</summary>

If the regular expression is your bottleneck, all you can hope for is parallel query (available with supported versions of PostgreSQL), which will allow you to have the query run on several cores in parallel.

</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在PostgreSQL中连接大型表并使用分区的最佳实践是什么？

问题

Oracle SQL：从比较日期与另一个表的行中选择值

Snowflake中的动态文件名生成

尝试从连接表中计算每个唯一ID的行数 – MySQL

使用 SQL LAG() 函数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论