2023年5月13日 13:26:38go评论90阅读模式

英文:

What's the best way to index a text array with large text?

问题

我有一些任意大小的文本数组，并且我想要为contains查找创建索引。

看起来数组中的文本对于传统索引来说太大了。

这也是一个相当大的表 - 它有数十亿行。

使用标准的GIN索引会给我带来错误：

ERROR: index row size 6648 exceeds maximum 2712 for index "index"

在查找后，看起来GIN默认使用BTREE，这可能不适用于这些大列的情况。

有什么好的替代索引可以使用，而不必借助复杂的类型转换吗？（一个简单的索引可能就足够了）。

英文:

I have some arbitrarily-sized text in a text array, and I would like to index it for contains lookups.

It looks like the text in the array is too large for a traditional index.

This is also quite a large table - it has a few billion rows.

Using a standard GIN index gives me the error:

ERROR:  index row size 6648 exceeds maximum 2712 for index &quot;index&quot;

After looking it up, it looks like GIN defaults to using BTREE which is probably not the right thing to use for these types of huge columns.

What's a good alternate index I can use without having to resort to complex type conversion? (A simple one could do).

答案1

得分: 1

以下是翻译好的部分：

我将不对字符串建立索引，而是对其进行哈希处理。为此，请创建一个如下的函数：

CREATE FUNCTION hashtestarray(text[]) RETURNS integer[]
IMMUTABLE PARALLEL SAFE
BEGIN ATOMIC
   SELECT array_agg(hashtext(t)) FROM unnest($1) AS a(t);
END;

这将应用 text 哈希函数到数组的每个元素上。

然后建立索引：

CREATE INDEX ON tab USING gin (hashtestarray(arraycol));

并进行搜索：

SELECT ... FROM tab
WHERE hashtestarray(arraycol) @&gt; hashtext(&#39;searchstring&#39;)
  AND arraycol @&gt; &#39;searchstring&#39;:

第一个条件可以使用索引，而第二个条件将排除误报。

英文:

I would not index the strings, but hashes thereof. For that, create a function like

CREATE FUNCTION hashtestarray(text[]) RETURNS integer[]
IMMUTABLE PARALLEL SAFE
BEGIN ATOMIC
   SELECT array_agg(hashtext(t)) FROM unnest($1) AS a(t);
END;

That applies the text hash function to each element of the array.

Then index like

CREATE INDEX ON tab USING gin (hashtestarray(arraycol));

and search like

SELECT ... FROM tab
WHERE hashtestarray(arraycol) @&gt; hashtext(&#39;searchstring&#39;)
  AND arraycol @&gt; &#39;searchstring&#39;:

The first condition can use the index, and the second condition will remove false positives.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

最佳方式索引大文本的文本数组是什么？

问题

答案1

在Postgres 16 BETA中，ParseNamespaceItem是否应该具有与其RangeTableEntry相同的索引？

Postgres + Go + Docker-compose 无法连接数据库：拨号tcp 127.0.0.1:5432: 连接被拒绝

使用Postgres SQL中的xmltable解析XML

拒绝权限创建表，即使使用GRANT命令。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。