英文:
What's the best way to index a text array with large text?
问题
我有一些任意大小的文本数组,并且我想要为contains
查找创建索引。
看起来数组中的文本对于传统索引来说太大了。
这也是一个相当大的表 - 它有数十亿行。
使用标准的GIN索引会给我带来错误:
ERROR: index row size 6648 exceeds maximum 2712 for index "index"
在查找后,看起来GIN默认使用BTREE,这可能不适用于这些大列的情况。
有什么好的替代索引可以使用,而不必借助复杂的类型转换吗?(一个简单的索引可能就足够了)。
英文:
I have some arbitrarily-sized text in a text array, and I would like to index it for contains
lookups.
It looks like the text in the array is too large for a traditional index.
This is also quite a large table - it has a few billion rows.
Using a standard GIN index gives me the error:
ERROR: index row size 6648 exceeds maximum 2712 for index "index"
After looking it up, it looks like GIN defaults to using BTREE which is probably not the right thing to use for these types of huge columns.
What's a good alternate index I can use without having to resort to complex type conversion? (A simple one could do).
答案1
得分: 1
以下是翻译好的部分:
我将不对字符串建立索引,而是对其进行哈希处理。为此,请创建一个如下的函数:
CREATE FUNCTION hashtestarray(text[]) RETURNS integer[]
IMMUTABLE PARALLEL SAFE
BEGIN ATOMIC
SELECT array_agg(hashtext(t)) FROM unnest($1) AS a(t);
END;
这将应用 text
哈希函数到数组的每个元素上。
然后建立索引:
CREATE INDEX ON tab USING gin (hashtestarray(arraycol));
并进行搜索:
SELECT ... FROM tab
WHERE hashtestarray(arraycol) @> hashtext('searchstring')
AND arraycol @> 'searchstring':
第一个条件可以使用索引,而第二个条件将排除误报。
英文:
I would not index the strings, but hashes thereof. For that, create a function like
CREATE FUNCTION hashtestarray(text[]) RETURNS integer[]
IMMUTABLE PARALLEL SAFE
BEGIN ATOMIC
SELECT array_agg(hashtext(t)) FROM unnest($1) AS a(t);
END;
That applies the text
hash function to each element of the array.
Then index like
CREATE INDEX ON tab USING gin (hashtestarray(arraycol));
and search like
SELECT ... FROM tab
WHERE hashtestarray(arraycol) @> hashtext('searchstring')
AND arraycol @> 'searchstring':
The first condition can use the index, and the second condition will remove false positives.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论