英文:
Databases: Effectively implement string contains query
问题
我需要一种有效地执行类似字符串包含查询的方法,如:
# 在SQL中
LIKE '%some-string%'
# 在Mongo中
{ $regex: /some-string/ }
但是当数据集很大时,查询速度非常慢。例如,我在一个虚拟数据库中尝试过(带索引和不带索引 - 没有索引在Mongo上出奇地更快),并生成了1亿行数据(实际上更多)。如果我使用ElasticSearch似乎是合理的,但我想知道是否有一种数据库或一种方法可以优化这种用例?我已经询问过了,我确实需要包含而不是前缀匹配...
<details>
<summary>英文:</summary>
I need a way to effectively do a string contains query like:
In SQL
LIKE '%some-string%'
In mongo
{ $regex: /some-string/ }
But its very slow when the dataset size is big. Eg. I tried in a dummy DB (with and without an index - no index is surprisingly faster on mongo) and generate 100m rows (in reality theres more). Seems reasonable if I use ElasticSearch, but I am wondering if theres a DB or way I can structure my data to optimise this use case? I asked and I really need contains instead of a prefix match ...
</details>
# 答案1
**得分**: 1
PostgreSQL 提供了所谓的[三元索引][1]。这些索引可以有效地加速 SQL 中的 `col LIKE '%search%'` 断言。请注意,索引可以在所有服务器中加速 `col LIKE 'string%'`(没有前导通配符字符)。
MySQL / Mariadb 有[全文索引][2],它使用不同的 SQL 语法。这个特性是逐词工作的,与 `LIKE` 不同,后者是逐字符的。Microsoft SQL Server 有[类似的特性][3],但使用不同的语法。它也是逐词工作的。
因此,没有一种 SQL 标准的方式可以高效地实现这个功能,不同的数据库服务器采用不同的方法。
如果你还没有选择特定的数据库服务器,你应该确定一个全文搜索方案是否能满足你的需求。如果你需要从 LIKE 中获得良好的性能,PostgreSQL 的三元索引是一种方式。
[1]: https://www.postgresql.org/docs/current/pgtrgm.html
[2]: https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html
[3]: https://learn.microsoft.com/en-us/sql/relational-databases/search/full-text-search?view=sql-server-ver16
<details>
<summary>英文:</summary>
Postgresql offers so-called [trigram indexes][1]. Those indexes can accelerate SQL `col LIKE '%search%'` predicates efficiently enough. Notice that indexing can, in all makes of server, speed up `col LIKE 'string%'` (without the leading wildcard character).
MySQL / Mariadb have [FULLTEXT indexes][2] that work with a distinctive SQL syntax. That feature works word-by-word unlike, well, `LIKE` which works character-by-character. Microsoft SQL Server has a [similar feature][3] with different syntax. It also works word-by-word.
So, there's no SQL standard way to do this efficiently, and different makes of server do it differently.
If you haven't yet chosen a particular make of server, you should figure out whether one of the full text schemes will serve your purpose. If you must get good performance from LIKE,
postgresql's trigram indexing is the way to go.
[1]: https://www.postgresql.org/docs/current/pgtrgm.html
[2]: https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html
[3]: https://learn.microsoft.com/en-us/sql/relational-databases/search/full-text-search?view=sql-server-ver16
</details>
# 答案2
**得分**: 0
没有通用的解决方法适用于所有数据库系统,我认为。正如另一个答案已经解释的那样,许多流行的数据库系统都有全文搜索扩展,尽管它们无法像Lucene/ElasticSearch等工具那样完成一切,但应该足以极大地加快您的用例速度。
让我从数据库内部的角度来解释这个问题。假设您的选择性很高,即只有很小的一部分元组实际上与您的条件匹配,那么通常情况下,您会希望有某种索引结构。对于这种类型的查询,您所需要的索引结构可能是一种基数树/字典树,但并非所有SQL数据库中都实现了这种标准数据结构。实际上,几乎所有SQL数据库中都实现的唯一数据结构是B-Tree。但B-Tree只能执行前缀查询,类似于 `LIKE 'test%'`。如果您的数据库没有此类索引,那么要执行 `LIKE '%test%'` 的唯一机会就是拥有一个非常快速的运行时系统,而传统的(开源)数据库系统都没有这样的系统...
<details>
<summary>英文:</summary>
There's no general solution to this that works for all database systems i think. As another answer already explains, there are fulltext search extensions to a lot of popular database systems that, while they're far from being able to do what stuff like Lucene/ElasticSearch can do, should be enough to massively speed up your use case.
Let me explain this from a database internals perspective. Let's say that your selectivity is high a.k.a only a very small percentage of your tuples actually match your condition then you would generally want to have some kind of index structure. The kind of index structure you would **need** for this kind of query is some kind of Radix-Tree/Trie but that's not a standard data structure implemented in all SQL databases. The only data structure that is actually implemented in almost all SQL databases is a B-Tree. But a B-Tree can only do Prefix queries something like `LIKE 'test%'`. The only chance you have for `LIKE '%test%'` if your database doesn't have such indexes is having a very fast runtime system which none of the traditional (open source) database systems has...
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论