数据库:有效实现字符串包含查询

huangapple go评论68阅读模式
英文:

Databases: Effectively implement string contains query

问题

我需要一种有效地执行类似字符串包含查询的方法,如:

# 在SQL中
LIKE '%some-string%'

# 在Mongo中
{ $regex: /some-string/ }

但是当数据集很大时,查询速度非常慢。例如,我在一个虚拟数据库中尝试过(带索引和不带索引 - 没有索引在Mongo上出奇地更快),并生成了1亿行数据(实际上更多)。如果我使用ElasticSearch似乎是合理的,但我想知道是否有一种数据库或一种方法可以优化这种用例?我已经询问过了,我确实需要包含而不是前缀匹配...


<details>
<summary>英文:</summary>

I need a way to effectively do a string contains query like: 

In SQL

LIKE '%some-string%'

In mongo

{ $regex: /some-string/ }


But its very slow when the dataset size is big. Eg. I tried in a dummy DB (with and without an index - no index is surprisingly faster on mongo) and generate 100m rows (in reality theres more). Seems reasonable if I use ElasticSearch, but I am wondering if theres a DB or way I can structure my data to optimise this use case? I asked and I really need contains instead of a prefix match ...

</details>


# 答案1
**得分**: 1

PostgreSQL 提供了所谓的[三元索引][1]。这些索引可以有效地加速 SQL 中的 `col LIKE '%search%'` 断言。请注意,索引可以在所有服务器中加速 `col LIKE 'string%'`(没有前导通配符字符)。

MySQL / Mariadb 有[全文索引][2],它使用不同的 SQL 语法。这个特性是逐词工作的,与 `LIKE` 不同,后者是逐字符的。Microsoft SQL Server 有[类似的特性][3],但使用不同的语法。它也是逐词工作的。

因此,没有一种 SQL 标准的方式可以高效地实现这个功能,不同的数据库服务器采用不同的方法。

如果你还没有选择特定的数据库服务器,你应该确定一个全文搜索方案是否能满足你的需求。如果你需要从 LIKE 中获得良好的性能,PostgreSQL 的三元索引是一种方式。

  [1]: https://www.postgresql.org/docs/current/pgtrgm.html
  [2]: https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html
  [3]: https://learn.microsoft.com/en-us/sql/relational-databases/search/full-text-search?view=sql-server-ver16

<details>
<summary>英文:</summary>

Postgresql offers so-called [trigram indexes][1]. Those indexes can accelerate SQL `col LIKE &#39;%search%&#39;` predicates efficiently enough. Notice that indexing can, in all makes of server, speed up `col LIKE &#39;string%&#39;` (without the leading wildcard character).

MySQL / Mariadb have [FULLTEXT indexes][2] that work with a distinctive SQL syntax. That feature works word-by-word unlike, well, `LIKE` which works character-by-character. Microsoft SQL Server has a [similar feature][3] with different syntax. It also works word-by-word.

So, there&#39;s no SQL standard way to do this efficiently, and different makes of server do it differently. 

If you haven&#39;t yet chosen a particular make of server, you should figure out whether one of the full text schemes will serve your purpose. If you must get good performance from LIKE, 
postgresql&#39;s trigram indexing is the way to go.

  [1]: https://www.postgresql.org/docs/current/pgtrgm.html
  [2]: https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html
  [3]: https://learn.microsoft.com/en-us/sql/relational-databases/search/full-text-search?view=sql-server-ver16

</details>



# 答案2
**得分**: 0

没有通用的解决方法适用于所有数据库系统,我认为。正如另一个答案已经解释的那样,许多流行的数据库系统都有全文搜索扩展,尽管它们无法像Lucene/ElasticSearch等工具那样完成一切,但应该足以极大地加快您的用例速度。

让我从数据库内部的角度来解释这个问题。假设您的选择性很高,即只有很小的一部分元组实际上与您的条件匹配,那么通常情况下,您会希望有某种索引结构。对于这种类型的查询,您所需要的索引结构可能是一种基数树/字典树,但并非所有SQL数据库中都实现了这种标准数据结构。实际上,几乎所有SQL数据库中都实现的唯一数据结构是B-Tree。但B-Tree只能执行前缀查询,类似于 `LIKE 'test%'`。如果您的数据库没有此类索引,那么要执行 `LIKE '%test%'` 的唯一机会就是拥有一个非常快速的运行时系统,而传统的(开源)数据库系统都没有这样的系统...

<details>
<summary>英文:</summary>

There&#39;s no general solution to this that works for all database systems i think. As another answer already explains, there are fulltext search extensions to a lot of popular database systems that, while they&#39;re far from being able to do what stuff like Lucene/ElasticSearch can do, should be enough to massively speed up your use case.

Let me explain this from a database internals perspective. Let&#39;s say that your selectivity is high a.k.a only a very small percentage of your tuples actually match your condition then you would generally want to have some kind of index structure. The kind of index structure you would **need** for this kind of query is some kind of Radix-Tree/Trie but that&#39;s not a standard data structure implemented in all SQL databases. The only data structure that is actually implemented in almost all SQL databases is a B-Tree. But a B-Tree can only do Prefix queries something like `LIKE &#39;test%&#39;`. The only chance you have for `LIKE &#39;%test%&#39;` if your database doesn&#39;t have such indexes is having a very fast runtime system which none of the traditional (open source) database systems has...

</details>



huangapple
  • 本文由 发表于 2023年2月18日 14:55:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75491707.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定