2023年2月8日 21:51:40go评论61阅读模式

英文:

Simple MySQL query with ORDER BY too slow

问题

编辑：查询普通单词的所需时间实际上是1.78秒。原始帖子中提到的4.5秒是在查询特殊单词（如'.vnet'）时的情况。（我知道REGEXP '\\b.vnet\\b'不会找到'.vnet'的完整单词匹配。我可能会稍后使用更复杂的正则表达式来修复这个问题，或者如果太耗时，可能会取消对'.vnet'的支持。）此外，我在下面添加了解决方案5。

我有以下MySQL查询来实现整词匹配。

SELECT source, target
    FROM tm
    WHERE source REGEXP '\\bword\\b'
      AND customer = 'COMPANY X'
      AND language = 'YYY'
    ORDER BY CHAR_LENGTH(source)
    LIMIT 5;

目前有两个客户和两种语言。

我的目标是在数十万个英语句子中找到短语的前5个最接近的匹配项。之所以按CHAR_LENGTH排序获取的记录，是因为长度越短，匹配比例越高，因为REGEXP '\\bword\\b'确保source已经包含'word'。

tm表：

CREATE TABLE tm (
    id INT AUTO_INCREMENT PRIMARY KEY,
    source TEXT(7000) NOT NULL,
    target TEXT(6000) NOT NULL,
    language CHAR(3),
    customer VARCHAR(10),
    INDEX src_cus_lang (source(755), customer, language)

上面的查询花了大约4.5秒才完成，对于我和我配备Intel Core i5-10400F、16GB RAM和SSD的PC来说非常慢。

EXPLAIN命令显示如下结果：

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tm
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1117154
     filtered: 1.00
        Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)

我尝试删除src_cus_lang索引并创建了一个新的索引(customer, language, source(755))，但没有任何改善。

我可以想到一些解决方案：

重新创建tm表，在过程中按CHAR_LENGTH(source)排序。这对我来说不是理想的选择，因为我想保留表的原始顺序。
创建一个名为src_len的新列，即源的长度。然而，按src_len排序仍然非常慢。
将tm表按customer和language分成4个单独的表。对我来说也不是理想的选择。
对source列创建索引。仍然非常慢。
使用INDEX(customer, language)。对普通单词和像'.vnet'这样的特殊单词都多花了1.4秒。

有没有办法将执行时间缩短到不到0.5秒？

英文:

Edit:The time spent for querying a normal word is actually 1.78 seconds. The 4.5 seconds mentioned in the original post below was when querying special words like '.vnet'. (I know REGEXP '\\b.vnet\\b' won't find the whole word match for '.vnet'. I might use a more complex regex to fix this later, or drop the support for '.vnet' if it's too time-consuming.) Also I added solution 5 below.

I have the following MySQL query to achieve whole word matching.

SELECT  source, target
    FROM  tm
    WHERE  source REGEXP &#39;\\bword\\b&#39;
      AND  customer = &#39;COMPANY X&#39;
      AND  language = &#39;YYY&#39;
    ORDER BY  CHAR_LENGTH(source)
    LIMIT  5;

There are 2 customers and 2 languages currently.

My goal is to find the top 5 closest matches of a phrase among hundreds of thousands of English sentences. The reason the fetched records are ordered by CHAR_LENGTH is because the shorter the length, the higher the match ratio, since REGEXP '\\bword\\b' makes sure source has word already.

The tm table:

CREATE TABLE tm(
    id INT AUTO_INCREMENT PRIMARY KEY,
    source TEXT(7000) NOT NULL,
    target TEXT(6000) NOT NULL,
    language CHAR(3),
    customer VARCHAR(10),
    INDEX src_cus_lang (source(755), customer, language)

The query above took about 4.5 seconds to finish, which is very slow for me and my PC that has an Intel Core i5-10400F, 16GB RAM and an SSD.

The EXPLAIN command showed the below result:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tm
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1117154
     filtered: 1.00
        Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)

I tried to delelte the src_cus_lang index and created a new one (customer, language, source(755)), but no improvement at all.

I can think of a few solutions:

Recreate the tm table, ordering by CHAR_LENGTH(source) in the process. This is not ideal for me as I'd like to keep the original order of the table.
Create a new column named src_len, i.e. the length of the source. However, ORDER BY src_len is still very slow.
Split the tm table into 4 separate ones by customer and language. Not ideal for me.
Index the source column. Still very slow.
Use INDEX(customer, language). Took 1.4 seconds longer for both normal words and special words like '.vnet'.

Is there a way to cut the execution time down to less than 0.5 seconds?

答案1

得分: 2

这基本上是没有用的：

INDEX src_cus_lang (source(755), customer, language)

前缀部分会让其他列变得不太有用。 REGEXP 需要检查全部的 1.1M 行。

这样会更好：

INDEX(customer, language)

它至少会筛选这两列，然后较少地应用 REGEXP。

由于通常希望在考虑 ORDER BY 之前完成 WHERE，所以对于 src_len 等的尝试没有帮助。

如果只有 4 种不同的 customer 和 language 组合，那么不太能做什么。

但是，你应该考虑一个 FULLTEXT(source) 索引。有了这个，

MATCH(source) AGAINST('word' IN BOOLEAN MODE)
AND ...

将会运行得更快。

还可以尝试 IN NATURAL LANGUAGE MODE。

英文:

This is essentially useless:

INDEX src_cus_lang (source(755), customer, language)

The prefixing keeps the rest of the columns from being very useful. REGEXP requires checking all 1.1M rows.

This would be better:

INDEX(customer, language)

It will at least filter on those two columns, then apply the REGEXP fewer times.

Since it usually wants to finish with the WHERE before considering the ORDER BY, your attempts at src_len, etc, did not help.

If there are only 4 different combinations of customer and language, not much can be done.

However, you should consider a FULLTEXT(source) index. With such,

MATCH(source) AGAINST(&#39;+word&#39; IN BOOLEAN MODE)
AND ...

will work much faster.

Also try IN NATURAL LANGUAGE MODE.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

简单的MySQL查询，带有ORDER BY，速度太慢。

问题

答案1

在Matlab中处理内存限制下的矩阵乘法策略。

在 Laravel 中将用户存储在一个表中还是多个表中。

在什么情况下应该初始化一个新变量，而在什么情况下不应该初始化？

GORM无法保存has-many结构体。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论