简单的MySQL查询,带有ORDER BY,速度太慢。

huangapple go评论54阅读模式
英文:

Simple MySQL query with ORDER BY too slow

问题

编辑:查询普通单词的所需时间实际上是1.78秒。原始帖子中提到的4.5秒是在查询特殊单词(如'.vnet')时的情况。(我知道REGEXP '\\b.vnet\\b'不会找到'.vnet'的完整单词匹配。我可能会稍后使用更复杂的正则表达式来修复这个问题,或者如果太耗时,可能会取消对'.vnet'的支持。)此外,我在下面添加了解决方案5。


我有以下MySQL查询来实现整词匹配。

SELECT source, target
    FROM tm
    WHERE source REGEXP '\\bword\\b'
      AND customer = 'COMPANY X'
      AND language = 'YYY'
    ORDER BY CHAR_LENGTH(source)
    LIMIT 5;

目前有两个客户和两种语言。

我的目标是在数十万个英语句子中找到短语的前5个最接近的匹配项。之所以按CHAR_LENGTH排序获取的记录,是因为长度越短,匹配比例越高,因为REGEXP '\\bword\\b'确保source已经包含'word'。

tm表:

CREATE TABLE tm (
    id INT AUTO_INCREMENT PRIMARY KEY,
    source TEXT(7000) NOT NULL,
    target TEXT(6000) NOT NULL,
    language CHAR(3),
    customer VARCHAR(10),
    INDEX src_cus_lang (source(755), customer, language)

上面的查询花了大约4.5秒才完成,对于我和我配备Intel Core i5-10400F、16GB RAM和SSD的PC来说非常慢。

EXPLAIN命令显示如下结果:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tm
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1117154
     filtered: 1.00
        Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)

我尝试删除src_cus_lang索引并创建了一个新的索引(customer, language, source(755)),但没有任何改善。

我可以想到一些解决方案:

  1. 重新创建tm表,在过程中按CHAR_LENGTH(source)排序。这对我来说不是理想的选择,因为我想保留表的原始顺序。
  2. 创建一个名为src_len的新列,即源的长度。然而,按src_len排序仍然非常慢。
  3. tm表按customerlanguage分成4个单独的表。对我来说也不是理想的选择。
  4. source列创建索引。仍然非常慢。
  5. 使用INDEX(customer, language)。对普通单词和像'.vnet'这样的特殊单词都多花了1.4秒。

有没有办法将执行时间缩短到不到0.5秒?

英文:

Edit:The time spent for querying a normal word is actually 1.78 seconds. The 4.5 seconds mentioned in the original post below was when querying special words like '.vnet'. (I know REGEXP '\\b.vnet\\b' won't find the whole word match for '.vnet'. I might use a more complex regex to fix this later, or drop the support for '.vnet' if it's too time-consuming.) Also I added solution 5 below.


I have the following MySQL query to achieve whole word matching.

SELECT  source, target
    FROM  tm
    WHERE  source REGEXP '\\bword\\b'
      AND  customer = 'COMPANY X'
      AND  language = 'YYY'
    ORDER BY  CHAR_LENGTH(source)
    LIMIT  5;

There are 2 customers and 2 languages currently.

My goal is to find the top 5 closest matches of a phrase among hundreds of thousands of English sentences. The reason the fetched records are ordered by CHAR_LENGTH is because the shorter the length, the higher the match ratio, since REGEXP '\\bword\\b' makes sure source has word already.

The tm table:

CREATE TABLE tm(
    id INT AUTO_INCREMENT PRIMARY KEY,
    source TEXT(7000) NOT NULL,
    target TEXT(6000) NOT NULL,
    language CHAR(3),
    customer VARCHAR(10),
    INDEX src_cus_lang (source(755), customer, language)

The query above took about 4.5 seconds to finish, which is very slow for me and my PC that has an Intel Core i5-10400F, 16GB RAM and an SSD.

The EXPLAIN command showed the below result:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tm
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1117154
     filtered: 1.00
        Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)

I tried to delelte the src_cus_lang index and created a new one (customer, language, source(755)), but no improvement at all.

I can think of a few solutions:

  1. Recreate the tm table, ordering by CHAR_LENGTH(source) in the process. This is not ideal for me as I'd like to keep the original order of the table.
  2. Create a new column named src_len, i.e. the length of the source. However, ORDER BY src_len is still very slow.
  3. Split the tm table into 4 separate ones by customer and language. Not ideal for me.
  4. Index the source column. Still very slow.
  5. Use INDEX(customer, language). Took 1.4 seconds longer for both normal words and special words like '.vnet'.

Is there a way to cut the execution time down to less than 0.5 seconds?

答案1

得分: 2

这基本上是没有用的:

INDEX src_cus_lang (source(755), customer, language)

前缀部分会让其他列变得不太有用。 REGEXP 需要检查全部的 1.1M 行。

这样会更好:

INDEX(customer, language)

它至少会筛选这两列,然后较少地应用 REGEXP

由于通常希望在考虑 ORDER BY 之前完成 WHERE,所以对于 src_len 等的尝试没有帮助。

如果只有 4 种不同的 customerlanguage 组合,那么不太能做什么。

但是,你应该考虑一个 FULLTEXT(source) 索引。有了这个,

MATCH(source) AGAINST('word' IN BOOLEAN MODE)
AND ...

将会运行得更快。

还可以尝试 IN NATURAL LANGUAGE MODE

英文:

This is essentially useless:

INDEX src_cus_lang (source(755), customer, language)

The prefixing keeps the rest of the columns from being very useful. REGEXP requires checking all 1.1M rows.

This would be better:

INDEX(customer, language)

It will at least filter on those two columns, then apply the REGEXP fewer times.

Since it usually wants to finish with the WHERE before considering the ORDER BY, your attempts at src_len, etc, did not help.

If there are only 4 different combinations of customer and language, not much can be done.

However, you should consider a FULLTEXT(source) index. With such,

MATCH(source) AGAINST('+word' IN BOOLEAN MODE)
AND ...

will work much faster.

Also try IN NATURAL LANGUAGE MODE.

huangapple
  • 本文由 发表于 2023年2月8日 21:51:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75386772.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定