英文:
Simple MySQL query with ORDER BY too slow
问题
编辑:查询普通单词的所需时间实际上是1.78秒。原始帖子中提到的4.5秒是在查询特殊单词(如'.vnet')时的情况。(我知道REGEXP '\\b.vnet\\b'
不会找到'.vnet'的完整单词匹配。我可能会稍后使用更复杂的正则表达式来修复这个问题,或者如果太耗时,可能会取消对'.vnet'的支持。)此外,我在下面添加了解决方案5。
我有以下MySQL查询来实现整词匹配。
SELECT source, target
FROM tm
WHERE source REGEXP '\\bword\\b'
AND customer = 'COMPANY X'
AND language = 'YYY'
ORDER BY CHAR_LENGTH(source)
LIMIT 5;
目前有两个客户和两种语言。
我的目标是在数十万个英语句子中找到短语的前5个最接近的匹配项。之所以按CHAR_LENGTH
排序获取的记录,是因为长度越短,匹配比例越高,因为REGEXP '\\bword\\b'
确保source
已经包含'word'。
tm表:
CREATE TABLE tm (
id INT AUTO_INCREMENT PRIMARY KEY,
source TEXT(7000) NOT NULL,
target TEXT(6000) NOT NULL,
language CHAR(3),
customer VARCHAR(10),
INDEX src_cus_lang (source(755), customer, language)
上面的查询花了大约4.5秒才完成,对于我和我配备Intel Core i5-10400F、16GB RAM和SSD的PC来说非常慢。
EXPLAIN命令显示如下结果:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: tm
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1117154
filtered: 1.00
Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)
我尝试删除src_cus_lang
索引并创建了一个新的索引(customer, language, source(755))
,但没有任何改善。
我可以想到一些解决方案:
- 重新创建
tm
表,在过程中按CHAR_LENGTH(source)
排序。这对我来说不是理想的选择,因为我想保留表的原始顺序。 - 创建一个名为
src_len
的新列,即源的长度。然而,按src_len
排序仍然非常慢。 - 将
tm
表按customer
和language
分成4个单独的表。对我来说也不是理想的选择。 - 对
source
列创建索引。仍然非常慢。 - 使用
INDEX(customer, language)
。对普通单词和像'.vnet'这样的特殊单词都多花了1.4秒。
有没有办法将执行时间缩短到不到0.5秒?
英文:
Edit:The time spent for querying a normal word is actually 1.78 seconds. The 4.5 seconds mentioned in the original post below was when querying special words like '.vnet'. (I know REGEXP '\\b.vnet\\b'
won't find the whole word match for '.vnet'. I might use a more complex regex to fix this later, or drop the support for '.vnet' if it's too time-consuming.) Also I added solution 5 below.
I have the following MySQL query to achieve whole word matching.
SELECT source, target
FROM tm
WHERE source REGEXP '\\bword\\b'
AND customer = 'COMPANY X'
AND language = 'YYY'
ORDER BY CHAR_LENGTH(source)
LIMIT 5;
There are 2 customers and 2 languages currently.
My goal is to find the top 5 closest matches of a phrase among hundreds of thousands of English sentences. The reason the fetched records are ordered by CHAR_LENGTH
is because the shorter the length, the higher the match ratio, since REGEXP '\\bword\\b'
makes sure source
has word
already.
The tm table:
CREATE TABLE tm(
id INT AUTO_INCREMENT PRIMARY KEY,
source TEXT(7000) NOT NULL,
target TEXT(6000) NOT NULL,
language CHAR(3),
customer VARCHAR(10),
INDEX src_cus_lang (source(755), customer, language)
The query above took about 4.5 seconds to finish, which is very slow for me and my PC that has an Intel Core i5-10400F, 16GB RAM and an SSD.
The EXPLAIN command showed the below result:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: tm
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1117154
filtered: 1.00
Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)
I tried to delelte the src_cus_lang
index and created a new one (customer, language, source(755))
, but no improvement at all.
I can think of a few solutions:
- Recreate the
tm
table, ordering by CHAR_LENGTH(source) in the process. This is not ideal for me as I'd like to keep the original order of the table. - Create a new column named
src_len
, i.e. the length of the source. However,ORDER BY src_len
is still very slow. - Split the
tm
table into 4 separate ones bycustomer
andlanguage
. Not ideal for me. - Index the
source
column. Still very slow. - Use
INDEX(customer, language)
. Took 1.4 seconds longer for both normal words and special words like '.vnet'.
Is there a way to cut the execution time down to less than 0.5 seconds?
答案1
得分: 2
这基本上是没有用的:
INDEX src_cus_lang (source(755), customer, language)
前缀部分会让其他列变得不太有用。 REGEXP
需要检查全部的 1.1M 行。
这样会更好:
INDEX(customer, language)
它至少会筛选这两列,然后较少地应用 REGEXP
。
由于通常希望在考虑 ORDER BY
之前完成 WHERE
,所以对于 src_len 等的尝试没有帮助。
如果只有 4 种不同的 customer
和 language
组合,那么不太能做什么。
但是,你应该考虑一个 FULLTEXT(source)
索引。有了这个,
MATCH(source) AGAINST('word' IN BOOLEAN MODE)
AND ...
将会运行得更快。
还可以尝试 IN NATURAL LANGUAGE MODE
。
英文:
This is essentially useless:
INDEX src_cus_lang (source(755), customer, language)
The prefixing keeps the rest of the columns from being very useful. REGEXP
requires checking all 1.1M rows.
This would be better:
INDEX(customer, language)
It will at least filter on those two columns, then apply the REGEXP
fewer times.
Since it usually wants to finish with the WHERE
before considering the ORDER BY
, your attempts at src_len, etc, did not help.
If there are only 4 different combinations of customer
and language
, not much can be done.
However, you should consider a FULLTEXT(source)
index. With such,
MATCH(source) AGAINST('+word' IN BOOLEAN MODE)
AND ...
will work much faster.
Also try IN NATURAL LANGUAGE MODE
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论