英文:
Regex performance issue when trying to get snippet of match from long text
问题
我正在尝试使用以下正则表达式获取匹配的单词以及它前后的一些单词(最多5个):
const start = Date.now();
console.log("There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).".match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu));
console.log(Date.now() - start);
不过,这似乎非常慢,需要数百毫秒。我是否遗漏了什么,是否可以在性能方面进行改进?
英文:
I am trying to get a matching word along with some words ahead of and after it (at most 5) with the below regex:
<!-- begin snippet: js hide: false console: true babel: false -->
<!-- language: lang-js -->
const start = Date.now();
console.log("There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).".match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu));
console.log(Date.now() - start);
<!-- end snippet -->
This though seems to be very slow, and it takes in the hundreds of milliseconds. Is there something I am missing and could this be improved in terms of performance?
答案1
得分: 1
好的,以下是翻译好的部分:
"Seems the OP wants to take up to 5 words ahead and behind lorem
. So if we change \S*\s*
to \S+\s+
the regex is faster like 100x times. Also the OP's regex fails with single quotes:
如果OP希望获取lorem
前后最多5个单词,那么如果我们将\S*\s*
更改为\S+\s+
,正则表达式的速度会快100倍。此外,OP的正则表达式无法处理单引号:"
"My regex failed too so added \S*lorem\S*
.
我的正则表达式也失败了,所以我添加了\S*lorem\S*
。"
"Also we could omit capturing the words.
另外,我们可以省略捕获这些单词。"
"So the final regex:
因此,最终的正则表达式:"
"/(?:\S+\s+){0,5}\Slorem\S(?:\s+\S+){0,5}/giu"
英文:
Seems the OP wants to take up to 5 words ahead and behind lorem
. So if we change \S*\s*
to \S+\s+
the regex is faster like 100x times. Also the OP's regex fails with single quotes:
and a search for 'lorem ipsum' will uncover many web
My regex failed too so added \S*lorem\S*
.
Also we could omit capturing the words.
So the final regex:
/(?:\S+\s+){0,5}\S*lorem\S*(?:\s+\S+){0,5}/giu
<!-- begin snippet: js hide: false console: true babel: false -->
<!-- language: lang-html -->
<script benchmark data-count="100">
const str = "There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like). and lorem again.";
// @benchmark original
str.match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu)
// @benchmark Alexander
str.match(/(?:\S+\s+){0,5}\S*lorem\S*(?:\s+\S+){0,5}/giu)
</script>
<script src="https://cdn.jsdelivr.net/gh/silentmantra/benchmark/loader.js"></script>
<!-- end snippet -->
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论