正则表达式性能问题,尝试从长文本中获取匹配片段。

huangapple go评论56阅读模式
英文:

Regex performance issue when trying to get snippet of match from long text

问题

我正在尝试使用以下正则表达式获取匹配的单词以及它前后的一些单词(最多5个):

const start = Date.now();
console.log("There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).".match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu));

console.log(Date.now() - start);

不过,这似乎非常慢,需要数百毫秒。我是否遗漏了什么,是否可以在性能方面进行改进?

英文:

I am trying to get a matching word along with some words ahead of and after it (at most 5) with the below regex:

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-js -->

const start = Date.now();
console.log("There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).".match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu));

console.log(Date.now() - start);

<!-- end snippet -->

This though seems to be very slow, and it takes in the hundreds of milliseconds. Is there something I am missing and could this be improved in terms of performance?

答案1

得分: 1

好的,以下是翻译好的部分:

"Seems the OP wants to take up to 5 words ahead and behind lorem. So if we change \S*\s* to \S+\s+ the regex is faster like 100x times. Also the OP's regex fails with single quotes:
如果OP希望获取lorem前后最多5个单词,那么如果我们将\S*\s*更改为\S+\s+,正则表达式的速度会快100倍。此外,OP的正则表达式无法处理单引号:"

"My regex failed too so added \S*lorem\S*.
我的正则表达式也失败了,所以我添加了\S*lorem\S*。"

"Also we could omit capturing the words.
另外,我们可以省略捕获这些单词。"

"So the final regex:
因此,最终的正则表达式:"

"/(?:\S+\s+){0,5}\Slorem\S(?:\s+\S+){0,5}/giu"

英文:

Seems the OP wants to take up to 5 words ahead and behind lorem. So if we change \S*\s* to \S+\s+ the regex is faster like 100x times. Also the OP's regex fails with single quotes:

and a search for &#39;lorem ipsum&#39; will uncover many web

My regex failed too so added \S*lorem\S*.

Also we could omit capturing the words.

So the final regex:

/(?:\S+\s+){0,5}\S*lorem\S*(?:\s+\S+){0,5}/giu

正则表达式性能问题,尝试从长文本中获取匹配片段。

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-html -->

&lt;script benchmark data-count=&quot;100&quot;&gt;

const str = &quot;There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don&#39;t look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn&#39;t anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using &#39;Content here, content here&#39;, making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for &#39;lorem ipsum&#39; will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like). and lorem again.&quot;;

// @benchmark original
str.match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu)

// @benchmark Alexander
str.match(/(?:\S+\s+){0,5}\S*lorem\S*(?:\s+\S+){0,5}/giu)

&lt;/script&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/gh/silentmantra/benchmark/loader.js&quot;&gt;&lt;/script&gt;

<!-- end snippet -->

huangapple
  • 本文由 发表于 2023年7月20日 17:31:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76728505.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定