2023年7月20日 17:31:04go评论96阅读模式

英文:

Regex performance issue when trying to get snippet of match from long text

问题

我正在尝试使用以下正则表达式获取匹配的单词以及它前后的一些单词（最多5个）：

const start = Date.now();
console.log("There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).".match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu));
console.log(Date.now() - start);

不过，这似乎非常慢，需要数百毫秒。我是否遗漏了什么，是否可以在性能方面进行改进？

英文:

I am trying to get a matching word along with some words ahead of and after it (at most 5) with the below regex:

const start = Date.now();
console.log("There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).".match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu));

console.log(Date.now() - start);

This though seems to be very slow, and it takes in the hundreds of milliseconds. Is there something I am missing and could this be improved in terms of performance?

答案1

得分: 1

好的，以下是翻译好的部分：

"Seems the OP wants to take up to 5 words ahead and behind lorem. So if we change \S*\s* to \S+\s+ the regex is faster like 100x times. Also the OP's regex fails with single quotes:
如果OP希望获取lorem前后最多5个单词，那么如果我们将\S*\s*更改为\S+\s+，正则表达式的速度会快100倍。此外，OP的正则表达式无法处理单引号："

"My regex failed too so added \S*lorem\S*.
我的正则表达式也失败了，所以我添加了\S*lorem\S*。"

"Also we could omit capturing the words.
另外，我们可以省略捕获这些单词。"

"So the final regex:
因此，最终的正则表达式："

"/(?:\S+\s+){0,5}\Slorem\S(?:\s+\S+){0,5}/giu"

英文:

Seems the OP wants to take up to 5 words ahead and behind lorem. So if we change \S*\s* to \S+\s+ the regex is faster like 100x times. Also the OP's regex fails with single quotes:

and a search for &#39;lorem ipsum&#39; will uncover many web

My regex failed too so added \S*lorem\S*.

Also we could omit capturing the words.

So the final regex:

/(?:\S+\s+){0,5}\S*lorem\S*(?:\s+\S+){0,5}/giu

&lt;script benchmark data-count=&quot;100&quot;&gt;
const str = &quot;There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don&#39;t look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn&#39;t anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using &#39;Content here, content here&#39;, making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for &#39;lorem ipsum&#39; will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like). and lorem again.&quot;;
// @benchmark original
str.match(/(\S*\s*){0,5}lorem(\s*\S*){0,5}/giu)
// @benchmark Alexander
str.match(/(?:\S+\s+){0,5}\S*lorem\S*(?:\s+\S+){0,5}/giu)
&lt;/script&gt;
&lt;script src=&quot;https://cdn.jsdelivr.net/gh/silentmantra/benchmark/loader.js&quot;&gt;&lt;/script&gt;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

正则表达式性能问题，尝试从长文本中获取匹配片段。

问题

答案1

如何使用switch语句有条件地渲染React函数组件？

如何根据屏幕大小在HTML/CSS/JavaScript中更改背景图像？（艺术方向）

在JavaScript中将对象分割成较小对象的最快方法是什么？

“Error parsing CSV file with NodeJS: ‘referenceerror: [first cell in table] is undefined.”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。