在JavaScript中的低效正则表达式

huangapple go评论93阅读模式
英文:

inefficient regular expression in javascript

问题

在我们的下面的代码中,使用CodeQL扫描时收到了一个警告,警告内容为:“正则表达式的这部分可能会在以'0'开头且包含许多'0'重复的字符串上引发指数级回溯。”

const validateUrl = str => {
  var pattern = new RegExp(
    "^(https?:\\/\\/)?" + // 协议
    "((([a-z\\d]([a-z\\d-]*[a-z\\d])*)\\.)+[a-z]{2,}|" + // 域名

请告诉我如何解决这个问题,问题的具体原因是什么?

英文:

Hi in our below code using codeql scanning got an alert that "This part of the regular expression may cause exponential backtracking on strings starting with '0' and containing many repetitions of '0'."

const validateUrl = str => {
  var pattern = new RegExp(
    "^(https?:\\/\\/)?" + // protocol
    "((([a-z\\d]([a-z\\d-]*[a-z\\d])*)\\.)+[a-z]{2,}|" + // domain name

please let me know how can we resolve this and what exactly the problem?

答案1

得分: 1

正如其他答案正确识别的那样,问题出在([a-z\d-]*[a-z\d])*。这个正则表达式识别了([a-z\d]+)*的一个超集。两个量词的嵌套使得一个天真的回溯正则表达式实现尝试的选项太多:考虑一个由a(或0、或[a-z\d-]中的任何其他字符)组成的长字符串,后面跟着$。正则表达式引擎首先尝试匹配单个a,然后根据*进一步匹配。如果这失败了,它将稍后尝试匹配两个a,依此类推。之后将重复相同的过程。最终,你的正则表达式引擎将尝试所有将a分割成长度大于0的子字符串的选项,这种选项超指数增多。

请注意,封闭的(...\.)+不是问题,因为点充当分隔符。

你的第一个选项,不需要更改正则表达式,只需切换到具有线性时间复杂度保证的正则表达式引擎即可。这个缺陷只在回溯正则表达式引擎中出现,而不是使用DFA(确定性有限自动机)的正则表达式引擎。似乎有一些NPM包可以将正则表达式转换为DFA。

你的第二个选项是重写正则表达式。[a-z\d]([a-z\d-]*[a-z\d])*匹配任何字母数字和短横线的序列,只要短横线不在开头或结尾;因此,我们可以将外部的*量词转换为?量词:[a-z\d]([a-z\d-]*[a-z\d])?。或者,我们也可以将其写为[a-z\d]+(\-+[a-z\d]+)*。这嵌套了量词,但短横线充当分隔符,因此这也不应该出现灾难性的回溯。

你没有展示完整的正则表达式,但很可能不足以验证URL。一个正确的、符合规范的用于验证URL的正则表达式相当长。查看Wikipedia上的语法图表;你遗漏了整个部分(如userinfo、端口、查询、片段)。Peter Seliger的评论是正确的:

楼主可以考虑将这样的验证任务委托给Web API URL 兼容的 node实现

英文:

As the other answer has correctly identified, the problem is ([a-z\d-]*[a-z\d])*. This RegEx recognizes a superset of ([a-z\d]+)*. The nesting of the two quantifiers gives a naive backtracking RegEx implementation too many options to try: Consider a long string consisting of a's (or 0's, or any other character in [a-z\d-]) followed by $. The RegEx engine will first try matching a single a, then matching further according to the *. If this fails, it will later try matching two a's and so on. The same will be repeated afterwards. Ultimately your RegEx engine will have tried all options of partitioning the a's into substrings of length > 0, of which there are superexponentially many.

Note that the enclosing (...\.)+ is not an issue, since the dot serves as a delimiter.

Your first option, which wouldn't require changing the RegEx at all, would simply be to switch to a RegEx engine with guaranteed linear time complexity. This flaw is only exhibited by backtracking RegEx engines and not by RegEx engines which use DFAs (Deterministic Finite Automatons). There seem to be NPM packages which can convert RegEx to DFA.

Your second option is to rewrite the RegEx. [a-z\d]([a-z\d-]*[a-z\d])* matches any sequence of alphanumerics and dashes as long as the dashes aren't at the end or beginning; thus we can convert the outer * quantifier into a ? quantifier: [a-z\d]([a-z\d-]*[a-z\d])?.
Alternatively, we could also write it as [a-z\d]+(\-+[a-z\d]+)*. This nests quantifiers, but the dashes serve as a delimiter, so this shouldn't exhibit catastrophic backtracking either.

You're not showing the full RegEx, but it is most likely insufficient to validate URLs. A proper, spec-compliant RegEx to validate URLs is rather lengthy. Take a look at the syntax diagram on Wikipedia; you're missing entire parts of it (like userinfo, port, query, fragment). Peter Seliger's comment is spot on:

> The OP might consider delegating such a validation task to an Web-Api URL compatible node implementation.

答案2

得分: 0

问题在于您正在使用的正则表达式具有一种模式,可能会导致某些字符串上的指数回溯。这意味着对于某些字符串,正则表达式引擎可能会尝试通过模式的大量可能路径,导致性能下降,并且如果攻击者可以控制输入字符串,则可能发生拒绝服务攻击。

触发此警告的表达式部分可能是这部分:

([a-z\d]([a-z\d-]*[a-z\d])*)

该模式试图匹配一系列字母数字字符和破折号,但其结构方式可能会引发问题。嵌套重复(*内部包含另一个*)是导致指数回溯风险的原因。

通常可以重写模式以避免嵌套重复。在这种情况下,您可以将模式重写为以下内容:

const validateUrl = str => {
  var pattern = new RegExp(
    "^(https?:\\/\\/)?" + // 协议
    "((([a-z\\d]+[-])*[a-z\\d]+\\.)+[a-z]{2,}|" // 域名
    // ... 其余的模式
  );
  // ... 代码的其余部分
}

在这里,我已将模式更改为使用([a-z\\d]+[-])*来匹配一个或多个字母数字字符,后跟一个破折号,重复零次或更多次,然后后跟[a-z\\d]+,这会匹配一个或多个字母数字字符。

这个新模式应该仍然可以匹配有效的域名,就像旧模式一样,但不会有指数回溯的风险。您可能需要在各种输入上进行测试,以确保它仍然按预期运行。

英文:

The problem is that the regular expression you're using has a pattern that can cause exponential backtracking on certain strings. This means that for some strings, the regular expression engine may end up trying a huge number of possible paths through the pattern, resulting in slow performance and a possible denial of service if an attacker can control the input string.

The part of the expression that's triggering this warning is likely this bit:

([a-z\\d]([a-z\\d-]*[a-z\\d])*)

This pattern is trying to match a series of alphanumeric characters and dashes, but the way it's structured can cause issues. The nested repetition (* inside of another *) is what creates the risk of exponential backtracking.

You can usually rewrite the pattern to avoid the nested repetition. In this case, you might rewrite the pattern to something like this:

const validateUrl = str => {
  var pattern = new RegExp(
    "^(https?:\\/\\/)?" + // protocol
    "((([a-z\\d]+[-])*[a-z\\d]+\\.)+[a-z]{2,}|" // domain name
    // ... rest of pattern
  );
  // ... rest of code
}

Here, I've changed the pattern to use ([a-z\\d]+[-])* to match one or more alphanumeric characters followed by a dash, repeated zero or more times, and then followed by [a-z\\d]+, which matches one or more alphanumeric characters.

This new pattern should still match valid domain names like the old pattern but without the risk of exponential backtracking. You may want to test it on various inputs to make sure it still behaves as intended.

huangapple
  • 本文由 发表于 2023年8月10日 18:19:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76874811.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定