简化这个正则表达式以减少其复杂性。

huangapple go评论57阅读模式
英文:

Simplify this regex to reduce its complexity

问题

SonarLint目前在我因税号的某些正则表达式复杂性而发出警告。我需要将复杂度从21降低到20,但我根本想不出如何做到这一点。我尝试过的所有解决方案都只会使它变得更加复杂...

相关的正则表达式:

^\d{9}$|^\d{3}-\d{2}-\d{4}$|^\d{2}-\d{7}$|^\d{3}-\d{3}-\d{3}$

它需要匹配这些特定格式的税号,不得匹配其他内容:

555555555

555-55-5555

55-5555555

555-555-555

英文:

SonarLint is currently yelling at me for the complexity of some regex I am using for tax ID numbers. I need to reduce the complexity from 21 to 20 and I simply cannot think of a way to do that. All of the solutions I've tried have only made it more complex...

The regex in question:

^\d{9}$|^\d{3}-\d{2}-\d{4}$|^\d{2}-\d{7}$|^\d{3}-\d{3}-\d{3}$

It needs to match tax ID numbers in these specific formats, and nothing else:

555555555

555-55-5555

55-5555555

555-555-555

答案1

得分: 2

Sonar的正则表达式复杂性讨论建议考虑将正则表达式替换为普通代码(我推测这可能涉及在某些情况下使用更简单的正则表达式)。在这里,看起来可以通过使用单独的正则表达式来测试每个备选项,而不是将它们全部压缩到一个正则表达式中来实现这一点,所以请考虑这一点。

相同的讨论描述了Sonar如何计算正则表达式的复杂性:

以下每个运算符都会增加一个等于当前嵌套级别的复杂性,并且还会为其参数增加一个当前嵌套级别:

  • | - 当多个 | 运算符一起使用时,后续的运算符只会增加复杂性1
  • &&(在字符类内) - 当多个 && 运算符一起使用时,后续的运算符只会增加复杂性1
  • 量词(*, +, ?, {n,m}, {n,}{n})
  • 设置标志的非捕获组(如 (?i:some_pattern)(?i)some_pattern)
  • 向前查看和向后查看断言

此外,以下特性的每次使用都会增加复杂性1,而不管嵌套情况如何:

  • 字符类
  • 回溯引用

这将产生9个嵌套级别为2的21个量词,以及三个嵌套级别为1的 | 运算符:2 * 9 + 3 == 21。

我怀疑是否存在一种更简单的正则表达式来执行此任务,特别是一种能够实现Sonar规则的精神,即使正则表达式使用易于阅读和维护。但是,您仍然可以通过利用以下方式使Sonar接受它而不抱怨:

如果正则表达式分散在多个变量中,那么每个变量都会单独计算复杂性,而不是整个正则表达式。如果正则表达式跨多行分割,那么每一行都会单独处理,如果有注释(可以是Java注释或正则表达式内的注释),否则将整个正则表达式分析。

我认为这意味着以下这种方式对Sonar来说是可接受的:

    Pattern p = Pattern.compile(
            // xxxxxxxxx
            "^\d{9}$"
            // xxx-xx-xxxx
            + "|\d{3}-\d{2}-\d{4}$"
            // xx-xxxxxxx
            + "|\d{2}-\d{7}$"
            // xxx-xxx-xxx
            + "|\d{3}-\d{3}-\d{3}$"
            );

文档让我相信还会有其他类似的表达正则表达式片段的方法。

英文:

Sonar's discussion of regex complexity suggests considering replacing the regex with regular code (which I infer might involve simpler regexes in some cases). Here, it looks like it would be straightforward to do that by testing each alternative via a separate regex instead of cramming them all into one, so do consider that.

The same discussion describes how Sonar computes regex complexity:

> Each of the following operators increases the complexity by an amount
> equal to the current nesting level and also increases the current
> nesting level by one for its arguments:
>
> - | - when multiple | operators are used together, the subsequent ones only increase the complexity by 1
> - && (inside character classes) - when multiple && operators are used together, the subsequent ones only increase the complexity by
> 1
> - Quantifiers (*, +, ?, {n,m}, {n,} or {n})
> - Non-capturing groups that set flags (such as (?i:some_pattern) or (?i)some_pattern)
> - Lookahead and lookbehind assertions
>
> Additionally, each use of the following features increase the
> complexity by 1 regardless of nesting:
>
> - character classes
> - back references

That produces 21 through 9 quantifiers, each at nesting level 2, plus three | operators, all at nesting level 1: 2 * 9 + 3 == 21.

I doubt that a less complex regex exists for this job, especially one that achieves the spirit of this Sonar rule: to make regex usages easy to read and maintain. However, you should still be able to get Sonar to accept it without complaint by leveraging this:

> If a regular expression is split among multiple variables, the
> complexity is calculated for each variable individually, not for the
> whole regular expression. If a regular expression is split over
> multiple lines, each line is treated individually if it is accompanied
> by a comment (either a Java comment or a comment within the regular
> expression), otherwise the regular expression is analyzed as a whole.

I think that means that something along these lines would be acceptable to Sonar:

    Pattern p = Pattern.compile(
            // xxxxxxxxx
            "^\d{9}$"
            // xxx-xx-xxxx
            + "|^\d{3}-\d{2}-\d{4}$"
            // xx-xxxxxxx
            + "|^\d{2}-\d{7}$"
            // xxx-xxx-xxx
            + "|^\d{3}-\d{3}-\d{3}$"
            );

The docs lead me to believe that there would be other viable variations along similar lines of expressing the regex pieces separately.

huangapple
  • 本文由 发表于 2023年4月6日 22:33:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75950746.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定