英文:
Simplify this regex to reduce its complexity
问题
SonarLint目前在我因税号的某些正则表达式复杂性而发出警告。我需要将复杂度从21降低到20,但我根本想不出如何做到这一点。我尝试过的所有解决方案都只会使它变得更加复杂...
相关的正则表达式:
^\d{9}$|^\d{3}-\d{2}-\d{4}$|^\d{2}-\d{7}$|^\d{3}-\d{3}-\d{3}$
它需要匹配这些特定格式的税号,不得匹配其他内容:
555555555
555-55-5555
55-5555555
555-555-555
英文:
SonarLint is currently yelling at me for the complexity of some regex I am using for tax ID numbers. I need to reduce the complexity from 21 to 20 and I simply cannot think of a way to do that. All of the solutions I've tried have only made it more complex...
The regex in question:
^\d{9}$|^\d{3}-\d{2}-\d{4}$|^\d{2}-\d{7}$|^\d{3}-\d{3}-\d{3}$
It needs to match tax ID numbers in these specific formats, and nothing else:
555555555
555-55-5555
55-5555555
555-555-555
答案1
得分: 2
Sonar的正则表达式复杂性讨论建议考虑将正则表达式替换为普通代码(我推测这可能涉及在某些情况下使用更简单的正则表达式)。在这里,看起来可以通过使用单独的正则表达式来测试每个备选项,而不是将它们全部压缩到一个正则表达式中来实现这一点,所以请考虑这一点。
相同的讨论描述了Sonar如何计算正则表达式的复杂性:
以下每个运算符都会增加一个等于当前嵌套级别的复杂性,并且还会为其参数增加一个当前嵌套级别:
|
- 当多个 | 运算符一起使用时,后续的运算符只会增加复杂性1&&
(在字符类内) - 当多个&&
运算符一起使用时,后续的运算符只会增加复杂性1- 量词(
*
,+
,?
,{n,m}
,{n,}
或{n}
)- 设置标志的非捕获组(如
(?i:some_pattern)
或(?i)some_pattern
)- 向前查看和向后查看断言
此外,以下特性的每次使用都会增加复杂性1,而不管嵌套情况如何:
- 字符类
- 回溯引用
这将产生9个嵌套级别为2的21个量词,以及三个嵌套级别为1的 |
运算符:2 * 9 + 3 == 21。
我怀疑是否存在一种更简单的正则表达式来执行此任务,特别是一种能够实现Sonar规则的精神,即使正则表达式使用易于阅读和维护。但是,您仍然可以通过利用以下方式使Sonar接受它而不抱怨:
如果正则表达式分散在多个变量中,那么每个变量都会单独计算复杂性,而不是整个正则表达式。如果正则表达式跨多行分割,那么每一行都会单独处理,如果有注释(可以是Java注释或正则表达式内的注释),否则将整个正则表达式分析。
我认为这意味着以下这种方式对Sonar来说是可接受的:
Pattern p = Pattern.compile(
// xxxxxxxxx
"^\d{9}$"
// xxx-xx-xxxx
+ "|\d{3}-\d{2}-\d{4}$"
// xx-xxxxxxx
+ "|\d{2}-\d{7}$"
// xxx-xxx-xxx
+ "|\d{3}-\d{3}-\d{3}$"
);
文档让我相信还会有其他类似的表达正则表达式片段的方法。
英文:
Sonar's discussion of regex complexity suggests considering replacing the regex with regular code (which I infer might involve simpler regexes in some cases). Here, it looks like it would be straightforward to do that by testing each alternative via a separate regex instead of cramming them all into one, so do consider that.
The same discussion describes how Sonar computes regex complexity:
> Each of the following operators increases the complexity by an amount
> equal to the current nesting level and also increases the current
> nesting level by one for its arguments:
>
> - |
- when multiple | operators are used together, the subsequent ones only increase the complexity by 1
> - &&
(inside character classes) - when multiple &&
operators are used together, the subsequent ones only increase the complexity by
> 1
> - Quantifiers (*
, +
, ?
, {n,m}
, {n,}
or {n}
)
> - Non-capturing groups that set flags (such as (?i:some_pattern)
or (?i)some_pattern
)
> - Lookahead and lookbehind assertions
>
> Additionally, each use of the following features increase the
> complexity by 1 regardless of nesting:
>
> - character classes
> - back references
That produces 21 through 9 quantifiers, each at nesting level 2, plus three |
operators, all at nesting level 1: 2 * 9 + 3 == 21.
I doubt that a less complex regex exists for this job, especially one that achieves the spirit of this Sonar rule: to make regex usages easy to read and maintain. However, you should still be able to get Sonar to accept it without complaint by leveraging this:
> If a regular expression is split among multiple variables, the
> complexity is calculated for each variable individually, not for the
> whole regular expression. If a regular expression is split over
> multiple lines, each line is treated individually if it is accompanied
> by a comment (either a Java comment or a comment within the regular
> expression), otherwise the regular expression is analyzed as a whole.
I think that means that something along these lines would be acceptable to Sonar:
Pattern p = Pattern.compile(
// xxxxxxxxx
"^\d{9}$"
// xxx-xx-xxxx
+ "|^\d{3}-\d{2}-\d{4}$"
// xx-xxxxxxx
+ "|^\d{2}-\d{7}$"
// xxx-xxx-xxx
+ "|^\d{3}-\d{3}-\d{3}$"
);
The docs lead me to believe that there would be other viable variations along similar lines of expressing the regex pieces separately.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论