为什么 Java 正则表达式会匹配下划线?

huangapple go评论74阅读模式
英文:

Why does Java-regex matches underscore?

问题

我试图匹配URL模式string.string.,其中任意数量的string.,使用^([^\\W_]+.)([^\\W_]+.)$作为第一次尝试,它可以成功匹配两个连续的模式。但是,当我将其推广为^([^\\W_]+.)+$时,它停止工作并且匹配错误的模式“string.str_ing.”。你知道第二个版本有什么问题吗?

英文:

I was trying to match the URL pattern string.string. for any number of string. using ^([^\\W_]+.)([^\\W_]+.)$ as a first attempt, and it works for matching two consecutive patterns. But then, when I generalize it to ^([^\\W_]+.)+$ stops working and matches the wrong pattern "string.str_ing.".
Do you know what is incorrect with the second version?

答案1

得分: 0

使用 ^([^\\W_]+.)([^\\W_]+.)$ 你可以匹配任意两个由受限字符集组成的单词。尽管你没有转义 .,但只要第一个单词首先匹配到 string,然后是任意的字面值(这就是未转义的 . 的意思),然后再是 string,它仍然可以工作。

在后面的模式中,未转义的点 (.) 是至少出现一次的捕获组的一部分(因为你使用了 +),因此它允许任何字符作为除数。换句话说,string.str_ing. 被理解为:

  • 第1个单词是 string
  • 第2个单词是 str
  • 第3个单词是 ing

...只要未转义的点 (.) 允许任何除数(包括字面上的 ._)。

为了使正则表达式按预期工作,需要转义点号,修改后的正则表达式为 (演示链接):

^([^\\W_]+\\.)+$
英文:

With ^([^\\W_]+.)([^\\W_]+.)$ you match any two words with restricted set of characters. Although, you have not escaped the ., it still works as long as the first word is matched first string, then any literal (that's what unescaped . means) and then string again.

In the latter one the unescaped dot (.) is a part of the capturing group occurring at least once (since you use +), therefore it allows any character as a divisor. In other words string.str_ing. is understood as:

  • string as the 1st word
  • str as the 2nd word
  • ing as the 3rd word

... as long as the unescaped dot (.) allows any divisor (both . literally and _).

Escape the dot to make the Regex work as intented (demo):

^([^\\W_]+\.)+$

答案2

得分: 0

你需要转义你的 . 字符,否则它将匹配包括 _ 在内的任何字符。

```regexp
^([^\\W_]+\\.?)+$

这可以是你的通用正则表达式


<details>
<summary>英文:</summary>

You need to escape your . character, else it will match any character including _.

^([^\W_]+.?)+$

this can be your generalised regex

</details>



# 答案3
**得分**: 0

[\^\\W]似乎是一个奇怪的选择 - 它匹配的是“非非单词字符”。我还没有仔细思考过,但听起来它等效于\w,即匹配一个单词字符。

无论如何,对于^\W和\w,您都在要求匹配下划线 - 这就是为什么它与包含下划线的字符串匹配。 "单词字符" 包括大写字母、小写字母、数字和下划线。

您可能想要使用[a-z]+或者也许是[A-Za-z0-9]+。

<details>
<summary>英文:</summary>

[^\W] seems a weird choice - it&#39;s matching &#39;not not-a-word-character&#39;.  I haven&#39;t thought it through, but that sounds like it&#39;s equivalent to \w, i.e., matching a word character.

Either way, with ^\W and \w, you&#39;re asking to match underscores - which is why it matches the string with the underscore.  &quot;Word characters&quot; are uppercase alphabetics, lowercase alphabetics, digits, **and underscore**.

You probably want [a-z]+  or maybe [A-Za-z0-9]+





</details>



huangapple
  • 本文由 发表于 2020年6月29日 02:26:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/62626603.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定