英文:
Why does Java-regex matches underscore?
问题
我试图匹配URL模式string.string.
,其中任意数量的string.
,使用^([^\\W_]+.)([^\\W_]+.)$
作为第一次尝试,它可以成功匹配两个连续的模式。但是,当我将其推广为^([^\\W_]+.)+$
时,它停止工作并且匹配错误的模式“string.str_ing.”。你知道第二个版本有什么问题吗?
英文:
I was trying to match the URL pattern string.string.
for any number of string.
using ^([^\\W_]+.)([^\\W_]+.)$
as a first attempt, and it works for matching two consecutive patterns. But then, when I generalize it to ^([^\\W_]+.)+$
stops working and matches the wrong pattern "string.str_ing.".
Do you know what is incorrect with the second version?
答案1
得分: 0
使用 ^([^\\W_]+.)([^\\W_]+.)$
你可以匹配任意两个由受限字符集组成的单词。尽管你没有转义 .
,但只要第一个单词首先匹配到 string
,然后是任意的字面值(这就是未转义的 .
的意思),然后再是 string
,它仍然可以工作。
在后面的模式中,未转义的点 (.
) 是至少出现一次的捕获组的一部分(因为你使用了 +
),因此它允许任何字符作为除数。换句话说,string.str_ing.
被理解为:
- 第1个单词是
string
- 第2个单词是
str
- 第3个单词是
ing
...只要未转义的点 (.
) 允许任何除数(包括字面上的 .
和 _
)。
为了使正则表达式按预期工作,需要转义点号,修改后的正则表达式为 (演示链接):
^([^\\W_]+\\.)+$
英文:
With ^([^\\W_]+.)([^\\W_]+.)$
you match any two words with restricted set of characters. Although, you have not escaped the .
, it still works as long as the first word is matched first string
, then any literal (that's what unescaped .
means) and then string
again.
In the latter one the unescaped dot (.
) is a part of the capturing group occurring at least once (since you use +
), therefore it allows any character as a divisor. In other words string.str_ing.
is understood as:
string
as the 1st wordstr
as the 2nd wording
as the 3rd word
... as long as the unescaped dot (.
) allows any divisor (both .
literally and _
).
Escape the dot to make the Regex work as intented (demo):
^([^\\W_]+\.)+$
答案2
得分: 0
你需要转义你的 . 字符,否则它将匹配包括 _ 在内的任何字符。
```regexp
^([^\\W_]+\\.?)+$
这可以是你的通用正则表达式
<details>
<summary>英文:</summary>
You need to escape your . character, else it will match any character including _.
^([^\W_]+.?)+$
this can be your generalised regex
</details>
# 答案3
**得分**: 0
[\^\\W]似乎是一个奇怪的选择 - 它匹配的是“非非单词字符”。我还没有仔细思考过,但听起来它等效于\w,即匹配一个单词字符。
无论如何,对于^\W和\w,您都在要求匹配下划线 - 这就是为什么它与包含下划线的字符串匹配。 "单词字符" 包括大写字母、小写字母、数字和下划线。
您可能想要使用[a-z]+或者也许是[A-Za-z0-9]+。
<details>
<summary>英文:</summary>
[^\W] seems a weird choice - it's matching 'not not-a-word-character'. I haven't thought it through, but that sounds like it's equivalent to \w, i.e., matching a word character.
Either way, with ^\W and \w, you're asking to match underscores - which is why it matches the string with the underscore. "Word characters" are uppercase alphabetics, lowercase alphabetics, digits, **and underscore**.
You probably want [a-z]+ or maybe [A-Za-z0-9]+
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论