为什么’³’(上标3)与Python正则表达式中的字母字符匹配?

huangapple go评论97阅读模式
英文:

Why does '³' (superscript 3) match the python re for alpha characters?

问题

匹配一串Unicode字母的正则表达式是:"[^\W\d_]+"

但当我执行以下代码时:

import re
re.match("[^\W\d_]+",'³')

我得到了:

<re.Match object; span=(0, 1), match='³'>

为什么呢?

英文:

This is the match string I use for matching a sequence of Unicode letters - "[^\W\d_]+"

but when I do :

import re
re.match(&quot;[^\W\d_]+&quot;,&#39;&#179;&#39;)

I get

<re.Match object; span=(0, 1), match='³'>

Why?

答案1

得分: 3

我认为[^\W\d_]+匹配除数字和下划线以外的字母数字字符:

  • \W 匹配任何不是单词字符的字符。这与\w相反。如果使用ASCII标志,这变成了等同于[^a-zA-Z0-9_]。如果使用LOCALE标志,它匹配当前区域设置中既不是字母数字字符也不是下划线的字符。 (来自Python的re文档)。
  • \d 匹配十进制数字。
  • _ 匹配下划线。
  • [^\W\d_] 匹配除\W\d_之外的任何内容。这意味着它匹配除任何不是单词字符十进制数字下划线之外的任何内容。这意味着它匹配单词字符,除了十进制数字和下划线。 &#39;&#179;&#39;是一个单词字符,不是十进制数字,也不是下划线,所以它匹配。
英文:

I think [^\W\d_]+ matches alphanumeric characters other than digits and underscore:

  • \W Matches any character which is not a word character. This is the opposite of \w. If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_]. If the LOCALE flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore. (from Python's re docs).
  • \d matches decimal digits
  • _ matches underscore
  • [^blablabla] matches anything but blablbla
  • [^\W\d_] matches anything but \W, \d and _. Which means it matches anything but any character which is not a word character, decimal digits, and underscore. Which means it matches word character except decimal digits and underscore. '³' is a word character, and not decimal digit, and not underscore, so it matches.

huangapple
  • 本文由 发表于 2023年7月20日 11:29:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76726497.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定