英文:
Match subgroup with same name on separate conditions in Python
问题
I'm trying to create a regex to parse a cookie header I'm getting back that the cookies package complains about. I've come up with the following regex that works:
((?P<name>\w+)(=("(?P<value1>[\w;,#\.\/\\=]+)"|(?P<value2>[\w#\.\/\\]+)))?)[;,]?
For an example, see this link to regex101. Essentially, this works but I'd like to use value in place of both value1 and value2. This shouldn't be an issue because it's impossible to generate a match on both subgroups at the same time (one is quoted and the other isn't). Is there a way I can do this?
英文:
I'm trying to create a regex to parse a cookie header I'm getting back that the cookies package complains about. I've come up with the following regex that works:
((?P<name>\w+)(=(\"(?P<value1>[\w;,#\.\/\\=]+)\"|(?P<value2>[\w#\.\/\\]+)))?)[;,]?
For an example, see this link to regex101. Essentially, this works but I'd like to use value in place of both value1 and value2. This shouldn't be an issue because it's impossible to generate a match on both subgroups at the same time (one is quoted and the other isn't). Is there a way I can do this?
答案1
得分: 2
是的,你可以使用回顾后断言来精确匹配你需要的内容:
(
(?P<name>\w+)
(
=
(?P<quote>"?) # 匹配可选的引号,并捕获它,
(?P<value> # 然后要么是
(?<=")[\w;,#./\\=]+ # 一个跟在引号后的带引号的值,
| # 或者
[\w#./\\]+ # 一个未带引号的值,作为备选项,
) # 和
(?P=quote) # 我们捕获的引号或空字符串。
)?
)
[;,]?
一个值总是跟在 = 后面,所以 (?<=") 只有在它是带引号的值时才匹配。
请注意,你不需要转义 /(在 Python 正则表达式中它没有特殊含义)、.(在字符类内部),也不需要转义 "(如果你使用 ''、''' 或 """ 作为字符串分隔符)。
或者,你也可以使用条件分组语法:
(
(?P<name>\w+)
(
=
(
(?P<quote>"?) # 匹配一个可选的引号
(?P<value> # 然后
(?(quote) # 如果组 <quote> 被捕获
[\w;,#./\\=]+ # 一个带引号的值
| # 或者
[\w#./\\]+ # 一个未带引号的值
) #
) # 后跟
(?(quote)(?P=quote)) # 相同的引号,如果组存在的话。
)
)?
)
[;,]?
英文:
Yes, you can match precisely what you need with a lookbehind:
(
(?P<name>\w+)
(
=
(?P<quote>"?) # Match an optional quote, which we capture,
(?P<value> # then either
(?<=")[\w;,#./\\=]+ # a quoted value, which follows a quote,
| # or
[\w#./\\]+ # an unquoted value, which is the fallback,
) # and
(?P=quote) # the quote we captured or an empty string.
)?
)
[;,]?
A value always follows =, so (?<=") matches iff it's a quoted value.
Note that you don't have to escape / (it doesn't mean anything special in Python flavor), . (inside a character class), nor " (if you use ', ''' or """ as string delimiters).
Try it on regex101.com.
Alternatively, you can use the conditional group syntax:
(
(?P<name>\w+)
(
=
(
(?P<quote>")? # Match an optional group which contains a quote
(?P<value> # then
(?(quote) # if group <quote> was captured
[\w;,#./\\=]+ # a quoted value
| # or
[\w#./\\]+ # an unquoted value
) #
) # followed by
(?(quote)(?P=quote)) # the same quote, if the group presents.
)
)?
)
[;,]?
Try it on regex101.com.
答案2
得分: 2
只需在每个引号之后添加可选的量词 "?"。
(?P<name>\w+)(=(\"?(?P<value>[\w;,#\.\/\\=]+)\"|)?)[;,]?
此外,我认为额外的捕获组是不必要的。
并且,您可以减少字符类的要求,使用懒惰量词 "?"。
(?P<name>[^ ]+?)=\"?(?P<value>.+?)\"?[;,]
另一种方法是保留引号,并在程序内部删除它们。
(?P<name>[^ ]+?)=(?P<value>\".+?\"|.+?)[;,]
作为参考,有一个关于 cookies 的 RFC,提供了其语法的定义。
RFC 6265 – HTTP State Management Mechanism。
看起来分隔值是分号 ";"。
cookie-header = "Cookie:" OWS cookie-string OWS
cookie-string = cookie-pair *( ";" SP cookie-pair )
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
token = <token, defined in [RFC2616], Section 2.2>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <"> | "/"
| "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
英文:
Just place the optional quantifier, ?, after each of the quotation marks.
(?P<name>\w+)(=(\"?(?P<value>[\w;,#\.\/\\=]+)\"?|)?)[;,]?
Furthermore, I don't believe the additional capture groups are necessary.
And, you could decrease the character class requirement, using a lazy quantifier, ?.
(?P<name>[^ ]+?)=\"?(?P<value>.+?)\"?[;,]
An alternate method would be to preserve the quotation marks, and remove them from within the program.
(?P<name>[^ ]+?)=(?P<value>\".+?\"|.+?)[;,]
For reference, there is an RFC for cookies, which provides a definition for its syntax.
RFC 6265 – HTTP State Management Mechanism.
It appears that the delimiting value is a semi-colon, ;.
cookie-header = "Cookie:" OWS cookie-string OWS
cookie-string = cookie-pair *( ";" SP cookie-pair )
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
token = <token, defined in [RFC2616], Section 2.2>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论