英文:
Match subgroup with same name on separate conditions in Python
问题
I'm trying to create a regex to parse a cookie header I'm getting back that the cookies package complains about. I've come up with the following regex that works:
((?P<name>\w+)(=("(?P<value1>[\w;,#\.\/\\=]+)"|(?P<value2>[\w#\.\/\\]+)))?)[;,]?
For an example, see this link to regex101. Essentially, this works but I'd like to use value
in place of both value1
and value2
. This shouldn't be an issue because it's impossible to generate a match on both subgroups at the same time (one is quoted and the other isn't). Is there a way I can do this?
英文:
I'm trying to create a regex to parse a cookie header I'm getting back that the cookies package complains about. I've come up with the following regex that works:
((?P<name>\w+)(=(\"(?P<value1>[\w;,#\.\/\\=]+)\"|(?P<value2>[\w#\.\/\\]+)))?)[;,]?
For an example, see this link to regex101. Essentially, this works but I'd like to use value
in place of both value1
and value2
. This shouldn't be an issue because it's impossible to generate a match on both subgroups at the same time (one is quoted and the other isn't). Is there a way I can do this?
答案1
得分: 2
是的,你可以使用回顾后断言来精确匹配你需要的内容:
(
(?P<name>\w+)
(
=
(?P<quote>"?) # 匹配可选的引号,并捕获它,
(?P<value> # 然后要么是
(?<=")[\w;,#./\\=]+ # 一个跟在引号后的带引号的值,
| # 或者
[\w#./\\]+ # 一个未带引号的值,作为备选项,
) # 和
(?P=quote) # 我们捕获的引号或空字符串。
)?
)
[;,]?
一个值总是跟在 =
后面,所以 (?<=")
只有在它是带引号的值时才匹配。
请注意,你不需要转义 /
(在 Python 正则表达式中它没有特殊含义)、.
(在字符类内部),也不需要转义 "
(如果你使用 ''
、'''
或 """
作为字符串分隔符)。
或者,你也可以使用条件分组语法:
(
(?P<name>\w+)
(
=
(
(?P<quote>"?) # 匹配一个可选的引号
(?P<value> # 然后
(?(quote) # 如果组 <quote> 被捕获
[\w;,#./\\=]+ # 一个带引号的值
| # 或者
[\w#./\\]+ # 一个未带引号的值
) #
) # 后跟
(?(quote)(?P=quote)) # 相同的引号,如果组存在的话。
)
)?
)
[;,]?
英文:
Yes, you can match precisely what you need with a lookbehind:
(
(?P<name>\w+)
(
=
(?P<quote>"?) # Match an optional quote, which we capture,
(?P<value> # then either
(?<=")[\w;,#./\\=]+ # a quoted value, which follows a quote,
| # or
[\w#./\\]+ # an unquoted value, which is the fallback,
) # and
(?P=quote) # the quote we captured or an empty string.
)?
)
[;,]?
A value always follows =
, so (?<=")
matches iff it's a quoted value.
Note that you don't have to escape /
(it doesn't mean anything special in Python flavor), .
(inside a character class), nor "
(if you use '
, '''
or """
as string delimiters).
Try it on regex101.com.
Alternatively, you can use the conditional group syntax:
(
(?P<name>\w+)
(
=
(
(?P<quote>")? # Match an optional group which contains a quote
(?P<value> # then
(?(quote) # if group <quote> was captured
[\w;,#./\\=]+ # a quoted value
| # or
[\w#./\\]+ # an unquoted value
) #
) # followed by
(?(quote)(?P=quote)) # the same quote, if the group presents.
)
)?
)
[;,]?
Try it on regex101.com.
答案2
得分: 2
只需在每个引号之后添加可选的量词 "?"。
(?P<name>\w+)(=(\"?(?P<value>[\w;,#\.\/\\=]+)\"|)?)[;,]?
此外,我认为额外的捕获组是不必要的。
并且,您可以减少字符类的要求,使用懒惰量词 "?"。
(?P<name>[^ ]+?)=\"?(?P<value>.+?)\"?[;,]
另一种方法是保留引号,并在程序内部删除它们。
(?P<name>[^ ]+?)=(?P<value>\".+?\"|.+?)[;,]
作为参考,有一个关于 cookies 的 RFC,提供了其语法的定义。
RFC 6265 – HTTP State Management Mechanism。
看起来分隔值是分号 ";"。
cookie-header = "Cookie:" OWS cookie-string OWS
cookie-string = cookie-pair *( ";" SP cookie-pair )
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
token = <token, defined in [RFC2616], Section 2.2>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <"> | "/"
| "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
英文:
Just place the optional quantifier, ?, after each of the quotation marks.
(?P<name>\w+)(=(\"?(?P<value>[\w;,#\.\/\\=]+)\"?|)?)[;,]?
Furthermore, I don't believe the additional capture groups are necessary.
And, you could decrease the character class requirement, using a lazy quantifier, ?.
(?P<name>[^ ]+?)=\"?(?P<value>.+?)\"?[;,]
An alternate method would be to preserve the quotation marks, and remove them from within the program.
(?P<name>[^ ]+?)=(?P<value>\".+?\"|.+?)[;,]
For reference, there is an RFC for cookies, which provides a definition for its syntax.
RFC 6265 – HTTP State Management Mechanism.
It appears that the delimiting value is a semi-colon, ;.
cookie-header = "Cookie:" OWS cookie-string OWS
cookie-string = cookie-pair *( ";" SP cookie-pair )
cookie-pair = cookie-name "=" cookie-value
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
token = <token, defined in [RFC2616], Section 2.2>
token = 1*<any CHAR except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论