在Python中匹配具有相同名称的子组在不同条件下。

huangapple go评论78阅读模式
英文:

Match subgroup with same name on separate conditions in Python

问题

I'm trying to create a regex to parse a cookie header I'm getting back that the cookies package complains about. I've come up with the following regex that works:

((?P<name>\w+)(=("(?P<value1>[\w;,#\.\/\\=]+)"|(?P<value2>[\w#\.\/\\]+)))?)[;,]?

For an example, see this link to regex101. Essentially, this works but I'd like to use value in place of both value1 and value2. This shouldn't be an issue because it's impossible to generate a match on both subgroups at the same time (one is quoted and the other isn't). Is there a way I can do this?

英文:

I'm trying to create a regex to parse a cookie header I'm getting back that the cookies package complains about. I've come up with the following regex that works:

((?P&lt;name&gt;\w+)(=(\&quot;(?P&lt;value1&gt;[\w;,#\.\/\\=]+)\&quot;|(?P&lt;value2&gt;[\w#\.\/\\]+)))?)[;,]?

For an example, see this link to regex101. Essentially, this works but I'd like to use value in place of both value1 and value2. This shouldn't be an issue because it's impossible to generate a match on both subgroups at the same time (one is quoted and the other isn't). Is there a way I can do this?

答案1

得分: 2

是的,你可以使用回顾后断言来精确匹配你需要的内容:

(
  (?P<name>\w+)
  (
    =
    (?P<quote>"?)            # 匹配可选的引号,并捕获它,
    (?P<value>               # 然后要么是
      (?<=")[\w;,#./\\=]+    # 一个跟在引号后的带引号的值,
    |                        # 或者
      [\w#./\\]+             # 一个未带引号的值,作为备选项,
    )                        # 和
    (?P=quote)               # 我们捕获的引号或空字符串。
  )?
)
[;,]?

一个值总是跟在 = 后面,所以 (?<=") 只有在它是带引号的值时才匹配。

请注意,你不需要转义 /(在 Python 正则表达式中它没有特殊含义)、.(在字符类内部),也不需要转义 "(如果你使用 '''''""" 作为字符串分隔符)。

regex101.com 上试试吧

或者,你也可以使用条件分组语法:

(
  (?P<name>\w+)
  (
    =
    (
      (?P<quote>"?)          # 匹配一个可选的引号
      (?P<value>             # 然后
        (?(quote)            # 如果组 <quote> 被捕获
          [\w;,#./\\=]+      # 一个带引号的值
        |                    # 或者
          [\w#./\\]+         # 一个未带引号的值
        )                    #
      )                      # 后跟
      (?(quote)(?P=quote))   # 相同的引号,如果组存在的话。
    )
  )?
)
[;,]?

regex101.com 上试试吧

英文:

Yes, you can match precisely what you need with a lookbehind:

(
  (?P&lt;name&gt;\w+)
  (
    =
    (?P&lt;quote&gt;&quot;?)            # Match an optional quote, which we capture,
    (?P&lt;value&gt;               # then either
      (?&lt;=&quot;)[\w;,#./\\=]+    # a quoted value, which follows a quote,
    |                        # or
      [\w#./\\]+             # an unquoted value, which is the fallback,
    )                        # and
    (?P=quote)               # the quote we captured or an empty string.
  )?
)
[;,]?

A value always follows =, so (?&lt;=&quot;) matches iff it's a quoted value.

Note that you don't have to escape / (it doesn't mean anything special in Python flavor), . (inside a character class), nor &quot; (if you use &#39;, &#39;&#39;&#39; or &quot;&quot;&quot; as string delimiters).

Try it on regex101.com.

Alternatively, you can use the conditional group syntax:

(
  (?P&lt;name&gt;\w+)
  (
    =
    (
      (?P&lt;quote&gt;&quot;)?          # Match an optional group which contains a quote
      (?P&lt;value&gt;             # then
        (?(quote)            #                if group &lt;quote&gt; was captured
          [\w;,#./\\=]+      # a quoted value
        |                    # or
          [\w#./\\]+         # an unquoted value
        )                    #
      )                      # followed by
      (?(quote)(?P=quote))   # the same quote, if the group presents.
    )
  )?
)
[;,]?

Try it on regex101.com.

答案2

得分: 2

只需在每个引号之后添加可选的量词 "?"。

(?P<name>\w+)(=(\"?(?P<value>[\w;,#\.\/\\=]+)\"|)?)[;,]?

此外,我认为额外的捕获组是不必要的。
并且,您可以减少字符类的要求,使用懒惰量词 "?"。

(?P<name>[^ ]+?)=\"?(?P<value>.+?)\"?[;,]

另一种方法是保留引号,并在程序内部删除它们。

(?P<name>[^ ]+?)=(?P<value>\".+?\"|.+?)[;,]

作为参考,有一个关于 cookies 的 RFC,提供了其语法的定义。
RFC 6265 – HTTP State Management Mechanism

看起来分隔值是分号 ";"。

cookie-header     = "Cookie:" OWS cookie-string OWS
cookie-string     = cookie-pair *( ";" SP cookie-pair )

cookie-pair       = cookie-name "=" cookie-value
cookie-name       = token
cookie-value      = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )

token             = <token, defined in [RFC2616], Section 2.2>

token             = 1*<any CHAR except CTLs or separators>
separators        = "(" | ")" | "<" | ">" | "@"
                  | "," | ";" | ":" | "\" | <"> | "/"
                  | "[" | "]" | "?" | "="
                  | "{" | "}" | SP | HT
英文:

Just place the optional quantifier, ?, after each of the quotation marks.

(?P&lt;name&gt;\w+)(=(\&quot;?(?P&lt;value&gt;[\w;,#\.\/\\=]+)\&quot;?|)?)[;,]?

Furthermore, I don't believe the additional capture groups are necessary.
And, you could decrease the character class requirement, using a lazy quantifier, ?.

(?P&lt;name&gt;[^ ]+?)=\&quot;?(?P&lt;value&gt;.+?)\&quot;?[;,]

An alternate method would be to preserve the quotation marks, and remove them from within the program.

(?P&lt;name&gt;[^ ]+?)=(?P&lt;value&gt;\&quot;.+?\&quot;|.+?)[;,]

For reference, there is an RFC for cookies, which provides a definition for its syntax.
RFC 6265 &ndash; HTTP State Management Mechanism.

It appears that the delimiting value is a semi-colon, ;.

cookie-header     = &quot;Cookie:&quot; OWS cookie-string OWS
cookie-string     = cookie-pair *( &quot;;&quot; SP cookie-pair )

cookie-pair       = cookie-name &quot;=&quot; cookie-value
cookie-name       = token
cookie-value      = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )

token             = &lt;token, defined in [RFC2616], Section 2.2&gt;

token             = 1*&lt;any CHAR except CTLs or separators&gt;
separators        = &quot;(&quot; | &quot;)&quot; | &quot;&lt;&quot; | &quot;&gt;&quot; | &quot;@&quot;
                  | &quot;,&quot; | &quot;;&quot; | &quot;:&quot; | &quot;\&quot; | &lt;&quot;&gt;
                  | &quot;/&quot; | &quot;[&quot; | &quot;]&quot; | &quot;?&quot; | &quot;=&quot;
                  | &quot;{&quot; | &quot;}&quot; | SP | HT

huangapple
  • 本文由 发表于 2023年6月15日 10:58:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76478791.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定