正则表达式,可选择性地捕获引号内的所有内容

huangapple go评论45阅读模式
英文:

Regex that optionally catches all within quotes

问题

我有逗号分隔的元组作为输入。每个元组包含三个由逗号分隔的元素:数字、文本和数字。每个元素可能不存在(空字符串)。

文本可能用引号括起来,但可选地,当文本中有逗号或空格时,它会被括起来。输入的示例:

(1,Item1,1),(,"Item 2",2),(3,"With,comma",3)

是否可能编写一个正则表达式来提取元组中的元素?

我尝试过的一个方法是\((.*?),"?(.*?)"?,(.*?)\),但在文本中有逗号的元组上失败了。

这是我测试的地方: regex101.com/r/2RetnU/1

英文:

I have comma-separated tuples as an input. Each tuple contain three elements separated by comma: number, text and number. Each element may be absent (empty string).

Text may be wrapped in quotation marks but optionally, it gets wrapped in case when there's a comma or whitespace within a text. An example of the input:

(1,Item1,1),(,"Item 2",2),(3,"With,comma",3)

Is it possible to write a regex that fetches elements from a tuples?

One of my attempts is next \((.*?),"?(.*?)"?,(.*?)\) but it fails on the tuple where text has comma inside.

Here's where I test it: regex101.com/r/2RetnU/1

答案1

得分: 3

Tuple1 = \1, Tuple2 = \2, Tuple3 = \5

Tested on:

(1,"Item(with\"escaped\"quotes and(parentheses)",3),(2,escaped, non-"quoted" text , separated and with (parentheses)
,88),(1,Item1,1),("Item 2",2),(3,"With,comma",3),
(23,"Item1, Item2,\"Item 3\"",99),(2,,88),(,,)
  • 此正则表达式考虑了字符串元组可能会被转义或用双引号括起来的情况。
  • 它要求元组中至少包含分隔符“,”的最小元组(全为空)。
  1. 要成为元组集,字符串需要按照这个顺序出现这些字符:(,,),之间可能会有更多字符。
  2. 捕获元组1和3中的数字很容易:\d*
  3. 元组2作为字符串分为两种变体:
    1. 转义字符串:([^()",]|\\[(),"])*
      字符串中不允许出现,()"这些元字符,除非有前导的\转义字符。
    2. 引号括起来的字符串:"([^"]|\\")*"
      这里只有"双引号是元字符,必须以转义方式出现。
    3. 不需要对元字符\进行转义,因为元组定义足够严格,除非在多个元组上存在非常特殊且在语法上错误的构造。接受这一点后,正则表达式对一些转义错误是健壮的,比如(1,"\"",3)(1,未引用\,3),但对于元组集骨架中的错误敏感。

由于没有边界检测 ^ ... $,如果未使用 \n 作为分隔符,正则表达式将跨越多行。

英文:
\((\d*),(([^()",]|\\[(),"])*|"([^"]|\\")*"),(\d*)\)

Tuple1 = \1, Tuple2 = \2, Tuple3 = \5

Tested on:

(1,"Item(with\"escaped\"quotes and(parentheses)",3),(2,escaped\, non-

\"quoted\" text \, separated and with \(parentheses\)
,88),(1,Item1,1),(,"Item 2",2),(3,"With,comma",3),
(23,Item1\, Item2\,\"Item 3\",99),(2,,88),(,,)
  • This RegEx respects the possibility that the string tuple might be escaped or enclosed in double quotes.
    As (for further processing) you need to know if you have an escaped string, the non-escaped string includes the double quotes.
  • It requires that the minimum tuple (all empty) includes the separators ,
  1. To be a set of tuples the string requires these character to appear in this order: (,,), in between there might be more characters.
  2. Catching the digits in tuple 1 and 3 is trivial: \d*
  3. The tuple 2 as string splits in 2 variants:
    1. Escaped string: ([^()",]|\\[(),"])*
      The appearance of the meta characters ,()" in the string is not allowed, except when having a leading \ escape character.
    2. Quoted string: "([^"]|\\")*"
      Here only the " double quote is a meta character that must only appear escaped.
    3. Escaping of meta character \ is not required, as the tuple-definition is strict enough, except there is a very special, for the tuple-sets syntactically false construction over multiple tuples.
      Accepting this the RegEx is robust against some escaping faults like (1,"\",3) or (1,unquoted\,3), but sensitive against faults in the tuple-set skeleton.

As there is no edge detection ^ ... $ the regex will walk over multiple lines if the \n is not used as a delimiter.

huangapple
  • 本文由 发表于 2023年4月19日 18:11:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76053272.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定