使用正则表达式将具有相同标题的数据块拆分开来。

huangapple go评论88阅读模式
英文:

Split blocks of data with the same title using regex

问题

我有一个长字符串,构建如下:

[[title]]
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]

[[title]]
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]

[[title]]
a = "a3"
b = "3"
c = "3"

[[title]]
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]

我的目标是提取每个标题内部的文本(不包括标题),并将其放入一个切片中。
我尝试使用属性键(如d和e),但有时它们不存在。

你可以看一下我的正则表达式:

(?m)(((\[\[title]]\s*\n)(?:^.+$\n)+?)(d.*?$)(\s*e(.|\n)*?])?)

我想找到一种方法来提取每个标题之间的数据,直到\n或字符串结束

补充说明:

我正在使用GO语言,所以无法使用后向语法中的\。

谢谢!

英文:

I have a long string that is build like that:

[[title]]
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]

[[title]]
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]

[[title]]
a = "a3"
b = "3"
c = "3"

[[title]]
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]

My target is to extract the text inside each title (without the title) and put it into a slice.
I've tried to use the attributes keys (like d and e) but sometimes they don't exist.

You can look in my regex below:

(?m)(((\[\[title]]\s*\n)(?:^.+$\n)+?)(d.*?$)(\s*e(.|\n)*?])?)

I want to find a way to extract the data between each title until \n or end of string

Edition:

I'm using GO so I can't use look around \ behind syntax

Thanks!

答案1

得分: 1

你可以使用以下模式,从[[title]]匹配到空行为止。

`\[\[title]](.*?)^$`gms

解释

  • \[\[title]] 匹配 [[title]]
  • ( 捕获组
    • .*? 非贪婪匹配,直到下一个匹配
  • ) 关闭组
  • ^$ 使用 m(多行)标志,表示空行

Golang的正则表达式引擎中查看演示

英文:

You can use the following pattern that matches from [[title]] to an empty line.

`\[\[title]](.*?)^$`gms

Explanation

  • \[\[title]] Match [[title]]
  • ( Capturing group
    • .*? Non-greedy match till next match
  • ) Close group
  • ^$ Using m (multiline) flag this means an empty line

See the demo with the Golang regex engine

答案2

得分: 1

这似乎有效。虽然不像@ArtyomVancyan的答案那样简单或优雅,但它有一个小优点,就是不需要在表达式的末尾加上换行符:

[演示]

(?m)(?:\[\[title]]\n((?:.*\n)+?(?:\]|^$)))+

解释:

  • (?m): 多行修饰符。
  • (?:\[\[title]]\n(<text until next closing square bracket or blank line>))+: 查找以[[title]]\n开头,后跟<text until next closing square bracket or blank line>的一个或多个块,并捕获这些文本。
  • (?:.*\n)+?(?:\]|^$): 两个连续的非捕获子组;第一个是一堆行,(?:.*|n)+,非贪婪,?;第二个是一个闭合方括号]或一个空行^$。也就是说,一堆行以包含闭合方括号的第一行或一个空行结尾。
英文:

This seems to work. It's not as simple or elegant as @ArtyomVancyan's answer, although it has the little advantage that it doesn't need a newline at the end of the expression:

[Demo]

(?m)(?:\[\[title]]\n((?:.*\n)+?(?:\]|^$)))+

Explanation:

  • (?m): multi line modifier.
  • (?:\[\[title]]\n(<text until next closing square bracket or blank line>))+: find one or more blocks starting with [[title]]\n and followed by <text until next closing square bracket or blank line>, and capture those texts.
  • (?:.*\n)+?(?:\]|^$): two consecutive non-capturing subgroups; the first one is a bunch of lines, (?:.*|n)+, non-greedy, ?; and the second one is either a closing square bracket, ], or an empty line, ^$. That is, a bunch of lines ending either in the first line line containing a closing square bracket or a blank line.

答案3

得分: 0

你可以使用一个模式来重复标题部分下可能的行的格式。

这些行以单词字符开头,后面跟着=,然后是部分"..."[...]

[[title]]((?:\r?\n\w+\s*=\s*(?:"[^"]"|[[^][]]))*)

解释

  • \[\[title]] 匹配[[title]]
  • ( 捕获第一组
    • (?: 非捕获组
      • \r?\n 匹配换行符
      • \w+\s*=\s* 匹配1个或多个单词字符和可选的空格字符之间的=符号
      • (?: 非捕获组,用于备选项
        • "[^"]*" 匹配"..."
        • | 或者
        • \[[^\]\[]*] 匹配[...]
      • ) 关闭非捕获组
    • )* 关闭非捕获组并可选地重复
  • ) 关闭第一组

正则表达式演示

英文:

You might use a pattern to repeat the possible format of the lines under the title part.

The lines start with word characters followed by = and then either a part "..." or [...]

\[\[title]]((?:\r?\n\w+\s*=\s*(?:"[^"]*"|\[[^\]\[]*]))*)

Explanation

  • \[\[title]] Match [[title]]
  • ( Capture group 1
    • (?: Non capture group
      • \r?\n Match a newline
      • \w+\s*=\s* Match 1+ word chars and = between optional whitspace chars
      • (?: Non capture group for the alternatives
        • "[^"]*" Match from "..."
        • | Or
        • \[[^\]\[]*] match from [...]
      • ) Close non capture group
    • )* Close non capture group and optionally repeat
  • ) Close group 1

Regex demo

答案4

得分: 0

不要担心正则表达式可能带来的问题,你可能会更好地通过构建一个适用于你的自定义格式的自定义解析器,或者你可以重用一个 INI configparser 的实现。

如果标题总是定义在以 [[]] 为对的块的开头,你可以使用正则表达式找到它们,但只是将它们分离出来。

如果你对内容不感兴趣(当然,下一步你可能会感兴趣),并且你确信结构就像你展示的那样简单,你也可以直接在这些位置上进行两次分割。

>>> long_string_config = """"""  # 省略输入数据以保持简洁
>>> for block in filter(None, (a.split("]]")[-1].strip() for a in long_string_config.split("[["))):
...    print("---")
...    print(block)
...
---
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]
---
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]
---
a = "a3"
b = "3"
c = "3"
---
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]
英文:

Instead of making a regex which seems fraught with perils, you'll probably be better served by just building a custom parser for your custom format, or you may find you can repurpose an implementation of an INI configparser

If the titles are always defined as being within pairs of [[]] and at the start of a block, you could use a regex to find them, but only to separate them out

If you're not interested in the content (surely the next step is that you are) and you're sure the structure is as simple as you show, you could also just directly split twice on these instead

>>> long_string_config = """ """  # input data omitted for brevity
>>> for block in filter(None, (a.split("]]")[-1].strip() for a in long_string_config.split("[["))):
...    print("---")
...    print(block)
...
---
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]
---
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]
---
a = "a3"
b = "3"
c = "3"
---
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]

huangapple
  • 本文由 发表于 2022年6月8日 23:34:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/72548491.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定