英文:
Split blocks of data with the same title using regex
问题
我有一个长字符串,构建如下:
[[title]]
a = "1"
b = "1"
c = "1"
d = "1"
e = [
"1",
"1",
]
[[title]]
a = "2"
b = "2"
c = "2"
d = "2"
e = [
"2",
]
[[title]]
a = "a3"
b = "3"
c = "3"
[[title]]
a = "a4"
b = "4"
c = "4"
e = [
"4",
]
我的目标是提取每个标题内部的文本(不包括标题),并将其放入一个切片中。
我尝试使用属性键(如d和e),但有时它们不存在。
你可以看一下我的正则表达式:
(?m)(((\[\[title]]\s*\n)(?:^.+$\n)+?)(d.*?$)(\s*e(.|\n)*?])?)
我想找到一种方法来提取每个标题之间的数据,直到\n或字符串结束
补充说明:
我正在使用GO语言,所以无法使用后向语法中的\。
谢谢!
英文:
I have a long string that is build like that:
[[title]]
a = "1"
b = "1"
c = "1"
d = "1"
e = [
"1",
"1",
]
[[title]]
a = "2"
b = "2"
c = "2"
d = "2"
e = [
"2",
]
[[title]]
a = "a3"
b = "3"
c = "3"
[[title]]
a = "a4"
b = "4"
c = "4"
e = [
"4",
]
My target is to extract the text inside each title (without the title) and put it into a slice.
I've tried to use the attributes keys (like d and e) but sometimes they don't exist.
You can look in my regex below:
(?m)(((\[\[title]]\s*\n)(?:^.+$\n)+?)(d.*?$)(\s*e(.|\n)*?])?)
I want to find a way to extract the data between each title until \n or end of string
Edition:
I'm using GO so I can't use look around \ behind syntax
Thanks!
答案1
得分: 1
你可以使用以下模式,从[[title]]
匹配到空行为止。
`\[\[title]](.*?)^$`gms
解释
\[\[title]]
匹配[[title]]
(
捕获组.*?
非贪婪匹配,直到下一个匹配
)
关闭组^$
使用m
(多行)标志,表示空行
英文:
You can use the following pattern that matches from [[title]]
to an empty line.
`\[\[title]](.*?)^$`gms
Explanation
\[\[title]]
Match[[title]]
(
Capturing group.*?
Non-greedy match till next match
)
Close group^$
Usingm
(multiline) flag this means an empty line
答案2
得分: 1
这似乎有效。虽然不像@ArtyomVancyan的答案那样简单或优雅,但它有一个小优点,就是不需要在表达式的末尾加上换行符:
(?m)(?:\[\[title]]\n((?:.*\n)+?(?:\]|^$)))+
解释:
(?m)
: 多行修饰符。(?:\[\[title]]\n(<text until next closing square bracket or blank line>))+
: 查找以[[title]]\n
开头,后跟<text until next closing square bracket or blank line>
的一个或多个块,并捕获这些文本。(?:.*\n)+?(?:\]|^$)
: 两个连续的非捕获子组;第一个是一堆行,(?:.*|n)+
,非贪婪,?
;第二个是一个闭合方括号]
或一个空行^$
。也就是说,一堆行以包含闭合方括号的第一行或一个空行结尾。
英文:
This seems to work. It's not as simple or elegant as @ArtyomVancyan's answer, although it has the little advantage that it doesn't need a newline at the end of the expression:
(?m)(?:\[\[title]]\n((?:.*\n)+?(?:\]|^$)))+
Explanation:
(?m)
: multi line modifier.(?:\[\[title]]\n(<text until next closing square bracket or blank line>))+
: find one or more blocks starting with[[title]]\n
and followed by<text until next closing square bracket or blank line>
, and capture those texts.(?:.*\n)+?(?:\]|^$)
: two consecutive non-capturing subgroups; the first one is a bunch of lines,(?:.*|n)+
, non-greedy,?
; and the second one is either a closing square bracket,]
, or an empty line,^$
. That is, a bunch of lines ending either in the first line line containing a closing square bracket or a blank line.
答案3
得分: 0
你可以使用一个模式来重复标题部分下可能的行的格式。
这些行以单词字符开头,后面跟着=
,然后是部分"..."
或[...]
[[title]]((?:\r?\n\w+\s*=\s*(?:"[^"]"|[[^][]]))*)
解释
\[\[title]]
匹配[[title]]
(
捕获第一组(?:
非捕获组\r?\n
匹配换行符\w+\s*=\s*
匹配1个或多个单词字符和可选的空格字符之间的=
符号(?:
非捕获组,用于备选项"[^"]*"
匹配"..."
|
或者\[[^\]\[]*]
匹配[
...]
)
关闭非捕获组
)*
关闭非捕获组并可选地重复
)
关闭第一组
英文:
You might use a pattern to repeat the possible format of the lines under the title part.
The lines start with word characters followed by =
and then either a part "..."
or [...]
\[\[title]]((?:\r?\n\w+\s*=\s*(?:"[^"]*"|\[[^\]\[]*]))*)
Explanation
\[\[title]]
Match[[title]]
(
Capture group 1(?:
Non capture group\r?\n
Match a newline\w+\s*=\s*
Match 1+ word chars and=
between optional whitspace chars(?:
Non capture group for the alternatives"[^"]*"
Match from"..."
|
Or\[[^\]\[]*]
match from[
...]
)
Close non capture group
)*
Close non capture group and optionally repeat
)
Close group 1
答案4
得分: 0
不要担心正则表达式可能带来的问题,你可能会更好地通过构建一个适用于你的自定义格式的自定义解析器,或者你可以重用一个 INI configparser 的实现。
如果标题总是定义在以 [[]]
为对的块的开头,你可以使用正则表达式找到它们,但只是将它们分离出来。
如果你对内容不感兴趣(当然,下一步你可能会感兴趣),并且你确信结构就像你展示的那样简单,你也可以直接在这些位置上进行两次分割。
>>> long_string_config = """""" # 省略输入数据以保持简洁
>>> for block in filter(None, (a.split("]]")[-1].strip() for a in long_string_config.split("[["))):
... print("---")
... print(block)
...
---
a = "1"
b = "1"
c = "1"
d = "1"
e = [
"1",
"1",
]
---
a = "2"
b = "2"
c = "2"
d = "2"
e = [
"2",
]
---
a = "a3"
b = "3"
c = "3"
---
a = "a4"
b = "4"
c = "4"
e = [
"4",
]
英文:
Instead of making a regex which seems fraught with perils, you'll probably be better served by just building a custom parser for your custom format, or you may find you can repurpose an implementation of an INI configparser
If the titles are always defined as being within pairs of [[]]
and at the start of a block, you could use a regex to find them, but only to separate them out
If you're not interested in the content (surely the next step is that you are) and you're sure the structure is as simple as you show, you could also just directly split twice on these instead
>>> long_string_config = """ """ # input data omitted for brevity
>>> for block in filter(None, (a.split("]]")[-1].strip() for a in long_string_config.split("[["))):
... print("---")
... print(block)
...
---
a = "1"
b = "1"
c = "1"
d = "1"
e = [
"1",
"1",
]
---
a = "2"
b = "2"
c = "2"
d = "2"
e = [
"2",
]
---
a = "a3"
b = "3"
c = "3"
---
a = "a4"
b = "4"
c = "4"
e = [
"4",
]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论