2022年6月8日 23:34:54go评论88阅读模式

英文:

Split blocks of data with the same title using regex

问题

我有一个长字符串，构建如下：

[[title]]
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]

[[title]]
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]

[[title]]
a = "a3"
b = "3"
c = "3"

[[title]]
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]

我的目标是提取每个标题内部的文本（不包括标题），并将其放入一个切片中。
我尝试使用属性键（如d和e），但有时它们不存在。

你可以看一下我的正则表达式：

(?m)(((\[\[title]]\s*\n)(?:^.+$\n)+?)(d.*?$)(\s*e(.|\n)*?])?)

我想找到一种方法来提取每个标题之间的数据，直到\n或字符串结束

补充说明：

我正在使用GO语言，所以无法使用后向语法中的\。

谢谢！

英文:

I have a long string that is build like that:

[[title]]
a = &quot;1&quot;
b = &quot;1&quot;
c = &quot;1&quot;
d = &quot;1&quot;
e = [
 &quot;1&quot;,
 &quot;1&quot;,
]

[[title]]
a = &quot;2&quot;
b = &quot;2&quot;
c = &quot;2&quot;
d = &quot;2&quot;
e = [
 &quot;2&quot;,
]

[[title]]
a = &quot;a3&quot;
b = &quot;3&quot;
c = &quot;3&quot;

[[title]]
a = &quot;a4&quot;
b = &quot;4&quot;
c = &quot;4&quot;
e = [
 &quot;4&quot;,
]

My target is to extract the text inside each title (without the title) and put it into a slice.
I've tried to use the attributes keys (like d and e) but sometimes they don't exist.

You can look in my regex below:

(?m)(((\[\[title]]\s*\n)(?:^.+$\n)+?)(d.*?$)(\s*e(.|\n)*?])?)

I want to find a way to extract the data between each title until \n or end of string

Edition:

I'm using GO so I can't use look around \ behind syntax

Thanks!

答案1

得分: 1

你可以使用以下模式，从[[title]]匹配到空行为止。

`\[\[title]](.*?)^$`gms

解释

\[\[title]] 匹配 [[title]]
( 捕获组
- .*? 非贪婪匹配，直到下一个匹配
) 关闭组
^$ 使用 m（多行）标志，表示空行

在Golang的正则表达式引擎中查看演示

英文:

You can use the following pattern that matches from [[title]] to an empty line.

`\[\[title]](.*?)^$`gms

Explanation

\[\[title]] Match [[title]]
( Capturing group
- .*? Non-greedy match till next match
) Close group
^$ Using m (multiline) flag this means an empty line

See the demo with the Golang regex engine

答案2

得分: 1

这似乎有效。虽然不像@ArtyomVancyan的答案那样简单或优雅，但它有一个小优点，就是不需要在表达式的末尾加上换行符：

[演示]

(?m)(?:\[\[title]]\n((?:.*\n)+?(?:\]|^$)))+

解释：

(?m): 多行修饰符。
(?:\[\[title]]\n(<text until next closing square bracket or blank line>))+: 查找以[[title]]\n开头，后跟<text until next closing square bracket or blank line>的一个或多个块，并捕获这些文本。
(?:.*\n)+?(?:\]|^$): 两个连续的非捕获子组；第一个是一堆行，(?:.*|n)+，非贪婪，?；第二个是一个闭合方括号]或一个空行^$。也就是说，一堆行以包含闭合方括号的第一行或一个空行结尾。

英文:

This seems to work. It's not as simple or elegant as @ArtyomVancyan's answer, although it has the little advantage that it doesn't need a newline at the end of the expression:

[Demo]

(?m)(?:\[\[title]]\n((?:.*\n)+?(?:\]|^$)))+

Explanation:

(?m): multi line modifier.
(?:\[\[title]]\n(<text until next closing square bracket or blank line>))+: find one or more blocks starting with [[title]]\n and followed by <text until next closing square bracket or blank line>, and capture those texts.
(?:.*\n)+?(?:\]|^$): two consecutive non-capturing subgroups; the first one is a bunch of lines, (?:.*|n)+, non-greedy, ?; and the second one is either a closing square bracket, ], or an empty line, ^$. That is, a bunch of lines ending either in the first line line containing a closing square bracket or a blank line.

答案3

得分: 0

你可以使用一个模式来重复标题部分下可能的行的格式。

这些行以单词字符开头，后面跟着=，然后是部分"..."或[...]

[[title]]((?:\r?\n\w+\s*=\s*(?:"[^"]"|[[^][]]))*)

解释

\[\[title]] 匹配[[title]]
( 捕获第一组
- (?: 非捕获组
  - \r?\n 匹配换行符
  - \w+\s*=\s* 匹配1个或多个单词字符和可选的空格字符之间的=符号
  - (?: 非捕获组，用于备选项
    - "[^"]*" 匹配"..."
    - | 或者
    - \[[^\]\[]*] 匹配[...]
  - ) 关闭非捕获组
- )* 关闭非捕获组并可选地重复
) 关闭第一组

正则表达式演示

英文:

You might use a pattern to repeat the possible format of the lines under the title part.

The lines start with word characters followed by = and then either a part "..." or [...]

\[\[title]]((?:\r?\n\w+\s*=\s*(?:&quot;[^&quot;]*&quot;|\[[^\]\[]*]))*)

Explanation

\[\[title]] Match [[title]]
( Capture group 1
- (?: Non capture group
  - \r?\n Match a newline
  - \w+\s*=\s* Match 1+ word chars and = between optional whitspace chars
  - (?: Non capture group for the alternatives
    - "[^"]*" Match from "..."
    - | Or
    - \[[^\]\[]*] match from [...]
  - ) Close non capture group
- )* Close non capture group and optionally repeat
) Close group 1

Regex demo

答案4

得分: 0

不要担心正则表达式可能带来的问题，你可能会更好地通过构建一个适用于你的自定义格式的自定义解析器，或者你可以重用一个 INI configparser 的实现。

如果标题总是定义在以 [[]] 为对的块的开头，你可以使用正则表达式找到它们，但只是将它们分离出来。

如果你对内容不感兴趣（当然，下一步你可能会感兴趣），并且你确信结构就像你展示的那样简单，你也可以直接在这些位置上进行两次分割。

>>> long_string_config = """"""  # 省略输入数据以保持简洁
>>> for block in filter(None, (a.split("]]")[-1].strip() for a in long_string_config.split("[["))):
...    print("---")
...    print(block)
...
---
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]
---
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]
---
a = "a3"
b = "3"
c = "3"
---
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]

英文:

Instead of making a regex which seems fraught with perils, you'll probably be better served by just building a custom parser for your custom format, or you may find you can repurpose an implementation of an INI configparser

If the titles are always defined as being within pairs of [[]] and at the start of a block, you could use a regex to find them, but only to separate them out

If you're not interested in the content (surely the next step is that you are) and you're sure the structure is as simple as you show, you could also just directly split twice on these instead

&gt;&gt;&gt; long_string_config = &quot;&quot;&quot; &quot;&quot;&quot;  # input data omitted for brevity
&gt;&gt;&gt; for block in filter(None, (a.split(&quot;]]&quot;)[-1].strip() for a in long_string_config.split(&quot;[[&quot;))):
...    print(&quot;---&quot;)
...    print(block)
...
---
a = &quot;1&quot;
b = &quot;1&quot;
c = &quot;1&quot;
d = &quot;1&quot;
e = [
 &quot;1&quot;,
 &quot;1&quot;,
]
---
a = &quot;2&quot;
b = &quot;2&quot;
c = &quot;2&quot;
d = &quot;2&quot;
e = [
 &quot;2&quot;,
]
---
a = &quot;a3&quot;
b = &quot;3&quot;
c = &quot;3&quot;
---
a = &quot;a4&quot;
b = &quot;4&quot;
c = &quot;4&quot;
e = [
 &quot;4&quot;,
]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用正则表达式将具有相同标题的数据块拆分开来。

问题

答案1

解释

Explanation

答案2

答案3

答案4

无缓冲通道

Does System.nanotime leap back or forward

将Go的UnixDate时间转换为RFC3339格式无法保留时区

如何在 macOS 上使用 logrus/lumberjack 自动重新创建日志文件

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论