2020年9月2日 10:31:40go评论77阅读模式

英文:

Java regex to match multiline sections with subsections

问题

以下是翻译好的内容：

作为对更简单的StackOverflow问题的扩展，是否有一个Java正则表达式可以在一次遍历中从多行文本文档中提取每个部分和子部分，文档的结构如下：

&lt;与内容无关的行&gt;
...
&lt;与内容无关的行&gt;
####&lt;section_title&gt;
概述
...
...
介绍
...
...
细节
...
...
####&lt;section_title&gt;
概述
...
...
介绍
...
...
细节
...
...

section_title可以是任何内容，它以及每个子部分的标题（概述、介绍、细节）都是行中唯一的文本。所有其他行可以包含任何文本，从空行到数千个字符，在多行中分布。

当然，也可以使用BufferedReader逐行读取文档，但正则表达式会提供一种更优雅的解决方案。

英文:

As an expansion of a simpler StackOverflow question, is there a Java regex that can extract in one pass each section and subsection from a multiline text document, having a structure like

&lt;Irrelevant line&gt;
...
&lt;Irrelevant line&gt;
####&lt;section_title&gt;
OVERVIEW
...
...
INTRODUCTION
...
...
DETAILS
...
...
####&lt;section_title&gt;
OVERVIEW
...
...
INTRODUCTION
...
...
DETAILS
...
...

The section_title can be anything and it, as well as each subsection title (OVERVIEW, INTRODUCTION, DETAILS) is the only text in the line. All other lines can contain any text, from empty to thousands of characters, in multiple lines.

Alternatively, of course, the document can be processed using a BufferedReader and reading line by line, but a regex would offer a more elegant solution.

答案1

得分: 1

以下是翻译好的部分：

以下正则表达式将在迭代时一次返回一个子部分，如果需要，可以选择包括第一个子部分的部分标题。

(?m)(?:^####(.*)\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\z)

(?m) 表示在正则表达式的其余部分中，^ 和 $ 匹配行的开头和结尾（分别）。因此，我们使用 \z 来匹配输入的结尾，这是通常由 $ 匹配的内容。

(?s:XXX) 让 . 与 XXX 模式一起匹配任何字符，包括行分隔符字符（\r，\n）。

\R 匹配 \r，\n 或 \r\n，即匹配跨操作系统的换行符（Windows vs. Linux）。

使用 .*?（非贪婪）匹配，然后接着 (?=XXX)，将使正则表达式匹配文本，直到但不包括 XXX 模式。

示例
<sub>(也可在regex101.com上查看)</sub>

String regex = "（请参考上面的正则表达式）";

String input = "<不相关的行>\r\n" + 
               "...\r\n" + 
               "<不相关的行>\r\n" + 
               "####<部分标题>\r\n" + 
               "OVERVIEW\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "INTRODUCTION\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "DETAILS\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "####<部分标题>\r\n" + 
               "OVERVIEW\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "INTRODUCTION\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "DETAILS\r\n" + 
               "...\r\n" + 
               "...";

for (Matcher m = Pattern.compile(regex).matcher(input); m.find(); ) {
	String sectionTitle = m.group(1);
	String subSectionTitle = m.group(2);
	String content = m.group(3);
	if (sectionTitle != null)
		System.out.println("部分标题: " + sectionTitle);
	System.out.println("子部分标题: " + subSectionTitle);
	System.out.println("内容: " + content.replaceAll("(?ms)(?<=.)^", "         "));
}

输出

部分标题: <部分标题>
子部分标题: OVERVIEW
内容: ...
         ...

子部分标题: INTRODUCTION
内容: ...
         ...

子部分标题: DETAILS
内容: ...
         ...

部分标题: <部分标题>
子部分标题: OVERVIEW
内容: ...
         ...

子部分标题: INTRODUCTION
内容: ...
         ...

子部分标题: DETAILS
内容: ...
         ...

英文:

The following regex will return one sub-section at a time when iterating, optionally including the section header for the first sub-section.

(?m)(?:^####(.*)\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\z)

(?m) means that ^ and $ matches beginning and end of line (respectively) in the rest of the regex, so we then use \z to match end of input, which is what $ normally matches.

(?s:XXX) makes . match any character with the XXX pattern, including line separator characters (\r, \n).

\R matches \r, \n, or \r\n, i.e. matches a line separator regardless of OS (Windows vs. Linux).

Using .*? (reluctant) matching followed by (?=XXX) will make the regex match text up to but excluding the XXX pattern.

Demo
<sub>(also available on regex101.com)</sub>

String regex = &quot;(?m)(?:^####(.*)\\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\\z)&quot;;

String input = &quot;&lt;Irrelevant line&gt;\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;&lt;Irrelevant line&gt;\r\n&quot; + 
               &quot;####&lt;section_title&gt;\r\n&quot; + 
               &quot;OVERVIEW\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;INTRODUCTION\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;DETAILS\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;####&lt;section_title&gt;\r\n&quot; + 
               &quot;OVERVIEW\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;INTRODUCTION\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;DETAILS\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...&quot;;

for (Matcher m = Pattern.compile(regex).matcher(input); m.find(); ) {
	String sectionTitle = m.group(1);
	String subSectionTitle = m.group(2);
	String content = m.group(3);
	if (sectionTitle != null)
		System.out.println(&quot;sectionTitle: &quot; + sectionTitle);
	System.out.println(&quot;subSectionTitle: &quot; + subSectionTitle);
	System.out.println(&quot;content: &quot; + content.replaceAll(&quot;(?ms)(?&lt;=.)^&quot;, &quot;         &quot;));
}

Output

sectionTitle: &lt;section_title&gt;
subSectionTitle: OVERVIEW
content: ...
         ...

subSectionTitle: INTRODUCTION
content: ...
         ...

subSectionTitle: DETAILS
content: ...
         ...

sectionTitle: &lt;section_title&gt;
subSectionTitle: OVERVIEW
content: ...
         ...

subSectionTitle: INTRODUCTION
content: ...
         ...

subSectionTitle: DETAILS
content: ...
         ...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Java正则表达式匹配带有子部分的多行章节

问题

答案1

java.lang.ClassNotFoundException: org.apache.kafka.clients.consumer.ConsumerGroupMetadata

为什么我无法从我的Firestore数据库中获取集合中的所有文档？

Error querying sqlite database in android studio.

错误：在Linux上使用ant编译时出错：版本错误 52.0，应为 50.0。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论