Java正则表达式匹配带有子部分的多行章节

huangapple go评论70阅读模式
英文:

Java regex to match multiline sections with subsections

问题

以下是翻译好的内容:

作为对更简单的StackOverflow问题的扩展,是否有一个Java正则表达式可以在一次遍历中从多行文本文档中提取每个部分和子部分,文档的结构如下:

<与内容无关的行>
...
<与内容无关的行>
####<section_title>
概述
...
...
介绍
...
...
细节
...
...
####<section_title>
概述
...
...
介绍
...
...
细节
...
...

section_title可以是任何内容,它以及每个子部分的标题(概述、介绍、细节)都是行中唯一的文本。所有其他行可以包含任何文本,从空行到数千个字符,在多行中分布。

当然,也可以使用BufferedReader逐行读取文档,但正则表达式会提供一种更优雅的解决方案。

英文:

As an expansion of a simpler StackOverflow question, is there a Java regex that can extract in one pass each section and subsection from a multiline text document, having a structure like

<Irrelevant line>
...
<Irrelevant line>
####<section_title>
OVERVIEW
...
...
INTRODUCTION
...
...
DETAILS
...
...
####<section_title>
OVERVIEW
...
...
INTRODUCTION
...
...
DETAILS
...
...

The section_title can be anything and it, as well as each subsection title (OVERVIEW, INTRODUCTION, DETAILS) is the only text in the line. All other lines can contain any text, from empty to thousands of characters, in multiple lines.

Alternatively, of course, the document can be processed using a BufferedReader and reading line by line, but a regex would offer a more elegant solution.

答案1

得分: 1

以下是翻译好的部分:

以下正则表达式将在迭代时一次返回一个子部分,如果需要,可以选择包括第一个子部分的部分标题。

(?m)(?:^####(.*)\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\z)

(?m) 表示在正则表达式的其余部分中,^$ 匹配行的开头和结尾(分别)。因此,我们使用 \z 来匹配输入的结尾,这是通常由 $ 匹配的内容。

(?s:XXX).XXX 模式一起匹配任何字符,包括行分隔符字符(\r\n)。

\R 匹配 \r\n\r\n,即匹配跨操作系统的换行符(Windows vs. Linux)。

使用 .*?(非贪婪)匹配,然后接着 (?=XXX),将使正则表达式匹配文本,直到但不包括 XXX 模式。

示例
<sub>(也可在regex101.com上查看)</sub>

String regex = "(请参考上面的正则表达式)";

String input = "<不相关的行>\r\n" + 
               "...\r\n" + 
               "<不相关的行>\r\n" + 
               "####<部分标题>\r\n" + 
               "OVERVIEW\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "INTRODUCTION\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "DETAILS\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "####<部分标题>\r\n" + 
               "OVERVIEW\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "INTRODUCTION\r\n" + 
               "...\r\n" + 
               "...\r\n" + 
               "DETAILS\r\n" + 
               "...\r\n" + 
               "...";

for (Matcher m = Pattern.compile(regex).matcher(input); m.find(); ) {
	String sectionTitle = m.group(1);
	String subSectionTitle = m.group(2);
	String content = m.group(3);
	if (sectionTitle != null)
		System.out.println("部分标题: " + sectionTitle);
	System.out.println("子部分标题: " + subSectionTitle);
	System.out.println("内容: " + content.replaceAll("(?ms)(?<=.)^", "         "));
}

输出

部分标题: <部分标题>
子部分标题: OVERVIEW
内容: ...
         ...

子部分标题: INTRODUCTION
内容: ...
         ...

子部分标题: DETAILS
内容: ...
         ...

部分标题: <部分标题>
子部分标题: OVERVIEW
内容: ...
         ...

子部分标题: INTRODUCTION
内容: ...
         ...

子部分标题: DETAILS
内容: ...
         ...
英文:

The following regex will return one sub-section at a time when iterating, optionally including the section header for the first sub-section.

(?m)(?:^####(.*)\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\z)

(?m) means that ^ and $ matches beginning and end of line (respectively) in the rest of the regex, so we then use \z to match end of input, which is what $ normally matches.

(?s:XXX) makes . match any character with the XXX pattern, including line separator characters (\r, \n).

\R matches \r, \n, or \r\n, i.e. matches a line separator regardless of OS (Windows vs. Linux).

Using .*? (reluctant) matching followed by (?=XXX) will make the regex match text up to but excluding the XXX pattern.

Demo
<sub>(also available on regex101.com)</sub>

String regex = &quot;(?m)(?:^####(.*)\\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\\z)&quot;;

String input = &quot;&lt;Irrelevant line&gt;\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;&lt;Irrelevant line&gt;\r\n&quot; + 
               &quot;####&lt;section_title&gt;\r\n&quot; + 
               &quot;OVERVIEW\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;INTRODUCTION\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;DETAILS\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;####&lt;section_title&gt;\r\n&quot; + 
               &quot;OVERVIEW\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;INTRODUCTION\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;DETAILS\r\n&quot; + 
               &quot;...\r\n&quot; + 
               &quot;...&quot;;

for (Matcher m = Pattern.compile(regex).matcher(input); m.find(); ) {
	String sectionTitle = m.group(1);
	String subSectionTitle = m.group(2);
	String content = m.group(3);
	if (sectionTitle != null)
		System.out.println(&quot;sectionTitle: &quot; + sectionTitle);
	System.out.println(&quot;subSectionTitle: &quot; + subSectionTitle);
	System.out.println(&quot;content: &quot; + content.replaceAll(&quot;(?ms)(?&lt;=.)^&quot;, &quot;         &quot;));
}

Output

sectionTitle: &lt;section_title&gt;
subSectionTitle: OVERVIEW
content: ...
         ...

subSectionTitle: INTRODUCTION
content: ...
         ...

subSectionTitle: DETAILS
content: ...
         ...

sectionTitle: &lt;section_title&gt;
subSectionTitle: OVERVIEW
content: ...
         ...

subSectionTitle: INTRODUCTION
content: ...
         ...

subSectionTitle: DETAILS
content: ...
         ...

huangapple
  • 本文由 发表于 2020年9月2日 10:31:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/63697849.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定