英文:
Java regex to match multiline sections with subsections
问题
以下是翻译好的内容:
作为对更简单的StackOverflow
问题的扩展,是否有一个Java
正则表达式可以在一次遍历中从多行文本文档中提取每个部分和子部分,文档的结构如下:
<与内容无关的行>
...
<与内容无关的行>
####<section_title>
概述
...
...
介绍
...
...
细节
...
...
####<section_title>
概述
...
...
介绍
...
...
细节
...
...
section_title
可以是任何内容,它以及每个子部分的标题(概述、介绍、细节)都是行中唯一的文本。所有其他行可以包含任何文本,从空行到数千个字符,在多行中分布。
当然,也可以使用BufferedReader
逐行读取文档,但正则表达式会提供一种更优雅的解决方案。
英文:
As an expansion of a simpler StackOverflow
question, is there a Java
regex that can extract in one pass each section and subsection from a multiline text document, having a structure like
<Irrelevant line>
...
<Irrelevant line>
####<section_title>
OVERVIEW
...
...
INTRODUCTION
...
...
DETAILS
...
...
####<section_title>
OVERVIEW
...
...
INTRODUCTION
...
...
DETAILS
...
...
The section_title
can be anything and it, as well as each subsection title (OVERVIEW, INTRODUCTION, DETAILS) is the only text in the line. All other lines can contain any text, from empty to thousands of characters, in multiple lines.
Alternatively, of course, the document can be processed using a BufferedReader
and reading line by line, but a regex would offer a more elegant solution.
答案1
得分: 1
以下是翻译好的部分:
以下正则表达式将在迭代时一次返回一个子部分,如果需要,可以选择包括第一个子部分的部分标题。
(?m)(?:^####(.*)\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\z)
(?m)
表示在正则表达式的其余部分中,^
和 $
匹配行的开头和结尾(分别)。因此,我们使用 \z
来匹配输入的结尾,这是通常由 $
匹配的内容。
(?s:XXX)
让 .
与 XXX
模式一起匹配任何字符,包括行分隔符字符(\r
,\n
)。
\R
匹配 \r
,\n
或 \r\n
,即匹配跨操作系统的换行符(Windows vs. Linux)。
使用 .*?
(非贪婪)匹配,然后接着 (?=XXX)
,将使正则表达式匹配文本,直到但不包括 XXX
模式。
示例
<sub>(也可在regex101.com上查看)</sub>
String regex = "(请参考上面的正则表达式)";
String input = "<不相关的行>\r\n" +
"...\r\n" +
"<不相关的行>\r\n" +
"####<部分标题>\r\n" +
"OVERVIEW\r\n" +
"...\r\n" +
"...\r\n" +
"INTRODUCTION\r\n" +
"...\r\n" +
"...\r\n" +
"DETAILS\r\n" +
"...\r\n" +
"...\r\n" +
"####<部分标题>\r\n" +
"OVERVIEW\r\n" +
"...\r\n" +
"...\r\n" +
"INTRODUCTION\r\n" +
"...\r\n" +
"...\r\n" +
"DETAILS\r\n" +
"...\r\n" +
"...";
for (Matcher m = Pattern.compile(regex).matcher(input); m.find(); ) {
String sectionTitle = m.group(1);
String subSectionTitle = m.group(2);
String content = m.group(3);
if (sectionTitle != null)
System.out.println("部分标题: " + sectionTitle);
System.out.println("子部分标题: " + subSectionTitle);
System.out.println("内容: " + content.replaceAll("(?ms)(?<=.)^", " "));
}
输出
部分标题: <部分标题>
子部分标题: OVERVIEW
内容: ...
...
子部分标题: INTRODUCTION
内容: ...
...
子部分标题: DETAILS
内容: ...
...
部分标题: <部分标题>
子部分标题: OVERVIEW
内容: ...
...
子部分标题: INTRODUCTION
内容: ...
...
子部分标题: DETAILS
内容: ...
...
英文:
The following regex will return one sub-section at a time when iterating, optionally including the section header for the first sub-section.
(?m)(?:^####(.*)\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\z)
(?m)
means that ^
and $
matches beginning and end of line (respectively) in the rest of the regex, so we then use \z
to match end of input, which is what $
normally matches.
(?s:XXX)
makes .
match any character with the XXX
pattern, including line separator characters (\r
, \n
).
\R
matches \r
, \n
, or \r\n
, i.e. matches a line separator regardless of OS (Windows vs. Linux).
Using .*?
(reluctant) matching followed by (?=XXX)
will make the regex match text up to but excluding the XXX
pattern.
Demo
<sub>(also available on regex101.com)</sub>
String regex = "(?m)(?:^####(.*)\\R)?^(OVERVIEW|INTRODUCTION|DETAILS)\\R(?s:(.*?))(?=^####|^(?:OVERVIEW|INTRODUCTION|DETAILS)$|\\z)";
String input = "<Irrelevant line>\r\n" +
"...\r\n" +
"<Irrelevant line>\r\n" +
"####<section_title>\r\n" +
"OVERVIEW\r\n" +
"...\r\n" +
"...\r\n" +
"INTRODUCTION\r\n" +
"...\r\n" +
"...\r\n" +
"DETAILS\r\n" +
"...\r\n" +
"...\r\n" +
"####<section_title>\r\n" +
"OVERVIEW\r\n" +
"...\r\n" +
"...\r\n" +
"INTRODUCTION\r\n" +
"...\r\n" +
"...\r\n" +
"DETAILS\r\n" +
"...\r\n" +
"...";
for (Matcher m = Pattern.compile(regex).matcher(input); m.find(); ) {
String sectionTitle = m.group(1);
String subSectionTitle = m.group(2);
String content = m.group(3);
if (sectionTitle != null)
System.out.println("sectionTitle: " + sectionTitle);
System.out.println("subSectionTitle: " + subSectionTitle);
System.out.println("content: " + content.replaceAll("(?ms)(?<=.)^", " "));
}
Output
sectionTitle: <section_title>
subSectionTitle: OVERVIEW
content: ...
...
subSectionTitle: INTRODUCTION
content: ...
...
subSectionTitle: DETAILS
content: ...
...
sectionTitle: <section_title>
subSectionTitle: OVERVIEW
content: ...
...
subSectionTitle: INTRODUCTION
content: ...
...
subSectionTitle: DETAILS
content: ...
...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论