2023年2月14日 01:15:43go评论93阅读模式

英文:

Grab groups of matches based on strings that start a line

问题

我正在尝试构建一个正则表达式，以捕获从^INS到^DMG之间的行组。我能够排除INS*Y*G8到DMG之间的内容。然而，我不断遇到灾难性的回溯问题。

Group 1应该是：

**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**

Group 2应该是：

**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**

谢谢你的帮助。

英文:

I am trying to build a regex that will capture groups of lines up to and including lines from ^INS through ^DMG. I am able to exclude INS*Y*G8 through DMG. However, I keep getting catastrophic backtracking.

INS*Y*G8*030**A***AC~
REF*0F*XXXXXXXX~
NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
DMG*D8*19700101*M~
**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
AMT*D2*100~
AMT*FK*100~
AMT*R*50~
AMT*C1*30~
AMT*P3*31~
AMT*B9*32~
NM1*31*1~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~

In general, I need a regex that can capture groups of lines given a string that starts a line, up to and including the line that ends the capture group based on another given start of a line.

I have tried this (INS\*Y\*[^G8]+.*)(.*?)(?=DMG) and other variations unsuccessfully. What I am expecting is...

Group 1 should be:

**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**

Group 2 should be:

**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**

Thank you for your help.

答案1

得分: 5

以下是翻译好的部分：

"With your shown samples and attempts please try following regex" 翻译为 "使用您提供的示例和尝试，请尝试以下正则表达式"

"Here is the Online Demo for used Regex." 翻译为 "这是用于该正则表达式的在线演示。"

"Here is the Complete python3 code written and tested in Python3 using re.findall module of it." 翻译为 "这是使用Python3编写和测试的完整Python3代码，使用其re.findall模块。"

"Output will be as follows with your shown samples:" 翻译为 "输出将如下所示，根据您提供的示例："

"Explanation of Used regex:" 翻译为 "所使用正则表达式的解释:"

" (?:^|\n) ##In a non-capturing group match starting of value OR new line." 翻译为 " (?:^|\n) ##在非捕获组中匹配值的开头或新行。"

"( INS* ##Matching INS followed by literal * here." 翻译为 "( INS* ##匹配后跟文字*的INS。"

"(?!Y*G8) ##Using negative look ahead to make sure YG8 is not present." 翻译为 "(?!Y*G8) ##使用负向预查确保YG8不存在。"

"(?:.\n)+? ##In a non-capturing group matching till new line greedy match with 1 or more matches." 翻译为 "(?:.\n)+? ##在非捕获组中匹配直到新行，贪婪匹配1个或更多个匹配项。"

"DMG\S+ ##Matching DMG followed by continuous non-spaces." 翻译为 "DMG\S+ ##匹配后跟连续的非空格字符。"

"1: https://regex101.com/r/xTlX0i/1" 翻译为 "1: https://regex101.com/r/xTlX0i/1"

英文:

With your shown samples and attempts please try following regex

(?:^|\n)(INS\*(?!Y\*G8)(?:.*\n)+?DMG\S+)

Here is the Online Demo for used Regex.

Here is the Complete python3 code written and tested in Python3 using re.findall module of it.

re.findall(r&quot;(?m)(?:^|\n)(INS\*(?!Y\*G8)(?:.*\n)+?DMG\S+)&quot;,var)

Output will be as follows with your shown samples:

['INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\nDMG*D8*20000101*F~**', 'INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\nDMG*D8*20000101*F~**']

Explanation of Used regex:

(?:^|\n)     ##In a non-capturing group match starting of value OR new line.
(            ##Starting a capturing group from here.
  INS\*      ##Matching INS followed by literal * here.
  (?!Y\*G8)  ##Using negative look ahead to make sure Y*G8 is not present.
  (?:.*\n)+? ##In a non-capturing group matching till new line greedy match with 1 or more matches.
  DMG\S+     ##Matching DMG followed by continuous non-spaces.
)            ##Closing capturing group here.

答案2

得分: 4

Edit

如果问题中**INS文本中的**标记是用于加粗文本的标记，那么你可以将正则表达式模式编写为：

^INS\*(?!Y\*G8).*(?:\n(?!INS\*|DMG\*).*)*\nDMG\*.*

查看 regex101 演示

<hr>

你可以使用一个从**INS开始的单一匹配，并确保它后面没有跟着Y*G8

然后匹配所有后续行，这些行不以**INS或**DMG开头

^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*

解释

^ 字符串的开头
\*\*INS(?!Y\*G8) 匹配**INS，并确保它后面不是直接跟着Y*G8
.* 匹配整行
(?: 非捕获组，用于整体重复
- \n 匹配换行符
- (?!\*\*(?:INS|DMG)) 负向前瞻，确保右侧不是**INS或**DMG
- .* 匹配整行
)* 关闭非捕获组并可选地重复，以匹配所有行
\n\*\*DMG.* 匹配换行符，然后匹配**DMG和剩余部分

正则表达式演示。

示例代码

import re
pattern = r"^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*"
s = ("INS*Y*G8*030**A***AC~\n"
	"REF*0F*XXXXXXXX~\n"
	"NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~\n"
	"N3*45874 WHYYWYW WTWYXW~\n"
	"N4*DYXWHXVYW*NY*88980~\n"
	"DMG*D8*19700101*M~\n"
	"**INS*Y*01*030**A***AC~**\n"
	"REF*0F*XXXXXXXXX~\n"
	"NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\n"
	"PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n"
	"N3*45874 WHYYWYW WTWYXW~\n"
	"N4*DYXWHXVYW*NY*88980~\n"
	"**DMG*D8*20000101*F~**\n"
	"**INS*Y*19*030**A***AC~**\n"
	"REF*0F*XXXXXXXXX~\n"
	"NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\n"
	"PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n"
	"N3*45874 WHYYWYW WTWYXW~\n"
	"N4*DYXWHXVYW*NY*88980~\n"
	"**DMG*D8*20000101*F~**\n"
	"AMT*D2*100~\n"
	"AMT*FK*100~\n"
	"AMT*R*50~\n"
	"AMT*C1*30~\n"
	"AMT*P3*31~\n"
	"AMT*B9*32~\n"
	"NM1*31*1~\n"
	"N3*45874 WHYYWYW WTWYXW~\n"
	"N4*DYXWHXVYW*NY*88980~")
print(re.findall(pattern, s, re.M))

输出

[
'**INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**',
'**INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**'
]

英文:

Edit

If the leading ** in the example for **INS text are markers for bold text in the question, then you could write the pattern as:

^INS\*(?!Y\*G8).*(?:\n(?!INS\*|DMG\*).*)*\nDMG\*.*

See a regex101 demo

<hr>

You could use a single match starting with **INS and asserting that it is not followed by Y*G8

Then match all following lines that do not start with either **INS or **DMG

^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*

Explanation

^ Start of string
\*\*INS(?!Y\*G8) Match **INS and assert that it is not directly followed by Y*G8
.* Match the whole line
(?: Non capture group to repeat as a whole part
- \n Match a newline
- (?!\*\*(?:INS|DMG)) Negative lookahead, assert not **INS or **DMG directly to the right
- .* Match the whole line
)* Close the non capture group and optionally repeat it to match all lines
\n\*\*DMG.* Match a newline, then **DMG and the rest of th eline

Regex demo101.

Example code

import re
pattern = r&quot;^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*&quot;
s = (&quot;INS*Y*G8*030**A***AC~\n&quot;
	&quot;REF*0F*XXXXXXXX~\n&quot;
	&quot;NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~\n&quot;
	&quot;N3*45874 WHYYWYW WTWYXW~\n&quot;
	&quot;N4*DYXWHXVYW*NY*88980~\n&quot;
	&quot;DMG*D8*19700101*M~\n&quot;
	&quot;**INS*Y*01*030**A***AC~**\n&quot;
	&quot;REF*0F*XXXXXXXXX~\n&quot;
	&quot;NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\n&quot;
	&quot;PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n&quot;
	&quot;N3*45874 WHYYWYW WTWYXW~\n&quot;
	&quot;N4*DYXWHXVYW*NY*88980~\n&quot;
	&quot;**DMG*D8*20000101*F~**\n&quot;
	&quot;**INS*Y*19*030**A***AC~**\n&quot;
	&quot;REF*0F*XXXXXXXXX~\n&quot;
	&quot;NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\n&quot;
	&quot;PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n&quot;
	&quot;N3*45874 WHYYWYW WTWYXW~\n&quot;
	&quot;N4*DYXWHXVYW*NY*88980~\n&quot;
	&quot;**DMG*D8*20000101*F~**\n&quot;
	&quot;AMT*D2*100~\n&quot;
	&quot;AMT*FK*100~\n&quot;
	&quot;AMT*R*50~\n&quot;
	&quot;AMT*C1*30~\n&quot;
	&quot;AMT*P3*31~\n&quot;
	&quot;AMT*B9*32~\n&quot;
	&quot;NM1*31*1~\n&quot;
	&quot;N3*45874 WHYYWYW WTWYXW~\n&quot;
	&quot;N4*DYXWHXVYW*NY*88980~&quot;)
print(re.findall(pattern, s, re.M))

Output

[
&#39;**INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**&#39;,
&#39;**INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**&#39;
]

答案3

得分: 3

你在所需的停止和开始模式上有点模糊，所以我假设了一个开始模式是<begin line> **INS，结束模式是<begin line> **DMG。如果这不完全正确，你可以调整正则表达式。

这个解决方案的关键是设置正确的标志集以正确地扫描多行。它们是：

"g" - 全局 - 继续扫描所有匹配项，而不仅仅是第一个
"s" - 单行 - .匹配换行符，因此可以使用.+匹配多行
"m" - 多行 - ^和$匹配<newline>以及字符串的开头和结尾
"i" - 忽略大小写 - 可能不是必需的，但它可以使模式更短。

所以，这个解决方案是：

r"^\*\*INS.*?^\*\*DMG.*?$"gmsi

...相当简单。在一行的开头找到一个INS**，继续扫描多行的任何字符，直到你看到一个DMG**在一行的开头，然后扫描直到下一行结束。

这是在Regex101上的屏幕截图：

英文:

You were a little vague on just what your desired stop & start patterns should be, so I assumed a start pattern of <begin line> **INS and an end pattern of <begin line> **DMG. If that's not exactly correct you can adjust the regex.

The key to this solution, is to set the right set of flags to scan multiple lines correctly. They are:

"g" - global - continue to scan for all matches, not just the first
"s" - single line - .matches newline, so multiple lines can be matched with .+
"m" - mult-line - ^ & $ match <newline> as well as begin & end of string.
"i" - ignore case - May not be necessary, but it can make the patterns shorter.

So, this solution is

r&quot;^\*\*INS.*?^\*\*DMG.*?$&quot;gmsi

...pretty simple. Find a **INS at the beginning of a line, continue to scan any character across multiple lines until you see a **DMG at the beginning of a line, then scan everything up to the next line end.

Here's s a screenprint in Regex101:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据以行首开始的字符串来获取匹配组。

问题

答案1

答案2

答案3

可以我访问 Nx3 矩阵中具有特定值的其他列的数据吗？

找到在 Excel 中部门中重叠的员工。

将两个相等大小的向量转换为它们的成对乘积矩阵在Python中可以这样做：

在Pandas DataFrame中打印“近似重复”的行。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。