英文:
Grab groups of matches based on strings that start a line
问题
我正在尝试构建一个正则表达式,以捕获从^INS到^DMG之间的行组。我能够排除INS*Y*G8到DMG之间的内容。然而,我不断遇到灾难性的回溯问题。
Group 1应该是:
**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
Group 2应该是:
**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
谢谢你的帮助。
英文:
I am trying to build a regex that will capture groups of lines up to and including lines from ^INS through ^DMG. I am able to exclude INS*Y*G8 through DMG. However, I keep getting catastrophic backtracking.
INS*Y*G8*030**A***AC~
REF*0F*XXXXXXXX~
NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
DMG*D8*19700101*M~
**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
AMT*D2*100~
AMT*FK*100~
AMT*R*50~
AMT*C1*30~
AMT*P3*31~
AMT*B9*32~
NM1*31*1~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
In general, I need a regex that can capture groups of lines given a string that starts a line, up to and including the line that ends the capture group based on another given start of a line.
I have tried this (INS\*Y\*[^G8]+.*)(.*?)(?=DMG)
and other variations unsuccessfully. What I am expecting is...
Group 1 should be:
**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
Group 2 should be:
**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
Thank you for your help.
答案1
得分: 5
以下是翻译好的部分:
"With your shown samples and attempts please try following regex" 翻译为 "使用您提供的示例和尝试,请尝试以下正则表达式"
"Here is the Online Demo for used Regex." 翻译为 "这是用于该正则表达式的在线演示。"
"Here is the Complete python3 code written and tested in Python3 using re.findall
module of it." 翻译为 "这是使用Python3编写和测试的完整Python3代码,使用其re.findall
模块。"
"Output will be as follows with your shown samples:" 翻译为 "输出将如下所示,根据您提供的示例:"
"Explanation of Used regex:" 翻译为 "所使用正则表达式的解释:"
" (?:^|\n) ##In a non-capturing group match starting of value OR new line." 翻译为 "
(?:^|\n) ##在非捕获组中匹配值的开头或新行。"
"( INS* ##Matching INS followed by literal * here." 翻译为 "( INS* ##匹配后跟文字*的INS。"
"(?!Y*G8) ##Using negative look ahead to make sure YG8 is not present." 翻译为 "(?!Y*G8) ##使用负向预查确保YG8不存在。"
"(?:.\n)+? ##In a non-capturing group matching till new line greedy match with 1 or more matches." 翻译为 "(?:.\n)+? ##在非捕获组中匹配直到新行,贪婪匹配1个或更多个匹配项。"
"DMG\S+ ##Matching DMG followed by continuous non-spaces." 翻译为 "DMG\S+ ##匹配后跟连续的非空格字符。"
"1: https://regex101.com/r/xTlX0i/1" 翻译为 "1: https://regex101.com/r/xTlX0i/1"
英文:
With your shown samples and attempts please try following regex
(?:^|\n)(INS\*(?!Y\*G8)(?:.*\n)+?DMG\S+)
Here is the Online Demo for used Regex.
Here is the Complete python3 code written and tested in Python3 using re.findall
module of it.
re.findall(r"(?m)(?:^|\n)(INS\*(?!Y\*G8)(?:.*\n)+?DMG\S+)",var)
Output will be as follows with your shown samples:
['INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\nDMG*D8*20000101*F~**', 'INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\nDMG*D8*20000101*F~**']
Explanation of Used regex:
(?:^|\n) ##In a non-capturing group match starting of value OR new line.
( ##Starting a capturing group from here.
INS\* ##Matching INS followed by literal * here.
(?!Y\*G8) ##Using negative look ahead to make sure Y*G8 is not present.
(?:.*\n)+? ##In a non-capturing group matching till new line greedy match with 1 or more matches.
DMG\S+ ##Matching DMG followed by continuous non-spaces.
) ##Closing capturing group here.
答案2
得分: 4
Edit
如果问题中**INS
文本中的**
标记是用于加粗文本的标记,那么你可以将正则表达式模式编写为:
^INS\*(?!Y\*G8).*(?:\n(?!INS\*|DMG\*).*)*\nDMG\*.*
查看 regex101 演示
<hr>
你可以使用一个从**INS
开始的单一匹配,并确保它后面没有跟着Y*G8
然后匹配所有后续行,这些行不以**INS
或**DMG
开头
^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*
解释
^
字符串的开头\*\*INS(?!Y\*G8)
匹配**INS
,并确保它后面不是直接跟着Y*G8
.*
匹配整行(?:
非捕获组,用于整体重复\n
匹配换行符(?!\*\*(?:INS|DMG))
负向前瞻,确保右侧不是**INS
或**DMG
.*
匹配整行
)*
关闭非捕获组并可选地重复,以匹配所有行\n\*\*DMG.*
匹配换行符,然后匹配**DMG
和剩余部分
示例代码
import re
pattern = r"^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*"
s = ("INS*Y*G8*030**A***AC~\n"
"REF*0F*XXXXXXXX~\n"
"NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~\n"
"DMG*D8*19700101*M~\n"
"**INS*Y*01*030**A***AC~**\n"
"REF*0F*XXXXXXXXX~\n"
"NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\n"
"PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~\n"
"**DMG*D8*20000101*F~**\n"
"**INS*Y*19*030**A***AC~**\n"
"REF*0F*XXXXXXXXX~\n"
"NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\n"
"PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~\n"
"**DMG*D8*20000101*F~**\n"
"AMT*D2*100~\n"
"AMT*FK*100~\n"
"AMT*R*50~\n"
"AMT*C1*30~\n"
"AMT*P3*31~\n"
"AMT*B9*32~\n"
"NM1*31*1~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~")
print(re.findall(pattern, s, re.M))
输出
[
'**INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**',
'**INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**'
]
英文:
Edit
If the leading **
in the example for **INS
text are markers for bold text in the question, then you could write the pattern as:
^INS\*(?!Y\*G8).*(?:\n(?!INS\*|DMG\*).*)*\nDMG\*.*
See a regex101 demo
<hr>
You could use a single match starting with **INS
and asserting that it is not followed by Y*G8
Then match all following lines that do not start with either **INS
or **DMG
^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*
Explanation
^
Start of string\*\*INS(?!Y\*G8)
Match**INS
and assert that it is not directly followed byY*G8
.*
Match the whole line(?:
Non capture group to repeat as a whole part\n
Match a newline(?!\*\*(?:INS|DMG))
Negative lookahead, assert not**INS
or**DMG
directly to the right.*
Match the whole line
)*
Close the non capture group and optionally repeat it to match all lines\n\*\*DMG.*
Match a newline, then**DMG
and the rest of th eline
Example code
import re
pattern = r"^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*"
s = ("INS*Y*G8*030**A***AC~\n"
"REF*0F*XXXXXXXX~\n"
"NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~\n"
"DMG*D8*19700101*M~\n"
"**INS*Y*01*030**A***AC~**\n"
"REF*0F*XXXXXXXXX~\n"
"NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\n"
"PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~\n"
"**DMG*D8*20000101*F~**\n"
"**INS*Y*19*030**A***AC~**\n"
"REF*0F*XXXXXXXXX~\n"
"NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\n"
"PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~\n"
"**DMG*D8*20000101*F~**\n"
"AMT*D2*100~\n"
"AMT*FK*100~\n"
"AMT*R*50~\n"
"AMT*C1*30~\n"
"AMT*P3*31~\n"
"AMT*B9*32~\n"
"NM1*31*1~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~")
print(re.findall(pattern, s, re.M))
Output
[
'**INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**',
'**INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**'
]
答案3
得分: 3
你在所需的停止和开始模式上有点模糊,所以我假设了一个开始模式是<begin line> **INS,结束模式是<begin line> **DMG。如果这不完全正确,你可以调整正则表达式。
这个解决方案的关键是设置正确的标志集以正确地扫描多行。它们是:
- "g" - 全局 - 继续扫描所有匹配项,而不仅仅是第一个
- "s" - 单行 -
.
匹配换行符,因此可以使用.+
匹配多行 - "m" - 多行 -
^
和$
匹配<newline>以及字符串的开头和结尾 - "i" - 忽略大小写 - 可能不是必需的,但它可以使模式更短。
所以,这个解决方案是:
r"^\*\*INS.*?^\*\*DMG.*?$"gmsi
...相当简单。在一行的开头找到一个INS**,继续扫描多行的任何字符,直到你看到一个DMG**在一行的开头,然后扫描直到下一行结束。
这是在Regex101上的屏幕截图:
英文:
You were a little vague on just what your desired stop & start patterns should be, so I assumed a start pattern of <begin line> **INS and an end pattern of <begin line> **DMG. If that's not exactly correct you can adjust the regex.
The key to this solution, is to set the right set of flags to scan multiple lines correctly. They are:
- "g" - global - continue to scan for all matches, not just the first
- "s" - single line -
.
matches newline, so multiple lines can be matched with.+
- "m" - mult-line -
^
&$
match <newline> as well as begin & end of string. - "i" - ignore case - May not be necessary, but it can make the patterns shorter.
So, this solution is
r"^\*\*INS.*?^\*\*DMG.*?$"gmsi
...pretty simple. Find a **INS at the beginning of a line, continue to scan any character across multiple lines until you see a **DMG at the beginning of a line, then scan everything up to the next line end.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论