根据以行首开始的字符串来获取匹配组。

huangapple go评论63阅读模式
英文:

Grab groups of matches based on strings that start a line

问题

我正在尝试构建一个正则表达式,以捕获从^INS到^DMG之间的行组。我能够排除INS*Y*G8到DMG之间的内容。然而,我不断遇到灾难性的回溯问题。

Group 1应该是:

**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**

Group 2应该是:

**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**

谢谢你的帮助。

英文:

I am trying to build a regex that will capture groups of lines up to and including lines from ^INS through ^DMG. I am able to exclude INS*Y*G8 through DMG. However, I keep getting catastrophic backtracking.

INS*Y*G8*030**A***AC~
REF*0F*XXXXXXXX~
NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
DMG*D8*19700101*M~
**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
AMT*D2*100~
AMT*FK*100~
AMT*R*50~
AMT*C1*30~
AMT*P3*31~
AMT*B9*32~
NM1*31*1~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~

In general, I need a regex that can capture groups of lines given a string that starts a line, up to and including the line that ends the capture group based on another given start of a line.

I have tried this (INS\*Y\*[^G8]+.*)(.*?)(?=DMG) and other variations unsuccessfully. What I am expecting is...

Group 1 should be:

**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**

Group 2 should be:

**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**

Thank you for your help.

答案1

得分: 5

以下是翻译好的部分:

"With your shown samples and attempts please try following regex" 翻译为 "使用您提供的示例和尝试,请尝试以下正则表达式"

"Here is the Online Demo for used Regex." 翻译为 "这是用于该正则表达式的在线演示。"

"Here is the Complete python3 code written and tested in Python3 using re.findall module of it." 翻译为 "这是使用Python3编写和测试的完整Python3代码,使用其re.findall模块。"

"Output will be as follows with your shown samples:" 翻译为 "输出将如下所示,根据您提供的示例:"

"Explanation of Used regex:" 翻译为 "所使用正则表达式的解释:"

" (?:^|\n) ##In a non-capturing group match starting of value OR new line." 翻译为 " (?:^|\n) ##在非捕获组中匹配值的开头或新行。"

"( INS* ##Matching INS followed by literal * here." 翻译为 "( INS* ##匹配后跟文字*的INS。"

"(?!Y*G8) ##Using negative look ahead to make sure YG8 is not present." 翻译为 "(?!Y*G8) ##使用负向预查确保YG8不存在。"

"(?:.\n)+? ##In a non-capturing group matching till new line greedy match with 1 or more matches." 翻译为 "(?:.\n)+? ##在非捕获组中匹配直到新行,贪婪匹配1个或更多个匹配项。"

"DMG\S+ ##Matching DMG followed by continuous non-spaces." 翻译为 "DMG\S+ ##匹配后跟连续的非空格字符。"

"1: https://regex101.com/r/xTlX0i/1" 翻译为 "1: https://regex101.com/r/xTlX0i/1"

英文:

With your shown samples and attempts please try following regex

(?:^|\n)(INS\*(?!Y\*G8)(?:.*\n)+?DMG\S+)

Here is the Online Demo for used Regex.

Here is the Complete python3 code written and tested in Python3 using re.findall module of it.

re.findall(r"(?m)(?:^|\n)(INS\*(?!Y\*G8)(?:.*\n)+?DMG\S+)",var)

Output will be as follows with your shown samples:

['INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\nDMG*D8*20000101*F~**', 'INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\nDMG*D8*20000101*F~**']

Explanation of Used regex:

(?:^|\n)     ##In a non-capturing group match starting of value OR new line.
(            ##Starting a capturing group from here.
  INS\*      ##Matching INS followed by literal * here.
  (?!Y\*G8)  ##Using negative look ahead to make sure Y*G8 is not present.
  (?:.*\n)+? ##In a non-capturing group matching till new line greedy match with 1 or more matches.
  DMG\S+     ##Matching DMG followed by continuous non-spaces.
)            ##Closing capturing group here.

答案2

得分: 4

Edit

如果问题中**INS文本中的**标记是用于加粗文本的标记,那么你可以将正则表达式模式编写为:

^INS\*(?!Y\*G8).*(?:\n(?!INS\*|DMG\*).*)*\nDMG\*.*

查看 regex101 演示

<hr>

你可以使用一个从**INS开始的单一匹配,并确保它后面没有跟着Y*G8

然后匹配所有后续行,这些行不以**INS**DMG开头

^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*

解释

  • ^ 字符串的开头
  • \*\*INS(?!Y\*G8) 匹配**INS,并确保它后面不是直接跟着Y*G8
  • .* 匹配整行
  • (?: 非捕获组,用于整体重复
    • \n 匹配换行符
    • (?!\*\*(?:INS|DMG)) 负向前瞻,确保右侧不是**INS**DMG
    • .* 匹配整行
  • )* 关闭非捕获组并可选地重复,以匹配所有行
  • \n\*\*DMG.* 匹配换行符,然后匹配**DMG和剩余部分

正则表达式演示

示例代码

import re

pattern = r"^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*"

s = ("INS*Y*G8*030**A***AC~\n"
	"REF*0F*XXXXXXXX~\n"
	"NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~\n"
	"N3*45874 WHYYWYW WTWYXW~\n"
	"N4*DYXWHXVYW*NY*88980~\n"
	"DMG*D8*19700101*M~\n"
	"**INS*Y*01*030**A***AC~**\n"
	"REF*0F*XXXXXXXXX~\n"
	"NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\n"
	"PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n"
	"N3*45874 WHYYWYW WTWYXW~\n"
	"N4*DYXWHXVYW*NY*88980~\n"
	"**DMG*D8*20000101*F~**\n"
	"**INS*Y*19*030**A***AC~**\n"
	"REF*0F*XXXXXXXXX~\n"
	"NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\n"
	"PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n"
	"N3*45874 WHYYWYW WTWYXW~\n"
	"N4*DYXWHXVYW*NY*88980~\n"
	"**DMG*D8*20000101*F~**\n"
	"AMT*D2*100~\n"
	"AMT*FK*100~\n"
	"AMT*R*50~\n"
	"AMT*C1*30~\n"
	"AMT*P3*31~\n"
	"AMT*B9*32~\n"
	"NM1*31*1~\n"
	"N3*45874 WHYYWYW WTWYXW~\n"
	"N4*DYXWHXVYW*NY*88980~")

print(re.findall(pattern, s, re.M))

输出

[
'**INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**',
'**INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**'
]
英文:

Edit

If the leading ** in the example for **INS text are markers for bold text in the question, then you could write the pattern as:

^INS\*(?!Y\*G8).*(?:\n(?!INS\*|DMG\*).*)*\nDMG\*.*

See a regex101 demo

<hr>

You could use a single match starting with **INS and asserting that it is not followed by Y*G8

Then match all following lines that do not start with either **INS or **DMG

^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*

Explanation

  • ^ Start of string
  • \*\*INS(?!Y\*G8) Match **INS and assert that it is not directly followed by Y*G8
  • .* Match the whole line
  • (?: Non capture group to repeat as a whole part
    • \n Match a newline
    • (?!\*\*(?:INS|DMG)) Negative lookahead, assert not **INS or **DMG directly to the right
    • .* Match the whole line
  • )* Close the non capture group and optionally repeat it to match all lines
  • \n\*\*DMG.* Match a newline, then **DMG and the rest of th eline

Regex demo101.

Example code

import re

pattern = r&quot;^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*&quot;

s = (&quot;INS*Y*G8*030**A***AC~\n&quot;
	&quot;REF*0F*XXXXXXXX~\n&quot;
	&quot;NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~\n&quot;
	&quot;N3*45874 WHYYWYW WTWYXW~\n&quot;
	&quot;N4*DYXWHXVYW*NY*88980~\n&quot;
	&quot;DMG*D8*19700101*M~\n&quot;
	&quot;**INS*Y*01*030**A***AC~**\n&quot;
	&quot;REF*0F*XXXXXXXXX~\n&quot;
	&quot;NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\n&quot;
	&quot;PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n&quot;
	&quot;N3*45874 WHYYWYW WTWYXW~\n&quot;
	&quot;N4*DYXWHXVYW*NY*88980~\n&quot;
	&quot;**DMG*D8*20000101*F~**\n&quot;
	&quot;**INS*Y*19*030**A***AC~**\n&quot;
	&quot;REF*0F*XXXXXXXXX~\n&quot;
	&quot;NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\n&quot;
	&quot;PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n&quot;
	&quot;N3*45874 WHYYWYW WTWYXW~\n&quot;
	&quot;N4*DYXWHXVYW*NY*88980~\n&quot;
	&quot;**DMG*D8*20000101*F~**\n&quot;
	&quot;AMT*D2*100~\n&quot;
	&quot;AMT*FK*100~\n&quot;
	&quot;AMT*R*50~\n&quot;
	&quot;AMT*C1*30~\n&quot;
	&quot;AMT*P3*31~\n&quot;
	&quot;AMT*B9*32~\n&quot;
	&quot;NM1*31*1~\n&quot;
	&quot;N3*45874 WHYYWYW WTWYXW~\n&quot;
	&quot;N4*DYXWHXVYW*NY*88980~&quot;)

print(re.findall(pattern, s, re.M))

Output

[
&#39;**INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**&#39;,
&#39;**INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**&#39;
]

答案3

得分: 3

你在所需的停止和开始模式上有点模糊,所以我假设了一个开始模式是<begin line> **INS,结束模式是<begin line> **DMG。如果这不完全正确,你可以调整正则表达式。

这个解决方案的关键是设置正确的标志集以正确地扫描多行。它们是:

  • "g" - 全局 - 继续扫描所有匹配项,而不仅仅是第一个
  • "s" - 单行 - .匹配换行符,因此可以使用.+匹配多行
  • "m" - 多行 - ^$匹配<newline>以及字符串的开头和结尾
  • "i" - 忽略大小写 - 可能不是必需的,但它可以使模式更短。

所以,这个解决方案是:

r"^\*\*INS.*?^\*\*DMG.*?$"gmsi

...相当简单。在一行的开头找到一个INS**,继续扫描多行的任何字符,直到你看到一个DMG**在一行的开头,然后扫描直到下一行结束。

这是在Regex101上的屏幕截图:

根据以行首开始的字符串来获取匹配组。

英文:

You were a little vague on just what your desired stop & start patterns should be, so I assumed a start pattern of <begin line> **INS and an end pattern of <begin line> **DMG. If that's not exactly correct you can adjust the regex.

The key to this solution, is to set the right set of flags to scan multiple lines correctly. They are:

  • "g" - global - continue to scan for all matches, not just the first
  • "s" - single line - .matches newline, so multiple lines can be matched with .+
  • "m" - mult-line - ^ & $ match <newline> as well as begin & end of string.
  • "i" - ignore case - May not be necessary, but it can make the patterns shorter.

So, this solution is

r&quot;^\*\*INS.*?^\*\*DMG.*?$&quot;gmsi

...pretty simple. Find a **INS at the beginning of a line, continue to scan any character across multiple lines until you see a **DMG at the beginning of a line, then scan everything up to the next line end.

Here's s a screenprint in Regex101:
根据以行首开始的字符串来获取匹配组。

huangapple
  • 本文由 发表于 2023年2月14日 01:15:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75439134.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定