匹配一切,直到整个正则表达式再次匹配。

huangapple go评论71阅读模式
英文:

Match everything until the whole regex matches again

问题

  • [VC1000]: [VC1000]是我试图匹配的字符串(实际字符串要长得多)。
  • [Venture Capital]: 这是有关风险投资的课程。
  • [4]: [4]代表学分。
  • [This is a class about venture capital and more description, that could mention a future course like VC2000 but might not]: 这是关于风险投资的课程,以及更多描述,可能提到未来课程,如VC2000,但也可能不提到。

你目前的正则表达式 (^\*?[A-Z]{2}\s?[0-9]{4}) (.*?)([0-9]|[0-9]-[0-9]+)\s?cr\. 已经很接近了,但是你需要修改它来捕获描述部分。你可以使用以下正则表达式来实现:

(^\*?[A-Z]{2}\s?[0-9]{4})\s(.*?)([0-9]|[0-9]-[0-9]+)\s?cr\.

这个正则表达式在原有的基础上做了以下修改:

  • 去掉了描述部分前面的空格,以便正确捕获描述。
  • 将描述部分的捕获括号移到了描述部分的前面,这样它就会捕获所有描述内容。

这个正则表达式应该能够匹配你所需的所有组。

英文:

This is the string I'm trying to match on (real one is much longer).

VC1000 Venture Capital 4 cr.
This is a class about venture capital
and more description, that could mention a future course like
VC2000 but might not
VC2000 venture capital II 4 cr.
Another description about blah
VC 3000 venture capital III 4-6 cr.
back again

I'm trying to get groups that would look like

  • [VC1000]
  • [Venture Capital]
  • [4]
  • [This is a class about venture capital and more description, that could mention a future course like VC2000 but might not]

I almost got it but I'm not sure how to get the description between class listings. Right now I have:

(^\*?[A-Z]{2}\s?[0-9]{4}) (.*?)([0-9]|[0-9]-[0-9]+)\s?cr\.

But i'm not sure how to proceed. Adding .* matches too much, and doing .* with the first group from above prevents the first group getting caught every other match.

What's the trick I'm missing?

答案1

得分: 2

尝试(regex101):

import re

pat = r'^([A-Z]{2}\s*\d{4})\s+([^\n]+?)(\d+-?\d*\s+cr\.)$(.*?)(?=^[A-Z]{2}\s*\d{4}\s+[^\n]+?\d+-?\d*\s+cr\.$|\Z)'
pat = re.compile(pat, flags=re.S|re.M)

text = '''\
VC1000 Venture Capital 4 cr.
This is a class about venture capital
and more description, that could mention a future course like
VC2000 but might not
VC2000 venture capital II 4 cr.
Another description about blah
VC 3000 venture capital III 4-6 cr.
back again'''

for a, b, c, d in pat.findall(text):
    print(a)
    print(b)
    print(c)
    print(d)
    print('-' * 80)

打印:

VC1000
Venture Capital 
4 cr.

This is a class about venture capital
and more description, that could mention a future course like
VC2000 but might not

--------------------------------------------------------------------------------
VC2000
venture capital II 
4 cr.

Another description about blah

--------------------------------------------------------------------------------
VC 3000
venture capital III 
4-6 cr.

back again
--------------------------------------------------------------------------------
英文:

Try (regex101):

import re

pat = r'^([A-Z]{2}\s*\d{4})\s+([^\n]+?)(\d+-?\d*\s+cr\.)$(.*?)(?=^[A-Z]{2}\s*\d{4}\s+[^\n]+?\d+-?\d*\s+cr\.$|\Z)'
pat = re.compile(pat, flags=re.S|re.M)

text = '''\
VC1000 Venture Capital 4 cr.
This is a class about venture capital
and more description, that could mention a future course like
VC2000 but might not
VC2000 venture capital II 4 cr.
Another description about blah
VC 3000 venture capital III 4-6 cr.
back again'''

for a, b, c, d in pat.findall(text):
	print(a)
	print(b)
	print(c)
	print(d)
	print('-' * 80)

Prints:

VC1000
Venture Capital 
4 cr.

This is a class about venture capital
and more description, that could mention a future course like
VC2000 but might not

--------------------------------------------------------------------------------
VC2000
venture capital II 
4 cr.

Another description about blah

--------------------------------------------------------------------------------
VC 3000
venture capital III 
4-6 cr.

back again
--------------------------------------------------------------------------------

huangapple
  • 本文由 发表于 2023年5月26日 07:50:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76336864.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定