英文:
Regex to remove captions with condition not to overlap second match
问题
以下是您要的翻译结果:
我有以下字符串,这些字符串我从一个PDF文件中提取出来:
这是
图13:约翰拿着他的礼物和
鲜花
来源:官方摄影师
一个美丽的
表格:某种表格
并完全
完整
表格:约翰拿着他的礼物和
来源:官方摄影师
句子
文本包括图表,大多数都有标题在上面和来源在下面,但有些没有。从根本上讲,我希望保留的文本应该是:
这是
一个美丽的
并完全
完整
句子
我尝试了以下方法:
s = re.sub(r'(Fig|Table)[\s\S]+?Source:.*\n', '', mystring, flags=re.MULTILINE)
但不幸的是,它返回:
这是
一个美丽的
句子
在我有限的正则表达式知识下,我无法弄清楚如何设置这样的条件:
它应该在“来源”后的第一个“\n”处停止,只有在没有新的“图表”之间才能停止,否则应该在开头的第一个“\n”处停止。
有什么主意吗?谢谢。
英文:
I have the following string, which I extract from a pdf:
This is
Fig. 13: John holding his present and
the flowers
Source: official photographer
a beautiful
Table: a table of some kind
and fully
complete
Table: John holding his present and
Source: official photographer
sentence
The text includes figs and tables, most of which have a caption on top and a source on bottom, but some don't. Fundamentally, the text I want to be left with should be:
This is
a beautiful
and fully
complete
sentence
I have tried the following:
s = re.sub(r'(Fig|Table)[\s\S]+?Source:.*\n', '', mystring,flags=re.MULTILINE)
But unfortunately it returns:
This is
a beautiful
sentence
With my limited knowledge of regex I cannot figure out how to put such a condition:
It should stop at the first \n after Source, only if there is no new fig|table in between, in which case it should have stopped at the first \n from start.
Any idea? Thank you.
答案1
得分: 4
你需要匹配的是一个Fig或Table,然后是以下之一:
- 包括直到并包括以
Source开头的行的字符,原始文本中的该行之前没有Fig或Table;或 - 直到行尾的字符
你可以通过使用tempered greedy token来实现第一种情况,这确保了每个处理的字符在找到Source之前都不会出现Fig或Table。以下正则表达式将实现你的要求:
(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)
这匹配:
(?:Fig|Table):单词Fig或Table;然后要么(?:(?!Fig|Table)[\s\S])+?:最少数量的字符,这些字符中没有包含Fig或Table;或Source[^\n]*\n:单词Source后跟一些字符,直到换行符;或[^\n]*\n:一些字符,直到换行符
正则表达式演示在regex101上。
在Python中:
import re
s = re.sub(r'(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)', '', mystring)
print(s)
输出:
This is
a beautiful
and fully
complete
sentence
请注意,这会保留原始字符串中的换行符(如果存在),可以使用strip来删除它们。
英文:
What you need to match is a Fig or Table followed by either
- Characters up to and including a line starting with
Source, with noFigorTablein between the original one andSource; or - Characters up to the end of line
You can achieve #1 above by using a tempered greedy token, which ensures that each character processed until Source is found does not precede Fig or Table. This regex will do what you want:
(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)
This matches:
(?:Fig|Table): a wordFigorTable; and then either(?:(?!Fig|Table)[\s\S])+?: a minimal number of characters, none of which precede either of the wordsFigorTableSource[^\n]*\n: The wordSourcefollowed by some number of characters until newline; or[^\n]*\nsome number of characters until newline
Regex demo on regex101
In python:
s = re.sub(r'(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)', '', mystring)
print(s)
Output:
This is
a beautiful
and fully
complete
sentence
Note this does leave newlines (if present in the original string) at the start and end of the string, they can be removed with strip.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论