英文:
Regex to remove captions with condition not to overlap second match
问题
以下是您要的翻译结果:
我有以下字符串,这些字符串我从一个PDF文件中提取出来:
这是
图13:约翰拿着他的礼物和
鲜花
来源:官方摄影师
一个美丽的
表格:某种表格
并完全
完整
表格:约翰拿着他的礼物和
来源:官方摄影师
句子
文本包括图表,大多数都有标题在上面和来源在下面,但有些没有。从根本上讲,我希望保留的文本应该是:
这是
一个美丽的
并完全
完整
句子
我尝试了以下方法:
s = re.sub(r'(Fig|Table)[\s\S]+?Source:.*\n', '', mystring, flags=re.MULTILINE)
但不幸的是,它返回:
这是
一个美丽的
句子
在我有限的正则表达式知识下,我无法弄清楚如何设置这样的条件:
它应该在“来源”后的第一个“\n”处停止,只有在没有新的“图表”之间才能停止,否则应该在开头的第一个“\n”处停止。
有什么主意吗?谢谢。
英文:
I have the following string, which I extract from a pdf:
This is
Fig. 13: John holding his present and
the flowers
Source: official photographer
a beautiful
Table: a table of some kind
and fully
complete
Table: John holding his present and
Source: official photographer
sentence
The text includes figs and tables, most of which have a caption on top and a source on bottom, but some don't. Fundamentally, the text I want to be left with should be:
This is
a beautiful
and fully
complete
sentence
I have tried the following:
s = re.sub(r'(Fig|Table)[\s\S]+?Source:.*\n', '', mystring,flags=re.MULTILINE)
But unfortunately it returns:
This is
a beautiful
sentence
With my limited knowledge of regex I cannot figure out how to put such a condition:
It should stop at the first \n
after Source
, only if there is no new fig|table
in between, in which case it should have stopped at the first \n
from start.
Any idea? Thank you.
答案1
得分: 4
你需要匹配的是一个Fig
或Table
,然后是以下之一:
- 包括直到并包括以
Source
开头的行的字符,原始文本中的该行之前没有Fig
或Table
;或 - 直到行尾的字符
你可以通过使用tempered greedy token来实现第一种情况,这确保了每个处理的字符在找到Source
之前都不会出现Fig
或Table
。以下正则表达式将实现你的要求:
(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)
这匹配:
(?:Fig|Table)
:单词Fig
或Table
;然后要么(?:(?!Fig|Table)[\s\S])+?
:最少数量的字符,这些字符中没有包含Fig
或Table
;或Source[^\n]*\n
:单词Source
后跟一些字符,直到换行符;或[^\n]*\n
:一些字符,直到换行符
正则表达式演示在regex101上。
在Python中:
import re
s = re.sub(r'(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)', '', mystring)
print(s)
输出:
This is
a beautiful
and fully
complete
sentence
请注意,这会保留原始字符串中的换行符(如果存在),可以使用strip
来删除它们。
英文:
What you need to match is a Fig
or Table
followed by either
- Characters up to and including a line starting with
Source
, with noFig
orTable
in between the original one andSource
; or - Characters up to the end of line
You can achieve #1 above by using a tempered greedy token, which ensures that each character processed until Source
is found does not precede Fig
or Table
. This regex will do what you want:
(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)
This matches:
(?:Fig|Table)
: a wordFig
orTable
; and then either(?:(?!Fig|Table)[\s\S])+?
: a minimal number of characters, none of which precede either of the wordsFig
orTable
Source[^\n]*\n
: The wordSource
followed by some number of characters until newline; or[^\n]*\n
some number of characters until newline
Regex demo on regex101
In python:
s = re.sub(r'(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)', '', mystring)
print(s)
Output:
This is
a beautiful
and fully
complete
sentence
Note this does leave newlines (if present in the original string) at the start and end of the string, they can be removed with strip
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论