删除标题并确保第二个匹配不重叠的正则表达式

huangapple go评论62阅读模式
英文:

Regex to remove captions with condition not to overlap second match

问题

以下是您要的翻译结果:

我有以下字符串,这些字符串我从一个PDF文件中提取出来:

这是
图13:约翰拿着他的礼物和
鲜花
来源:官方摄影师
一个美丽的
表格:某种表格
并完全
完整
表格:约翰拿着他的礼物和
来源:官方摄影师
句子

文本包括图表,大多数都有标题在上面和来源在下面,但有些没有。从根本上讲,我希望保留的文本应该是:

这是
一个美丽的
并完全
完整
句子

我尝试了以下方法:
s = re.sub(r'(Fig|Table)[\s\S]+?Source:.*\n', '', mystring, flags=re.MULTILINE)

但不幸的是,它返回:
这是
一个美丽的
句子

在我有限的正则表达式知识下,我无法弄清楚如何设置这样的条件:

它应该在“来源”后的第一个“\n”处停止,只有在没有新的“图表”之间才能停止,否则应该在开头的第一个“\n”处停止。

有什么主意吗?谢谢。
英文:

I have the following string, which I extract from a pdf:

This is
Fig. 13: John holding his present and
the flowers
Source: official photographer
a beautiful
Table: a table of some kind
and fully
complete
Table: John holding his present and
Source: official photographer
sentence

The text includes figs and tables, most of which have a caption on top and a source on bottom, but some don't. Fundamentally, the text I want to be left with should be:

This is
a beautiful
and fully
complete
sentence

I have tried the following:

s = re.sub(r'(Fig|Table)[\s\S]+?Source:.*\n', '', mystring,flags=re.MULTILINE)

But unfortunately it returns:

This is
a beautiful
sentence

With my limited knowledge of regex I cannot figure out how to put such a condition:

It should stop at the first \n after Source, only if there is no new fig|table in between, in which case it should have stopped at the first \n from start.

Any idea? Thank you.

答案1

得分: 4

你需要匹配的是一个FigTable,然后是以下之一:

  1. 包括直到并包括以Source开头的行的字符,原始文本中的该行之前没有FigTable;或
  2. 直到行尾的字符

你可以通过使用tempered greedy token来实现第一种情况,这确保了每个处理的字符在找到Source之前都不会出现FigTable。以下正则表达式将实现你的要求:

(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)

这匹配:

  • (?:Fig|Table):单词FigTable;然后要么
  • (?:(?!Fig|Table)[\s\S])+?:最少数量的字符,这些字符中没有包含FigTable;或
  • Source[^\n]*\n:单词Source后跟一些字符,直到换行符;或
  • [^\n]*\n:一些字符,直到换行符

正则表达式演示在regex101上。

在Python中:

import re

s = re.sub(r'(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)', '', mystring)
print(s)

输出:

This is
a beautiful
and fully
complete
sentence

请注意,这会保留原始字符串中的换行符(如果存在),可以使用strip来删除它们。

英文:

What you need to match is a Fig or Table followed by either

  1. Characters up to and including a line starting with Source, with no Fig or Table in between the original one and Source; or
  2. Characters up to the end of line

You can achieve #1 above by using a tempered greedy token, which ensures that each character processed until Source is found does not precede Fig or Table. This regex will do what you want:

(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)

This matches:

  • (?:Fig|Table) : a word Fig or Table; and then either
  • (?:(?!Fig|Table)[\s\S])+? : a minimal number of characters, none of which precede either of the words Fig or Table
  • Source[^\n]*\n : The word Source followed by some number of characters until newline; or
  • [^\n]*\n some number of characters until newline

Regex demo on regex101

In python:

s = re.sub(r'(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)', '', mystring)
print(s)

Output:

This is
a beautiful
and fully
complete
sentence

Note this does leave newlines (if present in the original string) at the start and end of the string, they can be removed with strip.

huangapple
  • 本文由 发表于 2023年5月25日 10:08:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76328441.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定