2023年5月25日 10:08:11go评论70阅读模式

英文:

Regex to remove captions with condition not to overlap second match

问题

以下是您要的翻译结果：

我有以下字符串，这些字符串我从一个PDF文件中提取出来：

这是
图13：约翰拿着他的礼物和
鲜花
来源：官方摄影师
一个美丽的
表格：某种表格
并完全
完整
表格：约翰拿着他的礼物和
来源：官方摄影师
句子

文本包括图表，大多数都有标题在上面和来源在下面，但有些没有。从根本上讲，我希望保留的文本应该是：

这是
一个美丽的
并完全
完整
句子

我尝试了以下方法：
s = re.sub(r'(Fig|Table)[\s\S]+?Source:.*\n', '', mystring, flags=re.MULTILINE)

但不幸的是，它返回：
这是
一个美丽的
句子

在我有限的正则表达式知识下，我无法弄清楚如何设置这样的条件：

它应该在“来源”后的第一个“\n”处停止，只有在没有新的“图表”之间才能停止，否则应该在开头的第一个“\n”处停止。

有什么主意吗？谢谢。

英文:

I have the following string, which I extract from a pdf:

This is
Fig. 13: John holding his present and
the flowers
Source: official photographer
a beautiful
Table: a table of some kind
and fully
complete
Table: John holding his present and
Source: official photographer
sentence

The text includes figs and tables, most of which have a caption on top and a source on bottom, but some don't. Fundamentally, the text I want to be left with should be:

This is
a beautiful
and fully
complete
sentence

I have tried the following:

s = re.sub(r&#39;(Fig|Table)[\s\S]+?Source:.*\n&#39;, &#39;&#39;, mystring,flags=re.MULTILINE)

But unfortunately it returns:

This is
a beautiful
sentence

With my limited knowledge of regex I cannot figure out how to put such a condition:

It should stop at the first \n after Source, only if there is no new fig|table in between, in which case it should have stopped at the first \n from start.

Any idea? Thank you.

答案1

得分: 4

你需要匹配的是一个Fig或Table，然后是以下之一：

包括直到并包括以Source开头的行的字符，原始文本中的该行之前没有Fig或Table；或
直到行尾的字符

你可以通过使用tempered greedy token来实现第一种情况，这确保了每个处理的字符在找到Source之前都不会出现Fig或Table。以下正则表达式将实现你的要求：

(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)

这匹配：

(?:Fig|Table)：单词Fig或Table；然后要么
(?:(?!Fig|Table)[\s\S])+?：最少数量的字符，这些字符中没有包含Fig或Table；或
Source[^\n]*\n：单词Source后跟一些字符，直到换行符；或
[^\n]*\n：一些字符，直到换行符

正则表达式演示在regex101上。

在Python中：

import re

s = re.sub(r'(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)', '', mystring)
print(s)

输出：

This is
a beautiful
and fully
complete
sentence

请注意，这会保留原始字符串中的换行符（如果存在），可以使用strip来删除它们。

英文:

What you need to match is a Fig or Table followed by either

Characters up to and including a line starting with Source, with no Fig or Table in between the original one and Source; or
Characters up to the end of line

You can achieve #1 above by using a tempered greedy token, which ensures that each character processed until Source is found does not precede Fig or Table. This regex will do what you want:

(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)

This matches:

(?:Fig|Table) : a word Fig or Table; and then either
(?:(?!Fig|Table)[\s\S])+? : a minimal number of characters, none of which precede either of the words Fig or Table
Source[^\n]*\n : The word Source followed by some number of characters until newline; or
[^\n]*\n some number of characters until newline

Regex demo on regex101

In python:

s = re.sub(r&#39;(?:Fig|Table)(?:(?:(?!Fig|Table)[\s\S])+?Source[^\n]*\n|[^\n]*\n)&#39;, &#39;&#39;, mystring)
print(s)

Output:

This is
a beautiful
and fully
complete
sentence

Note this does leave newlines (if present in the original string) at the start and end of the string, they can be removed with strip.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

删除标题并确保第二个匹配不重叠的正则表达式

问题

答案1

使用”Golang”在GAE中的成本/性能优势是什么？

python code to extract a record from a data frame from excel based on condition and create and input as column value

将RDD列表映射到具有两个参数的函数。

如何按某列对数据框进行排序？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论