2023年6月25日 22:31:54go评论75阅读模式

英文:

How to remove subsets from a string in Python

问题

我想要删除从'YYYY.MM.DD'一直到'Tech|'（包括'Tech|'）之间的所有内容，所以我希望最终的字符串看起来像这样：

dbt とは何をするツールなのか？こんにちは、ソフトウェアエンジニアの冨田です。

我制作了以下代码，但未能删除'Tech|'：

text = re.sub(r'\d{4}\.\d{2}\.\d{2}(?=.*Tech\|)', '', text)

我会感激您的建议。

英文:

I have the following type of text:

dbt とは何をするツールなのか？2022.02.09data build tool|dbt|Tech|こんにちは、ソフトウェアエンジニアの冨田です。

I want to remove all the YYYY.MM.DD occurrences all the way through 'Tech|' included.

So I want the final string to look like this:

dbt とは何をするツールなのか？こんにちは、ソフトウェアエンジニアの冨田です。

I made the following code but it fails to remove 'Tech|':

text = re.sub(r&#39;\d{4}\.\d{2}\.\d{2}(?=.*Tech|)&#39;, &#39;&#39;, text)

I would appreciate your kind suggestions.

答案1

得分: 1

匹配 Tech|，而不是将它断言到右边，需要转义竖线以字面匹配它，并使用非贪婪的 .*? 来匹配第一次出现的 Tech|

单词边界 \b 防止部分单词匹配。

import re

s = "dbt とは何をするツールなのか？2022.02.09data build tool|dbt|Tech|こんにちは、ソフトウェアエンジニアの冨田です。"
pattern = r"\d{4}\.\d{2}\.\d{2}.*?\bTech\|"
print(re.sub(pattern, '', s))

输出结果：

dbt とは何をするツールなのか？こんにちは、ソフトウェアエンジニアの冨田です。

查看 Python 示例。

英文:

Match Tech| instead of asserting it to the right, escape the pipe to match it literally, and use a non greedy .*? to match the first occurrence of Tech|

The word boundary \b prevents a partial word match.

import re
 
s = &quot;dbt とは何をするツールなのか？2022.02.09data build tool|dbt|Tech|こんにちは、ソフトウェアエンジニアの冨田です。&quot;
pattern = r&quot;\d{4}\.\d{2}\.\d{2}.*?\bTech\|&quot;
print(re.sub(pattern, &#39;&#39;, s))

Output

dbt とは何をするツールなのか？こんにちは、ソフトウェアエンジニアの冨田です。

See a Python demo.

答案2

得分: 1

Tech| 被用作分隔符，以便匹配日期和分隔符之间的内容。这会带来风险，如果找不到 Tech| 或者这个文字可能会发生变化。请注意，分隔符的任何不完整或缺失部分都将导致匹配通过所有 Unicode 字符以查找下一个分隔符，并将它们一并清除。

如果这只是一次性的话，没问题。
看起来在日期之后的 ASCII 字符可能会是一个更好的分隔符。
另一种选择是在日期之后加一个 ASCII 分隔符。

\d+\.\d+\.\d+(?:\s*[\x21-\x7e]+)*\s?

代码：

text = re.sub(r'\d+\.\d+\.\d+(?:\s*[\x21-\x7e]+)*\s?', '', text)

你也可以保留 Tech| 分隔符，但将 .*? 替换为仅包含 ASCII 和空格，以避免在发生变化时超出范围。

\d+\.\d+\.\d+(?:\s*[\x21-\x7e]+)*?\s*Tech\|

英文:

Tech| is being used as a delimiter so that what is matched is between the date
and the delimiter. This runs the risk, if Tech| is not found or this literal could possibly change.
Note that any imperfection or missing part of the delimiter will cause
the match to go through all the Unicode chars to find the next delimiter
wiping them out as well.

If this is a 1-off then no problem.
It appears that ASCII characters following the date may be a better delimiter.
An alternative is one with an ASCII delimiter following the date.

\d+\.\d+\.\d+(?:\s*[\x21-\x7e]+)*\s?

code

text = re.sub(r&#39;\d+\.\d+\.\d+(?:\s*[\x21-\x7e]+)*\s?&#39;, &#39;&#39;, text)

You could also keep the Tech| delimiter but replace the .*? with
ASCII and white space only so as not overshoot in case of a change.

\d+\.\d+\.\d+(?:\s*[\x21-\x7e]+)*?\s*Tech\|

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Python中从字符串中移除子集

问题

答案1

答案2

在录制视频中检测特定对象的角度

如何找出值从它们的周期开始时发生了怎样的变化？

从sys.stdin读取转义序列，转义后的字节将延迟到下一次按键时使用select。

pandas系列转为JSON内存泄漏

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论