2023年3月7日 18:02:20go评论74阅读模式

英文:

How to remove possible suffix repetitions from a str column?

问题

考虑以下数据框，其中str列中的后缀可能重复出现：

    Book
0   Book1.pdf
1   Book2.pdf.pdf
2   Book3.epub
3   Book4.mobi.mobi
4   Book5.epub.epub

期望的输出（在需要时删除后缀）：

    Book
0   Book1.pdf
1   Book2.pdf
2   Book3.epub
3   Book4.mobi
4   Book5.epub

我尝试了在.字符上拆分，然后计算最后一项的出现次数，以检查是否存在重复。

我只是使用文件路径来说明我的观点！列的内容可能与路径不同！

英文:

Consider the following dataframe, where the suffix in a str column might be repeating itself:

    Book
0   Book1.pdf
1   Book2.pdf.pdf
2   Book3.epub
3   Book4.mobi.mobi
4   Book5.epub.epub

Desired output (removed suffixes where needed)

    Book
0   Book1.pdf
1   Book2.pdf
2   Book3.epub
3   Book4.mobi
4   Book5.epub

I have tried splitting on the . character and then counting occurences of the last item to check if there is duplication.

I have used file paths only to illustrate my point! The contents of the column could be something different than paths!

答案1

得分: 4

使用正则表达式与捕获组和参考以及 str.replace：

df['Book'] = df['Book'].str.replace(r'(\.[^.]+)$', r'', regex=True)

# 或者
df['Book'] = df['Book'].str.replace(r'(\.[^.]+)(?=)$', '', regex=True)

输出：

         Book
0   Book1.pdf
1   Book2.pdf
2  Book3.epub
3  Book4.mobi
4  Book5.epub

正则表达式演示 1

正则表达式演示 2

泛化

如果你想要一个通用的正则表达式，不依赖于 .：

df['Book'] = df['Book'].str.replace(r'(.+)$', r'', regex=True)

正则表达式演示

英文:

Use a regex with a capturing group + reference and str.replace:

df[&#39;Book&#39;] = df[&#39;Book&#39;].str.replace(r&#39;(\.[^.]+)$&#39;, r&#39;&#39;, regex=True)

# or
df[&#39;Book&#39;] = df[&#39;Book&#39;].str.replace(r&#39;(\.[^.]+)(?=)$&#39;, &#39;&#39;, regex=True)

Output:

         Book
0   Book1.pdf
1   Book2.pdf
2  Book3.epub
3  Book4.mobi
4  Book5.epub

regex demo 1

regex demo 2

generalization

if you want something generic that doesn't depend on the .:

df[&#39;Book&#39;] = df[&#39;Book&#39;].str.replace(r&#39;(.+)$&#39;, r&#39;&#39;, regex=True)

regex demo

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从字符串列中去除可能的后缀重复？

问题

答案1

泛化

generalization

将两个具有一对多关系的数据框合并。

python sympy不能在使用字母I时替代数值。

修复包含字典的Python代码。

为什么下面的Apache Beam代码返回不同的输出？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论