如何从字符串列中去除可能的后缀重复?

huangapple go评论74阅读模式
英文:

How to remove possible suffix repetitions from a str column?

问题

考虑以下数据框,其中str列中的后缀可能重复出现:

    Book
0   Book1.pdf
1   Book2.pdf.pdf
2   Book3.epub
3   Book4.mobi.mobi
4   Book5.epub.epub

期望的输出(在需要时删除后缀):

    Book
0   Book1.pdf
1   Book2.pdf
2   Book3.epub
3   Book4.mobi
4   Book5.epub

我尝试了在.字符上拆分,然后计算最后一项的出现次数,以检查是否存在重复。

我只是使用文件路径来说明我的观点!列的内容可能与路径不同!

英文:

Consider the following dataframe, where the suffix in a str column might be repeating itself:

    Book
0   Book1.pdf
1   Book2.pdf.pdf
2   Book3.epub
3   Book4.mobi.mobi
4   Book5.epub.epub

Desired output (removed suffixes where needed)

    Book
0   Book1.pdf
1   Book2.pdf
2   Book3.epub
3   Book4.mobi
4   Book5.epub

I have tried splitting on the . character and then counting occurences of the last item to check if there is duplication.

I have used file paths only to illustrate my point! The contents of the column could be something different than paths!

答案1

得分: 4

使用正则表达式与捕获组和参考以及 str.replace

df['Book'] = df['Book'].str.replace(r'(\.[^.]+)$', r'', regex=True)

# 或者
df['Book'] = df['Book'].str.replace(r'(\.[^.]+)(?=)$', '', regex=True)

输出:

         Book
0   Book1.pdf
1   Book2.pdf
2  Book3.epub
3  Book4.mobi
4  Book5.epub

正则表达式演示 1

正则表达式演示 2

泛化

如果你想要一个通用的正则表达式,不依赖于 .

df['Book'] = df['Book'].str.replace(r'(.+)$', r'', regex=True)

正则表达式演示

英文:

Use a regex with a capturing group + reference and str.replace:

df['Book'] = df['Book'].str.replace(r'(\.[^.]+)$', r'', regex=True)

# or
df['Book'] = df['Book'].str.replace(r'(\.[^.]+)(?=)$', '', regex=True)

Output:

         Book
0   Book1.pdf
1   Book2.pdf
2  Book3.epub
3  Book4.mobi
4  Book5.epub

regex demo 1

regex demo 2

generalization

if you want something generic that doesn't depend on the .:

df['Book'] = df['Book'].str.replace(r'(.+)$', r'', regex=True)

regex demo

huangapple
  • 本文由 发表于 2023年3月7日 18:02:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75660485.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定