英文:
How to remove possible suffix repetitions from a str column?
问题
考虑以下数据框,其中str
列中的后缀可能重复出现:
Book
0 Book1.pdf
1 Book2.pdf.pdf
2 Book3.epub
3 Book4.mobi.mobi
4 Book5.epub.epub
期望的输出(在需要时删除后缀):
Book
0 Book1.pdf
1 Book2.pdf
2 Book3.epub
3 Book4.mobi
4 Book5.epub
我尝试了在.
字符上拆分,然后计算最后一项的出现次数,以检查是否存在重复。
我只是使用文件路径来说明我的观点!列的内容可能与路径不同!
英文:
Consider the following dataframe, where the suffix in a str
column might be repeating itself:
Book
0 Book1.pdf
1 Book2.pdf.pdf
2 Book3.epub
3 Book4.mobi.mobi
4 Book5.epub.epub
Desired output (removed suffixes where needed)
Book
0 Book1.pdf
1 Book2.pdf
2 Book3.epub
3 Book4.mobi
4 Book5.epub
I have tried splitting on the .
character and then counting occurences of the last item to check if there is duplication.
I have used file paths only to illustrate my point! The contents of the column could be something different than paths!
答案1
得分: 4
使用正则表达式与捕获组和参考以及 str.replace
:
df['Book'] = df['Book'].str.replace(r'(\.[^.]+)$', r'', regex=True)
# 或者
df['Book'] = df['Book'].str.replace(r'(\.[^.]+)(?=)$', '', regex=True)
输出:
Book
0 Book1.pdf
1 Book2.pdf
2 Book3.epub
3 Book4.mobi
4 Book5.epub
泛化
如果你想要一个通用的正则表达式,不依赖于 .
:
df['Book'] = df['Book'].str.replace(r'(.+)$', r'', regex=True)
英文:
Use a regex with a capturing group + reference and str.replace
:
df['Book'] = df['Book'].str.replace(r'(\.[^.]+)$', r'', regex=True)
# or
df['Book'] = df['Book'].str.replace(r'(\.[^.]+)(?=)$', '', regex=True)
Output:
Book
0 Book1.pdf
1 Book2.pdf
2 Book3.epub
3 Book4.mobi
4 Book5.epub
generalization
if you want something generic that doesn't depend on the .
:
df['Book'] = df['Book'].str.replace(r'(.+)$', r'', regex=True)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论