英文:
Find similarities between strings within a DataFrame column
问题
我有一些客户的类似名称,我想将它们分组到一个中,例如:
标题 |
---|
schwabstsoct2022 |
schwabsts |
schwabregionaloct2022 |
schwabregional2 |
flagstar-2022 |
flagstar-2021 |
有些具有一个可用于分隔字符串并对其进行分类的字符,但有些则没有,所以是否有一种行之间的相似度分数,我可以快速使用它进行分类,并将结果放在另一列。
谢谢!
英文:
I have similar names for clients that I want to group into one, for example:
A header |
---|
schwabstsoct2022 |
schwabsts |
schwabregionaloct2022 |
schwabregional2 |
flagstar-2022 |
flagstar-2021 |
Some have a character I can use to separate the string and then classify it but some don't, so is there a similarity score between rows I can use to classify it quickly, and have the output on another column.
Thanks!
答案1
得分: 2
我希望我正确理解了您的问题。要找到相似度分数,您可以使用difflib
内置模块:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
for s1 in df['A header']:
df[s1] = [similar(s1, s2) for s2 in df['A header']]
print(df)
打印结果:
A header schwabstsoct2022 schwabsts schwabregionaloct2022 schwabregional2 flagstar-2022 flagstar-2021
0 schwabstsoct2022 1.000000 0.720000 0.702703 0.516129 0.482759 0.413793
1 schwabsts 0.720000 1.000000 0.466667 0.500000 0.272727 0.272727
2 schwabregionaloct2022 0.702703 0.466667 1.000000 0.833333 0.352941 0.294118
3 schwabregional2 0.516129 0.500000 0.833333 1.000000 0.142857 0.142857
4 flagstar-2022 0.482759 0.272727 0.411765 0.285714 1.000000 0.923077
5 flagstar-2021 0.413793 0.272727 0.352941 0.285714 0.923077 1.000000
英文:
I hope I've understood your question right. To find similarity score you can use difflib
built-in module:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
for s1 in df['A header']:
df[s1] = [similar(s1, s2) for s2 in df['A header']]
print(df)
Prints:
A header schwabstsoct2022 schwabsts schwabregionaloct2022 schwabregional2 flagstar-2022 flagstar-2021
0 schwabstsoct2022 1.000000 0.720000 0.702703 0.516129 0.482759 0.413793
1 schwabsts 0.720000 1.000000 0.466667 0.500000 0.272727 0.272727
2 schwabregionaloct2022 0.702703 0.466667 1.000000 0.833333 0.352941 0.294118
3 schwabregional2 0.516129 0.500000 0.833333 1.000000 0.142857 0.142857
4 flagstar-2022 0.482759 0.272727 0.411765 0.285714 1.000000 0.923077
5 flagstar-2021 0.413793 0.272727 0.352941 0.285714 0.923077 1.000000
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论