在DataFrame列内查找字符串之间的相似性。

huangapple go评论65阅读模式
英文:

Find similarities between strings within a DataFrame column

问题

我有一些客户的类似名称,我想将它们分组到一个中,例如:

标题
schwabstsoct2022
schwabsts
schwabregionaloct2022
schwabregional2
flagstar-2022
flagstar-2021

有些具有一个可用于分隔字符串并对其进行分类的字符,但有些则没有,所以是否有一种行之间的相似度分数,我可以快速使用它进行分类,并将结果放在另一列。

谢谢!

英文:

I have similar names for clients that I want to group into one, for example:

A header
schwabstsoct2022
schwabsts
schwabregionaloct2022
schwabregional2
flagstar-2022
flagstar-2021

Some have a character I can use to separate the string and then classify it but some don't, so is there a similarity score between rows I can use to classify it quickly, and have the output on another column.

Thanks!

答案1

得分: 2

我希望我正确理解了您的问题。要找到相似度分数,您可以使用difflib内置模块:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

for s1 in df['A header']:
    df[s1] = [similar(s1, s2) for s2 in df['A header']]

print(df)

打印结果:

                A header  schwabstsoct2022  schwabsts  schwabregionaloct2022  schwabregional2  flagstar-2022  flagstar-2021
0       schwabstsoct2022          1.000000   0.720000               0.702703         0.516129       0.482759       0.413793
1              schwabsts          0.720000   1.000000               0.466667         0.500000       0.272727       0.272727
2  schwabregionaloct2022          0.702703   0.466667               1.000000         0.833333       0.352941       0.294118
3        schwabregional2          0.516129   0.500000               0.833333         1.000000       0.142857       0.142857
4          flagstar-2022          0.482759   0.272727               0.411765         0.285714       1.000000       0.923077
5          flagstar-2021          0.413793   0.272727               0.352941         0.285714       0.923077       1.000000
英文:

I hope I've understood your question right. To find similarity score you can use difflib built-in module:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

for s1 in df['A header']:
    df[s1] = [similar(s1, s2) for s2 in df['A header']]

print(df)

Prints:

                A header  schwabstsoct2022  schwabsts  schwabregionaloct2022  schwabregional2  flagstar-2022  flagstar-2021
0       schwabstsoct2022          1.000000   0.720000               0.702703         0.516129       0.482759       0.413793
1              schwabsts          0.720000   1.000000               0.466667         0.500000       0.272727       0.272727
2  schwabregionaloct2022          0.702703   0.466667               1.000000         0.833333       0.352941       0.294118
3        schwabregional2          0.516129   0.500000               0.833333         1.000000       0.142857       0.142857
4          flagstar-2022          0.482759   0.272727               0.411765         0.285714       1.000000       0.923077
5          flagstar-2021          0.413793   0.272727               0.352941         0.285714       0.923077       1.000000

huangapple
  • 本文由 发表于 2023年3月31日 22:12:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/75899539.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定