问题

我有一些客户的类似名称，我想将它们分组到一个中，例如：

标题
schwabstsoct2022
schwabsts
schwabregionaloct2022
schwabregional2
flagstar-2022
flagstar-2021

有些具有一个可用于分隔字符串并对其进行分类的字符，但有些则没有，所以是否有一种行之间的相似度分数，我可以快速使用它进行分类，并将结果放在另一列。

谢谢！

英文:

I have similar names for clients that I want to group into one, for example:

A header
schwabstsoct2022
schwabsts
schwabregionaloct2022
schwabregional2
flagstar-2022
flagstar-2021

Some have a character I can use to separate the string and then classify it but some don't, so is there a similarity score between rows I can use to classify it quickly, and have the output on another column.

Thanks!

答案1

得分: 2

我希望我正确理解了您的问题。要找到相似度分数，您可以使用difflib内置模块：

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

for s1 in df['A header']:
    df[s1] = [similar(s1, s2) for s2 in df['A header']]

print(df)

打印结果：

                A header  schwabstsoct2022  schwabsts  schwabregionaloct2022  schwabregional2  flagstar-2022  flagstar-2021
0       schwabstsoct2022          1.000000   0.720000               0.702703         0.516129       0.482759       0.413793
1              schwabsts          0.720000   1.000000               0.466667         0.500000       0.272727       0.272727
2  schwabregionaloct2022          0.702703   0.466667               1.000000         0.833333       0.352941       0.294118
3        schwabregional2          0.516129   0.500000               0.833333         1.000000       0.142857       0.142857
4          flagstar-2022          0.482759   0.272727               0.411765         0.285714       1.000000       0.923077
5          flagstar-2021          0.413793   0.272727               0.352941         0.285714       0.923077       1.000000

英文:

I hope I've understood your question right. To find similarity score you can use difflib built-in module:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

for s1 in df[&#39;A header&#39;]:
    df[s1] = [similar(s1, s2) for s2 in df[&#39;A header&#39;]]

print(df)

Prints:

                A header  schwabstsoct2022  schwabsts  schwabregionaloct2022  schwabregional2  flagstar-2022  flagstar-2021
0       schwabstsoct2022          1.000000   0.720000               0.702703         0.516129       0.482759       0.413793
1              schwabsts          0.720000   1.000000               0.466667         0.500000       0.272727       0.272727
2  schwabregionaloct2022          0.702703   0.466667               1.000000         0.833333       0.352941       0.294118
3        schwabregional2          0.516129   0.500000               0.833333         1.000000       0.142857       0.142857
4          flagstar-2022          0.482759   0.272727               0.411765         0.285714       1.000000       0.923077
5          flagstar-2021          0.413793   0.272727               0.352941         0.285714       0.923077       1.000000

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在DataFrame列内查找字符串之间的相似性。

问题

答案1

Enhanced for loop and Array 增强型for循环和数组

将 F 表达式的列表相加。

如何去除框架和滚动条之间的间隙？

TarFile.extractall基本路径错误，python？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论