2023年5月11日 03:37:18go评论94阅读模式

英文:

Python - Fuzzy matching result in new column for category based on ratio over 80

问题

I would like to scan a folder to pick up all the files end with '.txt' and then create a data frame by creating a new column for categorization with similar file names (partial score of ratio >=80)

import os
path = '../../../files'
text_files = [f for f in os.listdir(path) if f.endswith('.txt')]
text_files

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

s1 = "programmi.txt"
s2 = "programmi-2.txt"
fuzz.ratio(s1, s2)

The result I expect to see is like below:

英文:

I would like to scan a folder to pick up all the files end with '.txt' and then create a data frame by creating a new column for categorization with similar file names (partial score of ratio >=80)

import os
path = &#39;../../../files&#39;
text_files = [f for f in os.listdir(path) if f.endswith(&#39;.txt&#39;)]
text_files 
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
s1 = &quot;programmi.txt&quot;
s2 = &quot;programmi-2.txt&quot;
fuzz.ratio(s1, s2)

The result I expect to see is like below:

答案1

得分: 1

这是一个使用两个for循环比较每个文本与其他所有文本以获得所需模糊比率的解决方案。

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
txt_list = [
    "programmi.txt",
    "readl-001.txt",
    "dict_class124.txt",
    "readl-002.txt",
    "programmi-2.txt",
    "programmi-re.txt",
    "readl-003.txt",
    "dict_class125.txt",
    "dict_class1264.txt",
    "hello world"
]
list_categorised_texts = []
txt_category = []
category_index = 0
threshold = 80
# two for loops since we need to compare each text to all the others
for txt_1 in txt_list:
    if txt_1 not in list_categorised_texts:  # if the first text of the current pair is not yet categorised, add as a new category
        category_index += 1
        list_categorised_texts.append(txt_1)
        txt_category.append(category_index)
    for txt_2 in txt_list:
        if txt_1 == txt_2:  # we don't want to compare the same texts
            continue
        elif txt_2 in list_categorised_texts:  # skip already classified texts
            continue
        else:  # if txt_2 is similar, add to the list of classified texts with the corresponding category
            similarity = fuzz.ratio(txt_1, txt_2)
            if similarity >= threshold:
                list_categorised_texts.append(txt_2)
                txt_category.append(category_index)
data = {
    'texts': list_categorised_texts,
    'category': txt_category
}
df = pd.DataFrame(data)
print(df.to_markdown())

结果：

|    | texts              |   category |
|---:|:-------------------|-----------:|
|  0 | programmi.txt      |          1 |
|  1 | programmi-2.txt    |          1 |
|  2 | programmi-re.txt   |          1 |
|  3 | readl-001.txt      |          2 |
|  4 | readl-002.txt      |          2 |
|  5 | readl-003.txt      |          2 |
|  6 | dict_class124.txt  |          3 |
|  7 | dict_class125.txt  |          3 |
|  8 | dict_class1264.txt |          3 |
|  9 | hello world        |          4 |

警告：

请注意，此方法具有顺序依赖性：在下面的示例中，将dict_cl.txt与其他名称进行比较只导致一个匹配，而将dict_class12.txt与所有其他名称进行比较导致3个匹配。对于您的用例，我们假设每个组与其他组非常不同，这不应该是问题。但是，此示例显示在更复杂的情况下，成对比较有点棘手。

print(fuzz.ratio('dict_cl.txt', 'dict_class125.txt'))  # 79 -> not same category
print(fuzz.ratio('dict_cl.txt', 'dict_class1264.txt'))  # 76 -> not same category
print(fuzz.ratio('dict_cl.txt', 'dict_class12.txt'))  # 81 -> same category
print("###")
print(fuzz.ratio('dict_class12.txt', 'dict_cl.txt'))  # 81 -> same category
print(fuzz.ratio('dict_class12.txt', 'dict_class125.txt'))  # 97 -> same category
print(fuzz.ratio('dict_class12.txt', 'dict_class1264.txt'))  # 94 -> same category

英文:

Here's a solution which uses two for loops to compare each text to all the others to obtain the fuzz ratio needed for the categorisations.

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
txt_list = [
    &quot;programmi.txt&quot;,
    &quot;readl-001.txt&quot;,
    &quot;dict_class124.txt&quot;,
    &quot;readl-002.txt&quot;,
    &quot;programmi-2.txt&quot;,
    &quot;programmi-re.txt&quot;,
    &quot;readl-003.txt&quot;,
    &quot;dict_class125.txt&quot;,
    &quot;dict_class1264.txt&quot;,
    &quot;hello world&quot;
]
list_categorised_texts = []
txt_category = []
category_index = 0
threshold = 80
# two for loops since we need to compare each text to all the others
for txt_1 in txt_list:
    if txt_1 not in list_categorised_texts:  # if the first text of the current pair is not yet categorised, add as new category
        category_index += 1
        list_categorised_texts.append(txt_1)
        txt_category.append(category_index)
        
    for txt_2 in txt_list:
        
        if txt_1 == txt_2:  # we don&#39;t want to compare the same texts
            continue
        
        elif txt_2 in list_categorised_texts:  # skip already classified texts
            continue
        else:  # if the txt_2 is similar, add to list of classified texts with corresponding category
            similarity = fuzz.ratio(txt_1, txt_2)
            if similarity &gt;= threshold:
                list_categorised_texts.append(txt_2)
                txt_category.append(category_index)
        
            
data = {
    &#39;texts&#39;: list_categorised_texts,
    &#39;category&#39;: txt_category
}
df = pd.DataFrame(data)
print(df.to_markdown())

Result:

|    | texts              |   category |
|---:|:-------------------|-----------:|
|  0 | programmi.txt      |          1 |
|  1 | programmi-2.txt    |          1 |
|  2 | programmi-re.txt   |          1 |
|  3 | readl-001.txt      |          2 |
|  4 | readl-002.txt      |          2 |
|  5 | readl-003.txt      |          2 |
|  6 | dict_class124.txt  |          3 |
|  7 | dict_class125.txt  |          3 |
|  8 | dict_class1264.txt |          3 |
|  9 | hello world        |          4 |

Warning:

Please note that this approach has an order-dependency: In the example below, comparing dict_cl.txt to the other names only leads to one match, while comparing dict_class12.txt to all other names leads to 3 matches. For your use case, where we assume that each group is very distinct from each other, this should not be a problem. However, this example shows that pairwise comparisons are a bit tricky in more sophisticated situations.

print(fuzz.ratio(&#39;dict_cl.txt&#39;, &#39;dict_class125.txt&#39;))  # 79 -&gt; not same category
print(fuzz.ratio(&#39;dict_cl.txt&#39;, &#39;dict_class1264.txt&#39;))  # 76 -&gt; not same category
print(fuzz.ratio(&#39;dict_cl.txt&#39;, &#39;dict_class12.txt&#39;))  # 81 -&gt; same category
print(&quot;###&quot;)
print(fuzz.ratio(&#39;dict_class12.txt&#39;, &#39;dict_cl.txt&#39;))  # 81 -&gt; same category
print(fuzz.ratio(&#39;dict_class12.txt&#39;, &#39;dict_class125.txt&#39;))  # 97 -&gt; same category
print(fuzz.ratio(&#39;dict_class12.txt&#39;, &#39;dict_class1264.txt&#39;))  # 94 -&gt; same category

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python – 基于相似度超过80的结果，在新列中为类别进行模糊匹配

问题

答案1

将样式化的数据框导出到Excel（背景颜色）

不确定下面的代码中输入是如何分割的。

Altair 单击时工具提示的位置记录到文件中

在一个具有多个选项卡的Dash应用程序中的多个输入

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。