2023年6月8日 05:11:34go评论100阅读模式

英文:

Aggregate rows per its text pattern using Python

问题

I am working on an interesting text mining (maybe text pattern recognition) problem. The dataset has two columns as follows:

The first column is input message, while the second one is its count. For many of the input messages, they are in fact the same, but different with regard to the details, e.g., the third row and the fourth row deliver the same message, but the fourth row has more specific information compared to the third one.

How can I create an algorithm to aggregate those rows with high similarity? By the end of the day, the aggregate data will be somehow as follows:

英文:

I am working on an interesting text mining (maybe text pattern recognition) problem. The dataset has two columns as follows:

column1 (string)            column2 (integer)
&#39;ABC&#39;                             3
&#39;DEF&#39;                             4
&#39;abc&#39;                             1
&#39;abc:very specific message&#39;       1
...

How can I create an algorithm to aggregate those rows with high similarity? By the end of the day, the aggregate data will be somehow as follows:

column1 (string)            column2 (integer)
&#39;ABC&#39;                             3
&#39;DEF&#39;                             4
&#39;abc&#39;                             2

答案1

得分: 1

I think I am in the right direction. Please check below as a simple implementation:

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()
text1 = "昨天：我和男朋友吃早餐"
text2 = "昨天：我和男朋友吃早餐：咖啡和面包圈"
similar(text1, text2)

输出结果为：0.80（四舍五入）

我们可以使用相似度比率来判断任意两行是否具有足够高的相似度。感谢讨论。

英文:

I think I am in the right direction. Please check below as a simple implementation:

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()
text1 = &quot;Yesterday:I had a breakfast:with my boyfriend&quot;
text2 = &quot;Yesterday:I had a breakfast:with my boyfriend:it is coffee and bagel&quot;
similar(text1, text2)

The output is: 0.80 (rounded)

We can use the similarity ratio to justify whether any two rows has high enough similarity. Thanks for discussion.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Python根据其文本模式聚合行。

问题

答案1

Python控制台脚本可执行文件在复制包后不存在。

Ansible + Python – 以编程方式提供 ansible-vault 密码

Telebot导入问题

Python – Kivy: 标签在函数执行期间不更新

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。