使用Python根据其文本模式聚合行。

huangapple go评论73阅读模式
英文:

Aggregate rows per its text pattern using Python

问题

I am working on an interesting text mining (maybe text pattern recognition) problem. The dataset has two columns as follows:

The first column is input message, while the second one is its count. For many of the input messages, they are in fact the same, but different with regard to the details, e.g., the third row and the fourth row deliver the same message, but the fourth row has more specific information compared to the third one.

How can I create an algorithm to aggregate those rows with high similarity? By the end of the day, the aggregate data will be somehow as follows:

英文:

I am working on an interesting text mining (maybe text pattern recognition) problem. The dataset has two columns as follows:

column1 (string)            column2 (integer)
'ABC'                             3
'DEF'                             4
'abc'                             1
'abc:very specific message'       1
...

The first column is input message, while the second one is its count. For many of the input messages, they are in fact the same, but different with regard to the details, e.g., the third row and the fourth row deliver the same message, but the fourth row has more specific information compared to the third one.

How can I create an algorithm to aggregate those rows with high similarity? By the end of the day, the aggregate data will be somehow as follows:

column1 (string)            column2 (integer)
'ABC'                             3
'DEF'                             4
'abc'                             2

答案1

得分: 1

I think I am in the right direction. Please check below as a simple implementation:

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()
text1 = "昨天:我和男朋友吃早餐"
text2 = "昨天:我和男朋友吃早餐:咖啡和面包圈"
similar(text1, text2)

输出结果为:0.80(四舍五入)

我们可以使用相似度比率来判断任意两行是否具有足够高的相似度。感谢讨论。

英文:

I think I am in the right direction. Please check below as a simple implementation:

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()
text1 = "Yesterday:I had a breakfast:with my boyfriend"
text2 = "Yesterday:I had a breakfast:with my boyfriend:it is coffee and bagel"
similar(text1, text2)

The output is: 0.80 (rounded)

We can use the similarity ratio to justify whether any two rows has high enough similarity. Thanks for discussion.

huangapple
  • 本文由 发表于 2023年6月8日 05:11:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76427133.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定