英文:
Aggregate rows per its text pattern using Python
问题
I am working on an interesting text mining (maybe text pattern recognition) problem. The dataset has two columns as follows:
The first column is input message, while the second one is its count. For many of the input messages, they are in fact the same, but different with regard to the details, e.g., the third row and the fourth row deliver the same message, but the fourth row has more specific information compared to the third one.
How can I create an algorithm to aggregate those rows with high similarity? By the end of the day, the aggregate data will be somehow as follows:
英文:
I am working on an interesting text mining (maybe text pattern recognition) problem. The dataset has two columns as follows:
column1 (string) column2 (integer)
'ABC' 3
'DEF' 4
'abc' 1
'abc:very specific message' 1
...
The first column is input message, while the second one is its count. For many of the input messages, they are in fact the same, but different with regard to the details, e.g., the third row and the fourth row deliver the same message, but the fourth row has more specific information compared to the third one.
How can I create an algorithm to aggregate those rows with high similarity? By the end of the day, the aggregate data will be somehow as follows:
column1 (string) column2 (integer)
'ABC' 3
'DEF' 4
'abc' 2
答案1
得分: 1
I think I am in the right direction. Please check below as a simple implementation:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
text1 = "昨天:我和男朋友吃早餐"
text2 = "昨天:我和男朋友吃早餐:咖啡和面包圈"
similar(text1, text2)
输出结果为:0.80(四舍五入)
我们可以使用相似度比率来判断任意两行是否具有足够高的相似度。感谢讨论。
英文:
I think I am in the right direction. Please check below as a simple implementation:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
text1 = "Yesterday:I had a breakfast:with my boyfriend"
text2 = "Yesterday:I had a breakfast:with my boyfriend:it is coffee and bagel"
similar(text1, text2)
The output is: 0.80 (rounded)
We can use the similarity ratio to justify whether any two rows has high enough similarity. Thanks for discussion.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论