英文:
Categorize rows per their similarity in Python
问题
我在这里寻找与自然语言处理相关的数据处理问题的输入。
为了让生活变得更容易,我正在使用几年前从<https://stackoverflow.com/questions/47159996/how-to-group-text-data-based-on-document-similarity>发布的模拟数据集。
以下是代码部分:
import pandas as pd
from difflib import SequenceMatcher
df = pd.DataFrame({'Questions': ['你在做什么?','今晚你在做什么?','你现在在做什么?','你叫什么名字?','你的昵称是什么?','你的全名是什么?','我们应该见面吗?','你好吗?']})
def similarity_score(s1, s2):
return SequenceMatcher(None, s1, s2).ratio()
def similarity(x,df):
sim_score = []
for i in df['Questions']:
sim_score.append(similarity_score(x,i))
return sim_score
df['相似度'] = df['Questions'].apply(lambda x : similarity(x, df)).astype(str)
print(df)
输出如下:
Questions 相似度
0 你在做什么? ['1.0', '0.8260869565217391', '0.9047619047619048', '...
1 今晚你在做什么? ['0.8260869565217391', '1.0', '0.84', '0.5333333333333...
2 你现在在做什么? ['0.9047619047619048', '0.84', '1.0', '0.5853658536585...
3 你叫什么名字? ['0.6486486486486487', '0.5333333333333333', '0.58536...
4 你的昵称是什么? ['0.5714285714285714', '0.52', '0.5217391304347826', ...
5 你的全名是什么? ['0.5714285714285714', '0.52', '0.5652173913043478', ...
6 我们应该见面吗? ['0.36363636363636365', '0.34146341463414637', '0.32...
7 你好吗? ['0.8108108108108109', '0.6666666666666666', '0.731707...
逻辑是我遍历数据框中的每一行,将其与所有其他行(包括自身)进行比较,以计算它们的相似度。然后,我将相似度分数存储为另一列,名为“相似度”。
接下来,我想对第一列的问题进行分类。如果相似度得分> 0.9,则这些行应分配到同一组。我该如何实现这一点?
英文:
I am here to look for input for a data manipulation problem related to natural language processing.
To make life easier, I am using a mock dataset posted several years ago from <https://stackoverflow.com/questions/47159996/how-to-group-text-data-based-on-document-similarity>.
import pandas as pd
from difflib import SequenceMatcher
df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?',
'How are you doing?' ]})
def similarity_score(s1, s2):
return SequenceMatcher(None, s1, s2).ratio()
def similarity(x,df):
sim_score = []
for i in df['Questions']:
sim_score.append(similarity_score(x,i))
return sim_score
df['similarity'] = df['Questions'].apply(lambda x : similarity(x, df)).astype(str)
print(df)
The output is as following
Questions \
0 What are you doing?
1 What are you doing tonight?
2 What are you doing now?
3 What is your name?
4 What is your nick name?
5 What is your full name?
6 Shall we meet?
7 How are you doing?
similarity
0 [1.0, 0.8260869565217391, 0.9047619047619048, ...
1 [0.8260869565217391, 1.0, 0.84, 0.533333333333...
2 [0.9047619047619048, 0.84, 1.0, 0.585365853658...
3 [0.6486486486486487, 0.5333333333333333, 0.585...
4 [0.5714285714285714, 0.52, 0.5217391304347826,...
5 [0.5714285714285714, 0.52, 0.5652173913043478,...
6 [0.36363636363636365, 0.34146341463414637, 0.3...
7 [0.8108108108108109, 0.6666666666666666, 0.731...
The logic is that I go through each row in the data frame to compare it to all over rows (including itself) in order to compute their similarity. I then store the similarity score as a list in another column called "similarity".
Next, I want to categorize the questions in the first column. If the similarity score > 0.9, then those rows should be assigned to the same group. How can I achieve this?
答案1
得分: 1
以下是代码的翻译部分:
解决方案是按行迭代您的相似度分数,根据某个阈值创建二进制掩码,然后使用二进制掩码仅提取满足阈值的那些问题。
请注意,此解决方案假定您希望的“组”是问题本身(即,对于每个问题,您希望与之关联的相似问题列表)。我为数组的其余部分制作了相似度分数,以创建这个最小示例。
import pandas as pd
orig_data = {
"Questions": [
"What are you doing?",
"What are you doing tonight?",
"What are you doing now?",
"What is your name?",
"What is your nick name?",
"What is your full name?",
"Shall we meet?",
"How are you doing?",
],
"similarity": [
[1.0, 0.826, 0.905, 0.234, 0.544, 0.673, 0.411, 0.45],
[0.826, 1.0, 0.84, 0.533, 0.444, 0.525, 0.641, 0.62],
[0.905, 0.84, 1.0, 0.585, 0.861, 0.685, 0.455, 0.65],
[0.649, 0.533, 0.585, 1.0, 0.901, 0.902, 0.642, 0.234],
[0.571, 0.52, 0.522, 0.901, 1.0, 0.905, 0.753, 0.786],
[0.571, 0.52, 0.565, 0.902, 0.903, 1.0, 0.123, 0.586],
[0.364, 0.341, 0.3, 0.674, 0.584, 0.421, 1.0, 0.544],
[0.811, 0.667, 0.731, 0.345, 0.764, 0.242, 0.55, 1.0],
],
}
df = pd.DataFrame(orig_data)
results = []
for idx, sim_row in enumerate(df["similarity"]):
bin_mask = [True if score > 0.9 else False for score in sim_row]
curr_q = df["Questions"][idx]
sim_quests = [q for q, b in zip(df["Questions"], bin_mask) if b and q != curr_q]
results.append(sim_quests)
df["similar-questions"] = results
print(df)
请注意,这只是代码的翻译部分,不包括问题或输出。
英文:
A solution is to iterate row-wise over your similarity scores, create a binary mask based on some threshold, and then use the binary mask to only extract those questions who meet the threshold.
Note that this solution presumes that the "groups" you desire are the questions themselves (i.e. for each question, you want a list of similar questions associated with it). I made up similarity scores for the rest of the array to create this minimal example.
Solution
import pandas as pd
orig_data = {
"Questions": [
"What are you doing?",
"What are you doing tonight?",
"What are you doing now?",
"What is your name?",
"What is your nick name?",
"What is your full name?",
"Shall we meet?",
"How are you doing?",
],
"similarity": [
[1.0, 0.826, 0.905, 0.234, 0.544, 0.673, 0.411, 0.45],
[0.826, 1.0, 0.84, 0.533, 0.444, 0.525, 0.641, 0.62],
[0.905, 0.84, 1.0, 0.585, 0.861, 0.685, 0.455, 0.65],
[0.649, 0.533, 0.585, 1.0, 0.901, 0.902, 0.642, 0.234],
[0.571, 0.52, 0.522, 0.901, 1.0, 0.905, 0.753, 0.786],
[0.571, 0.52, 0.565, 0.902, 0.903, 1.0, 0.123, 0.586],
[0.364, 0.341, 0.3, 0.674, 0.584, 0.421, 1.0, 0.544],
[0.811, 0.667, 0.731, 0.345, 0.764, 0.242, 0.55, 1.0],
],
}
df = pd.DataFrame(orig_data)
results = []
for idx, sim_row in enumerate(df["similarity"]):
bin_mask = [True if score > 0.9 else False for score in sim_row]
curr_q = df["Questions"][idx]
sim_quests = [q for q, b in zip(df["Questions"], bin_mask) if b and q != curr_q]
results.append(sim_quests)
df["similar-questions"] = results
print(df)
Output
Questions ... similar-questions
0 What are you doing? ... [What are you doing now?]
1 What are you doing tonight? ... []
2 What are you doing now? ... [What are you doing?]
3 What is your name? ... [What is your nick name?, What is your full na...
4 What is your nick name? ... [What is your name?, What is your full name?]
5 What is your full name? ... [What is your name?, What is your nick name?]
6 Shall we meet? ... []
7 How are you doing? ... []
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论