将行按它们在Python中的相似性分类。

huangapple go评论81阅读模式
英文:

Categorize rows per their similarity in Python

问题

我在这里寻找与自然语言处理相关的数据处理问题的输入

为了让生活变得更容易我正在使用几年前从<https://stackoverflow.com/questions/47159996/how-to-group-text-data-based-on-document-similarity>发布的模拟数据集

以下是代码部分

import pandas as pd
from difflib import SequenceMatcher

df = pd.DataFrame({'Questions': ['你在做什么?','今晚你在做什么?','你现在在做什么?','你叫什么名字?','你的昵称是什么?','你的全名是什么?','我们应该见面吗?','你好吗?']})

def similarity_score(s1, s2):
    return SequenceMatcher(None, s1, s2).ratio()

def similarity(x,df):
    sim_score = []
    for i in df['Questions']:
        sim_score.append(similarity_score(x,i))
    return sim_score

df['相似度'] = df['Questions'].apply(lambda x : similarity(x, df)).astype(str)
print(df)

输出如下

Questions                                       相似度
0         你在做什么  ['1.0', '0.8260869565217391', '0.9047619047619048', '...
1    今晚你在做什么  ['0.8260869565217391', '1.0', '0.84', '0.5333333333333...
2      你现在在做什么  ['0.9047619047619048', '0.84', '1.0', '0.5853658536585...
3         你叫什么名字  ['0.6486486486486487', '0.5333333333333333', '0.58536...
4        你的昵称是什么  ['0.5714285714285714', '0.52', '0.5217391304347826', ...
5        你的全名是什么  ['0.5714285714285714', '0.52', '0.5652173913043478', ...
6          我们应该见面吗  ['0.36363636363636365', '0.34146341463414637', '0.32...
7            你好吗  ['0.8108108108108109', '0.6666666666666666', '0.731707...

逻辑是我遍历数据框中的每一行将其与所有其他行包括自身进行比较以计算它们的相似度然后我将相似度分数存储为另一列名为相似度”。

接下来我想对第一列的问题进行分类如果相似度得分> 0.9则这些行应分配到同一组我该如何实现这一点
英文:

I am here to look for input for a data manipulation problem related to natural language processing.

To make life easier, I am using a mock dataset posted several years ago from <https://stackoverflow.com/questions/47159996/how-to-group-text-data-based-on-document-similarity>.

import pandas as pd
from difflib import SequenceMatcher
df = pd.DataFrame({&#39;Questions&#39;: [&#39;What are you doing?&#39;,&#39;What are you doing tonight?&#39;,&#39;What are you doing now?&#39;,&#39;What is your name?&#39;,&#39;What is your nick name?&#39;,&#39;What is your full name?&#39;,&#39;Shall we meet?&#39;,
&#39;How are you doing?&#39; ]})
def similarity_score(s1, s2):
return SequenceMatcher(None, s1, s2).ratio()
def similarity(x,df):
sim_score = []
for i in df[&#39;Questions&#39;]:
sim_score.append(similarity_score(x,i))
return sim_score
df[&#39;similarity&#39;] = df[&#39;Questions&#39;].apply(lambda x : similarity(x, df)).astype(str)
print(df)

The output is as following

Questions  \
0          What are you doing?   
1  What are you doing tonight?   
2      What are you doing now?   
3           What is your name?   
4      What is your nick name?   
5      What is your full name?   
6               Shall we meet?   
7           How are you doing?   
similarity  
0  [1.0, 0.8260869565217391, 0.9047619047619048, ...  
1  [0.8260869565217391, 1.0, 0.84, 0.533333333333...  
2  [0.9047619047619048, 0.84, 1.0, 0.585365853658...  
3  [0.6486486486486487, 0.5333333333333333, 0.585...  
4  [0.5714285714285714, 0.52, 0.5217391304347826,...  
5  [0.5714285714285714, 0.52, 0.5652173913043478,...  
6  [0.36363636363636365, 0.34146341463414637, 0.3...  
7  [0.8108108108108109, 0.6666666666666666, 0.731...  

The logic is that I go through each row in the data frame to compare it to all over rows (including itself) in order to compute their similarity. I then store the similarity score as a list in another column called "similarity".

Next, I want to categorize the questions in the first column. If the similarity score > 0.9, then those rows should be assigned to the same group. How can I achieve this?

答案1

得分: 1

以下是代码的翻译部分:

解决方案是按行迭代您的相似度分数,根据某个阈值创建二进制掩码,然后使用二进制掩码仅提取满足阈值的那些问题。

请注意,此解决方案假定您希望的“组”是问题本身(即,对于每个问题,您希望与之关联的相似问题列表)。我为数组的其余部分制作了相似度分数,以创建这个最小示例。

import pandas as pd

orig_data = {
    "Questions": [
        "What are you doing?",
        "What are you doing tonight?",
        "What are you doing now?",
        "What is your name?",
        "What is your nick name?",
        "What is your full name?",
        "Shall we meet?",
        "How are you doing?",
    ],
    "similarity": [
        [1.0, 0.826, 0.905, 0.234, 0.544, 0.673, 0.411, 0.45],
        [0.826, 1.0, 0.84, 0.533, 0.444, 0.525, 0.641, 0.62],
        [0.905, 0.84, 1.0, 0.585, 0.861, 0.685, 0.455, 0.65],
        [0.649, 0.533, 0.585, 1.0, 0.901, 0.902, 0.642, 0.234],
        [0.571, 0.52, 0.522, 0.901, 1.0, 0.905, 0.753, 0.786],
        [0.571, 0.52, 0.565, 0.902, 0.903, 1.0, 0.123, 0.586],
        [0.364, 0.341, 0.3, 0.674, 0.584, 0.421, 1.0, 0.544],
        [0.811, 0.667, 0.731, 0.345, 0.764, 0.242, 0.55, 1.0],
    ],
}

df = pd.DataFrame(orig_data)

results = []
for idx, sim_row in enumerate(df["similarity"]):
    bin_mask = [True if score > 0.9 else False for score in sim_row]
    curr_q = df["Questions"][idx]
    sim_quests = [q for q, b in zip(df["Questions"], bin_mask) if b and q != curr_q]
    results.append(sim_quests)

df["similar-questions"] = results
print(df)

请注意,这只是代码的翻译部分,不包括问题或输出。

英文:

A solution is to iterate row-wise over your similarity scores, create a binary mask based on some threshold, and then use the binary mask to only extract those questions who meet the threshold.

Note that this solution presumes that the "groups" you desire are the questions themselves (i.e. for each question, you want a list of similar questions associated with it). I made up similarity scores for the rest of the array to create this minimal example.

Solution

import pandas as pd

orig_data = {
    &quot;Questions&quot;: [
        &quot;What are you doing?&quot;,
        &quot;What are you doing tonight?&quot;,
        &quot;What are you doing now?&quot;,
        &quot;What is your name?&quot;,
        &quot;What is your nick name?&quot;,
        &quot;What is your full name?&quot;,
        &quot;Shall we meet?&quot;,
        &quot;How are you doing?&quot;,
    ],
    &quot;similarity&quot;: [
        [1.0, 0.826, 0.905, 0.234, 0.544, 0.673, 0.411, 0.45],
        [0.826, 1.0, 0.84, 0.533, 0.444, 0.525, 0.641, 0.62],
        [0.905, 0.84, 1.0, 0.585, 0.861, 0.685, 0.455, 0.65],
        [0.649, 0.533, 0.585, 1.0, 0.901, 0.902, 0.642, 0.234],
        [0.571, 0.52, 0.522, 0.901, 1.0, 0.905, 0.753, 0.786],
        [0.571, 0.52, 0.565, 0.902, 0.903, 1.0, 0.123, 0.586],
        [0.364, 0.341, 0.3, 0.674, 0.584, 0.421, 1.0, 0.544],
        [0.811, 0.667, 0.731, 0.345, 0.764, 0.242, 0.55, 1.0],
    ],
}

df = pd.DataFrame(orig_data)

results = []
for idx, sim_row in enumerate(df[&quot;similarity&quot;]):
    bin_mask = [True if score &gt; 0.9 else False for score in sim_row]
    curr_q = df[&quot;Questions&quot;][idx]
    sim_quests = [q for q, b in zip(df[&quot;Questions&quot;], bin_mask) if b and q != curr_q]
    results.append(sim_quests)

df[&quot;similar-questions&quot;] = results
print(df)

Output

                     Questions  ...                                  similar-questions
0          What are you doing?  ...                          [What are you doing now?]
1  What are you doing tonight?  ...                                                 []
2      What are you doing now?  ...                              [What are you doing?]
3           What is your name?  ...  [What is your nick name?, What is your full na...
4      What is your nick name?  ...      [What is your name?, What is your full name?]
5      What is your full name?  ...      [What is your name?, What is your nick name?]
6               Shall we meet?  ...                                                 []
7           How are you doing?  ...                                                 []

huangapple
  • 本文由 发表于 2023年6月9日 03:35:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76435173.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定