2023年6月9日 03:35:45go评论110阅读模式

英文:

Categorize rows per their similarity in Python

问题

我在这里寻找与自然语言处理相关的数据处理问题的输入。
为了让生活变得更容易，我正在使用几年前从&lt;https://stackoverflow.com/questions/47159996/how-to-group-text-data-based-on-document-similarity&gt;发布的模拟数据集。
以下是代码部分：
import pandas as pd
from difflib import SequenceMatcher
df = pd.DataFrame({'Questions': ['你在做什么？','今晚你在做什么？','你现在在做什么？','你叫什么名字？','你的昵称是什么？','你的全名是什么？','我们应该见面吗？','你好吗？']})
def similarity_score(s1, s2):
    return SequenceMatcher(None, s1, s2).ratio()
def similarity(x,df):
    sim_score = []
    for i in df['Questions']:
        sim_score.append(similarity_score(x,i))
    return sim_score
df['相似度'] = df['Questions'].apply(lambda x : similarity(x, df)).astype(str)
print(df)
输出如下：
Questions                                       相似度
0         你在做什么？  ['1.0', '0.8260869565217391', '0.9047619047619048', '...
1    今晚你在做什么？  ['0.8260869565217391', '1.0', '0.84', '0.5333333333333...
2      你现在在做什么？  ['0.9047619047619048', '0.84', '1.0', '0.5853658536585...
3         你叫什么名字？  ['0.6486486486486487', '0.5333333333333333', '0.58536...
4        你的昵称是什么？  ['0.5714285714285714', '0.52', '0.5217391304347826', ...
5        你的全名是什么？  ['0.5714285714285714', '0.52', '0.5652173913043478', ...
6          我们应该见面吗？  ['0.36363636363636365', '0.34146341463414637', '0.32...
7            你好吗？  ['0.8108108108108109', '0.6666666666666666', '0.731707...
逻辑是我遍历数据框中的每一行，将其与所有其他行（包括自身）进行比较，以计算它们的相似度。然后，我将相似度分数存储为另一列，名为“相似度”。
接下来，我想对第一列的问题进行分类。如果相似度得分> 0.9，则这些行应分配到同一组。我该如何实现这一点？

英文:

I am here to look for input for a data manipulation problem related to natural language processing.

To make life easier, I am using a mock dataset posted several years ago from <https://stackoverflow.com/questions/47159996/how-to-group-text-data-based-on-document-similarity>.

import pandas as pd
from difflib import SequenceMatcher
df = pd.DataFrame({&#39;Questions&#39;: [&#39;What are you doing?&#39;,&#39;What are you doing tonight?&#39;,&#39;What are you doing now?&#39;,&#39;What is your name?&#39;,&#39;What is your nick name?&#39;,&#39;What is your full name?&#39;,&#39;Shall we meet?&#39;,
&#39;How are you doing?&#39; ]})
def similarity_score(s1, s2):
return SequenceMatcher(None, s1, s2).ratio()
def similarity(x,df):
sim_score = []
for i in df[&#39;Questions&#39;]:
sim_score.append(similarity_score(x,i))
return sim_score
df[&#39;similarity&#39;] = df[&#39;Questions&#39;].apply(lambda x : similarity(x, df)).astype(str)
print(df)

The output is as following

Questions  \
0          What are you doing?   
1  What are you doing tonight?   
2      What are you doing now?   
3           What is your name?   
4      What is your nick name?   
5      What is your full name?   
6               Shall we meet?   
7           How are you doing?   
similarity  
0  [1.0, 0.8260869565217391, 0.9047619047619048, ...  
1  [0.8260869565217391, 1.0, 0.84, 0.533333333333...  
2  [0.9047619047619048, 0.84, 1.0, 0.585365853658...  
3  [0.6486486486486487, 0.5333333333333333, 0.585...  
4  [0.5714285714285714, 0.52, 0.5217391304347826,...  
5  [0.5714285714285714, 0.52, 0.5652173913043478,...  
6  [0.36363636363636365, 0.34146341463414637, 0.3...  
7  [0.8108108108108109, 0.6666666666666666, 0.731...

The logic is that I go through each row in the data frame to compare it to all over rows (including itself) in order to compute their similarity. I then store the similarity score as a list in another column called "similarity".

Next, I want to categorize the questions in the first column. If the similarity score > 0.9, then those rows should be assigned to the same group. How can I achieve this?

答案1

得分: 1

以下是代码的翻译部分：

解决方案是按行迭代您的相似度分数，根据某个阈值创建二进制掩码，然后使用二进制掩码仅提取满足阈值的那些问题。

请注意，此解决方案假定您希望的“组”是问题本身（即，对于每个问题，您希望与之关联的相似问题列表）。我为数组的其余部分制作了相似度分数，以创建这个最小示例。

import pandas as pd
orig_data = {
    "Questions": [
        "What are you doing?",
        "What are you doing tonight?",
        "What are you doing now?",
        "What is your name?",
        "What is your nick name?",
        "What is your full name?",
        "Shall we meet?",
        "How are you doing?",
    ],
    "similarity": [
        [1.0, 0.826, 0.905, 0.234, 0.544, 0.673, 0.411, 0.45],
        [0.826, 1.0, 0.84, 0.533, 0.444, 0.525, 0.641, 0.62],
        [0.905, 0.84, 1.0, 0.585, 0.861, 0.685, 0.455, 0.65],
        [0.649, 0.533, 0.585, 1.0, 0.901, 0.902, 0.642, 0.234],
        [0.571, 0.52, 0.522, 0.901, 1.0, 0.905, 0.753, 0.786],
        [0.571, 0.52, 0.565, 0.902, 0.903, 1.0, 0.123, 0.586],
        [0.364, 0.341, 0.3, 0.674, 0.584, 0.421, 1.0, 0.544],
        [0.811, 0.667, 0.731, 0.345, 0.764, 0.242, 0.55, 1.0],
    ],
}
df = pd.DataFrame(orig_data)
results = []
for idx, sim_row in enumerate(df["similarity"]):
    bin_mask = [True if score > 0.9 else False for score in sim_row]
    curr_q = df["Questions"][idx]
    sim_quests = [q for q, b in zip(df["Questions"], bin_mask) if b and q != curr_q]
    results.append(sim_quests)
df["similar-questions"] = results
print(df)

请注意，这只是代码的翻译部分，不包括问题或输出。

英文:

A solution is to iterate row-wise over your similarity scores, create a binary mask based on some threshold, and then use the binary mask to only extract those questions who meet the threshold.

Note that this solution presumes that the "groups" you desire are the questions themselves (i.e. for each question, you want a list of similar questions associated with it). I made up similarity scores for the rest of the array to create this minimal example.

Solution

import pandas as pd
orig_data = {
    &quot;Questions&quot;: [
        &quot;What are you doing?&quot;,
        &quot;What are you doing tonight?&quot;,
        &quot;What are you doing now?&quot;,
        &quot;What is your name?&quot;,
        &quot;What is your nick name?&quot;,
        &quot;What is your full name?&quot;,
        &quot;Shall we meet?&quot;,
        &quot;How are you doing?&quot;,
    ],
    &quot;similarity&quot;: [
        [1.0, 0.826, 0.905, 0.234, 0.544, 0.673, 0.411, 0.45],
        [0.826, 1.0, 0.84, 0.533, 0.444, 0.525, 0.641, 0.62],
        [0.905, 0.84, 1.0, 0.585, 0.861, 0.685, 0.455, 0.65],
        [0.649, 0.533, 0.585, 1.0, 0.901, 0.902, 0.642, 0.234],
        [0.571, 0.52, 0.522, 0.901, 1.0, 0.905, 0.753, 0.786],
        [0.571, 0.52, 0.565, 0.902, 0.903, 1.0, 0.123, 0.586],
        [0.364, 0.341, 0.3, 0.674, 0.584, 0.421, 1.0, 0.544],
        [0.811, 0.667, 0.731, 0.345, 0.764, 0.242, 0.55, 1.0],
    ],
}
df = pd.DataFrame(orig_data)
results = []
for idx, sim_row in enumerate(df[&quot;similarity&quot;]):
    bin_mask = [True if score &gt; 0.9 else False for score in sim_row]
    curr_q = df[&quot;Questions&quot;][idx]
    sim_quests = [q for q, b in zip(df[&quot;Questions&quot;], bin_mask) if b and q != curr_q]
    results.append(sim_quests)
df[&quot;similar-questions&quot;] = results
print(df)

Output

                     Questions  ...                                  similar-questions
0          What are you doing?  ...                          [What are you doing now?]
1  What are you doing tonight?  ...                                                 []
2      What are you doing now?  ...                              [What are you doing?]
3           What is your name?  ...  [What is your nick name?, What is your full na...
4      What is your nick name?  ...      [What is your name?, What is your full name?]
5      What is your full name?  ...      [What is your name?, What is your nick name?]
6               Shall we meet?  ...                                                 []
7           How are you doing?  ...                                                 []

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将行按它们在Python中的相似性分类。

问题

答案1

Solution

Output

How to plot a contour plot (density) of a data file with 3 columns (x, y, density) with the script automatically picking the data array dimension?

Loss function giving nan in pytorch

如何使用NumPy函数添加Polar数据框的列

为什么Python解释器在将2.0和2用作字典键时认为它们是相同的？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。