2023年3月7日 10:40:57go评论94阅读模式

英文:

Print Rows that are "Near Duplicates" in Pandas DataFrame

问题

我正在进行一个个人项目，该项目在多个研究文章数据库上执行Web Scraping（到目前为止，我已经完成了PubMed和Scopus），并提取文章的标题。实际上，我已经成功地独立完成了这个任务。接下来，我将文章列表合并到一个名为`Pandas DataFrame`的数据框中，其中包含两列：`Article`和`Database`。我想要在这两个文章数据库中去除重复项，并使用`df = df.drop_duplicates(subset='Article')`来去除完全匹配的文章。
但是，如果文章是“近似匹配”的，也就是说，也许标题中的某个单词拼错了，或者标题中的某个地方多了一个空格（不是在末尾，我已经使用`lstrip()`和`rstrip()`进行了校对）。
我之前曾经使用过`difflib`库中的`SequenceMatcher`执行字符串匹配，但是从未在DataFrame中使用过。所以，我的问题是，我应该如何编写以下条件语句，以便我可以查看近似相似的值：
“如果`df['Article']`中的某一行与`df['Article']`中的另一行相似度达到95%以上，就打印这两行。”
我已经开始做一些**使用单独的列**进行测试，代码如下：
    letters1 = ['a','b','c','a','b']
    letters2 = ['c','b','a','a','c']
    numbers = [1,2,3,4,5]
    
    data = {'Letters1': letters1,
            'Letters2': letters2,
            'Numbers': numbers}
    
    test = pd.DataFrame(data)
    test['result'] = ''
    
    for i in test['Letters1'].index:
        if SequenceMatcher(None, test['Letters1'], test['Letters2']).ratio() > 0:
            test['result'] = 'True'
        else:
            test['result'] = 'False'
    
    test.head()
然而，我并没有得到期望的结果，所以想在这里寻求帮助。有什么建议吗？重申一下，**最终我不想使用两列**，我只是使用上面的示例代码块来开始测试如何做到这一点。

英文:

I'm working on a personal project that performs Web Scraping on multiple databases of research articles (thus far I have done PubMed and Scopus) and extracts the titles of the articles. I've actually managed to pull this off on my own without problem. Next, I've combined the list of articles into a Pandas DataFrame with two columns: Article and Database. I wanted to remove duplicates across the two article databases and used df = df.drop_duplicates(subset='Article') to remove exact matches.

BUT, what if the articles are "near matches", that is, perhaps a word in the title was misspelled, or there is an extra blank space somewhere in the title (not at the end, I've proofed using lstrip() and rstrip()).

I have explored string matching in the past using SequenceMatcher from difflib, but never in a DataFrame. So, my question is, how would I code the following conditional so that I can review the near similar values:

"if row in df['Article'] is 95% similar to another row in df['Article'], print both rows."

I started doing some testing using separate columns like such:

letters1 = [&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;a&#39;,&#39;b&#39;]
letters2 = [&#39;c&#39;,&#39;b&#39;,&#39;a&#39;,&#39;a&#39;,&#39;c&#39;]
numbers = [1,2,3,4,5]
data = {&#39;Letters1&#39;:letters,
&#39;Letters2&#39;:letters2,
&#39;Numbers&#39;:numbers}
test = pd.DataFrame(data)
test[&#39;result&#39;] = &#39;&#39;
for i in test[&#39;Letters1&#39;].index:
if SequenceMatcher(None, test[&#39;Letters1&#39;], test[&#39;Letters2&#39;]).ratio() &gt; 0:
test[&#39;result&#39;] = &#39;True&#39;
else:
test[&#39;result&#39;] = &#39;False&#39;
test.head()

However, I'm already not getting the desired results and thought to seek help here first. Any suggestions? To reiterate, I don't want to use two columns ultimately, I am just using the example code block above to start testing how to do this.

答案1

得分: 1

以下是您要翻译的代码部分：

The unexpected result in your code is due to using **whole columns** instead of items. You can fix that for example by using the `.at` accessor
for i in test.index:
    if SequenceMatcher(None, test.at[i, 'Letters1'], test.at[i, 'Letters2']).ratio() > 0:
        test.at[i, 'result'] = True
    else:
        test.at[i, 'result'] = False
or more compact by
test["result"] = test.apply(
    lambda r: SequenceMatcher(None, r.at['Letters1'], r.at['Letters2']).ratio() > 0,
    axis=1
)
Result for the sample:
  Letters1 Letters2  Numbers result
0        a        c        1  False
1        b        b        2   True
2        c        a        3  False
3        a        a        4   True
4        b        c        5  False
As an alternative you could do something like:
from itertools import combinations
# Sample dataframe
df = pd.DataFrame({'Letters': ['a', 'b', 'c', 'a', 'b']})
for i, j in combinations(df.index, r=2):
    txt1, txt2 = df.at[i, "Letters"], df.at[j, "Letters"]
    if SequenceMatcher(None, txt1, txt2).ratio() > 0:
        print((i, txt1), (j, txt2))

Output:

(0, 'a') (3, 'a')
(1, 'b') (4, 'b')

英文:

The unexpected result in your code is due to using whole columns instead of items. You can fix that for example by using the .at accessor

for i in test.index:
    if SequenceMatcher(None, test.at[i, &#39;Letters1&#39;], test.at[i, &#39;Letters2&#39;]).ratio() &gt; 0:
        test.at[i, &#39;result&#39;] = True
    else:
        test.at[i, &#39;result&#39;] = False

or more compact by

test[&quot;result&quot;] = test.apply(
    lambda r: SequenceMatcher(None, r.at[&#39;Letters1&#39;], r.at[&#39;Letters2&#39;]).ratio() &gt; 0,
    axis=1
)

Result for the sample:

  Letters1 Letters2  Numbers result
0        a        c        1  False
1        b        b        2   True
2        c        a        3  False
3        a        a        4   True
4        b        c        5  False

As an alternative you could do something like:

from itertools import combinations
# Sample dataframe
df = pd.DataFrame({&#39;Letters&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;a&#39;, &#39;b&#39;]})
for i, j in combinations(df.index, r=2):
    txt1, txt2 = df.at[i, &quot;Letters&quot;], df.at[j, &quot;Letters&quot;]
    if SequenceMatcher(None, txt1, txt2).ratio() &gt; 0:
        print((i, txt1), (j, txt2))

Output:

(0, &#39;a&#39;) (3, &#39;a&#39;)
(1, &#39;b&#39;) (4, &#39;b&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Pandas DataFrame中打印“近似重复”的行。

问题

答案1

从MagicMock对象继承的子类具有奇怪的规范’str’，无法使用或模拟类的方法。

Trying to build a jacobian matrix using the multiprocessing library in python – how to share a matrix variable across multiple processes?

Tkinter Entry框的字体未随默认设置更改。

SQLAlchemy中使用日期和间隔的正确方法是什么？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。