在Pandas DataFrame中打印“近似重复”的行。

huangapple go评论71阅读模式
英文:

Print Rows that are "Near Duplicates" in Pandas DataFrame

问题

我正在进行一个个人项目该项目在多个研究文章数据库上执行Web Scraping到目前为止我已经完成了PubMed和Scopus),并提取文章的标题实际上我已经成功地独立完成了这个任务接下来我将文章列表合并到一个名为`Pandas DataFrame`的数据框中其中包含两列:`Article``Database`。我想要在这两个文章数据库中去除重复项并使用`df = df.drop_duplicates(subset='Article')`来去除完全匹配的文章

但是如果文章是近似匹配也就是说也许标题中的某个单词拼错了或者标题中的某个地方多了一个空格不是在末尾我已经使用`lstrip()``rstrip()`进行了校对)。

我之前曾经使用过`difflib`库中的`SequenceMatcher`执行字符串匹配但是从未在DataFrame中使用过所以我的问题是我应该如何编写以下条件语句以便我可以查看近似相似的值

如果`df['Article']`中的某一行与`df['Article']`中的另一行相似度达到95%以上就打印这两行。”

我已经开始做一些**使用单独的列**进行测试代码如下

    letters1 = ['a','b','c','a','b']
    letters2 = ['c','b','a','a','c']
    numbers = [1,2,3,4,5]
    
    data = {'Letters1': letters1,
            'Letters2': letters2,
            'Numbers': numbers}
    
    test = pd.DataFrame(data)
    test['result'] = ''
    
    for i in test['Letters1'].index:
        if SequenceMatcher(None, test['Letters1'], test['Letters2']).ratio() > 0:
            test['result'] = 'True'
        else:
            test['result'] = 'False'
    
    test.head()

然而我并没有得到期望的结果所以想在这里寻求帮助有什么建议吗重申一下**最终我不想使用两列**我只是使用上面的示例代码块来开始测试如何做到这一点
英文:

I'm working on a personal project that performs Web Scraping on multiple databases of research articles (thus far I have done PubMed and Scopus) and extracts the titles of the articles. I've actually managed to pull this off on my own without problem. Next, I've combined the list of articles into a Pandas DataFrame with two columns: Article and Database. I wanted to remove duplicates across the two article databases and used df = df.drop_duplicates(subset='Article') to remove exact matches.

BUT, what if the articles are "near matches", that is, perhaps a word in the title was misspelled, or there is an extra blank space somewhere in the title (not at the end, I've proofed using lstrip() and rstrip()).

I have explored string matching in the past using SequenceMatcher from difflib, but never in a DataFrame. So, my question is, how would I code the following conditional so that I can review the near similar values:

"if row in df['Article'] is 95% similar to another row in df['Article'], print both rows."

I started doing some testing using separate columns like such:

letters1 = ['a','b','c','a','b']
letters2 = ['c','b','a','a','c']
numbers = [1,2,3,4,5]
data = {'Letters1':letters,
'Letters2':letters2,
'Numbers':numbers}
test = pd.DataFrame(data)
test['result'] = ''
for i in test['Letters1'].index:
if SequenceMatcher(None, test['Letters1'], test['Letters2']).ratio() > 0:
test['result'] = 'True'
else:
test['result'] = 'False'
test.head()

However, I'm already not getting the desired results and thought to seek help here first. Any suggestions? To reiterate, I don't want to use two columns ultimately, I am just using the example code block above to start testing how to do this.

答案1

得分: 1

以下是您要翻译的代码部分:

The unexpected result in your code is due to using **whole columns** instead of items. You can fix that for example by using the `.at` accessor

for i in test.index:
    if SequenceMatcher(None, test.at[i, 'Letters1'], test.at[i, 'Letters2']).ratio() > 0:
        test.at[i, 'result'] = True
    else:
        test.at[i, 'result'] = False

or more compact by

test["result"] = test.apply(
    lambda r: SequenceMatcher(None, r.at['Letters1'], r.at['Letters2']).ratio() > 0,
    axis=1
)

Result for the sample:

  Letters1 Letters2  Numbers result
0        a        c        1  False
1        b        b        2   True
2        c        a        3  False
3        a        a        4   True
4        b        c        5  False

As an alternative you could do something like:

from itertools import combinations

# Sample dataframe
df = pd.DataFrame({'Letters': ['a', 'b', 'c', 'a', 'b']})

for i, j in combinations(df.index, r=2):
    txt1, txt2 = df.at[i, "Letters"], df.at[j, "Letters"]
    if SequenceMatcher(None, txt1, txt2).ratio() > 0:
        print((i, txt1), (j, txt2))

Output:

(0, 'a') (3, 'a')
(1, 'b') (4, 'b')

英文:

The unexpected result in your code is due to using whole columns instead of items. You can fix that for example by using the .at accessor

for i in test.index:
    if SequenceMatcher(None, test.at[i, 'Letters1'], test.at[i, 'Letters2']).ratio() > 0:
        test.at[i, 'result'] = True
    else:
        test.at[i, 'result'] = False

or more compact by

test["result"] = test.apply(
    lambda r: SequenceMatcher(None, r.at['Letters1'], r.at['Letters2']).ratio() > 0,
    axis=1
)

Result for the sample:

  Letters1 Letters2  Numbers result
0        a        c        1  False
1        b        b        2   True
2        c        a        3  False
3        a        a        4   True
4        b        c        5  False

As an alternative you could do something like:

from itertools import combinations

# Sample dataframe
df = pd.DataFrame({'Letters': ['a', 'b', 'c', 'a', 'b']})

for i, j in combinations(df.index, r=2):
    txt1, txt2 = df.at[i, "Letters"], df.at[j, "Letters"]
    if SequenceMatcher(None, txt1, txt2).ratio() > 0:
        print((i, txt1), (j, txt2))

Output:

(0, 'a') (3, 'a')
(1, 'b') (4, 'b')

huangapple
  • 本文由 发表于 2023年3月7日 10:40:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/75657596.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定