英文:
Print Rows that are "Near Duplicates" in Pandas DataFrame
问题
我正在进行一个个人项目,该项目在多个研究文章数据库上执行Web Scraping(到目前为止,我已经完成了PubMed和Scopus),并提取文章的标题。实际上,我已经成功地独立完成了这个任务。接下来,我将文章列表合并到一个名为`Pandas DataFrame`的数据框中,其中包含两列:`Article`和`Database`。我想要在这两个文章数据库中去除重复项,并使用`df = df.drop_duplicates(subset='Article')`来去除完全匹配的文章。
但是,如果文章是“近似匹配”的,也就是说,也许标题中的某个单词拼错了,或者标题中的某个地方多了一个空格(不是在末尾,我已经使用`lstrip()`和`rstrip()`进行了校对)。
我之前曾经使用过`difflib`库中的`SequenceMatcher`执行字符串匹配,但是从未在DataFrame中使用过。所以,我的问题是,我应该如何编写以下条件语句,以便我可以查看近似相似的值:
“如果`df['Article']`中的某一行与`df['Article']`中的另一行相似度达到95%以上,就打印这两行。”
我已经开始做一些**使用单独的列**进行测试,代码如下:
letters1 = ['a','b','c','a','b']
letters2 = ['c','b','a','a','c']
numbers = [1,2,3,4,5]
data = {'Letters1': letters1,
'Letters2': letters2,
'Numbers': numbers}
test = pd.DataFrame(data)
test['result'] = ''
for i in test['Letters1'].index:
if SequenceMatcher(None, test['Letters1'], test['Letters2']).ratio() > 0:
test['result'] = 'True'
else:
test['result'] = 'False'
test.head()
然而,我并没有得到期望的结果,所以想在这里寻求帮助。有什么建议吗?重申一下,**最终我不想使用两列**,我只是使用上面的示例代码块来开始测试如何做到这一点。
英文:
I'm working on a personal project that performs Web Scraping on multiple databases of research articles (thus far I have done PubMed and Scopus) and extracts the titles of the articles. I've actually managed to pull this off on my own without problem. Next, I've combined the list of articles into a Pandas DataFrame
with two columns: Article
and Database
. I wanted to remove duplicates across the two article databases and used df = df.drop_duplicates(subset='Article')
to remove exact matches.
BUT, what if the articles are "near matches", that is, perhaps a word in the title was misspelled, or there is an extra blank space somewhere in the title (not at the end, I've proofed using lstrip()
and rstrip()
).
I have explored string matching in the past using SequenceMatcher
from difflib
, but never in a DataFrame. So, my question is, how would I code the following conditional so that I can review the near similar values:
"if row in df['Article']
is 95% similar to another row in df['Article']
, print both rows."
I started doing some testing using separate columns like such:
letters1 = ['a','b','c','a','b']
letters2 = ['c','b','a','a','c']
numbers = [1,2,3,4,5]
data = {'Letters1':letters,
'Letters2':letters2,
'Numbers':numbers}
test = pd.DataFrame(data)
test['result'] = ''
for i in test['Letters1'].index:
if SequenceMatcher(None, test['Letters1'], test['Letters2']).ratio() > 0:
test['result'] = 'True'
else:
test['result'] = 'False'
test.head()
However, I'm already not getting the desired results and thought to seek help here first. Any suggestions? To reiterate, I don't want to use two columns ultimately, I am just using the example code block above to start testing how to do this.
答案1
得分: 1
以下是您要翻译的代码部分:
The unexpected result in your code is due to using **whole columns** instead of items. You can fix that for example by using the `.at` accessor
for i in test.index:
if SequenceMatcher(None, test.at[i, 'Letters1'], test.at[i, 'Letters2']).ratio() > 0:
test.at[i, 'result'] = True
else:
test.at[i, 'result'] = False
or more compact by
test["result"] = test.apply(
lambda r: SequenceMatcher(None, r.at['Letters1'], r.at['Letters2']).ratio() > 0,
axis=1
)
Result for the sample:
Letters1 Letters2 Numbers result
0 a c 1 False
1 b b 2 True
2 c a 3 False
3 a a 4 True
4 b c 5 False
As an alternative you could do something like:
from itertools import combinations
# Sample dataframe
df = pd.DataFrame({'Letters': ['a', 'b', 'c', 'a', 'b']})
for i, j in combinations(df.index, r=2):
txt1, txt2 = df.at[i, "Letters"], df.at[j, "Letters"]
if SequenceMatcher(None, txt1, txt2).ratio() > 0:
print((i, txt1), (j, txt2))
Output:
(0, 'a') (3, 'a')
(1, 'b') (4, 'b')
英文:
The unexpected result in your code is due to using whole columns instead of items. You can fix that for example by using the .at
accessor
for i in test.index:
if SequenceMatcher(None, test.at[i, 'Letters1'], test.at[i, 'Letters2']).ratio() > 0:
test.at[i, 'result'] = True
else:
test.at[i, 'result'] = False
or more compact by
test["result"] = test.apply(
lambda r: SequenceMatcher(None, r.at['Letters1'], r.at['Letters2']).ratio() > 0,
axis=1
)
Result for the sample:
Letters1 Letters2 Numbers result
0 a c 1 False
1 b b 2 True
2 c a 3 False
3 a a 4 True
4 b c 5 False
As an alternative you could do something like:
from itertools import combinations
# Sample dataframe
df = pd.DataFrame({'Letters': ['a', 'b', 'c', 'a', 'b']})
for i, j in combinations(df.index, r=2):
txt1, txt2 = df.at[i, "Letters"], df.at[j, "Letters"]
if SequenceMatcher(None, txt1, txt2).ratio() > 0:
print((i, txt1), (j, txt2))
Output:
(0, 'a') (3, 'a')
(1, 'b') (4, 'b')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论