英文:
Correct typos inside a column using word distance
问题
如果在一个 pandas 数据帧中有一个包含一堆名称的列:
NAME
-------
robert
robert
robrt
marie
ann
我想要合并类似的名称以纠正/统一拼写错误,结果如下:
NAME
-------
robert
robert
robert
marie
ann
我想要使用Levenshtein距离来搜索相似的记录。也欢迎使用其他度量方法的解决方案。
非常感谢提前的帮助。
Stackoverflow上的所有示例似乎都比较多列,因此我无法找到一个适合我的问题的好解决方案。
英文:
if have a column inside a pandas df containing a bunch of names:
NAME
-------
robert
robert
robrt
marie
ann
I'd like to merge similar ones in order to correct/uniform typos, resulting in:
NAME
-------
robert
robert
robert
marie
ann
I would like to use Levenshtein distance in order to search for similar records.
Also, solutions using other metrics are much appreciated.
Thanks a lot in advance
All examples on Stackoverflow seem to compare multiple columns, so I have not been able to find a nice solution to my problem.
答案1
得分: 1
一种可能的方法如下:
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
from Levenshtein import distance
import numpy as np
df = pd.DataFrame({'NAME': ['robert', 'robert', 'robrt', 'marie', 'ann']})
def merge_similar_names(df, column):
unique_names = df[column].str.lower().str.strip().unique()
distances = np.zeros((len(unique_names), len(unique_names)))
for i in range(len(unique_names)):
for j in range(i, len(unique_names)):
d = distance(unique_names[i], unique_names[j])
distances[i, j] = d
distances[j, i] = d
clusterer = AgglomerativeClustering(n_clusters=None, distance_threshold=2, linkage='complete', affinity='precomputed')
clusters = clusterer.fit_predict(distances)
name_clusters = pd.DataFrame({'NAME': unique_names, 'CLUSTER': clusters})
df = pd.merge(df, name_clusters, on='NAME')
most_common_names = df.groupby('CLUSTER')[column].apply(lambda x: x.value_counts().index[0]).reset_index()
df = pd.merge(df, most_common_names, on='CLUSTER')
df.rename(columns={column+'_y': column}, inplace=True)
return df
df = merge_similar_names(df, 'NAME')
print(df)
这将给你以下结果:
NAME_x CLUSTER NAME
0 robert 0 robert
1 robert 0 robert
2 robrt 0 robert
3 marie 2 marie
4 ann 1 ann
英文:
One possible approach is the following:
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
from Levenshtein import distance
import numpy as np
df = pd.DataFrame({'NAME': ['robert', 'robert', 'robrt', 'marie', 'ann']})
def merge_similar_names(df, column):
unique_names = df[column].str.lower().str.strip().unique()
distances = np.zeros((len(unique_names), len(unique_names)))
for i in range(len(unique_names)):
for j in range(i, len(unique_names)):
d = distance(unique_names[i], unique_names[j])
distances[i, j] = d
distances[j, i] = d
clusterer = AgglomerativeClustering(n_clusters=None, distance_threshold=2, linkage='complete', affinity='precomputed')
clusters = clusterer.fit_predict(distances)
name_clusters = pd.DataFrame({'NAME': unique_names, 'CLUSTER': clusters})
df = pd.merge(df, name_clusters, on='NAME')
most_common_names = df.groupby('CLUSTER')[column].apply(lambda x: x.value_counts().index[0]).reset_index()
df = pd.merge(df, most_common_names, on='CLUSTER')
df.rename(columns={column+'_y': column}, inplace=True)
return df
df = merge_similar_names(df, 'NAME')
print(df)
which will give you
NAME_x CLUSTER NAME
0 robert 0 robert
1 robert 0 robert
2 robrt 0 robert
3 marie 2 marie
4 ann 1 ann
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论