使用单词距离校正列内的拼写错误

huangapple go评论91阅读模式
英文:

Correct typos inside a column using word distance

问题

如果在一个 pandas 数据帧中有一个包含一堆名称的列:

  1. NAME
  2. -------
  3. robert
  4. robert
  5. robrt
  6. marie
  7. ann

我想要合并类似的名称以纠正/统一拼写错误,结果如下:

  1. NAME
  2. -------
  3. robert
  4. robert
  5. robert
  6. marie
  7. ann

我想要使用Levenshtein距离来搜索相似的记录。也欢迎使用其他度量方法的解决方案。

非常感谢提前的帮助。

Stackoverflow上的所有示例似乎都比较多列,因此我无法找到一个适合我的问题的好解决方案。

英文:

if have a column inside a pandas df containing a bunch of names:

  1. NAME
  2. -------
  3. robert
  4. robert
  5. robrt
  6. marie
  7. ann

I'd like to merge similar ones in order to correct/uniform typos, resulting in:

  1. NAME
  2. -------
  3. robert
  4. robert
  5. robert
  6. marie
  7. ann

I would like to use Levenshtein distance in order to search for similar records.
Also, solutions using other metrics are much appreciated.

Thanks a lot in advance

All examples on Stackoverflow seem to compare multiple columns, so I have not been able to find a nice solution to my problem.

答案1

得分: 1

一种可能的方法如下:

  1. import pandas as pd
  2. from sklearn.cluster import AgglomerativeClustering
  3. from Levenshtein import distance
  4. import numpy as np
  5. df = pd.DataFrame({'NAME': ['robert', 'robert', 'robrt', 'marie', 'ann']})
  6. def merge_similar_names(df, column):
  7. unique_names = df[column].str.lower().str.strip().unique()
  8. distances = np.zeros((len(unique_names), len(unique_names)))
  9. for i in range(len(unique_names)):
  10. for j in range(i, len(unique_names)):
  11. d = distance(unique_names[i], unique_names[j])
  12. distances[i, j] = d
  13. distances[j, i] = d
  14. clusterer = AgglomerativeClustering(n_clusters=None, distance_threshold=2, linkage='complete', affinity='precomputed')
  15. clusters = clusterer.fit_predict(distances)
  16. name_clusters = pd.DataFrame({'NAME': unique_names, 'CLUSTER': clusters})
  17. df = pd.merge(df, name_clusters, on='NAME')
  18. most_common_names = df.groupby('CLUSTER')[column].apply(lambda x: x.value_counts().index[0]).reset_index()
  19. df = pd.merge(df, most_common_names, on='CLUSTER')
  20. df.rename(columns={column+'_y': column}, inplace=True)
  21. return df
  22. df = merge_similar_names(df, 'NAME')
  23. print(df)

这将给你以下结果:

  1. NAME_x CLUSTER NAME
  2. 0 robert 0 robert
  3. 1 robert 0 robert
  4. 2 robrt 0 robert
  5. 3 marie 2 marie
  6. 4 ann 1 ann
英文:

One possible approach is the following:

  1. import pandas as pd
  2. from sklearn.cluster import AgglomerativeClustering
  3. from Levenshtein import distance
  4. import numpy as np
  5. df = pd.DataFrame({'NAME': ['robert', 'robert', 'robrt', 'marie', 'ann']})
  6. def merge_similar_names(df, column):
  7. unique_names = df[column].str.lower().str.strip().unique()
  8. distances = np.zeros((len(unique_names), len(unique_names)))
  9. for i in range(len(unique_names)):
  10. for j in range(i, len(unique_names)):
  11. d = distance(unique_names[i], unique_names[j])
  12. distances[i, j] = d
  13. distances[j, i] = d
  14. clusterer = AgglomerativeClustering(n_clusters=None, distance_threshold=2, linkage='complete', affinity='precomputed')
  15. clusters = clusterer.fit_predict(distances)
  16. name_clusters = pd.DataFrame({'NAME': unique_names, 'CLUSTER': clusters})
  17. df = pd.merge(df, name_clusters, on='NAME')
  18. most_common_names = df.groupby('CLUSTER')[column].apply(lambda x: x.value_counts().index[0]).reset_index()
  19. df = pd.merge(df, most_common_names, on='CLUSTER')
  20. df.rename(columns={column+'_y': column}, inplace=True)
  21. return df
  22. df = merge_similar_names(df, 'NAME')
  23. print(df)

which will give you

  1. NAME_x CLUSTER NAME
  2. 0 robert 0 robert
  3. 1 robert 0 robert
  4. 2 robrt 0 robert
  5. 3 marie 2 marie
  6. 4 ann 1 ann

huangapple
  • 本文由 发表于 2023年2月23日 23:52:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75547219.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定