比较Pandas数据帧列中相似拼写但另一列中的不同值。

huangapple go评论100阅读模式
英文:

Compare similar spelling in Pandas dataframe column but different value in another column

问题

以下是您要翻译的部分:

让我们假设我有一个在Python中的Pandas数据帧,看起来像这样:

  1. df_test = pd.DataFrame(data=None, columns=['file', 'number'])
  2. df_test.file = ['washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345', 'seattle_976', 'seattle_977', 'boston_367', 'boston 098']
  3. df_test.number = [20, 21, 33, 34, 33, 45, 45, 52, 52]

我想从这个数据集中找出那些在'file'列中以完全相同的字母开头(可能至少50%的字符串),但在'number'列中没有相同对应值的字符串。在这个示例中,这意味着我想创建一个新的数据帧,找到以下内容:

'washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345'

但不包括其他的,因为它们在拼写开头相同的情况下具有相同的'number'值。我知道有一个名为'difflib.get_close_matches'的函数,但我不确定如何将其实现以与数据帧中的其他列进行比较。任何建议或帮助都将不胜感激!

英文:

Let's say I have a Pandas dataframe in Python that looks something like this:

  1. df_test = pd.DataFrame(data=None, columns=['file', 'number'])
  2. df_test.file = ['washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345', 'seattle_976', 'seattle_977', 'boston_367', 'boston 098']
  3. df_test.number = [20, 21, 33, 34, 33, 45, 45, 52, 52]

What I want to find out from this dataset are those strings in 'file' that start with the same exact letters (maybe 50% of the string at least), but that do not have the same corresponding value in the 'number' column. In this example, it means I would want to create a new dataframe that finds:

  1. 'washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345'

But none of the others since they have the same 'number' when the spelling starts with the same string. I know there is a function 'difflib.get_close_matches' but I am not sure how to implement it to check with the other column in the dataframe. Any advice or help is really appreciated!

答案1

得分: 2

你需要澄清你的规则(多少个字母?或者多大比例?)

假设你想要完全匹配:

  1. df['match'] = df['file'].str.extract('^([a-zA-Z]+)', expand=False)
  2. df = df.groupby('match').filter(lambda _df: _df.number.nunique() > 1)
  3. print(df['file'].unique())
英文:

You need to clarify your rule (how many letters? or how much fraction?)

Assuming you want a full match:

  1. df['match'] = df['file'].str.extract('^([a-zA-Z]+)', expand=False)
  2. df = df.groupby('match').filter( lambda _df : _df.number.nunique() > 1)
  3. print(df['file'].unique())

答案2

得分: 1

以下是您要翻译的内容:

"The answer by Learning is a mess is much more efficient if the letters in the file strings will fully match. If there are other differences in file other than numbers and _/ , then you might want to use fuzzywuzzy to match the similarity of the files:

  1. from fuzzywuzzy import fuzz
  2. # get all permutations
  3. compare = pd.MultiIndex.from_product([df_test.file,
  4. df_test.file]).to_series()
  5. # fuzzy match - see https://stackoverflow.com/a/54866372/18571565
  6. def metrics(tup):
  7. return pd.Series([fuzz.ratio(*tup)],
  8. ['ratio'])
  9. compare = compare.apply(metrics)
  10. compare = compare.loc[compare.ratio.ge(60)] # chosen 60% minimum match here
  11. # get list of non-matching numbers for 'matched' files
  12. non_matching_files = compare.loc[
  13. # convert 'compare' index to pd.DataFrame
  14. pd.DataFrame(compare.index.to_list()).replace(
  15. # replace all values in df with matching 'number'
  16. df_test.set_index("file")["number"].to_dict())\
  17. # calculat differences between two columns and find those not equal
  18. .diff(axis=1)[1].ne(0).to_list()]\
  19. # return the first column of the index (the 'grouped' column) as a list
  20. .index.get_level_values(0).to_list()
  21. # filter df_test for 'file' in list
  22. df_test = df_test[df_test.file.isin(non_matching_files)]
  23. ```"
  24. <details>
  25. <summary>英文:</summary>
  26. The answer by [Learning is a mess](https://stackoverflow.com/a/76306636/18571565) is much more efficient if the letters in the `file` strings will fully match. If there are other differences in `file` other than numbers and `_`/` `, then you might want to use fuzzywuzzy to match the similarity of the files:

from fuzzywuzzy import fuzz

get all permutations

compare = pd.MultiIndex.from_product([df_test.file,
df_test.file]).to_series()

fuzzy match - see https://stackoverflow.com/a/54866372/18571565

def metrics(tup):
return pd.Series([fuzz.ratio(*tup)],
['ratio'])
compare = compare.apply(metrics)
compare = compare.loc[compare.ratio.ge(60)] # chosen 60% minimum match here

get list of non-matching numbers for "matched" files

non_matching_files = compare.loc[
# convert 'compare' index to pd.DataFrame
pd.DataFrame(compare.index.to_list()).replace(
# replace all values in df with matching 'number'
df_test.set_index("file")["number"].to_dict())
# calculat differences between two columns and find those not equal
.diff(axis=1)[1].ne(0).to_list()]
# return the first column of the index (the 'grouped' column) as a list
.index.get_level_values(0).to_list()

filter df_test for 'file' in list

df_test = df_test[df_test.file.isin(non_matching_files)]

  1. </details>

huangapple
  • 本文由 发表于 2023年5月22日 20:49:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76306376.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定