比较Pandas数据帧列中相似拼写但另一列中的不同值。

huangapple go评论64阅读模式
英文:

Compare similar spelling in Pandas dataframe column but different value in another column

问题

以下是您要翻译的部分:

让我们假设我有一个在Python中的Pandas数据帧,看起来像这样:

df_test = pd.DataFrame(data=None, columns=['file', 'number'])
df_test.file = ['washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345', 'seattle_976', 'seattle_977', 'boston_367', 'boston 098']
df_test.number = [20, 21, 33, 34, 33, 45, 45, 52, 52]

我想从这个数据集中找出那些在'file'列中以完全相同的字母开头(可能至少50%的字符串),但在'number'列中没有相同对应值的字符串。在这个示例中,这意味着我想创建一个新的数据帧,找到以下内容:

'washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345'

但不包括其他的,因为它们在拼写开头相同的情况下具有相同的'number'值。我知道有一个名为'difflib.get_close_matches'的函数,但我不确定如何将其实现以与数据帧中的其他列进行比较。任何建议或帮助都将不胜感激!

英文:

Let's say I have a Pandas dataframe in Python that looks something like this:

df_test = pd.DataFrame(data=None, columns=['file', 'number'])
df_test.file = ['washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345', 'seattle_976', 'seattle_977', 'boston_367', 'boston 098']
df_test.number = [20, 21, 33, 34, 33, 45, 45, 52, 52]

What I want to find out from this dataset are those strings in 'file' that start with the same exact letters (maybe 50% of the string at least), but that do not have the same corresponding value in the 'number' column. In this example, it means I would want to create a new dataframe that finds:

'washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345'

But none of the others since they have the same 'number' when the spelling starts with the same string. I know there is a function 'difflib.get_close_matches' but I am not sure how to implement it to check with the other column in the dataframe. Any advice or help is really appreciated!

答案1

得分: 2

你需要澄清你的规则(多少个字母?或者多大比例?)

假设你想要完全匹配:

df['match'] = df['file'].str.extract('^([a-zA-Z]+)', expand=False)
df = df.groupby('match').filter(lambda _df: _df.number.nunique() > 1)
print(df['file'].unique())
英文:

You need to clarify your rule (how many letters? or how much fraction?)

Assuming you want a full match:

df['match'] = df['file'].str.extract('^([a-zA-Z]+)', expand=False)
df = df.groupby('match').filter( lambda _df : _df.number.nunique() > 1)
print(df['file'].unique())

答案2

得分: 1

以下是您要翻译的内容:

"The answer by Learning is a mess is much more efficient if the letters in the file strings will fully match. If there are other differences in file other than numbers and _/ , then you might want to use fuzzywuzzy to match the similarity of the files:

from fuzzywuzzy import fuzz

# get all permutations
compare = pd.MultiIndex.from_product([df_test.file,
                                      df_test.file]).to_series()
# fuzzy match - see https://stackoverflow.com/a/54866372/18571565
def metrics(tup):
    return pd.Series([fuzz.ratio(*tup)],
                     ['ratio'])
compare = compare.apply(metrics)
compare = compare.loc[compare.ratio.ge(60)]  # chosen 60% minimum match here
# get list of non-matching numbers for 'matched' files
non_matching_files = compare.loc[
    # convert 'compare' index to pd.DataFrame
    pd.DataFrame(compare.index.to_list()).replace(
        # replace all values in df with matching 'number'
        df_test.set_index("file")["number"].to_dict())\
        # calculat differences between two columns and find those not equal
        .diff(axis=1)[1].ne(0).to_list()]\
    # return the first column of the index (the 'grouped' column) as a list
    .index.get_level_values(0).to_list()
# filter df_test for 'file' in list
df_test = df_test[df_test.file.isin(non_matching_files)]
```"

<details>
<summary>英文:</summary>

The answer by [Learning is a mess](https://stackoverflow.com/a/76306636/18571565) is much more efficient if the letters in the `file` strings will fully match.  If there are other differences in `file` other than numbers and `_`/` `, then you might want to use fuzzywuzzy to match the similarity of the files:

from fuzzywuzzy import fuzz

get all permutations

compare = pd.MultiIndex.from_product([df_test.file,
df_test.file]).to_series()

fuzzy match - see https://stackoverflow.com/a/54866372/18571565

def metrics(tup):
return pd.Series([fuzz.ratio(*tup)],
['ratio'])
compare = compare.apply(metrics)
compare = compare.loc[compare.ratio.ge(60)] # chosen 60% minimum match here

get list of non-matching numbers for "matched" files

non_matching_files = compare.loc[
# convert 'compare' index to pd.DataFrame
pd.DataFrame(compare.index.to_list()).replace(
# replace all values in df with matching 'number'
df_test.set_index("file")["number"].to_dict())
# calculat differences between two columns and find those not equal
.diff(axis=1)[1].ne(0).to_list()]
# return the first column of the index (the 'grouped' column) as a list
.index.get_level_values(0).to_list()

filter df_test for 'file' in list

df_test = df_test[df_test.file.isin(non_matching_files)]



</details>



huangapple
  • 本文由 发表于 2023年5月22日 20:49:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76306376.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定