2023年5月22日 20:49:44go评论100阅读模式

英文:

Compare similar spelling in Pandas dataframe column but different value in another column

问题

以下是您要翻译的部分：

让我们假设我有一个在Python中的Pandas数据帧，看起来像这样：

df_test = pd.DataFrame(data=None, columns=['file', 'number'])
df_test.file = ['washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345', 'seattle_976', 'seattle_977', 'boston_367', 'boston 098']
df_test.number = [20, 21, 33, 34, 33, 45, 45, 52, 52]

我想从这个数据集中找出那些在'file'列中以完全相同的字母开头（可能至少50%的字符串），但在'number'列中没有相同对应值的字符串。在这个示例中，这意味着我想创建一个新的数据帧，找到以下内容：

'washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345'

但不包括其他的，因为它们在拼写开头相同的情况下具有相同的'number'值。我知道有一个名为'difflib.get_close_matches'的函数，但我不确定如何将其实现以与数据帧中的其他列进行比较。任何建议或帮助都将不胜感激！

英文:

Let's say I have a Pandas dataframe in Python that looks something like this:

df_test = pd.DataFrame(data=None, columns=[&#39;file&#39;, &#39;number&#39;])
df_test.file = [&#39;washington_142&#39;, &#39;washington_287&#39;, &#39;chicago_453&#39;, &#39;chicago_221&#39;, &#39;chicago_345&#39;, &#39;seattle_976&#39;, &#39;seattle_977&#39;, &#39;boston_367&#39;, &#39;boston 098&#39;]
df_test.number = [20, 21, 33, 34, 33, 45, 45, 52, 52]

What I want to find out from this dataset are those strings in 'file' that start with the same exact letters (maybe 50% of the string at least), but that do not have the same corresponding value in the 'number' column. In this example, it means I would want to create a new dataframe that finds:

&#39;washington_142&#39;, &#39;washington_287&#39;, &#39;chicago_453&#39;, &#39;chicago_221&#39;, &#39;chicago_345&#39;

But none of the others since they have the same 'number' when the spelling starts with the same string. I know there is a function 'difflib.get_close_matches' but I am not sure how to implement it to check with the other column in the dataframe. Any advice or help is really appreciated!

答案1

得分: 2

你需要澄清你的规则（多少个字母？或者多大比例？）

假设你想要完全匹配：

df['match'] = df['file'].str.extract('^([a-zA-Z]+)', expand=False)
df = df.groupby('match').filter(lambda _df: _df.number.nunique() > 1)
print(df['file'].unique())

英文:

You need to clarify your rule (how many letters? or how much fraction?)

Assuming you want a full match:

df[&#39;match&#39;] = df[&#39;file&#39;].str.extract(&#39;^([a-zA-Z]+)&#39;, expand=False)
df = df.groupby(&#39;match&#39;).filter( lambda _df : _df.number.nunique() &gt; 1)
print(df[&#39;file&#39;].unique())

答案2

得分: 1

以下是您要翻译的内容：

"The answer by Learning is a mess is much more efficient if the letters in the file strings will fully match. If there are other differences in file other than numbers and _/ , then you might want to use fuzzywuzzy to match the similarity of the files:

from fuzzywuzzy import fuzz
# get all permutations
compare = pd.MultiIndex.from_product([df_test.file,
                                      df_test.file]).to_series()
# fuzzy match - see https://stackoverflow.com/a/54866372/18571565
def metrics(tup):
    return pd.Series([fuzz.ratio(*tup)],
                     ['ratio'])
compare = compare.apply(metrics)
compare = compare.loc[compare.ratio.ge(60)]  # chosen 60% minimum match here
# get list of non-matching numbers for 'matched' files
non_matching_files = compare.loc[
    # convert 'compare' index to pd.DataFrame
    pd.DataFrame(compare.index.to_list()).replace(
        # replace all values in df with matching 'number'
        df_test.set_index("file")["number"].to_dict())\
        # calculat differences between two columns and find those not equal
        .diff(axis=1)[1].ne(0).to_list()]\
    # return the first column of the index (the 'grouped' column) as a list
    .index.get_level_values(0).to_list()
# filter df_test for 'file' in list
df_test = df_test[df_test.file.isin(non_matching_files)]
```"
<details>
<summary>英文:</summary>
The answer by [Learning is a mess](https://stackoverflow.com/a/76306636/18571565) is much more efficient if the letters in the `file` strings will fully match.  If there are other differences in `file` other than numbers and `_`/` `, then you might want to use fuzzywuzzy to match the similarity of the files:

from fuzzywuzzy import fuzz

get all permutations

compare = pd.MultiIndex.from_product([df_test.file,
df_test.file]).to_series()

fuzzy match - see https://stackoverflow.com/a/54866372/18571565

def metrics(tup):
return pd.Series([fuzz.ratio(*tup)],
['ratio'])
compare = compare.apply(metrics)
compare = compare.loc[compare.ratio.ge(60)] # chosen 60% minimum match here

get list of non-matching numbers for "matched" files

non_matching_files = compare.loc[
# convert 'compare' index to pd.DataFrame
pd.DataFrame(compare.index.to_list()).replace(
# replace all values in df with matching 'number'
df_test.set_index("file")["number"].to_dict())
# calculat differences between two columns and find those not equal
.diff(axis=1)[1].ne(0).to_list()]
# return the first column of the index (the 'grouped' column) as a list
.index.get_level_values(0).to_list()

filter df_test for 'file' in list

df_test = df_test[df_test.file.isin(non_matching_files)]


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

比较Pandas数据帧列中相似拼写但另一列中的不同值。

问题

答案1

答案2

get all permutations

fuzzy match - see https://stackoverflow.com/a/54866372/18571565

get list of non-matching numbers for "matched" files

filter df_test for 'file' in list

我的DataFrame合并出了什么问题？

Json转换为Avro在Python中

How to perform internal dataframe or matrix calculations as a dataframe or matrix is being generated inside a Tidyverse map() function?

如何以2个跳跃合并这两个列表。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。