英文:
Compare similar spelling in Pandas dataframe column but different value in another column
问题
以下是您要翻译的部分:
让我们假设我有一个在Python中的Pandas数据帧,看起来像这样:
df_test = pd.DataFrame(data=None, columns=['file', 'number'])
df_test.file = ['washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345', 'seattle_976', 'seattle_977', 'boston_367', 'boston 098']
df_test.number = [20, 21, 33, 34, 33, 45, 45, 52, 52]
我想从这个数据集中找出那些在'file'列中以完全相同的字母开头(可能至少50%的字符串),但在'number'列中没有相同对应值的字符串。在这个示例中,这意味着我想创建一个新的数据帧,找到以下内容:
'washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345'
但不包括其他的,因为它们在拼写开头相同的情况下具有相同的'number'值。我知道有一个名为'difflib.get_close_matches'的函数,但我不确定如何将其实现以与数据帧中的其他列进行比较。任何建议或帮助都将不胜感激!
英文:
Let's say I have a Pandas dataframe in Python that looks something like this:
df_test = pd.DataFrame(data=None, columns=['file', 'number'])
df_test.file = ['washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345', 'seattle_976', 'seattle_977', 'boston_367', 'boston 098']
df_test.number = [20, 21, 33, 34, 33, 45, 45, 52, 52]
What I want to find out from this dataset are those strings in 'file' that start with the same exact letters (maybe 50% of the string at least), but that do not have the same corresponding value in the 'number' column. In this example, it means I would want to create a new dataframe that finds:
'washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345'
But none of the others since they have the same 'number' when the spelling starts with the same string. I know there is a function 'difflib.get_close_matches' but I am not sure how to implement it to check with the other column in the dataframe. Any advice or help is really appreciated!
答案1
得分: 2
你需要澄清你的规则(多少个字母?或者多大比例?)
假设你想要完全匹配:
df['match'] = df['file'].str.extract('^([a-zA-Z]+)', expand=False)
df = df.groupby('match').filter(lambda _df: _df.number.nunique() > 1)
print(df['file'].unique())
英文:
You need to clarify your rule (how many letters? or how much fraction?)
Assuming you want a full match:
df['match'] = df['file'].str.extract('^([a-zA-Z]+)', expand=False)
df = df.groupby('match').filter( lambda _df : _df.number.nunique() > 1)
print(df['file'].unique())
答案2
得分: 1
以下是您要翻译的内容:
"The answer by Learning is a mess is much more efficient if the letters in the file
strings will fully match. If there are other differences in file
other than numbers and _
/
, then you might want to use fuzzywuzzy to match the similarity of the files:
from fuzzywuzzy import fuzz
# get all permutations
compare = pd.MultiIndex.from_product([df_test.file,
df_test.file]).to_series()
# fuzzy match - see https://stackoverflow.com/a/54866372/18571565
def metrics(tup):
return pd.Series([fuzz.ratio(*tup)],
['ratio'])
compare = compare.apply(metrics)
compare = compare.loc[compare.ratio.ge(60)] # chosen 60% minimum match here
# get list of non-matching numbers for 'matched' files
non_matching_files = compare.loc[
# convert 'compare' index to pd.DataFrame
pd.DataFrame(compare.index.to_list()).replace(
# replace all values in df with matching 'number'
df_test.set_index("file")["number"].to_dict())\
# calculat differences between two columns and find those not equal
.diff(axis=1)[1].ne(0).to_list()]\
# return the first column of the index (the 'grouped' column) as a list
.index.get_level_values(0).to_list()
# filter df_test for 'file' in list
df_test = df_test[df_test.file.isin(non_matching_files)]
```"
<details>
<summary>英文:</summary>
The answer by [Learning is a mess](https://stackoverflow.com/a/76306636/18571565) is much more efficient if the letters in the `file` strings will fully match. If there are other differences in `file` other than numbers and `_`/` `, then you might want to use fuzzywuzzy to match the similarity of the files:
from fuzzywuzzy import fuzz
get all permutations
compare = pd.MultiIndex.from_product([df_test.file,
df_test.file]).to_series()
fuzzy match - see https://stackoverflow.com/a/54866372/18571565
def metrics(tup):
return pd.Series([fuzz.ratio(*tup)],
['ratio'])
compare = compare.apply(metrics)
compare = compare.loc[compare.ratio.ge(60)] # chosen 60% minimum match here
get list of non-matching numbers for "matched" files
non_matching_files = compare.loc[
# convert 'compare' index to pd.DataFrame
pd.DataFrame(compare.index.to_list()).replace(
# replace all values in df with matching 'number'
df_test.set_index("file")["number"].to_dict())
# calculat differences between two columns and find those not equal
.diff(axis=1)[1].ne(0).to_list()]
# return the first column of the index (the 'grouped' column) as a list
.index.get_level_values(0).to_list()
filter df_test for 'file' in list
df_test = df_test[df_test.file.isin(non_matching_files)]
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论