英文:
Trying to rearrange multiple columns in a dataframe based on ranking row values
问题
我正在处理匹配公司名称的工作,我有一个返回以下格式输出的数据框。
表格中有一个原始名称,对于每个原始名称,可能会有N个匹配项。对于每个匹配项,有3列,match_name_0,score_0,match_index_0等等,直到match_name_N。
我正在尝试找到一种方法,返回一个新的数据框,按照最高匹配分数对原始名称后面的列进行排序。基本上,如果match_score_2是最高的,那么匹配分数_0排在匹配分数_1之后,列的顺序将是
original_score, match_name_2, match_score_2, match_index_2, match_name_0, match_score_0, match_index_0, match_name_2, match_score_2, match_index_2,
如果出现平局,最左边的匹配项应该排名更高。我应该注意有时它们会按照正确的顺序排列,但30-40%的情况下,它们不会。
我一直盯着屏幕两个小时,完全被难住了,所以非常感激任何帮助。
index | original_name | match_name_0 | score_0 | match_index_0 | match_name_1 | score_1 | match_index_1 | match_name_2 | score_2 | match_index_2 | match_name_3 | score_3 | match_index_3 | match_name_4 | score_4 | match_index_4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | aberdeen asset management plc | aberdeen asset management sa | 100 | 2114 | aberdeen asset management plc esop | 100 | 2128 | aberdeen asset management inc | 100 | 2123 | aberdeen asset management spain | 71.18779356 | 2132 | aberdeen asset management ireland | 69.50514818 | 2125 |
2 | agi partners llc | agi partners llc | 100 | 5274 | agi partners llc | 100 | 5273 | agr partners llc | 57.51100704 | 5378 | aci partners llc | 53.45090217 | 3097 | avi partners llc | 53.45090217 | 17630 |
3 | alberta investment management corporation | alberta investment management corporation | 100 | 6754 | alberta investment management corporation pension arm | 100 | 6755 | anchor investment management corporation | 17.50748486 | 10682 | cbc investment management corporation | 11.79760839 | 36951 | harvest investment management corporation | 31.70316571 | 85547 |
英文:
I'm working on a matching company names and I have a dataframe that returns output in the format below.
The table has an original name and for each original name, there could be N number of matches. For each match, there are 3 columns, match_name_0, score_0, match_index_0 and so on up to match_name_N.
I'm trying to figure out a way to return a new dataframe that sorts the columns after the original_name by the highest match scores. Essentially, if match_score_2 was the highest then match_score_0 followed by match_score_1 the columns would be
original_score, match_name_2, match_score_2, match_index_2, match_name_0, match_score_0, match_index_0, match_name_2, match_score_2, match_index_2,
In the event of a tie, the leftmost match should be ranked higher. I should note that sometimes they will be in the correct order but 30-40% of the times, they are not.
I've been staring at my screen for 2 hours and totally stumped so any help is greatly appreciated
index | original_name | match_name_0 | score_0 | match_index_0 | match_name_1 | score_1 | match_index_1 | match_name_2 | score_2 | match_index_2 | match_name_3 | score_3 | match_index_3 | match_name_4 | score_4 | match_index_4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | aberdeen asset management plc | aberdeen asset management sa | 100 | 2114 | aberdeen asset management plc esop | 100 | 2128 | aberdeen asset management inc | 100 | 2123 | aberdeen asset management spain | 71.18779356 | 2132 | aberdeen asset management ireland | 69.50514818 | 2125 |
2 | agi partners llc | agi partners llc | 100 | 5274 | agi partners llc | 100 | 5273 | agr partners llc | 57.51100704 | 5378 | aci partners llc | 53.45090217 | 3097 | avi partners llc | 53.45090217 | 17630 |
3 | alberta investment management corporation | alberta investment management corporation | 100 | 6754 | alberta investment management corporation pension arm | 100 | 6755 | anchor investment management corporation | 17.50748486 | 10682 | cbc investment management corporation | 11.79760839 | 36951 | harvest investment management corporation | 31.70316571 | 85547 |
答案1
得分: 0
我理解您想要按照score
,然后按照match_number
来对每个original_name
的匹配进行排序。
宽格式的数据通常比较难处理,包括这种情况。我建议将数据转换为长格式,然后可以使用以下代码轻松地按照您需要的顺序进行排序:
sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
最后,您可以将数据重新转换为宽格式。
import pandas as pd
from io import StringIO
# 示例数据
df = """
original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,match_name_3,score_3,match_index_3,match_name_4,score_4,match_index_4
aberdeen asset management plc,aberdeen asset management sa,100,2114,aberdeen asset management plc esop,100,2128,aberdeen asset management inc,100,2123,aberdeen asset management spain,71.18779356,2132,aberdeen asset management ireland,69.50514818,2125
agi partners llc,agi partners llc,100,5274,agi partners llc,100,5273,agr partners llc,57.51100704,5378,aci partners llc,53.45090217,3097,avi partners llc,53.45090217,17630
alberta investment management corporation,alberta investment management corporation,100,6754,alberta investment management corporation pension arm,100,6755,anchor investment management corporation,17.50748486,10682,cbc investment management corporation,11.79760839,36951,harvest investment management corporation,31.70316571,85547
"""
df = pd.read_csv(StringIO(df.strip()), sep=',', engine='python')
# 宽格式转为长格式
result = pd.wide_to_long(df, ['match_name','score','match_index'], i='original_name', j='match_number', sep='_').reset_index()
# 按需求排序匹配项
result = result.sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
# 覆盖之前排序所施加的排名
# 这确保了一旦数据重新转换为宽格式,顺序保持不变
result['match_number'] = result.groupby('original_name').cumcount()
# 长格式转为宽格式
result = result.set_index(['original_name','match_number']).unstack()
# 整理数据以匹配预期结果
result = result.swaplevel(axis=1).sort_index(axis=1)
result = result.reindex(['match_name','score','match_index'], axis=1, level=1)
result.columns = [f'{col[1]}_{col[0]}' for col in result.columns]
因此,例如,alberta investment management corporation
的先前匹配项4现在是匹配项2(基于分数)。对于 agi partners llc
,匹配项3和4的顺序保持不变,因为它们具有相同的分数。
英文:
I am assuming you want to impose the ordering of matches first by score
and then by match_number
individually for each original_name
.
Wide datasets are usually difficult to deal with, including this case. I suggest to reshape to a long dataset, where you can easily impose your required ordering by
sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
Finally, you can reshape it back to a wide dataset.
import pandas as pd
from io import StringIO
# sample data
df = """
original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,match_name_3,score_3,match_index_3,match_name_4,score_4,match_index_4
aberdeen asset management plc,aberdeen asset management sa,100,2114,aberdeen asset management plc esop,100,2128,aberdeen asset management inc,100,2123,aberdeen asset management spain,71.18779356,2132,aberdeen asset management ireland,69.50514818,2125
agi partners llc,agi partners llc,100,5274,agi partners llc,100,5273,agr partners llc,57.51100704,5378,aci partners llc,53.45090217,3097,avi partners llc,53.45090217,17630
alberta investment management corporation,alberta investment management corporation,100,6754,alberta investment management corporation pension arm,100,6755,anchor investment management corporation,17.50748486,10682,cbc investment management corporation,11.79760839,36951,harvest investment management corporation,31.70316571,85547
"""
df= pd.read_csv(StringIO(df.strip()), sep=',', engine='python')
# wide to long
result = pd.wide_to_long(df, ['match_name','score','match_index'], i='original_name', j='match_number', sep='_').reset_index()
# sort matches as per requirement
result = result.sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
# overwrite ranking imposed by previous sort
# this ensures that the order is maintained once it is
# reshaped back to a wide dataset
result['match_number'] = result.groupby('original_name').cumcount()
# reshape long to wide
result = result.set_index(['original_name','match_number']).unstack()
# tidy up to match expected result
result = result.swaplevel(axis=1).sort_index(axis=1)
result = result.reindex(['match_name','score','match_index'], axis=1, level=1)
result.columns = [f'{col[1]}_{col[0]}' for col in result.columns]
As a result, for example, previous match 4 of alberta investment management corporation
is now match 2 (based on score). The order of matches 3 and 4 for agi partners llc
remain the same because they have the same score.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论