尝试根据排名行的值重新排列数据框中的多个列

huangapple go评论69阅读模式
英文:

Trying to rearrange multiple columns in a dataframe based on ranking row values

问题

我正在处理匹配公司名称的工作,我有一个返回以下格式输出的数据框。

表格中有一个原始名称,对于每个原始名称,可能会有N个匹配项。对于每个匹配项,有3列,match_name_0,score_0,match_index_0等等,直到match_name_N。

我正在尝试找到一种方法,返回一个新的数据框,按照最高匹配分数对原始名称后面的列进行排序。基本上,如果match_score_2是最高的,那么匹配分数_0排在匹配分数_1之后,列的顺序将是

original_score, match_name_2, match_score_2, match_index_2, match_name_0, match_score_0, match_index_0, match_name_2, match_score_2, match_index_2,

如果出现平局,最左边的匹配项应该排名更高。我应该注意有时它们会按照正确的顺序排列,但30-40%的情况下,它们不会。

我一直盯着屏幕两个小时,完全被难住了,所以非常感激任何帮助。

index original_name match_name_0 score_0 match_index_0 match_name_1 score_1 match_index_1 match_name_2 score_2 match_index_2 match_name_3 score_3 match_index_3 match_name_4 score_4 match_index_4
0 aberdeen asset management plc aberdeen asset management sa 100 2114 aberdeen asset management plc esop 100 2128 aberdeen asset management inc 100 2123 aberdeen asset management spain 71.18779356 2132 aberdeen asset management ireland 69.50514818 2125
2 agi partners llc agi partners llc 100 5274 agi partners llc 100 5273 agr partners llc 57.51100704 5378 aci partners llc 53.45090217 3097 avi partners llc 53.45090217 17630
3 alberta investment management corporation alberta investment management corporation 100 6754 alberta investment management corporation pension arm 100 6755 anchor investment management corporation 17.50748486 10682 cbc investment management corporation 11.79760839 36951 harvest investment management corporation 31.70316571 85547
英文:

I'm working on a matching company names and I have a dataframe that returns output in the format below.

The table has an original name and for each original name, there could be N number of matches. For each match, there are 3 columns, match_name_0, score_0, match_index_0 and so on up to match_name_N.

I'm trying to figure out a way to return a new dataframe that sorts the columns after the original_name by the highest match scores. Essentially, if match_score_2 was the highest then match_score_0 followed by match_score_1 the columns would be

original_score, match_name_2, match_score_2, match_index_2, match_name_0, match_score_0, match_index_0, match_name_2, match_score_2, match_index_2,

In the event of a tie, the leftmost match should be ranked higher. I should note that sometimes they will be in the correct order but 30-40% of the times, they are not.

I've been staring at my screen for 2 hours and totally stumped so any help is greatly appreciated

index original_name match_name_0 score_0 match_index_0 match_name_1 score_1 match_index_1 match_name_2 score_2 match_index_2 match_name_3 score_3 match_index_3 match_name_4 score_4 match_index_4
0 aberdeen asset management plc aberdeen asset management sa 100 2114 aberdeen asset management plc esop 100 2128 aberdeen asset management inc 100 2123 aberdeen asset management spain 71.18779356 2132 aberdeen asset management ireland 69.50514818 2125
2 agi partners llc agi partners llc 100 5274 agi partners llc 100 5273 agr partners llc 57.51100704 5378 aci partners llc 53.45090217 3097 avi partners llc 53.45090217 17630
3 alberta investment management corporation alberta investment management corporation 100 6754 alberta investment management corporation pension arm 100 6755 anchor investment management corporation 17.50748486 10682 cbc investment management corporation 11.79760839 36951 harvest investment management corporation 31.70316571 85547

答案1

得分: 0

我理解您想要按照score,然后按照match_number来对每个original_name的匹配进行排序。

宽格式的数据通常比较难处理,包括这种情况。我建议将数据转换为长格式,然后可以使用以下代码轻松地按照您需要的顺序进行排序:

sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])

最后,您可以将数据重新转换为宽格式。

import pandas as pd
from io import StringIO

# 示例数据
df = """
original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,match_name_3,score_3,match_index_3,match_name_4,score_4,match_index_4
aberdeen asset management plc,aberdeen asset management sa,100,2114,aberdeen asset management plc esop,100,2128,aberdeen asset management inc,100,2123,aberdeen asset management spain,71.18779356,2132,aberdeen asset management ireland,69.50514818,2125
agi partners llc,agi partners llc,100,5274,agi partners llc,100,5273,agr partners llc,57.51100704,5378,aci partners llc,53.45090217,3097,avi partners llc,53.45090217,17630
alberta investment management corporation,alberta investment management corporation,100,6754,alberta investment management corporation pension arm,100,6755,anchor investment management corporation,17.50748486,10682,cbc investment management corporation,11.79760839,36951,harvest investment management corporation,31.70316571,85547
"""
df = pd.read_csv(StringIO(df.strip()), sep=',', engine='python')

# 宽格式转为长格式
result = pd.wide_to_long(df, ['match_name','score','match_index'], i='original_name', j='match_number', sep='_').reset_index()

# 按需求排序匹配项
result = result.sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])

# 覆盖之前排序所施加的排名
# 这确保了一旦数据重新转换为宽格式,顺序保持不变
result['match_number'] = result.groupby('original_name').cumcount()

# 长格式转为宽格式
result = result.set_index(['original_name','match_number']).unstack()

# 整理数据以匹配预期结果
result = result.swaplevel(axis=1).sort_index(axis=1)
result = result.reindex(['match_name','score','match_index'], axis=1, level=1)
result.columns = [f'{col[1]}_{col[0]}' for col in result.columns]

因此,例如,alberta investment management corporation 的先前匹配项4现在是匹配项2(基于分数)。对于 agi partners llc,匹配项3和4的顺序保持不变,因为它们具有相同的分数。

英文:

I am assuming you want to impose the ordering of matches first by score and then by match_number individually for each original_name.

Wide datasets are usually difficult to deal with, including this case. I suggest to reshape to a long dataset, where you can easily impose your required ordering by

sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])

Finally, you can reshape it back to a wide dataset.

import pandas as pd
from io import StringIO
    
# sample data
df = """  
original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,match_name_3,score_3,match_index_3,match_name_4,score_4,match_index_4
aberdeen asset management plc,aberdeen asset management sa,100,2114,aberdeen asset management plc esop,100,2128,aberdeen asset management inc,100,2123,aberdeen asset management spain,71.18779356,2132,aberdeen asset management ireland,69.50514818,2125
agi partners llc,agi partners llc,100,5274,agi partners llc,100,5273,agr partners llc,57.51100704,5378,aci partners llc,53.45090217,3097,avi partners llc,53.45090217,17630
alberta investment management corporation,alberta investment management corporation,100,6754,alberta investment management corporation pension arm,100,6755,anchor investment management corporation,17.50748486,10682,cbc investment management corporation,11.79760839,36951,harvest investment management corporation,31.70316571,85547
"""
df= pd.read_csv(StringIO(df.strip()), sep=',', engine='python')

# wide to long
result = pd.wide_to_long(df, ['match_name','score','match_index'], i='original_name', j='match_number', sep='_').reset_index()

# sort matches as per requirement
result = result.sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])

# overwrite ranking imposed by previous sort
# this ensures that the order is maintained once it is
# reshaped back to a wide dataset
result['match_number'] = result.groupby('original_name').cumcount()

# reshape long to wide
result = result.set_index(['original_name','match_number']).unstack()

# tidy up to match expected result
result = result.swaplevel(axis=1).sort_index(axis=1)
result = result.reindex(['match_name','score','match_index'], axis=1, level=1)
result.columns = [f'{col[1]}_{col[0]}' for col in result.columns]

As a result, for example, previous match 4 of alberta investment management corporation is now match 2 (based on score). The order of matches 3 and 4 for agi partners llc remain the same because they have the same score.

huangapple
  • 本文由 发表于 2023年2月8日 23:09:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75387765.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定