英文:
Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted
问题
我正在努力找到一种方法来对具有两级索引的分组pandas数据框中的行进行排序。这是我有的数据框df的一个示例:
import pandas as pd
data = {'Count': [207, 105, 28, 37, 182, 194]}
index = pd.MultiIndex.from_tuples([
('age_group', '20-39'),
('age_group', '40-59'),
('age_group', '<19'),
('age_group', '>60'),
('gender', 'female'),
('gender', 'male')
], names=['Parameter', 'Values'])
df = pd.DataFrame(data, index=index)
我想根据以下列表对级别1的年龄组进行排序:age_groups = ["<19", "20-39", "40-59", ">60"]
,以便将具有"<19"的行放在"20-39"之前。
我有比这里显示的更多的参数和值,但即使在这个简单的示例中,我也找不到一个好的解决方案。
df.reindex(age_groups, level=1)
将删除其他参数。
pd.MultiIndex.from_product()
似乎要求我硬编码指定所有其他索引,这是我不想做的。我找到的所有示例和文档都具有所有级别0的相同级别1索引。
英文:
I am struggling with finding a way to sort the rows in a grouped pandas dataframe with two levels of indicies. This is an example of what a dataframe df I have:
I could not paste the table in a nice format, but here is code for creating the same:
import pandas as pd
data = {'Count': [207, 105, 28, 37, 182, 194]}
index = pd.MultiIndex.from_tuples([
('age_group', '20-39'),
('age_group', '40-59'),
('age_group', '<19'),
('age_group', '>60'),
('gender', 'female'),
('gender', 'male')
], names=['Parameter', 'Values'])
df = pd.DataFrame(data, index=index)
I would like to sort the level 1 age group based on this list: age_groups = ["<19", "20-39", "40-59", ">60"] in order to place the row with "<19" before "20-39".
I have a lot more parameters and values than shown here, but even with this simple example I cannot find a good solution.
df.reindex(age_groups, level=1) will remove the other Parameters.
pd.MultiIndex.from_product() seems to require me to hard coded specify all other indicies which I will not do. All examples and documentation I find have the same level 1 indicies for all level 0.
答案1
得分: 2
我可能会选择@Andrej的方法,如果类别是固定的。
如果你需要一个编程解决方案,这里有一个使用numpy.lexsort
的方法,它使用范围的最后一个数字来进行排序:
order = np.lexsort([df.index.get_level_values(1)
.str.extract(r'(\d+)$', expand=False)
.astype(float),
df.index.get_level_values(0)
])
out = df.iloc[order]
输出:
Count
Parameter Values
age_group <19 28
20-39 207
40-59 105
>60 37
gender female 182
male 194
英文:
I'd probably go with @Andrej's approach if the categories are fixed.
If you need a programmatic solution, here is one with numpy.lexsort
, it uses the last number of the range to perform the sort:
order = np.lexsort([df.index.get_level_values(1)
.str.extract(r'(\d+)$', expand=False)
.astype(float),
df.index.get_level_values(0)
])
out = df.iloc[order]
Output:
Count
Parameter Values
age_group <19 28
20-39 207
40-59 105
>60 37
gender female 182
male 194
答案2
得分: 1
在df.sort_index()
中使用了自定义键,代码有点笨拙。
def sorter(vals):
ages = ['<19', '20-39', '40-59', '>60']
genders = ['female', 'male']
if vals.name == 'Parameter':
return sorted(vals)
else:
return [ages.index(val) if val in ages else genders.index(val) for val in vals]
df.sort_index(key=sorter)
英文:
Kind of clunky, but I used a custom key with df.sort_index()
.
def sorter(vals):
ages = ['<19', '20-39', '40-59', '>60']
genders = ['female', 'male']
if vals.name == 'Parameter':
return sorted(vals)
else:
return [ages.index(val) if val in ages else genders.index(val) for val in vals]
df.sort_index(key = sorter)
答案3
得分: 1
另一种选择是简单地按正确的顺序选择行:
weights = ["<19", "20-39", "40-59", ">60"]
df = pd.concat([df.loc[("age_group", weights), :], df.loc(("gender",):]])
print(df)
打印结果:
Count
Parameter Values
age_group <19 28
20-39 207
40-59 105
>60 37
gender female 182
male 194
英文:
Another option is to simply select rows in correct order:
weights = ["<19", "20-39", "40-59", ">60"]
df = pd.concat([df.loc[("age_group", weights), :], df.loc[("gender",):]])
print(df)
Prints:
Count
Parameter Values
age_group <19 28
20-39 207
40-59 105
>60 37
gender female 182
male 194
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论