2023年7月18日 00:07:35go评论105阅读模式

英文:

Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted

问题

我正在努力找到一种方法来对具有两级索引的分组pandas数据框中的行进行排序。这是我有的数据框df的一个示例：

import pandas as pd
data = {'Count': [207, 105, 28, 37, 182, 194]}
index = pd.MultiIndex.from_tuples([
    ('age_group', '20-39'),
    ('age_group', '40-59'),
    ('age_group', '<19'),
    ('age_group', '>60'),
    ('gender', 'female'),
    ('gender', 'male')
], names=['Parameter', 'Values'])
df = pd.DataFrame(data, index=index)

我想根据以下列表对级别1的年龄组进行排序：age_groups = ["<19", "20-39", "40-59", ">60"]，以便将具有"<19"的行放在"20-39"之前。

我有比这里显示的更多的参数和值，但即使在这个简单的示例中，我也找不到一个好的解决方案。

df.reindex(age_groups, level=1) 将删除其他参数。
pd.MultiIndex.from_product() 似乎要求我硬编码指定所有其他索引，这是我不想做的。我找到的所有示例和文档都具有所有级别0的相同级别1索引。

英文:

I am struggling with finding a way to sort the rows in a grouped pandas dataframe with two levels of indicies. This is an example of what a dataframe df I have:

I could not paste the table in a nice format, but here is code for creating the same:

import pandas as pd
data = {&#39;Count&#39;: [207, 105, 28, 37, 182, 194]}
index = pd.MultiIndex.from_tuples([
    (&#39;age_group&#39;, &#39;20-39&#39;),
    (&#39;age_group&#39;, &#39;40-59&#39;),
    (&#39;age_group&#39;, &#39;&lt;19&#39;),
    (&#39;age_group&#39;, &#39;&gt;60&#39;),
    (&#39;gender&#39;, &#39;female&#39;),
    (&#39;gender&#39;, &#39;male&#39;)
], names=[&#39;Parameter&#39;, &#39;Values&#39;])
df = pd.DataFrame(data, index=index)

I would like to sort the level 1 age group based on this list: age_groups = ["<19", "20-39", "40-59", ">60"] in order to place the row with "<19" before "20-39".

I have a lot more parameters and values than shown here, but even with this simple example I cannot find a good solution.

df.reindex(age_groups, level=1) will remove the other Parameters.
pd.MultiIndex.from_product() seems to require me to hard coded specify all other indicies which I will not do. All examples and documentation I find have the same level 1 indicies for all level 0.

答案1

得分: 2

我可能会选择@Andrej的方法，如果类别是固定的。

如果你需要一个编程解决方案，这里有一个使用numpy.lexsort的方法，它使用范围的最后一个数字来进行排序：

order = np.lexsort([df.index.get_level_values(1)
                      .str.extract(r'(\d+)$', expand=False)
                      .astype(float),
                    df.index.get_level_values(0)
                    ])
out = df.iloc[order]

输出：

                  Count
Parameter Values       
age_group <19        28
          20-39     207
          40-59     105
          >60        37
gender    female    182
          male      194

英文:

I'd probably go with @Andrej's approach if the categories are fixed.

If you need a programmatic solution, here is one with numpy.lexsort, it uses the last number of the range to perform the sort:

order = np.lexsort([df.index.get_level_values(1)
                      .str.extract(r&#39;(\d+)$&#39;, expand=False)
                      .astype(float),
                    df.index.get_level_values(0)
                    ])
out = df.iloc[order]

Output:

                  Count
Parameter Values       
age_group &lt;19        28
          20-39     207
          40-59     105
          &gt;60        37
gender    female    182
          male      194

答案2

得分: 1

在df.sort_index()中使用了自定义键，代码有点笨拙。

def sorter(vals):
    ages = ['<19', '20-39', '40-59', '>60']
    genders = ['female', 'male']
    if vals.name == 'Parameter':
        return sorted(vals)
    else:
        return [ages.index(val) if val in ages else genders.index(val) for val in vals]
df.sort_index(key=sorter)

英文:

Kind of clunky, but I used a custom key with df.sort_index().

def sorter(vals):
    ages = [&#39;&lt;19&#39;, &#39;20-39&#39;, &#39;40-59&#39;, &#39;&gt;60&#39;]
    genders = [&#39;female&#39;, &#39;male&#39;]
    if vals.name == &#39;Parameter&#39;:
        return sorted(vals)
    else:
        return [ages.index(val) if val in ages else genders.index(val) for val in vals]
df.sort_index(key = sorter)

答案3

得分: 1

另一种选择是简单地按正确的顺序选择行：

weights = ["<19", "20-39", "40-59", ">60"]
df = pd.concat([df.loc[("age_group", weights), :], df.loc(("gender",):]])
print(df)

打印结果：

                  Count
Parameter Values       
age_group <19        28
          20-39     207
          40-59     105
          >60        37
gender    female    182
          male      194

英文:

Another option is to simply select rows in correct order:

weights = [&quot;&lt;19&quot;, &quot;20-39&quot;, &quot;40-59&quot;, &quot;&gt;60&quot;]
df = pd.concat([df.loc[(&quot;age_group&quot;, weights), :], df.loc[(&quot;gender&quot;,):]])
print(df)

Prints:

                  Count
Parameter Values       
age_group &lt;19        28
          20-39     207
          40-59     105
          &gt;60        37
gender    female    182
          male      194

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted

问题

答案1

答案2

答案3

访问字典项错误：TypeError：元组索引必须是整数或切片，而不是字符串

AttributeError: 模块 ‘numpy’ 没有 ‘complex’ 属性

解析JSON文件时出错，可能是JSON内容中有隐藏的值。

基于间隔列表跳过数组项

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。