Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted

huangapple go评论75阅读模式
英文:

Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted

问题

我正在努力找到一种方法来对具有两级索引的分组pandas数据框中的行进行排序。这是我有的数据框df的一个示例:

import pandas as pd
data = {'Count': [207, 105, 28, 37, 182, 194]}
index = pd.MultiIndex.from_tuples([
    ('age_group', '20-39'),
    ('age_group', '40-59'),
    ('age_group', '<19'),
    ('age_group', '>60'),
    ('gender', 'female'),
    ('gender', 'male')
], names=['Parameter', 'Values'])
df = pd.DataFrame(data, index=index)

我想根据以下列表对级别1的年龄组进行排序:age_groups = ["<19", "20-39", "40-59", ">60"],以便将具有"<19"的行放在"20-39"之前。

我有比这里显示的更多的参数和值,但即使在这个简单的示例中,我也找不到一个好的解决方案。

df.reindex(age_groups, level=1) 将删除其他参数。
pd.MultiIndex.from_product() 似乎要求我硬编码指定所有其他索引,这是我不想做的。我找到的所有示例和文档都具有所有级别0的相同级别1索引。

英文:

I am struggling with finding a way to sort the rows in a grouped pandas dataframe with two levels of indicies. This is an example of what a dataframe df I have:

Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted

I could not paste the table in a nice format, but here is code for creating the same:

import pandas as pd
data = {&#39;Count&#39;: [207, 105, 28, 37, 182, 194]}
index = pd.MultiIndex.from_tuples([
    (&#39;age_group&#39;, &#39;20-39&#39;),
    (&#39;age_group&#39;, &#39;40-59&#39;),
    (&#39;age_group&#39;, &#39;&lt;19&#39;),
    (&#39;age_group&#39;, &#39;&gt;60&#39;),
    (&#39;gender&#39;, &#39;female&#39;),
    (&#39;gender&#39;, &#39;male&#39;)
], names=[&#39;Parameter&#39;, &#39;Values&#39;])
df = pd.DataFrame(data, index=index)

I would like to sort the level 1 age group based on this list: age_groups = ["<19", "20-39", "40-59", ">60"] in order to place the row with "<19" before "20-39".

I have a lot more parameters and values than shown here, but even with this simple example I cannot find a good solution.

df.reindex(age_groups, level=1) will remove the other Parameters.
pd.MultiIndex.from_product() seems to require me to hard coded specify all other indicies which I will not do. All examples and documentation I find have the same level 1 indicies for all level 0.

答案1

得分: 2

我可能会选择@Andrej的方法,如果类别是固定的。

如果你需要一个编程解决方案,这里有一个使用numpy.lexsort的方法,它使用范围的最后一个数字来进行排序:

order = np.lexsort([df.index.get_level_values(1)
                      .str.extract(r'(\d+)$', expand=False)
                      .astype(float),
                    df.index.get_level_values(0)
                    ])

out = df.iloc[order]

输出:

                  Count
Parameter Values       
age_group <19        28
          20-39     207
          40-59     105
          >60        37
gender    female    182
          male      194
英文:

I'd probably go with @Andrej's approach if the categories are fixed.

If you need a programmatic solution, here is one with numpy.lexsort, it uses the last number of the range to perform the sort:

order = np.lexsort([df.index.get_level_values(1)
                      .str.extract(r&#39;(\d+)$&#39;, expand=False)
                      .astype(float),
                    df.index.get_level_values(0)
                    ])

out = df.iloc[order]

Output:

                  Count
Parameter Values       
age_group &lt;19        28
          20-39     207
          40-59     105
          &gt;60        37
gender    female    182
          male      194

答案2

得分: 1

df.sort_index()中使用了自定义键,代码有点笨拙。

def sorter(vals):
    ages = ['<19', '20-39', '40-59', '>60']
    genders = ['female', 'male']
    if vals.name == 'Parameter':
        return sorted(vals)
    else:
        return [ages.index(val) if val in ages else genders.index(val) for val in vals]

df.sort_index(key=sorter)
英文:

Kind of clunky, but I used a custom key with df.sort_index().

def sorter(vals):
    ages = [&#39;&lt;19&#39;, &#39;20-39&#39;, &#39;40-59&#39;, &#39;&gt;60&#39;]
    genders = [&#39;female&#39;, &#39;male&#39;]
    if vals.name == &#39;Parameter&#39;:
        return sorted(vals)
    else:
        return [ages.index(val) if val in ages else genders.index(val) for val in vals]

df.sort_index(key = sorter)

答案3

得分: 1

另一种选择是简单地按正确的顺序选择行:

weights = ["<19", "20-39", "40-59", ">60"]

df = pd.concat([df.loc[("age_group", weights), :], df.loc(("gender",):]])
print(df)

打印结果:

                  Count
Parameter Values       
age_group <19        28
          20-39     207
          40-59     105
          >60        37
gender    female    182
          male      194
英文:

Another option is to simply select rows in correct order:

weights = [&quot;&lt;19&quot;, &quot;20-39&quot;, &quot;40-59&quot;, &quot;&gt;60&quot;]

df = pd.concat([df.loc[(&quot;age_group&quot;, weights), :], df.loc[(&quot;gender&quot;,):]])
print(df)

Prints:

                  Count
Parameter Values       
age_group &lt;19        28
          20-39     207
          40-59     105
          &gt;60        37
gender    female    182
          male      194

huangapple
  • 本文由 发表于 2023年7月18日 00:07:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76706307.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定