Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted

huangapple go评论105阅读模式
英文:

Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted

问题

我正在努力找到一种方法来对具有两级索引的分组pandas数据框中的行进行排序。这是我有的数据框df的一个示例:

  1. import pandas as pd
  2. data = {'Count': [207, 105, 28, 37, 182, 194]}
  3. index = pd.MultiIndex.from_tuples([
  4. ('age_group', '20-39'),
  5. ('age_group', '40-59'),
  6. ('age_group', '<19'),
  7. ('age_group', '>60'),
  8. ('gender', 'female'),
  9. ('gender', 'male')
  10. ], names=['Parameter', 'Values'])
  11. df = pd.DataFrame(data, index=index)

我想根据以下列表对级别1的年龄组进行排序:age_groups = ["<19", "20-39", "40-59", ">60"],以便将具有"<19"的行放在"20-39"之前。

我有比这里显示的更多的参数和值,但即使在这个简单的示例中,我也找不到一个好的解决方案。

df.reindex(age_groups, level=1) 将删除其他参数。
pd.MultiIndex.from_product() 似乎要求我硬编码指定所有其他索引,这是我不想做的。我找到的所有示例和文档都具有所有级别0的相同级别1索引。

英文:

I am struggling with finding a way to sort the rows in a grouped pandas dataframe with two levels of indicies. This is an example of what a dataframe df I have:

Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted

I could not paste the table in a nice format, but here is code for creating the same:

  1. import pandas as pd
  2. data = {&#39;Count&#39;: [207, 105, 28, 37, 182, 194]}
  3. index = pd.MultiIndex.from_tuples([
  4. (&#39;age_group&#39;, &#39;20-39&#39;),
  5. (&#39;age_group&#39;, &#39;40-59&#39;),
  6. (&#39;age_group&#39;, &#39;&lt;19&#39;),
  7. (&#39;age_group&#39;, &#39;&gt;60&#39;),
  8. (&#39;gender&#39;, &#39;female&#39;),
  9. (&#39;gender&#39;, &#39;male&#39;)
  10. ], names=[&#39;Parameter&#39;, &#39;Values&#39;])
  11. df = pd.DataFrame(data, index=index)

I would like to sort the level 1 age group based on this list: age_groups = ["<19", "20-39", "40-59", ">60"] in order to place the row with "<19" before "20-39".

I have a lot more parameters and values than shown here, but even with this simple example I cannot find a good solution.

df.reindex(age_groups, level=1) will remove the other Parameters.
pd.MultiIndex.from_product() seems to require me to hard coded specify all other indicies which I will not do. All examples and documentation I find have the same level 1 indicies for all level 0.

答案1

得分: 2

我可能会选择@Andrej的方法,如果类别是固定的。

如果你需要一个编程解决方案,这里有一个使用numpy.lexsort的方法,它使用范围的最后一个数字来进行排序:

  1. order = np.lexsort([df.index.get_level_values(1)
  2. .str.extract(r'(\d+)$', expand=False)
  3. .astype(float),
  4. df.index.get_level_values(0)
  5. ])
  6. out = df.iloc[order]

输出:

  1. Count
  2. Parameter Values
  3. age_group <19 28
  4. 20-39 207
  5. 40-59 105
  6. >60 37
  7. gender female 182
  8. male 194
英文:

I'd probably go with @Andrej's approach if the categories are fixed.

If you need a programmatic solution, here is one with numpy.lexsort, it uses the last number of the range to perform the sort:

  1. order = np.lexsort([df.index.get_level_values(1)
  2. .str.extract(r&#39;(\d+)$&#39;, expand=False)
  3. .astype(float),
  4. df.index.get_level_values(0)
  5. ])
  6. out = df.iloc[order]

Output:

  1. Count
  2. Parameter Values
  3. age_group &lt;19 28
  4. 20-39 207
  5. 40-59 105
  6. &gt;60 37
  7. gender female 182
  8. male 194

答案2

得分: 1

df.sort_index()中使用了自定义键,代码有点笨拙。

  1. def sorter(vals):
  2. ages = ['<19', '20-39', '40-59', '>60']
  3. genders = ['female', 'male']
  4. if vals.name == 'Parameter':
  5. return sorted(vals)
  6. else:
  7. return [ages.index(val) if val in ages else genders.index(val) for val in vals]
  8. df.sort_index(key=sorter)
英文:

Kind of clunky, but I used a custom key with df.sort_index().

  1. def sorter(vals):
  2. ages = [&#39;&lt;19&#39;, &#39;20-39&#39;, &#39;40-59&#39;, &#39;&gt;60&#39;]
  3. genders = [&#39;female&#39;, &#39;male&#39;]
  4. if vals.name == &#39;Parameter&#39;:
  5. return sorted(vals)
  6. else:
  7. return [ages.index(val) if val in ages else genders.index(val) for val in vals]
  8. df.sort_index(key = sorter)

答案3

得分: 1

另一种选择是简单地按正确的顺序选择行:

  1. weights = ["<19", "20-39", "40-59", ">60"]
  2. df = pd.concat([df.loc[("age_group", weights), :], df.loc(("gender",):]])
  3. print(df)

打印结果:

  1. Count
  2. Parameter Values
  3. age_group <19 28
  4. 20-39 207
  5. 40-59 105
  6. >60 37
  7. gender female 182
  8. male 194
英文:

Another option is to simply select rows in correct order:

  1. weights = [&quot;&lt;19&quot;, &quot;20-39&quot;, &quot;40-59&quot;, &quot;&gt;60&quot;]
  2. df = pd.concat([df.loc[(&quot;age_group&quot;, weights), :], df.loc[(&quot;gender&quot;,):]])
  3. print(df)

Prints:

  1. Count
  2. Parameter Values
  3. age_group &lt;19 28
  4. 20-39 207
  5. 40-59 105
  6. &gt;60 37
  7. gender female 182
  8. male 194

huangapple
  • 本文由 发表于 2023年7月18日 00:07:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76706307.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定