2023年6月5日 21:17:56go评论98阅读模式

英文:

How to set broken bar order after grouping the dataframe

问题

以下是翻译好的部分：

这个代码示例绘制了一个断断续续的横向条形图，显示了一组人员在一段时间内加入和离开一个音乐团队的情况：
```python
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
result = pd.DataFrame([['Bill', 1972, 1974],
                       ['Bill', 1976, 1978],
                       ['Bill', 1967, 1971],
                       ['Danny', 1969, 1975],
                       ['Danny', 1976, 1977],
                       ['James', 1971, 1972],
                       ['Marshall', 1967, 1975]],
                      columns=['Person', 'Year_start', 'Year_left'])
fig, ax = plt.subplots()
names = sorted(result['Person'].unique())
colormap = plt.get_cmap('plasma')
slicedColorMap = colormap(np.linspace(0, 1, len(names)))
height = 0.5
for y, (name, g) in enumerate(result.groupby('Person')):
    ax.broken_barh(list(zip(g['Year_start'],
                            g['Year_left'] - g['Year_start'])),
                   (y - height / 2, height),
                   facecolors=slicedColorMap[y]
                   )
ax.set_ylim(0 - height, len(names) - 1 + height)
ax.set_xlim(result['Year_start'].min() - 1, result['Year_left'].max() + 1)
ax.set_yticks(range(len(names)), names)
ax.grid(True)
plt.show()

输出结果是这个图表：

我需要按'Year_start'和'Year_left'的升序顺序对条形图进行排序（以及在y轴上的人员）。

我知道如何在数据分组后对数据进行汇总和排序，并且我应该在重置索引后进行：

sorted_result = result.groupby('Person').agg({'Year_start': min, 'Year_left': max})
sorted_result = sorted_result.sort_values(['Year_start', 'Year_left'], ascending=[True, True]).reset_index()
print(sorted_result)

但是在绘制ax.broken_barh时将这个排序嵌入到现有的“for循环”中让我感到很困难（也因为我理解的是不可能在单次迭代中使用“agg”和“groupby”进行“sort_values”）。
在这个脚本中是否可能进行这种排序，还是我应该彻底重新考虑脚本结构？
非常感谢！


<details>
<summary>英文:</summary>
The code example below draws a broken barh diagram with a list of persons which joined and left a music band during a period of time:

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

result = pd.DataFrame([['Bill', 1972, 1974],
['Bill', 1976, 1978],
['Bill', 1967, 1971],
['Danny', 1969, 1975],
['Danny', 1976, 1977],
['James', 1971, 1972],
['Marshall', 1967, 1975]],
columns=['Person', 'Year_start', 'Year_left'])

fig, ax = plt.subplots()

names = sorted(result['Person'].unique())

colormap = plt.get_cmap('plasma')
slicedColorMap = colormap(np.linspace(0, 1, len(names)))

height = 0.5
for y, (name, g) in enumerate(result.groupby('Person')):
ax.broken_barh(list(zip(g['Year_start'],
g['Year_left']-g['Year_start'])),
(y-height/2, height),
facecolors=slicedColorMap[y]
)

ax.set_ylim(0-height, len(names)-1+height)
ax.set_xlim(result['Year_start'].min()-1, result['Year_left'].max()+1)
ax.set_yticks(range(len(names)), names)

ax.grid(True)
plt.show()


The output result is this diagram:
[![enter image description here](https://i.stack.imgur.com/zDabL.png)](https://i.stack.imgur.com/zDabL.png)
I need to order the bars (along with the Persons in y axis) by &#39;Year_start&#39; and &#39;Year_left&#39;, both in ascending order.
I know how to aggregate and order values in dataframe after the data is grouped, and that I should reset index afterwards:

sorted_result = result.groupby('Person').agg({'Year_start': min, 'Year_left': max})
sorted_result = sorted_result.sort_values(['Year_start', 'Year_left'], ascending=[True, True]).reset_index()
print(sorted_result)


But I am having a hard time to embed this sorting into existing &quot;for in&quot; loop when drawing the ax.broken_barh (also because as I understood it is not possible to perform &quot;sort_values&quot; with &quot;groupby&quot; using &quot;agg&quot; in a single iteration).
Is this sorting possible in this script at all or I should completely reconsider the script structure?
Many thanks!
</details>
# 答案1
**得分**: 1
以下是您要翻译的内容：
"你已经快要完成了 :-) 你已经按最早的开始和最早的结束时间对名称进行了排序。您只需要将“Person”列更改为按接收顺序的分类，并在分组之前添加sort_values('Person')以进行barh绘图。更新的代码如下。添加了注释以使其更容易理解。希望这是您正在寻找的内容...
请注意 - 我认为您正在使用matplotlib 3.2或更早版本。因此，set_yticks()仍然有效。但在更新的版本中已弃用。我还将其拆分为set_yticks()和set_yticklabels()，因为在新版本中需要这样做。还将名称更改为sorted_result.Person.to_list()，以确保标签正确对齐。
```python
result = pd.DataFrame([['Bill', 1972, 1974],
                       ['Bill', 1976, 1978],
                       ['Bill', 1967, 1971],
                       ['Danny', 1969, 1975],
                       ['Danny', 1976, 1977],
                       ['James', 1971, 1972],
                       ['Marshall', 1967, 1975]],
                      columns=['Person', 'Year_start', 'Year_left'])
fig, ax = plt.subplots()
names = sorted(result['Person'].unique())
colormap = plt.get_cmap('plasma')
slicedColorMap = colormap(np.linspace(0, 1, len(names)))
height = 0.5
## 新添加的代码 ##
## 这是您的代码.. 获取sorted_result
sorted_result = result.groupby('Person').agg({'Year_start': min, 'Year_left': max})
sorted_result = sorted_result.sort_values(['Year_start', 'Year_left'], ascending=[True, True]).reset_index()
## 将Person更改为分类，以便在排序时按您需要的顺序进行排序
## 请注意，我正在使用sorted_result.Person.to_list()，基本上是您需要的排序顺序
result['Person'] = pd.Categorical(
    result['Person'], 
    categories=sorted_result.Person.to_list(), 
    ordered=True
)
## 在分组之前添加sort_values('Person')...
for y, (name, g) in enumerate(result.sort_values('Person').groupby('Person')):
    ax.broken_barh(list(zip(g['Year_start'],
                            g['Year_left']-g['Year_start'])),
                   (y-height/2, height),
                   facecolors=slicedColorMap[y]
                   )
ax.set_ylim(0-height, len(names)-1+height)
ax.set_xlim(result['Year_start'].min()-1, result['Year_left'].max()+1)
ax.set_yticks(range(len(sorted_result.Person.to_list())))  ## 更改了名称
ax.set_yticklabels(sorted_result.Person.to_list())  ## 更改了名称
ax.grid(True)
plt.show()

英文:

You are almost there You already have the names sorted by earliest start and earliest end. You just need to change the Person column to categorical with the order that you received and then do the barh plotting by adding sort_values('Person') before grouping. The updated code is below. Added comments to make it easy. Hope this is what you are you are looking for...

Please note - I think you are using matplotlib 3.2 or earlier. So, set_yticks() works. But, it is deprecated in newer version. Have split it into set_yticks() and set_yticklabels() as is required in newer versions. Also, changed names to sorted_result.Person.to_list() so the labels are correctly aligned.

result = pd.DataFrame([[&#39;Bill&#39;, 1972, 1974],
                       [&#39;Bill&#39;, 1976, 1978],
                       [&#39;Bill&#39;, 1967, 1971],
                       [&#39;Danny&#39;, 1969, 1975],
                       [&#39;Danny&#39;, 1976, 1977],
                       [&#39;James&#39;, 1971, 1972],
                       [&#39;Marshall&#39;, 1967, 1975]],
                      columns=[&#39;Person&#39;, &#39;Year_start&#39;, &#39;Year_left&#39;])
fig, ax = plt.subplots()
names = sorted(result[&#39;Person&#39;].unique())
colormap = plt.get_cmap(&#39;plasma&#39;)
slicedColorMap = colormap(np.linspace(0, 1, len(names)))
height = 0.5
## NEW ADDED CODE ##
## This is your code.. get the sorted_result
sorted_result = result.groupby(&#39;Person&#39;).agg({&#39;Year_start&#39;: min, &#39;Year_left&#39;: max})
sorted_result = sorted_result.sort_values([&#39;Year_start&#39;, &#39;Year_left&#39;], ascending=[True, True]).reset_index()
## Change Person to categorical, so that, when you sort it, it will be in the order you need
## Notice that I am using sorted_result.Person.to_list(), basically sort order as you need
result[&#39;Person&#39;] = pd.Categorical(
    result[&#39;Person&#39;], 
    categories=sorted_result.Person.to_list(), 
    ordered=True
)
## Here, added sort_values(&#39;Person&#39;) before grouping...
for y, (name, g) in enumerate(result.sort_values(&#39;Person&#39;).groupby(&#39;Person&#39;)):
    ax.broken_barh(list(zip(g[&#39;Year_start&#39;],
                            g[&#39;Year_left&#39;]-g[&#39;Year_start&#39;])),
                   (y-height/2, height),
                   facecolors=slicedColorMap[y]
                   )
ax.set_ylim(0-height, len(names)-1+height)
ax.set_xlim(result[&#39;Year_start&#39;].min()-1, result[&#39;Year_left&#39;].max()+1)
ax.set_yticks(range(len(sorted_result.Person.to_list())))  ##Changed name
ax.set_yticklabels(sorted_result.Person.to_list())  ## Changed name
ax.grid(True)
plt.show()

答案2

得分: 1

使用groupby()时，只需在使用sort=False参数，然后在数据帧中按照所需方式对数据进行排序。代码的其余部分可以保持不变：

result = pd.DataFrame([['Bill', 1972, 1974],
                       ['Bill', 1976, 1978],
                       ['Bill', 1967, 1971],
                       ['Danny', 1969, 1975],
                       ['Danny', 1976, 1977],
                       ['James', 1971, 1972],
                       ['Marshall', 1967, 1975]],
                      columns=['Person', 'Year_start', 'Year_left'])
sorter = result.groupby('Person').agg({'Year_start':'min','Year_left':'max'})\
    .sort_values(['Year_start','Year_left'],
                 ascending=[True,True])\
        .index.to_frame().\
            assign(sorter = range(result['Person'].nunique()))\
                .set_index('Person').to_dict()['sorter']
                                                   
result['sorter'] = result['Person'].map(sorter)
result = result.sort_values('sorter',ascending=True)
fig, ax = plt.subplots()
colormap = plt.get_cmap('plasma')
slicedColorMap = colormap(np.linspace(0, 1, result['Person'].nunique()))
height = 0.5
names = []
for y, (name, g) in enumerate(result.groupby('Person',sort=False)): #Here I'm using sort=False to avoid groupby from sorting it differently
    print(name)
    ax.broken_barh(list(zip(g['Year_start'],
                            g['Year_left']-g['Year_start'])),
                   (y-height/2, height),
                   facecolors=slicedColorMap[y]
                   )
    names.append(name)

其余代码保持不变。这将输出相同的结果。

英文:

IIIC, all you need to do is use sort=False when using groupby() and previously sort the dataframe in your desired way. The rest of the code can remain the same:

EDIT: However, due to the sorting being quite specific and not easily covered in sort_values(), I suggest to do in an external dataframe and then merging it back to the original dataframe to sort it.

result = pd.DataFrame([[&#39;Bill&#39;, 1972, 1974],
                       [&#39;Bill&#39;, 1976, 1978],
                       [&#39;Bill&#39;, 1967, 1971],
                       [&#39;Danny&#39;, 1969, 1975],
                       [&#39;Danny&#39;, 1976, 1977],
                       [&#39;James&#39;, 1971, 1972],
                       [&#39;Marshall&#39;, 1967, 1975]],
                      columns=[&#39;Person&#39;, &#39;Year_start&#39;, &#39;Year_left&#39;])
sorter = result.groupby(&#39;Person&#39;).agg({&#39;Year_start&#39;:&#39;min&#39;,&#39;Year_left&#39;:&#39;max&#39;})\
    .sort_values([&#39;Year_start&#39;,&#39;Year_left&#39;],
                 ascending=[True,True])\
        .index.to_frame().\
            assign(sorter = range(result[&#39;Person&#39;].nunique()))\
                .set_index(&#39;Person&#39;).to_dict()[&#39;sorter&#39;]
                                                               
result[&#39;sorter&#39;] = result[&#39;Person&#39;].map(sorter)
result = result.sort_values(&#39;sorter&#39;,ascending=True)
fig, ax = plt.subplots()
colormap = plt.get_cmap(&#39;plasma&#39;)
slicedColorMap = colormap(np.linspace(0, 1, result[&#39;Person&#39;].nunique()))
height = 0.5
names = []
for y, (name, g) in enumerate(result.groupby(&#39;Person&#39;,sort=False)): #Here I&#39;m using sort=False to avoid groupby from sorting it differently
    print(name)
    ax.broken_barh(list(zip(g[&#39;Year_start&#39;],
                            g[&#39;Year_left&#39;]-g[&#39;Year_start&#39;])),
                   (y-height/2, height),
                   facecolors=slicedColorMap[y]
                   )
    names.append(name)

The rest of the code remains the same. This outputs:

I'm also making a small improvement by not statically defining names which will later be passed, but rather creating the list as the loop goes by so the name will always match the bar. That's why I'm also using result['Person'].nunique() rather than len(names)

EDIT: Code edited based on discussion with OP
1: https://i.stack.imgur.com/6epWQ.png

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在对数据框进行分组后设置断裂条顺序

问题

答案2

Python-polars: Create row per unique value in a pl.DataFrame column, columns with another, and values with a third

在深度学习中，当批处理大小减小时，是否可以提高预测速度？

Why does my Python function enter the else block when I use recursion to generate a Fibonacci list with 'else' in the if statement?

如何将pandas列转换为数字，如果列中包含字符串？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。