2023年5月22日 23:40:28go评论70阅读模式

英文:

How can I plot mean and standard deviation error bars stripplot or swarmplot?

问题

I created the following plot with the code and data posted at the end of this question:

The black dot represents the mean of the R2 Score over all retailers, and the black lines represent the corresponding standard deviation.

I want to achieve to display the mean and standard deviation in the typical way, as seen below:

I guess this must be possible with matplotlib errorbar or seaborn pointplot. But I'm working on this for ages and can not find a solution.

This answer with pointplot does not fulfill my needs, as I want one error bar over multiple categories, not one error bar per category.
I have a similar problem with this answer, working with swarmplot and pointplot.

The following is the corresponding code:

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

test = pd.read_csv('test.csv')

# Calculate mean and standard deviation
mean_data = test.groupby('featureset')['r2_score'].mean().values
std_data = test.groupby('featureset')['r2_score'].std().values
featuresets = ["c", "fc", "f", "s", "sc", "w"]

p = sns.stripplot(x="featureset",
                  y="r2_score",
                  hue="retailer",
                  data=test,
                  marker="^",
                  size=8)

# Plot stripplot with mean and standard deviation
sns.pointplot(x=featuresets,
              y=mean_data,
              join=False,
              color='black',
              markers='o',
              scale=2)
sns.pointplot(x=featuresets,
              y=mean_data - std_data,
              join=False,
              color='black',
              markers='_',
              scale=4)
sns.pointplot(x=featuresets,
              y=mean_data + std_data,
              join=False,
              color='black',
              markers='_',
              scale=4)

plt.legend(title='Retailer')
sns.move_legend(p, loc="upper left", bbox_to_anchor=(1, 1))

p.set(xlabel='Featureset', ylabel='R2 Score')

plt.savefig("plot.png", format="png", bbox_inches='tight')

For complete reproducibility, here add the used dataset that I named test.csv in this question:

r2_score,featureset,retailer
0.7055950484,c,S
0.942584686,c,K
0.8651950609,c,B
0.9051873402,c,H
0.5877088336,c,P
0.7944303127,c,O
0.6370605237,fc,S
0.9755270173,fc,K
0.9065356558,fc,B
0.921142567,fc,H
0.5798048892,fc,P
0.6580349995,fc,O
0.7217345443,f,S
0.9755270173,f,K
0.8839177116,f,B
0.921142567,f,H
0.5070612616,f,P
0.6580349995,f,O
0.5678318495,s,S
0.9637899061,s,K
0.9369641498,s,B
0.9297479733,s,H
0.5029283363,s,P
0.6580349995,s,O
0.5678318495,sc,S
0.9729308458,sc,K
0.8471079755,sc,B
0.9297479733,sc,H
0.497615548,sc,P
0.6580349995,sc,O
0.6624239947,w,S
0.889206858,w,K
0.7810312601,w,B
0.8562172874,w,H
0.4446346851,w,P
0.6580349995,w,O

EDIT: I updated my code to a point that fulfills my needs better than before with the help of the answers, receiving plots in the manner of the attached example.

Please find the corresponding code below:

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

def plot(data, x_axis, hue, target, ordered_list=['S', 'K', 'B', 'H', 'P', 'O']):
    
    data = pd.read_csv(data)

    data = data[["r2_score", x_axis, hue]]

    # Calculate mean and standard deviation
    mean_data = data.groupby(x_axis, sort=False)['r2_score'].mean()
    std_data = data.groupby(x_axis, sort=False)['r2_score'].std()
    x = std_data.index.tolist()

    data_sorted = data.sort_values(hue, key=lambda x: x.map({v:k for k, v in enumerate(ordered_list)}))

    colorlist = ['yellowgreen', 'seagreen', 'lightseagreen', 'steelblue', 'royalblue', 'slateblue']

    for i in range(len(x)):
        plt.errorbar(x=i,
                     y=mean_data[i],
                     yerr=std_data[i],
                     color='grey',
                     fmt='_',
                     capsize=5,
                     elinewidth=1,
                     capthick=1)

    for i in range(len(ordered_list)):    
        p = sns.stripplot(x=x_axis,
                          y="r2_score",
                          hue=hue,
                          data=data.loc[data[hue] == ordered_list[i]],
                          marker='$' + ordered_list[i] + '$',
                          size=10,
                          palette=[colorlist[i]])

    plt.xlabel(x_axis.title(), size='xx-large')
    plt.ylabel("R2 Score", size='xx-large')

    p.get_legend().remove()

plot("test.csv", "featureset", "retailer", "focusproduct")

I still want to change one thing: I want to increase readability by prohibiting elements of the plot to overlap (e.g., the markers and the errorbar, or the markers among themselves). I cannot find a way to do so.

英文:

I created the following plot with the code and data posted at the end of this question:

The black dot represents the mean of the R2 Score over all retailers, and the black lines represent the corresponding standard deviation.

I want to achieve to display the mean and standard deviation in the typical way, as seen below:

I guess this must be possible with matplotlib errorbar or seaborn pointplot. But I'm working on this for ages and can not find a solution.

The following is the corresponding code:

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

test = pd.read_csv(&#39;test.csv&#39;)

# Calculate mean and standard deviation
mean_data = test.groupby(&#39;featureset&#39;)[&#39;r2_score&#39;].mean().values
std_data = test.groupby(&#39;featureset&#39;)[&#39;r2_score&#39;].std().values
featuresets = [&quot;c&quot;, &quot;fc&quot;, &quot;f&quot;, &quot;s&quot;, &quot;sc&quot;, &quot;w&quot;]

p = sns.stripplot(x=&quot;featureset&quot;,
                  y=&quot;r2_score&quot;,
                  hue=&quot;retailer&quot;,
                  data=test,
                  marker=&quot;^&quot;,
                  size=8)

# Plot stripplot with mean and standard deviation
sns.pointplot(x=featuresets,
              y=mean_data,
              join=False,
              color=&#39;black&#39;,
              markers=&#39;o&#39;,
              scale=2)
sns.pointplot(x=featuresets,
              y=mean_data - std_data,
              join=False,
              color=&#39;black&#39;,
              markers=&#39;_&#39;,
              scale=4)
sns.pointplot(x=featuresets,
              y=mean_data + std_data,
              join=False,
              color=&#39;black&#39;,
              markers=&#39;_&#39;,
              scale=4)

plt.legend(title=&#39;Retailer&#39;)
sns.move_legend(p, loc=&quot;upper left&quot;, bbox_to_anchor=(1, 1))

p.set(xlabel=&#39;Featureset&#39;, ylabel=&#39;R2 Score&#39;)

plt.savefig(&quot;plot.png&quot;, format=&quot;png&quot;, bbox_inches=&#39;tight&#39;)

For complete reproducibility, here add the used dataset that I named test.csv in this question:

r2_score,featureset,retailer
0.7055950484,c,S
0.942584686,c,K
0.8651950609,c,B
0.9051873402,c,H
0.5877088336,c,P
0.7944303127,c,O
0.6370605237,fc,S
0.9755270173,fc,K
0.9065356558,fc,B
0.921142567,fc,H
0.5798048892,fc,P
0.6580349995,fc,O
0.7217345443,f,S
0.9755270173,f,K
0.8839177116,f,B
0.921142567,f,H
0.5070612616,f,P
0.6580349995,f,O
0.5678318495,s,S
0.9637899061,s,K
0.9369641498,s,B
0.9297479733,s,H
0.5029283363,s,P
0.6580349995,s,O
0.5678318495,sc,S
0.9729308458,sc,K
0.8471079755,sc,B
0.9297479733,sc,H
0.497615548,sc,P
0.6580349995,sc,O
0.6624239947,w,S
0.889206858,w,K
0.7810312601,w,B
0.8562172874,w,H
0.4446346851,w,P
0.6580349995,w,O

EDIT: I updated my code to a point that fulfilles my needs better than before with the help of the answers, receiving plots in the manner of the attached example.

Please find the corresponding code below:

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

def plot(data, x_axis, hue, target, ordered_list=[&#39;S&#39;, &#39;K&#39;, &#39;B&#39;, &#39;H&#39;, &#39;P&#39;, &#39;O&#39;]):
    
    data = pd.read_csv(data)

    data = data[[&quot;r2_score&quot;, x_axis, hue]]

    # Calculate mean and standard deviation
    mean_data = data.groupby(x_axis, sort=False)[&#39;r2_score&#39;].mean()
    std_data = data.groupby(x_axis, sort=False)[&#39;r2_score&#39;].std()
    x = std_data.index.tolist()
    
    data_sorted = data.sort_values(hue, key=lambda x: x.map({v:k for k, v in enumerate(ordered_list)}))
    
    colorlist = [&#39;yellowgreen&#39;, &#39;seagreen&#39;, &#39;lightseagreen&#39;, &#39;steelblue&#39;, &#39;royalblue&#39;, &#39;slateblue&#39;]
    
    for i in range(len(x)):
        plt.errorbar(x=i,
                     y=mean_data[i],
                     yerr=std_data[i],
                     color=&#39;grey&#39;,
                     fmt=&#39;_&#39;,
                     capsize=5,
                     elinewidth=1,
                     capthick=1)

    for i in range(len(ordered_list)):    
        p = sns.stripplot(x=x_axis,
                          y=&quot;r2_score&quot;,
                          hue=hue,
                          data=data.loc[data[hue] == ordered_list[i]],
                          marker=&#39;$&#39; + ordered_list[i] + &#39;$&#39;,
                          size=10,
                          palette=[colorlist[i]])

    plt.xlabel(x_axis.title(), size=&#39;xx-large&#39;)
    plt.ylabel(&quot;R2 Score&quot;, size=&#39;xx-large&#39;)
 
    p.get_legend().remove()

plot(&quot;test.csv&quot;, &quot;featureset&quot;, &quot;retailer&quot;, &quot;focusproduct&quot;)

I still want to change one thing: I want that increase readability by prohibiting elements of the plot to overlap (e.g. the markers and the errorbar, or the markers among themselves). I can not find a way to do so.

答案1

得分: 1

你的想法是正确的。Errorbar 是有效的。你还需要使用 yerr 和 capsize 参数。

对于特征集中的每个特征，使用以下代码：
plt.errorbar(x=feature, y=mean_data[i], yerr=std_data[i], color='black', fmt='_', capsize=3)

输出:

英文:

You had the right idea. Errorbar works. You need the yerr and capsize arguments too.

for i, feature in enumerate(featuresets):
    plt.errorbar(x=feature, y=mean_data[i], yerr=std_data[i], color=&#39;black&#39;, fmt=&#39;_&#39;, capsize=3)

Output:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

你可以使用stripplot或swarmplot来绘制均值和标准差误差条。

问题

答案1

如何使用文件名在Box中创建文件对象

为什么当我尝试重新分配它时，我的变量（winning）没有被重新分配？

我怎样从线程发送数据到Gtk应用程序？

打印列表元素和处理Python中的嵌套数据类型

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论