英文:
R - update boxplot axis range after adding points
问题
以下是翻译好的部分:
"I have a boxplot which summarizes ~60000 turbidity data points into quartiles, median, whiskers and sometimes outliers. Often a few outliers are so high up that the whole plot is compressed at the bottom, and I therefor choose to omit the outliers. However, I also have added averages to the plots as points, and I want these to be plotted always. The problem is that the y-axis of the boxplot does not adjust to the added average points, so when averages are far above the box they are simply plotted outside the chart window (see X-point for 2020, but none for 2021 or 2022). Normally with this parameter, the average will be between the whisker end and the most extreme outliers. This is normal, and expected in the data."
我有一个箱线图,它将大约60000个浊度数据点总结为四分位数、中位数、箱须图和有时离群值。通常,一些离群值很高,导致整个图在底部被压缩,因此我选择省略离群值。然而,我还将平均值添加到图中作为点,我希望这些点始终绘制出来。问题是箱线图的y轴不会根据添加的平均点进行调整,因此当平均值远远超过箱线时,它们会被绘制在图表窗口外(例如,2020年的X点,但2021年或2022年没有)。通常情况下,使用这个参数,平均值将位于箱须图的末端和最极端的离群值之间。这在数据中是正常的,也是预期的。
"My code is just
boxplot(...)
points(...)
and works as far as plotting the points. Just not adjusting the y-axis."
我的代码只是
boxplot(...)
points(...)
在绘制点方面是有效的。只是没有调整y轴。
"Question 1: is it not possible to get the boxplot to redraw with the new points data? I thought this was standard in R plots."
问题1:不可能让箱线图根据新的点数据重新绘制吗?我以为这在R图中是标准的。
"Question 2: if not, how can I dynamically adjust the y-axis range?"
问题2:如果不行,我应该如何动态调整y轴的范围?
英文:
I have a boxplot which summarizes ~60000 turbidity data points into quartiles, median, whiskers and sometimes outliers. Often a few outliers are so high up that the whole plot is compressed at the bottom, and I therefor choose to omit the outliers. However, I also have added averages to the plots as points, and I want these to be plotted always. The problem is that the y-axis of the boxplot does not adjust to the added average points, so when averages are far above the box they are simply plotted outside the chart window (see X-point for 2020, but none for 2021 or 2022). Normally with this parameter, the average will be between the whisker end and the most extreme outliers. This is normal, and expected in the data.
I have tried to capture the boxplot y-axis range to compare with the average, and then setting the ylim if needed, but I just don't know how to retrieve these axis ranges.
My code is just
boxplot(...)
points(...)
and works as far as plotting the points. Just not adjusting the y-axis.
Question 1: is it not possible to get the boxplot to redraw with the new points data? I thought this was standard in R plots.
Question 2: if not, how can I dynamically adjust the y-axis range?
答案1
得分: 0
让我们尝试用一些模拟数据来展示这个问题的具体例子:
set.seed(1)
df <- data.frame(y = c(rexp(99), 150), x = rep(c("A", "B"), each = 50))
在这里,组 "B" 有一个值为 150 的异常值,尽管大多数值都低几个数量级。这意味着如果我们尝试绘制箱线图,箱线会被挤压在绘图的底部:
boxplot(y ~ x, data = df, col = "lightblue")
如果我们去除异常值,箱线图将显示得很好:
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
问题出现在我们想要添加一个指示每个箱线图的均值的点时,因为 "B" 的均值位于绘图限制之外。让我们计算并绘制均值:
mean_vals <- sapply(split(df$y, df$x), mean)
mean_vals
#> A B
#> 0.9840417 4.0703334
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
"B" 的均值缺失,因为它位于绘图范围的上限之上。这里的秘诀是使用 boxplot.stats
来获取箱线图的须的限制。通过将我们的均值向量与这些统计信息的向量连接起来,然后获取其 range
,我们可以将绘图限制设置到需要的位置:
y_limits <- range(c(boxplot.stats(df$y)$stats, mean_vals))
现在,我们将这些限制应用于一个新的箱线图并在上面绘制点:
boxplot(y ~ x, data = df, outline = FALSE, ylim = y_limits, col = "lightblue")
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
作为比较,您可以使用 ggplot 进行整个操作,如下所示:
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_boxplot(fill = "lightblue", outlier.shape = NA) +
geom_point(size = 3, color = "red", stat = "summary", fun = mean) +
coord_cartesian(ylim = range(c(range(c(boxplot.stats(df$y)$stats,
mean_vals)))) +
theme_classic(base_size = 16)
创建于2023年02月05日,使用 reprex v2.0.2。
英文:
Let's try to show a concrete example of the problem with some simulated data:
set.seed(1)
df <- data.frame(y = c(rexp(99), 150), x = rep(c("A", "B"), each = 50))
Here, group "B" has a single outlier at 150, even though most values are a couple of orders of magnitude lower. That means that if we try to draw a boxplot, the boxes get squished at the bottom of the plot:
boxplot(y ~ x, data = df, col = "lightblue")
If we remove outliers, the boxes plot nicely:
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
The problem comes when we want to add a point indicating the mean value for each boxplot, since the mean of "B" lies outside the plot limits. Let's calculate and plot the means:
mean_vals <- sapply(split(df$y, df$x), mean)
mean_vals
#> A B
#> 0.9840417 4.0703334
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
The mean for "B" is missing because it lies above the upper range of the plot.
The secret here is to use boxplot.stats
to get the limits of the whiskers. By concatenating our vector of means to this vector of stats and getting its range
, we can set our plot limits exactly where they need to be:
y_limits <- range(c(boxplot.stats(df$y)$stats, mean_vals))
Now we apply these limits to a new boxplot and draw it with the points:
boxplot(y ~ x, data = df, outline = FALSE, ylim = y_limits, col = "lightblue")
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
For comparison, you could do the whole thing in ggplot like this:
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_boxplot(fill = "lightblue", outlier.shape = NA) +
geom_point(size = 3, color = "red", stat = "summary", fun = mean) +
coord_cartesian(ylim = range(c(range(c(boxplot.stats(df$y)$stats,
mean_vals))))) +
theme_classic(base_size = 16)
<sup>Created on 2023-02-05 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论