geom_density (ggplot2): 不同分组的一个密度图

huangapple go评论77阅读模式
英文:

geom_density (ggplot2): one density plot with different groups

问题

我已经意识到geom_density也许不是传达我想要的信息的最佳方式。这个图表旨在可视化以下信息:1)50m距离上事件发生的次数较少,2)对于100m和200m,事件发生的高峰在12点。是否有更好的方式来表示这些数据?Alan提供的第一个图表几乎完美地传达了这些信息,但当50m的数据由于r2evans提到的核平滑而彼此靠得太近时,它会失效。

我尝试使用ggplot2下的geom_density绘制一个密度图,显示不同距离和不同时间的事件发生次数(比例)。以下是模拟数据:

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7, 8, 8, 9, rnorm(196, 12, 2))
)

这个数据框有200个数据点,其中有4个在50米处,98个在100米处,以及98个在200米处。当我使用以下数据框和geom_density时:

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7, 8, 8, 9, rnorm(196, 12, 2))
) %>%
ggplot(aes(x = time, fill = distance, color = distance)) +
geom_density(alpha = 0.1)

然而,从这个图中看出,似乎50m的发生率要高得多。这是为什么呢?我怀疑这个函数计算了每个距离的密度,但50m的峰值接近0.8对我来说并没有太多意义。

是否可能计算密度,使得总面积为1,即50m曲线的面积为0.02?

英文:

[Edit] I have realized that geom_density may not be the best way to go for conveying the message I wanted. The plot was meant to visualize that 1) 50m has fewer event occurrences and 2) for 100m and 200m, the peaks of event occurrence are at 12. Would there be a better way to represent the data? The first plot @Alan has kindly provided is almost perfect for conveying such messages - but it fails when the data for 50m are too close to each other due to the kernel smoothing @r2evans has mentioned.

I am trying to plot a density plot using geom_density under ggplot2 that shows the number of times (proportion) an event occurs at different times at different distances. Below is the mock data.

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))

this df has 200 datapoints, 4 at 50m, 98 at 100m, and 98 at 200m. When I use geom_density with the df:

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) |>
  ggplot(aes(x = time, fill = distance, color = distance)) +
  geom_density(alpha = 0.1)

geom_density (ggplot2): 不同分组的一个密度图

However, looking at this plot, gives the impression that 50m is a lot more prevalent. Why is this happening? I wondered the function calculated the density per distance, but having the peak close to 0.8 for the 50m does not really make sense to me.

Is it possible to calculate the density such that the overall area is 1 i.e., the 50m curve will have an area of 0.02?

答案1

得分: 0

你可以使用 aes(y = after_stat(density * n/nrow(df)),如果你想要使所有三条曲线下的面积总和为1。由于你正在使用数据框进行管道操作,你需要执行 after_stat(density * n/200)(即手动输入分母)。

set.seed(1)

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) %>%
  ggplot(aes(x = time, fill = distance, color = distance)) +
  geom_density(alpha = 0.1, aes(y = after_stat(density * n/200)), bw = 0.5) 

我们可以通过将三个密度叠加在一起来验证这一点。这应该与将整列的密度绘制为不同颜色组(在这里我们以虚线的黑色线显示整体密度)的结果相同:

set.seed(1)

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) %>%
  ggplot(aes(x = time, fill = distance, color = distance)) +
  geom_density(alpha = 0.1, aes(y = after_stat(density * n/200)), 
               position = 'stack', bw = 0.5) +
  geom_density(color = 'black', fill = NA, linetype = 2, bw = 0.5)

geom_density (ggplot2): 不同分组的一个密度图

geom_density (ggplot2): 不同分组的一个密度图

英文:

You can use aes(y = after_stat(density * n/nrow(df)) if you want the summed areas under all 3 curves to add to 1. Since you are piping the data frame, you will need to do after_stat(density * n/200) (i.e. enter the denominator manually).

set.seed(1)

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) |>
  ggplot(aes(x = time, fill = distance, color = distance)) +
  geom_density(alpha = 0.1, aes(y = after_stat(density * n/200)), bw = 0.5) 

geom_density (ggplot2): 不同分组的一个密度图

We can see this is correct by stacking the three densities on top of each other. This should give an identical result to plotting the density of the whole column without splitting into different color groups (here we show the overall density as a dashed black line):

set.seed(1)

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) |>
  ggplot(aes(x = time, fill = distance, color = distance)) +
  geom_density(alpha = 0.1, aes(y = after_stat(density * n/200)), 
               position = 'stack', bw = 0.5) +
  geom_density(color = 'black', fill = NA, linetype = 2, bw = 0.5)

geom_density (ggplot2): 不同分组的一个密度图

答案2

得分: 0

你可以提前使用 density() 函数(或其他你喜欢的方法)生成密度数值,然后通过数据集中距离数值的比例来缩放 y 值:

library(dplyr)
library(ggplot2)
library(tidyr)
data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7, 8, 8, 9, rnorm(196, 12, 2))) %>%
  group_by(distance) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  mutate(scale = n / n()) %>%
  group_by(distance) %>%
  reframe(scale = first(scale),
          d = list(broom::tidy(density(time)))) %>%
  unnest(d) %>%
  mutate(y_scaled = y * scale) %>%
  ggplot(aes(x = x, y = y_scaled, colour = distance)) +
  geom_area(aes(fill = distance), position = "identity", alpha = 0.2) +
  labs(x = "Time", y = "Density (scaled)")

geom_density (ggplot2): 不同分组的一个密度图

创建于 2023-03-09,使用 reprex v2.0.2

英文:

You could make the density values ahead of time with density() (or something else if you like) then scale the y-values by the proportion of the distance values in the dataset:

library(dplyr)
library(ggplot2)
library(tidyr)
data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) |>
  group_by(distance) %>% 
  mutate(n = n()) |> 
  ungroup() |>
  mutate(scale = n/n()) |> 
  group_by(distance) |> 
  reframe(scale = first(scale), 
          d = list(broom::tidy(density(time)))) |> 
  unnest(d) |> 
  mutate(y_scaled = y*scale) |> 
  ggplot(aes(x=x, y=y_scaled, colour=distance)) + 
  geom_area(aes(fill=distance), position="identity", alpha=.2) + 
  labs(x="Time", y="Density (scaled)")

geom_density (ggplot2): 不同分组的一个密度图<!-- -->

<sup>Created on 2023-03-09 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年3月10日 00:18:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/75687302.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定