2023年3月10日 00:18:02go评论77阅读模式

英文:

geom_density (ggplot2): one density plot with different groups

问题

我已经意识到geom_density也许不是传达我想要的信息的最佳方式。这个图表旨在可视化以下信息：1）50m距离上事件发生的次数较少，2）对于100m和200m，事件发生的高峰在12点。是否有更好的方式来表示这些数据？Alan提供的第一个图表几乎完美地传达了这些信息，但当50m的数据由于r2evans提到的核平滑而彼此靠得太近时，它会失效。

我尝试使用ggplot2下的geom_density绘制一个密度图，显示不同距离和不同时间的事件发生次数（比例）。以下是模拟数据：

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7, 8, 8, 9, rnorm(196, 12, 2))
)

这个数据框有200个数据点，其中有4个在50米处，98个在100米处，以及98个在200米处。当我使用以下数据框和geom_density时：

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7, 8, 8, 9, rnorm(196, 12, 2))
) %>%
ggplot(aes(x = time, fill = distance, color = distance)) +
geom_density(alpha = 0.1)

然而，从这个图中看出，似乎50m的发生率要高得多。这是为什么呢？我怀疑这个函数计算了每个距离的密度，但50m的峰值接近0.8对我来说并没有太多意义。

是否可能计算密度，使得总面积为1，即50m曲线的面积为0.02？

英文:

[Edit] I have realized that geom_density may not be the best way to go for conveying the message I wanted. The plot was meant to visualize that 1) 50m has fewer event occurrences and 2) for 100m and 200m, the peaks of event occurrence are at 12. Would there be a better way to represent the data? The first plot @Alan has kindly provided is almost perfect for conveying such messages - but it fails when the data for 50m are too close to each other due to the kernel smoothing @r2evans has mentioned.

I am trying to plot a density plot using geom_density under ggplot2 that shows the number of times (proportion) an event occurs at different times at different distances. Below is the mock data.

data.frame(
  distance = c(rep(&quot;50&quot;, 4), rep(&quot;100&quot;, 98), rep(&quot;200&quot;, 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))

this df has 200 datapoints, 4 at 50m, 98 at 100m, and 98 at 200m. When I use geom_density with the df:

data.frame(
  distance = c(rep(&quot;50&quot;, 4), rep(&quot;100&quot;, 98), rep(&quot;200&quot;, 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) |&gt;
  ggplot(aes(x = time, fill = distance, color = distance)) +
  geom_density(alpha = 0.1)

However, looking at this plot, gives the impression that 50m is a lot more prevalent. Why is this happening? I wondered the function calculated the density per distance, but having the peak close to 0.8 for the 50m does not really make sense to me.

Is it possible to calculate the density such that the overall area is 1 i.e., the 50m curve will have an area of 0.02?

答案1

得分: 0

你可以使用 aes(y = after_stat(density * n/nrow(df))，如果你想要使所有三条曲线下的面积总和为1。由于你正在使用数据框进行管道操作，你需要执行 after_stat(density * n/200)（即手动输入分母）。

set.seed(1)

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) %>%
  ggplot(aes(x = time, fill = distance, color = distance)) +
  geom_density(alpha = 0.1, aes(y = after_stat(density * n/200)), bw = 0.5)

我们可以通过将三个密度叠加在一起来验证这一点。这应该与将整列的密度绘制为不同颜色组（在这里我们以虚线的黑色线显示整体密度）的结果相同：

set.seed(1)

data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) %>%
  ggplot(aes(x = time, fill = distance, color = distance)) +
  geom_density(alpha = 0.1, aes(y = after_stat(density * n/200)), 
               position = 'stack', bw = 0.5) +
  geom_density(color = 'black', fill = NA, linetype = 2, bw = 0.5)

英文:

You can use aes(y = after_stat(density * n/nrow(df)) if you want the summed areas under all 3 curves to add to 1. Since you are piping the data frame, you will need to do after_stat(density * n/200) (i.e. enter the denominator manually).

set.seed(1)

data.frame(
  distance = c(rep(&quot;50&quot;, 4), rep(&quot;100&quot;, 98), rep(&quot;200&quot;, 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) |&gt;
  ggplot(aes(x = time, fill = distance, color = distance)) +
  geom_density(alpha = 0.1, aes(y = after_stat(density * n/200)), bw = 0.5)

We can see this is correct by stacking the three densities on top of each other. This should give an identical result to plotting the density of the whole column without splitting into different color groups (here we show the overall density as a dashed black line):

set.seed(1)

data.frame(
  distance = c(rep(&quot;50&quot;, 4), rep(&quot;100&quot;, 98), rep(&quot;200&quot;, 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) |&gt;
  ggplot(aes(x = time, fill = distance, color = distance)) +
  geom_density(alpha = 0.1, aes(y = after_stat(density * n/200)), 
               position = &#39;stack&#39;, bw = 0.5) +
  geom_density(color = &#39;black&#39;, fill = NA, linetype = 2, bw = 0.5)

答案2

得分: 0

你可以提前使用 density() 函数（或其他你喜欢的方法）生成密度数值，然后通过数据集中距离数值的比例来缩放 y 值：

library(dplyr)
library(ggplot2)
library(tidyr)
data.frame(
  distance = c(rep("50", 4), rep("100", 98), rep("200", 98)),
  time = c(7, 8, 8, 9, rnorm(196, 12, 2))) %>%
  group_by(distance) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  mutate(scale = n / n()) %>%
  group_by(distance) %>%
  reframe(scale = first(scale),
          d = list(broom::tidy(density(time)))) %>%
  unnest(d) %>%
  mutate(y_scaled = y * scale) %>%
  ggplot(aes(x = x, y = y_scaled, colour = distance)) +
  geom_area(aes(fill = distance), position = "identity", alpha = 0.2) +
  labs(x = "Time", y = "Density (scaled)")

geom_density (ggplot2): 不同分组的一个密度图

^{创建于 2023-03-09，使用 reprex v2.0.2}

英文:

You could make the density values ahead of time with density() (or something else if you like) then scale the y-values by the proportion of the distance values in the dataset:

library(dplyr)
library(ggplot2)
library(tidyr)
data.frame(
  distance = c(rep(&quot;50&quot;, 4), rep(&quot;100&quot;, 98), rep(&quot;200&quot;, 98)),
  time = c(7,8,8,9, rnorm(196, 12, 2))) |&gt;
  group_by(distance) %&gt;% 
  mutate(n = n()) |&gt; 
  ungroup() |&gt;
  mutate(scale = n/n()) |&gt; 
  group_by(distance) |&gt; 
  reframe(scale = first(scale), 
          d = list(broom::tidy(density(time)))) |&gt; 
  unnest(d) |&gt; 
  mutate(y_scaled = y*scale) |&gt; 
  ggplot(aes(x=x, y=y_scaled, colour=distance)) + 
  geom_area(aes(fill=distance), position=&quot;identity&quot;, alpha=.2) + 
  labs(x=&quot;Time&quot;, y=&quot;Density (scaled)&quot;)

geom_density (ggplot2): 不同分组的一个密度图

<sup>Created on 2023-03-09 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

geom_density (ggplot2): 不同分组的一个密度图

问题

答案1

答案2

在R中绘制约束条件。

生成一个以”±”分隔的描述性统计表。

使用purrr在多个列上进行多个映射的重新编码。

使用两个其他远距离观察的平均值来替换多个缺失的观测数据点。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论