密度图应该使用ggplot返回具有多个水平的变量的有序频率吗?

huangapple go评论72阅读模式
英文:

Should a density plot with ggplot return the ordered frequencies of a variable with multiple level?

问题

关于使用 ggplot2 创建的密度图的问题,当绘制具有多个级别的变量轴时,该轴是否应返回从频率较高到较低的级别的有序序列?

我不确定我在这里得到的表示是正确的:

这是数据集:

data = data.frame(cos = c(rep('5', 308), rep('3', 199), rep('0', 184), rep('2', 9)), 
           mag = c('Yes', 'No'))

这是我尝试对变量进行排序和排序以绘制在 x轴 (cos) 上的方式:

library(data.table)
data = setDT(data)[, freq := .N, by = .(cos)][order(-freq)]

这是绘图的代码:

ggplot(data) +
  geom_density(aes(x= cos, fill = mag), alpha=0.4) +
  labs(title="Density curve",x="cos", y = "mag") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank()) +
  theme(axis.ticks.y=element_blank(), axis.ticks.x=element_blank())

由于频率较高,不应该在 x轴 上将5作为第一个点吗?

英文:

I would have a question about the density plot created with ggplot2. When plotting the axis of a variable with multiple levels, should that axis return the ordered sequence from the level with higher down to lower frequency?

I am not sure about the representation I got here:
密度图应该使用ggplot返回具有多个水平的变量的有序频率吗?

This is the dataset:

data = data.frame(cos = c(rep('5', 308), rep('3', 199), rep('0', 184), rep('2', 9)), 
           mag = c('Yes', 'No'))

this is the way I have tried to sort and order variable to plot on x axis (cos)

library(data.table)
data = setDT(data)[, freq := .N, by = .(cos)][order(-freq)]

and here the codde for the plot

ggplot(data) +
  geom_density(aes(x= cos, fill = mag), alpha=0.4) +
  labs(title="Density curve",x="cos", y = "mag") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        panel.background = element_blank()) +
  theme(axis.ticks.y=element_blank(), axis.ticks.x=element_blank())

Shouldn't be 5 the first point on x axis, due to higher frequency?

答案1

得分: 1

你描述的实际上不是密度图,因为x轴是离散的。密度图显示沿着_连续_变量的估计概率分布函数,因此对于离散轴上的任何离散x轴刻度之外的“密度”曲线的值是没有意义的(并且可能具有误导性)。

从评论中可以看出,您实际上只是要制作一个柱状图,但是使用钟形曲线而不是矩形作为“柱子”。

我认为做这种小众可视化的最佳方法是确切了解您要绘制的内容,整理数据成正确的格式,然后使用简单的几何图形进行绘制。

最容易的方法是使用连续轴,并在之后用离散级别进行虚假标记。整理可能看起来像这样的数据:

library(tidyverse)

df <- data %>%
  count(cos, mag) %>%
  mutate(cos = reorder(cos, -n)) %>%
  group_by(cos, mag) %>%
  summarise(x = seq(0, 5, 0.01), 
            y = n * dnorm(x, as.numeric(cos), sd = 0.2) / 
              dnorm(as.numeric(cos), as.numeric(cos), sd = 0.2))

绘图代码,如果您想叠加“是”和“否”值,则会是这样:

ggplot(df, aes(x = x, y = y, fill = mag, group = interaction(cos, mag))) +
  geom_area(position = "identity", color = "black", alpha = 0.5) +
  scale_x_continuous("cos", breaks = 1:4, labels = levels(df$cos)) +
  labs(y = "Count") +
  scale_fill_manual(values = c("orange", "deepskyblue4")) +
  theme_minimal(base_size = 20)

如果您想要堆叠它们,可以这样做:

ggplot(df, aes(x = x, y = y, fill = mag, group = interaction(cos, mag))) +
  lapply(split(df, df$cos), function(x) {
  geom_area(position = "stack", color = "black", alpha = 0.5, data = x)
    })+
  scale_x_continuous("cos", breaks = 1:4, labels = levels(df$cos)) +
  labs(y = "Count") +
  scale_fill_manual(values = c("orange", "deepskyblue4")) +
  theme_minimal(base_size = 20)

如果您想要它们分开,您需要稍微不同地整理数据:

df <- data %>%
  count(cos, mag) %>%
  mutate(cos = reorder(cos, -n)) %>%
  group_by(cos, mag) %>%
  summarise(x = seq(0, 5, 0.01), 
            y = n * dnorm(x, as.numeric(cos) + ifelse(mag == "Yes", -0.1, 0.1), 
                          sd = 0.2) / 
              dnorm(as.numeric(cos) + ifelse(mag == "Yes", -0.1, 0.1), 
                    as.numeric(cos) + ifelse(mag == "Yes", -0.1, 0.1), sd = 0.2))

ggplot(df, aes(x = x, y = y, fill = mag, group = interaction(cos, mag))) +
  geom_area(position = "identity", color = "black", alpha = 0.5) +
  scale_x_continuous("cos", breaks = 1:4, labels = levels(df$cos)) +
  labs(y = "Count") +
  scale_fill_manual(values = c("orange", "deepskyblue4")) +
  theme_minimal(base_size = 20)
英文:

What you are describing is not a density plot at all, since the x axis is discrete. A density plot shows the estimated probability distribution function along a continuous variable, so with a discrete axis, the value of a "density" curve anywhere other than the peak over each discrete x axis tick is meaningless (and potentially misleading).

From the comments, what you are looking for is effectively just a bar plot, but using bell curves rather than rectangles for the "bars".

I think the best way to do this sort of niche visualization is to work out exactly what you want to plot, wrangle the data into the correct format, then draw it with simple geoms.

It will be easier to use a continuous axis and fake-label it with the discrete levels afterwards. The wrangling might look something like this:

library(tidyverse)

df &lt;- data %&gt;%
  count(cos, mag) %&gt;%
  mutate(cos = reorder(cos, -n)) %&gt;%
  group_by(cos, mag) %&gt;%
  summarise(x = seq(0, 5, 0.01), 
            y = n * dnorm(x, as.numeric(cos), sd = 0.2) / 
              dnorm(as.numeric(cos), as.numeric(cos), sd = 0.2))

And the plotting code, if you want the "Yes" and "No" values overlaid, would be:

ggplot(df, aes(x = x, y = y, fill = mag, group = interaction(cos, mag))) +
  geom_area(position = &quot;identity&quot;, color = &quot;black&quot;, alpha = 0.5) +
  scale_x_continuous(&quot;cos&quot;, breaks = 1:4, labels = levels(df$cos)) +
  labs(y = &quot;Count&quot;) +
  scale_fill_manual(values = c(&quot;orange&quot;, &quot;deepskyblue4&quot;)) +
  theme_minimal(base_size = 20)

密度图应该使用ggplot返回具有多个水平的变量的有序频率吗?

If you instead want them stacked, you could do:

ggplot(df, aes(x = x, y = y, fill = mag, group = interaction(cos, mag))) +
  lapply(split(df, df$cos), function(x) {
  geom_area(position = &quot;stack&quot;, color = &quot;black&quot;, alpha = 0.5, data = x)
    })+
  scale_x_continuous(&quot;cos&quot;, breaks = 1:4, labels = levels(df$cos)) +
  labs(y = &quot;Count&quot;) +
  scale_fill_manual(values = c(&quot;orange&quot;, &quot;deepskyblue4&quot;)) +
  theme_minimal(base_size = 20)

密度图应该使用ggplot返回具有多个水平的变量的有序频率吗?

If you want them dodged, you would need to wrangle the data a little differently:

df &lt;- data %&gt;%
  count(cos, mag) %&gt;%
  mutate(cos = reorder(cos, -n)) %&gt;%
  group_by(cos, mag) %&gt;%
  summarise(x = seq(0, 5, 0.01), 
            y = n * dnorm(x, as.numeric(cos) + ifelse(mag == &quot;Yes&quot;, -0.1, 0.1), 
                          sd = 0.2) / 
              dnorm(as.numeric(cos) + ifelse(mag == &quot;Yes&quot;, -0.1, 0.1), 
                    as.numeric(cos) + ifelse(mag == &quot;Yes&quot;, -0.1, 0.1), sd = 0.2))


ggplot(df, aes(x = x, y = y, fill = mag, group = interaction(cos, mag))) +
  geom_area(position = &quot;identity&quot;, color = &quot;black&quot;, alpha = 0.5) +
  scale_x_continuous(&quot;cos&quot;, breaks = 1:4, labels = levels(df$cos)) +
  labs(y = &quot;Count&quot;) +
  scale_fill_manual(values = c(&quot;orange&quot;, &quot;deepskyblue4&quot;)) +
  theme_minimal(base_size = 20)

密度图应该使用ggplot返回具有多个水平的变量的有序频率吗?

huangapple
  • 本文由 发表于 2023年6月18日 18:05:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76499984.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定