统计每个百分位数中的观察次数

huangapple go评论75阅读模式
英文:

R: Counting Number of Observations in Each Percentile

问题

我正在使用R编程语言进行工作。

我有以下数据集:

library(dplyr)

set.seed(123)
n <- 100
country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
gender <- sample(c("M", "F"), n, replace = TRUE)
age <- sample(18:100, n, replace = TRUE)
height <- runif(n, min = 150, max = 180)
owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)

df <- data.frame(country, gender, age, height, owns_bicycle)

我的问题:

  • 首先,我想按身高值将身高分为3个等大小的组(例如0%-33%,33%-66%,66%-99%)。
  • 接下来,我想按年龄值将年龄分为5个等大小的组(例如0%-20%,20%-40%,等等)。
  • 然后,对于每个独特的国家、性别、年龄组和身高组的组合,我想找出拥有自行车的比例。

因此,这种分析能让我知道诸如“如果你是一个年龄在30-35岁之间的男性,身高在150-155厘米之间,来自美国,那么你拥有自行车的概率为43%”等信息。

以下是我目前尝试做的事情:

library(dplyr)

df %>%
  mutate(height_group = ntile(height, 3),
         age_group = ntile(age, 5)) %>%
  group_by(country, gender, height_group, age_group) %>%
  summarise(count = n(),
            min_height = min(height),
            max_height = max(height),
            min_age = min(age),
            max_age = max(age),
            percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) %>%
  mutate(height_range = paste0(round(min_height, 1), "-", round(max_height, 1)),
         age_range = paste0(min_age, "-", max_age)) %>%
  select(-min_height, -max_height, -min_age, -max_age)

当我查看结果时:

# A tibble: 62 x 8
# Groups:   country, gender, height_group [18]
   country gender height_group age_group count percent_own_bicycle height_range age_range
   <chr>   <chr>         <int>     <int> <int>               <dbl> <chr>        <chr>    
 1 Canada  F                 1         1     2                 0   157.2-158.5  23-30    
 2 Canada  F                 1         2     4                25   150.8-154.1  37-43    
 3 Canada  F                 1         4     2                 0   154.4-156.9  66-72    
 4 Canada  F                 1         5     1                 0   154.6-154.6  80-80    
 5 Canada  F                 2         1     1                 0   169.3-169.3  23-23

我看到height_group = 1 具有多个范围,例如157.2-158.5和150.8-154.1。为什么会这样 - 我本来以为height_group = 1只能有一个范围?

请问有人能告诉我我做错了什么吗?

谢谢!

英文:

I am working with the R programming language.

I have the following dataset:

library(dplyr)

set.seed(123)
n &lt;- 100
country &lt;- sample(c(&quot;USA&quot;, &quot;Canada&quot;, &quot;UK&quot;), n, replace = TRUE)
gender &lt;- sample(c(&quot;M&quot;, &quot;F&quot;), n, replace = TRUE)
age &lt;- sample(18:100, n, replace = TRUE)
height &lt;- runif(n, min = 150, max = 180)
owns_bicycle &lt;- sample(c(&quot;Yes&quot;, &quot;No&quot;), n, replace = TRUE)

df &lt;- data.frame(country, gender, age, height, owns_bicycle)

My Problem:

  • First, I want to break height into 3 equal sized groups by value of their height (e.g. 0%-33%, 33%-66%,66%-99%)
  • Next, I want to break age into 5 equal sized groups by value of their age (e.g. 0%-20%, 20%-40%, etc.)
  • Then, for each unique combination of country, gender, age_group and height_group, I want to find out the percent of who own a bicycle.

As a result, this type of analysis would let me know things like - "if you are a man between ages 30-35, between 150-155 cm and from USA, there is a 43% chance you own a bicycle".

Here is my current attempt to do this:

library(dplyr)

df %&gt;%
  mutate(height_group = ntile(height, 3),
         age_group = ntile(age, 5)) %&gt;%
  group_by(country, gender, height_group, age_group) %&gt;%
  summarise(count = n(),
            min_height = min(height),
            max_height = max(height),
            min_age = min(age),
            max_age = max(age),
            percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100) %&gt;%
  mutate(height_range = paste0(round(min_height, 1), &quot;-&quot;, round(max_height, 1)),
         age_range = paste0(min_age, &quot;-&quot;, max_age)) %&gt;%
  select(-min_height, -max_height, -min_age, -max_age)

When I look at the results:

# A tibble: 62 x 8
# Groups:   country, gender, height_group [18]
   country gender height_group age_group count percent_own_bicycle height_range age_range
   &lt;chr&gt;   &lt;chr&gt;         &lt;int&gt;     &lt;int&gt; &lt;int&gt;               &lt;dbl&gt; &lt;chr&gt;        &lt;chr&gt;    
 1 Canada  F                 1         1     2                 0   157.2-158.5  23-30    
 2 Canada  F                 1         2     4                25   150.8-154.1  37-43    
 3 Canada  F                 1         4     2                 0   154.4-156.9  66-72    
 4 Canada  F                 1         5     1                 0   154.6-154.6  80-80    
 5 Canada  F                 2         1     1                 0   169.3-169.3  23-23  

I see height_group = 1 having multiple ranges, e.g. 157.2-158.5 and 150.8-154.1 . How is this possible - I would have thought that height_group = 1 can only have a single range?

Can someone please show me what I am doing wrong

Thanks!

答案1

得分: 1

是的,您的尝试看起来是正确的。您使用ntile()函数将height变量分为三个等大小的组,将age变量分为五个等大小的组。然后,您按照countrygenderheight_groupage_group对数据进行了分组,并计算了每个组内的计数、最小身高、最大身高、最小年龄、最大年龄以及拥有自行车的个体的百分比。

英文:

Yes, your attempt looks correct. You have divided the height variable into three equal-sized groups using the ntile() function, and the age variable into five equal-sized groups. Then, you grouped the data by country, gender, height_group, and age_group and calculated the count, minimum height, maximum height, minimum age, maximum age, and the percentage of individuals who own a bicycle within each group.

答案2

得分: -3

这是您的代码的翻译部分:

OP在这里 - 这是我的第二次尝试:

final = df %>%
  mutate(height_group = cut(height, breaks = 3),
         age_group = cut(age, breaks = 5)) %>%
  group_by(country, gender, height_group, age_group) %>%
  summarise(count = n(),
            percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) 

我认为结果现在看起来更加 "一致" 了?(例如,没有多个范围)

`summarise()` 已经按 'country'、'gender' 和 'height_group' 分组输出。您可以使用 `.groups` 参数来覆盖。
# A tibble: 60 x 6
# Groups:   country, gender, height_group [18]
   country gender height_group age_group   count percent_own_bicycle
   <chr>   <chr>  <fct>        <fct>       <int>               <dbl>
 1 Canada  F      (151,161]    (17.9,34.2]     2                   0
 2 Canada  F      (151,161]    (34.2,50.4]     5                  40
 3 Canada  F      (151,161]    (50.4,66.6]     1                   0
 4 Canada  F      (151,161]    (66.6,82.8]     2                   0
 5 Canada  F      (151,161]    (82.8,99.1]     1                   0
 6 Canada  F      (161,170]    (17.9,34.2]     1                   0
 7 Canada  F      (161,170]    (34.2,50.4]     1                 100
 8 Canada  F      (161,170]    (50.4,66.6]     1                   0
 9 Canada  F      (161,170]    (82.8,99.1]     2                  50
10 Canada  F      (170,180]    (17.9,34.2]     3                   0

这正确吗?
英文:

OP here - this is my second attempt:

final = df %&gt;%
  mutate(height_group = cut(height, breaks = 3),
         age_group = cut(age, breaks = 5)) %&gt;%
  group_by(country, gender, height_group, age_group) %&gt;%
  summarise(count = n(),
            percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100) 

I think the results look more "consistent" now? (e.g. no multiple ranges)

  `summarise()` has grouped output by &#39;country&#39;, &#39;gender&#39;, &#39;height_group&#39;. You can override using the `.groups` argument.
# A tibble: 60 x 6
# Groups:   country, gender, height_group [18]
   country gender height_group age_group   count percent_own_bicycle
   &lt;chr&gt;   &lt;chr&gt;  &lt;fct&gt;        &lt;fct&gt;       &lt;int&gt;               &lt;dbl&gt;
 1 Canada  F      (151,161]    (17.9,34.2]     2                   0
 2 Canada  F      (151,161]    (34.2,50.4]     5                  40
 3 Canada  F      (151,161]    (50.4,66.6]     1                   0
 4 Canada  F      (151,161]    (66.6,82.8]     2                   0
 5 Canada  F      (151,161]    (82.8,99.1]     1                   0
 6 Canada  F      (161,170]    (17.9,34.2]     1                   0
 7 Canada  F      (161,170]    (34.2,50.4]     1                 100
 8 Canada  F      (161,170]    (50.4,66.6]     1                   0
 9 Canada  F      (161,170]    (82.8,99.1]     2                  50
10 Canada  F      (170,180]    (17.9,34.2]     3                   0

Is this correct?

huangapple
  • 本文由 发表于 2023年6月15日 06:14:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76477911.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定