统计每个百分位数中的观察次数

huangapple go评论98阅读模式
英文:

R: Counting Number of Observations in Each Percentile

问题

我正在使用R编程语言进行工作。

我有以下数据集:

  1. library(dplyr)
  2. set.seed(123)
  3. n <- 100
  4. country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
  5. gender <- sample(c("M", "F"), n, replace = TRUE)
  6. age <- sample(18:100, n, replace = TRUE)
  7. height <- runif(n, min = 150, max = 180)
  8. owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)
  9. df <- data.frame(country, gender, age, height, owns_bicycle)

我的问题:

  • 首先,我想按身高值将身高分为3个等大小的组(例如0%-33%,33%-66%,66%-99%)。
  • 接下来,我想按年龄值将年龄分为5个等大小的组(例如0%-20%,20%-40%,等等)。
  • 然后,对于每个独特的国家、性别、年龄组和身高组的组合,我想找出拥有自行车的比例。

因此,这种分析能让我知道诸如“如果你是一个年龄在30-35岁之间的男性,身高在150-155厘米之间,来自美国,那么你拥有自行车的概率为43%”等信息。

以下是我目前尝试做的事情:

  1. library(dplyr)
  2. df %>%
  3. mutate(height_group = ntile(height, 3),
  4. age_group = ntile(age, 5)) %>%
  5. group_by(country, gender, height_group, age_group) %>%
  6. summarise(count = n(),
  7. min_height = min(height),
  8. max_height = max(height),
  9. min_age = min(age),
  10. max_age = max(age),
  11. percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) %>%
  12. mutate(height_range = paste0(round(min_height, 1), "-", round(max_height, 1)),
  13. age_range = paste0(min_age, "-", max_age)) %>%
  14. select(-min_height, -max_height, -min_age, -max_age)

当我查看结果时:

  1. # A tibble: 62 x 8
  2. # Groups: country, gender, height_group [18]
  3. country gender height_group age_group count percent_own_bicycle height_range age_range
  4. <chr> <chr> <int> <int> <int> <dbl> <chr> <chr>
  5. 1 Canada F 1 1 2 0 157.2-158.5 23-30
  6. 2 Canada F 1 2 4 25 150.8-154.1 37-43
  7. 3 Canada F 1 4 2 0 154.4-156.9 66-72
  8. 4 Canada F 1 5 1 0 154.6-154.6 80-80
  9. 5 Canada F 2 1 1 0 169.3-169.3 23-23

我看到height_group = 1 具有多个范围,例如157.2-158.5和150.8-154.1。为什么会这样 - 我本来以为height_group = 1只能有一个范围?

请问有人能告诉我我做错了什么吗?

谢谢!

英文:

I am working with the R programming language.

I have the following dataset:

  1. library(dplyr)
  2. set.seed(123)
  3. n &lt;- 100
  4. country &lt;- sample(c(&quot;USA&quot;, &quot;Canada&quot;, &quot;UK&quot;), n, replace = TRUE)
  5. gender &lt;- sample(c(&quot;M&quot;, &quot;F&quot;), n, replace = TRUE)
  6. age &lt;- sample(18:100, n, replace = TRUE)
  7. height &lt;- runif(n, min = 150, max = 180)
  8. owns_bicycle &lt;- sample(c(&quot;Yes&quot;, &quot;No&quot;), n, replace = TRUE)
  9. df &lt;- data.frame(country, gender, age, height, owns_bicycle)

My Problem:

  • First, I want to break height into 3 equal sized groups by value of their height (e.g. 0%-33%, 33%-66%,66%-99%)
  • Next, I want to break age into 5 equal sized groups by value of their age (e.g. 0%-20%, 20%-40%, etc.)
  • Then, for each unique combination of country, gender, age_group and height_group, I want to find out the percent of who own a bicycle.

As a result, this type of analysis would let me know things like - "if you are a man between ages 30-35, between 150-155 cm and from USA, there is a 43% chance you own a bicycle".

Here is my current attempt to do this:

  1. library(dplyr)
  2. df %&gt;%
  3. mutate(height_group = ntile(height, 3),
  4. age_group = ntile(age, 5)) %&gt;%
  5. group_by(country, gender, height_group, age_group) %&gt;%
  6. summarise(count = n(),
  7. min_height = min(height),
  8. max_height = max(height),
  9. min_age = min(age),
  10. max_age = max(age),
  11. percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100) %&gt;%
  12. mutate(height_range = paste0(round(min_height, 1), &quot;-&quot;, round(max_height, 1)),
  13. age_range = paste0(min_age, &quot;-&quot;, max_age)) %&gt;%
  14. select(-min_height, -max_height, -min_age, -max_age)

When I look at the results:

  1. # A tibble: 62 x 8
  2. # Groups: country, gender, height_group [18]
  3. country gender height_group age_group count percent_own_bicycle height_range age_range
  4. &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
  5. 1 Canada F 1 1 2 0 157.2-158.5 23-30
  6. 2 Canada F 1 2 4 25 150.8-154.1 37-43
  7. 3 Canada F 1 4 2 0 154.4-156.9 66-72
  8. 4 Canada F 1 5 1 0 154.6-154.6 80-80
  9. 5 Canada F 2 1 1 0 169.3-169.3 23-23

I see height_group = 1 having multiple ranges, e.g. 157.2-158.5 and 150.8-154.1 . How is this possible - I would have thought that height_group = 1 can only have a single range?

Can someone please show me what I am doing wrong

Thanks!

答案1

得分: 1

是的,您的尝试看起来是正确的。您使用ntile()函数将height变量分为三个等大小的组,将age变量分为五个等大小的组。然后,您按照countrygenderheight_groupage_group对数据进行了分组,并计算了每个组内的计数、最小身高、最大身高、最小年龄、最大年龄以及拥有自行车的个体的百分比。

英文:

Yes, your attempt looks correct. You have divided the height variable into three equal-sized groups using the ntile() function, and the age variable into five equal-sized groups. Then, you grouped the data by country, gender, height_group, and age_group and calculated the count, minimum height, maximum height, minimum age, maximum age, and the percentage of individuals who own a bicycle within each group.

答案2

得分: -3

这是您的代码的翻译部分:

  1. OP在这里 - 这是我的第二次尝试:
  2. final = df %>%
  3. mutate(height_group = cut(height, breaks = 3),
  4. age_group = cut(age, breaks = 5)) %>%
  5. group_by(country, gender, height_group, age_group) %>%
  6. summarise(count = n(),
  7. percent_own_bicycle = mean(owns_bicycle == "Yes") * 100)
  8. 我认为结果现在看起来更加 "一致" 了?(例如,没有多个范围)
  9. `summarise()` 已经按 'country''gender' 'height_group' 分组输出。您可以使用 `.groups` 参数来覆盖。
  10. # A tibble: 60 x 6
  11. # Groups: country, gender, height_group [18]
  12. country gender height_group age_group count percent_own_bicycle
  13. <chr> <chr> <fct> <fct> <int> <dbl>
  14. 1 Canada F (151,161] (17.9,34.2] 2 0
  15. 2 Canada F (151,161] (34.2,50.4] 5 40
  16. 3 Canada F (151,161] (50.4,66.6] 1 0
  17. 4 Canada F (151,161] (66.6,82.8] 2 0
  18. 5 Canada F (151,161] (82.8,99.1] 1 0
  19. 6 Canada F (161,170] (17.9,34.2] 1 0
  20. 7 Canada F (161,170] (34.2,50.4] 1 100
  21. 8 Canada F (161,170] (50.4,66.6] 1 0
  22. 9 Canada F (161,170] (82.8,99.1] 2 50
  23. 10 Canada F (170,180] (17.9,34.2] 3 0
  24. 这正确吗?
英文:

OP here - this is my second attempt:

  1. final = df %&gt;%
  2. mutate(height_group = cut(height, breaks = 3),
  3. age_group = cut(age, breaks = 5)) %&gt;%
  4. group_by(country, gender, height_group, age_group) %&gt;%
  5. summarise(count = n(),
  6. percent_own_bicycle = mean(owns_bicycle == &quot;Yes&quot;) * 100)

I think the results look more "consistent" now? (e.g. no multiple ranges)

  1. `summarise()` has grouped output by &#39;country&#39;, &#39;gender&#39;, &#39;height_group&#39;. You can override using the `.groups` argument.
  2. # A tibble: 60 x 6
  3. # Groups: country, gender, height_group [18]
  4. country gender height_group age_group count percent_own_bicycle
  5. &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt; &lt;dbl&gt;
  6. 1 Canada F (151,161] (17.9,34.2] 2 0
  7. 2 Canada F (151,161] (34.2,50.4] 5 40
  8. 3 Canada F (151,161] (50.4,66.6] 1 0
  9. 4 Canada F (151,161] (66.6,82.8] 2 0
  10. 5 Canada F (151,161] (82.8,99.1] 1 0
  11. 6 Canada F (161,170] (17.9,34.2] 1 0
  12. 7 Canada F (161,170] (34.2,50.4] 1 100
  13. 8 Canada F (161,170] (50.4,66.6] 1 0
  14. 9 Canada F (161,170] (82.8,99.1] 2 50
  15. 10 Canada F (170,180] (17.9,34.2] 3 0

Is this correct?

huangapple
  • 本文由 发表于 2023年6月15日 06:14:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76477911.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定