英文:
R: Counting Number of Observations in Each Percentile
问题
我正在使用R编程语言进行工作。
我有以下数据集:
library(dplyr)
set.seed(123)
n <- 100
country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
gender <- sample(c("M", "F"), n, replace = TRUE)
age <- sample(18:100, n, replace = TRUE)
height <- runif(n, min = 150, max = 180)
owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)
df <- data.frame(country, gender, age, height, owns_bicycle)
我的问题:
- 首先,我想按身高值将身高分为3个等大小的组(例如0%-33%,33%-66%,66%-99%)。
 - 接下来,我想按年龄值将年龄分为5个等大小的组(例如0%-20%,20%-40%,等等)。
 - 然后,对于每个独特的国家、性别、年龄组和身高组的组合,我想找出拥有自行车的比例。
 
因此,这种分析能让我知道诸如“如果你是一个年龄在30-35岁之间的男性,身高在150-155厘米之间,来自美国,那么你拥有自行车的概率为43%”等信息。
以下是我目前尝试做的事情:
library(dplyr)
df %>%
  mutate(height_group = ntile(height, 3),
         age_group = ntile(age, 5)) %>%
  group_by(country, gender, height_group, age_group) %>%
  summarise(count = n(),
            min_height = min(height),
            max_height = max(height),
            min_age = min(age),
            max_age = max(age),
            percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) %>%
  mutate(height_range = paste0(round(min_height, 1), "-", round(max_height, 1)),
         age_range = paste0(min_age, "-", max_age)) %>%
  select(-min_height, -max_height, -min_age, -max_age)
当我查看结果时:
# A tibble: 62 x 8
# Groups:   country, gender, height_group [18]
   country gender height_group age_group count percent_own_bicycle height_range age_range
   <chr>   <chr>         <int>     <int> <int>               <dbl> <chr>        <chr>    
 1 Canada  F                 1         1     2                 0   157.2-158.5  23-30    
 2 Canada  F                 1         2     4                25   150.8-154.1  37-43    
 3 Canada  F                 1         4     2                 0   154.4-156.9  66-72    
 4 Canada  F                 1         5     1                 0   154.6-154.6  80-80    
 5 Canada  F                 2         1     1                 0   169.3-169.3  23-23
我看到height_group = 1 具有多个范围,例如157.2-158.5和150.8-154.1。为什么会这样 - 我本来以为height_group = 1只能有一个范围?
请问有人能告诉我我做错了什么吗?
谢谢!
英文:
I am working with the R programming language.
I have the following dataset:
library(dplyr)
set.seed(123)
n <- 100
country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
gender <- sample(c("M", "F"), n, replace = TRUE)
age <- sample(18:100, n, replace = TRUE)
height <- runif(n, min = 150, max = 180)
owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)
df <- data.frame(country, gender, age, height, owns_bicycle)
My Problem:
- First, I want to break height into 3 equal sized groups by value of their height (e.g. 0%-33%, 33%-66%,66%-99%)
 - Next, I want to break age into 5 equal sized groups by value of their age (e.g. 0%-20%, 20%-40%, etc.)
 - Then, for each unique combination of country, gender, age_group and height_group, I want to find out the percent of who own a bicycle.
 
As a result, this type of analysis would let me know things like - "if you are a man between ages 30-35, between 150-155 cm and from USA, there is a 43% chance you own a bicycle".
Here is my current attempt to do this:
library(dplyr)
df %>%
  mutate(height_group = ntile(height, 3),
         age_group = ntile(age, 5)) %>%
  group_by(country, gender, height_group, age_group) %>%
  summarise(count = n(),
            min_height = min(height),
            max_height = max(height),
            min_age = min(age),
            max_age = max(age),
            percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) %>%
  mutate(height_range = paste0(round(min_height, 1), "-", round(max_height, 1)),
         age_range = paste0(min_age, "-", max_age)) %>%
  select(-min_height, -max_height, -min_age, -max_age)
When I look at the results:
# A tibble: 62 x 8
# Groups:   country, gender, height_group [18]
   country gender height_group age_group count percent_own_bicycle height_range age_range
   <chr>   <chr>         <int>     <int> <int>               <dbl> <chr>        <chr>    
 1 Canada  F                 1         1     2                 0   157.2-158.5  23-30    
 2 Canada  F                 1         2     4                25   150.8-154.1  37-43    
 3 Canada  F                 1         4     2                 0   154.4-156.9  66-72    
 4 Canada  F                 1         5     1                 0   154.6-154.6  80-80    
 5 Canada  F                 2         1     1                 0   169.3-169.3  23-23  
I see height_group = 1 having multiple ranges, e.g. 157.2-158.5 and 150.8-154.1 . How is this possible - I would have thought that height_group = 1 can only have a single range?
Can someone please show me what I am doing wrong
Thanks!
答案1
得分: 1
是的,您的尝试看起来是正确的。您使用ntile()函数将height变量分为三个等大小的组,将age变量分为五个等大小的组。然后,您按照country、gender、height_group和age_group对数据进行了分组,并计算了每个组内的计数、最小身高、最大身高、最小年龄、最大年龄以及拥有自行车的个体的百分比。
英文:
Yes, your attempt looks correct. You have divided the height variable into three equal-sized groups using the ntile() function, and the age variable into five equal-sized groups. Then, you grouped the data by country, gender, height_group, and age_group and calculated the count, minimum height, maximum height, minimum age, maximum age, and the percentage of individuals who own a bicycle within each group.
答案2
得分: -3
这是您的代码的翻译部分:
OP在这里 - 这是我的第二次尝试:
final = df %>%
  mutate(height_group = cut(height, breaks = 3),
         age_group = cut(age, breaks = 5)) %>%
  group_by(country, gender, height_group, age_group) %>%
  summarise(count = n(),
            percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) 
我认为结果现在看起来更加 "一致" 了?(例如,没有多个范围)
`summarise()` 已经按 'country'、'gender' 和 'height_group' 分组输出。您可以使用 `.groups` 参数来覆盖。
# A tibble: 60 x 6
# Groups:   country, gender, height_group [18]
   country gender height_group age_group   count percent_own_bicycle
   <chr>   <chr>  <fct>        <fct>       <int>               <dbl>
 1 Canada  F      (151,161]    (17.9,34.2]     2                   0
 2 Canada  F      (151,161]    (34.2,50.4]     5                  40
 3 Canada  F      (151,161]    (50.4,66.6]     1                   0
 4 Canada  F      (151,161]    (66.6,82.8]     2                   0
 5 Canada  F      (151,161]    (82.8,99.1]     1                   0
 6 Canada  F      (161,170]    (17.9,34.2]     1                   0
 7 Canada  F      (161,170]    (34.2,50.4]     1                 100
 8 Canada  F      (161,170]    (50.4,66.6]     1                   0
 9 Canada  F      (161,170]    (82.8,99.1]     2                  50
10 Canada  F      (170,180]    (17.9,34.2]     3                   0
这正确吗?
英文:
OP here - this is my second attempt:
final = df %>%
  mutate(height_group = cut(height, breaks = 3),
         age_group = cut(age, breaks = 5)) %>%
  group_by(country, gender, height_group, age_group) %>%
  summarise(count = n(),
            percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) 
I think the results look more "consistent" now? (e.g. no multiple ranges)
  `summarise()` has grouped output by 'country', 'gender', 'height_group'. You can override using the `.groups` argument.
# A tibble: 60 x 6
# Groups:   country, gender, height_group [18]
   country gender height_group age_group   count percent_own_bicycle
   <chr>   <chr>  <fct>        <fct>       <int>               <dbl>
 1 Canada  F      (151,161]    (17.9,34.2]     2                   0
 2 Canada  F      (151,161]    (34.2,50.4]     5                  40
 3 Canada  F      (151,161]    (50.4,66.6]     1                   0
 4 Canada  F      (151,161]    (66.6,82.8]     2                   0
 5 Canada  F      (151,161]    (82.8,99.1]     1                   0
 6 Canada  F      (161,170]    (17.9,34.2]     1                   0
 7 Canada  F      (161,170]    (34.2,50.4]     1                 100
 8 Canada  F      (161,170]    (50.4,66.6]     1                   0
 9 Canada  F      (161,170]    (82.8,99.1]     2                  50
10 Canada  F      (170,180]    (17.9,34.2]     3                   0
Is this correct?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论