英文:
R: Counting Number of Observations in Each Percentile
问题
我正在使用R编程语言进行工作。
我有以下数据集:
library(dplyr)
set.seed(123)
n <- 100
country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
gender <- sample(c("M", "F"), n, replace = TRUE)
age <- sample(18:100, n, replace = TRUE)
height <- runif(n, min = 150, max = 180)
owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)
df <- data.frame(country, gender, age, height, owns_bicycle)
我的问题:
- 首先,我想按身高值将身高分为3个等大小的组(例如0%-33%,33%-66%,66%-99%)。
- 接下来,我想按年龄值将年龄分为5个等大小的组(例如0%-20%,20%-40%,等等)。
- 然后,对于每个独特的国家、性别、年龄组和身高组的组合,我想找出拥有自行车的比例。
因此,这种分析能让我知道诸如“如果你是一个年龄在30-35岁之间的男性,身高在150-155厘米之间,来自美国,那么你拥有自行车的概率为43%”等信息。
以下是我目前尝试做的事情:
library(dplyr)
df %>%
mutate(height_group = ntile(height, 3),
age_group = ntile(age, 5)) %>%
group_by(country, gender, height_group, age_group) %>%
summarise(count = n(),
min_height = min(height),
max_height = max(height),
min_age = min(age),
max_age = max(age),
percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) %>%
mutate(height_range = paste0(round(min_height, 1), "-", round(max_height, 1)),
age_range = paste0(min_age, "-", max_age)) %>%
select(-min_height, -max_height, -min_age, -max_age)
当我查看结果时:
# A tibble: 62 x 8
# Groups: country, gender, height_group [18]
country gender height_group age_group count percent_own_bicycle height_range age_range
<chr> <chr> <int> <int> <int> <dbl> <chr> <chr>
1 Canada F 1 1 2 0 157.2-158.5 23-30
2 Canada F 1 2 4 25 150.8-154.1 37-43
3 Canada F 1 4 2 0 154.4-156.9 66-72
4 Canada F 1 5 1 0 154.6-154.6 80-80
5 Canada F 2 1 1 0 169.3-169.3 23-23
我看到height_group = 1 具有多个范围,例如157.2-158.5和150.8-154.1。为什么会这样 - 我本来以为height_group = 1只能有一个范围?
请问有人能告诉我我做错了什么吗?
谢谢!
英文:
I am working with the R programming language.
I have the following dataset:
library(dplyr)
set.seed(123)
n <- 100
country <- sample(c("USA", "Canada", "UK"), n, replace = TRUE)
gender <- sample(c("M", "F"), n, replace = TRUE)
age <- sample(18:100, n, replace = TRUE)
height <- runif(n, min = 150, max = 180)
owns_bicycle <- sample(c("Yes", "No"), n, replace = TRUE)
df <- data.frame(country, gender, age, height, owns_bicycle)
My Problem:
- First, I want to break height into 3 equal sized groups by value of their height (e.g. 0%-33%, 33%-66%,66%-99%)
- Next, I want to break age into 5 equal sized groups by value of their age (e.g. 0%-20%, 20%-40%, etc.)
- Then, for each unique combination of country, gender, age_group and height_group, I want to find out the percent of who own a bicycle.
As a result, this type of analysis would let me know things like - "if you are a man between ages 30-35, between 150-155 cm and from USA, there is a 43% chance you own a bicycle".
Here is my current attempt to do this:
library(dplyr)
df %>%
mutate(height_group = ntile(height, 3),
age_group = ntile(age, 5)) %>%
group_by(country, gender, height_group, age_group) %>%
summarise(count = n(),
min_height = min(height),
max_height = max(height),
min_age = min(age),
max_age = max(age),
percent_own_bicycle = mean(owns_bicycle == "Yes") * 100) %>%
mutate(height_range = paste0(round(min_height, 1), "-", round(max_height, 1)),
age_range = paste0(min_age, "-", max_age)) %>%
select(-min_height, -max_height, -min_age, -max_age)
When I look at the results:
# A tibble: 62 x 8
# Groups: country, gender, height_group [18]
country gender height_group age_group count percent_own_bicycle height_range age_range
<chr> <chr> <int> <int> <int> <dbl> <chr> <chr>
1 Canada F 1 1 2 0 157.2-158.5 23-30
2 Canada F 1 2 4 25 150.8-154.1 37-43
3 Canada F 1 4 2 0 154.4-156.9 66-72
4 Canada F 1 5 1 0 154.6-154.6 80-80
5 Canada F 2 1 1 0 169.3-169.3 23-23
I see height_group = 1 having multiple ranges, e.g. 157.2-158.5 and 150.8-154.1 . How is this possible - I would have thought that height_group = 1 can only have a single range?
Can someone please show me what I am doing wrong
Thanks!
答案1
得分: 1
是的,您的尝试看起来是正确的。您使用ntile()
函数将height
变量分为三个等大小的组,将age
变量分为五个等大小的组。然后,您按照country
、gender
、height_group
和age_group
对数据进行了分组,并计算了每个组内的计数、最小身高、最大身高、最小年龄、最大年龄以及拥有自行车的个体的百分比。
英文:
Yes, your attempt looks correct. You have divided the height
variable into three equal-sized groups using the ntile()
function, and the age
variable into five equal-sized groups. Then, you grouped the data by country
, gender
, height_group
, and age_group
and calculated the count, minimum height, maximum height, minimum age, maximum age, and the percentage of individuals who own a bicycle within each group.
答案2
得分: -3
这是您的代码的翻译部分:
OP在这里 - 这是我的第二次尝试:
final = df %>%
mutate(height_group = cut(height, breaks = 3),
age_group = cut(age, breaks = 5)) %>%
group_by(country, gender, height_group, age_group) %>%
summarise(count = n(),
percent_own_bicycle = mean(owns_bicycle == "Yes") * 100)
我认为结果现在看起来更加 "一致" 了?(例如,没有多个范围)
`summarise()` 已经按 'country'、'gender' 和 'height_group' 分组输出。您可以使用 `.groups` 参数来覆盖。
# A tibble: 60 x 6
# Groups: country, gender, height_group [18]
country gender height_group age_group count percent_own_bicycle
<chr> <chr> <fct> <fct> <int> <dbl>
1 Canada F (151,161] (17.9,34.2] 2 0
2 Canada F (151,161] (34.2,50.4] 5 40
3 Canada F (151,161] (50.4,66.6] 1 0
4 Canada F (151,161] (66.6,82.8] 2 0
5 Canada F (151,161] (82.8,99.1] 1 0
6 Canada F (161,170] (17.9,34.2] 1 0
7 Canada F (161,170] (34.2,50.4] 1 100
8 Canada F (161,170] (50.4,66.6] 1 0
9 Canada F (161,170] (82.8,99.1] 2 50
10 Canada F (170,180] (17.9,34.2] 3 0
这正确吗?
英文:
OP here - this is my second attempt:
final = df %>%
mutate(height_group = cut(height, breaks = 3),
age_group = cut(age, breaks = 5)) %>%
group_by(country, gender, height_group, age_group) %>%
summarise(count = n(),
percent_own_bicycle = mean(owns_bicycle == "Yes") * 100)
I think the results look more "consistent" now? (e.g. no multiple ranges)
`summarise()` has grouped output by 'country', 'gender', 'height_group'. You can override using the `.groups` argument.
# A tibble: 60 x 6
# Groups: country, gender, height_group [18]
country gender height_group age_group count percent_own_bicycle
<chr> <chr> <fct> <fct> <int> <dbl>
1 Canada F (151,161] (17.9,34.2] 2 0
2 Canada F (151,161] (34.2,50.4] 5 40
3 Canada F (151,161] (50.4,66.6] 1 0
4 Canada F (151,161] (66.6,82.8] 2 0
5 Canada F (151,161] (82.8,99.1] 1 0
6 Canada F (161,170] (17.9,34.2] 1 0
7 Canada F (161,170] (34.2,50.4] 1 100
8 Canada F (161,170] (50.4,66.6] 1 0
9 Canada F (161,170] (82.8,99.1] 2 50
10 Canada F (170,180] (17.9,34.2] 3 0
Is this correct?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论