英文:
R: Calculating Proportions Based on Nested Groups
问题
以下是您提供的代码的翻译部分:
我正在使用R编程语言。
我有以下数据集:
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status <- as.factor(status )
Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
Patient_ID Gender Status Height Weight Hospital_Visits Disease
1 1 Female Citizen 145.0583 113.70725 1 No
2 2 Male Immigrant 161.2759 88.33188 18 No
3 3 Female Immigrant 138.5305 99.26961 6 Yes
4 4 Male Citizen 164.8102 84.31848 12 No
5 5 Male Citizen 159.1619 92.25090 12 Yes
6 6 Female Citizen 153.3513 101.31986 11 Yes
我试图完成以下任务:
- 选择所有性别为男性且状态为公民的行
- 基于此条件,将“Height”变量分为3个“ntiles”
- 现在,基于这些(嵌套的)累积条件(即性别=男性且状态=公民且Height_ntile = 1,性别=男性且状态=公民且Height_ntile = 2,性别=男性且状态=公民且Height_ntile = 3),将“Weight”变量分为3个“ntiles”
- 现在,基于这些添加的累积条件(即性别=男性且状态=公民且Height_ntile = 1,Weight_ntile = 1等),将“Hospital_Visits”分为3个“ntiles”
- 最后,基于这些添加的累积条件(性别=男性且状态=公民且Height_ntile = 1,Weight_ntile = 1,Hosptial_Visits_ntile = 1等),计算患病率
- 对于性别=女性且状态=公民等重复上述步骤。
根据以前的问题(https://stackoverflow.com/questions/75014703/r-interchanging-quantile-and-ntile-functions),我尝试调整代码以解决此问题:
results <- my_data %>%
group_by(Gender, Status) %>%
mutate(group = ntile(Height, 3)) %>%
group_by(Gender, Status, group) %>%
mutate(min = min(Height),
max = max(Height)) %>%
mutate(range_Height = paste(min, max, sep = "-")) %>%
ungroup() %>%
group_by(Gender, Status, range_Height) %>%
mutate(group = ntile(Weight, 3)) %>%
group_by(Gender, Status, range_Height, group) %>%
mutate(min = min(Weight),
max = max(Weight)) %>%
mutate(range_Weight = paste(min, max, sep = "-")) %>%
ungroup() %>%
group_by(Gender, Status, range_Height, range_Weight) %>%
mutate(group = ntile(Hospital_Visits, 3)) %>%
group_by(Gender, Status, range_Height, range_Weight, group) %>%
mutate(min = min(Hospital_Visits),
max = max(Hospital_Visits)) %>%
mutate(range_Hospital_Visits = paste(min, max, sep = "-")) %>%
ungroup() %>% group_by (Gender, Status, range_Height, range_Weight, range_Hospital_Visits) %>% summarise(disease_rate = mean(Disease), count = n())
该代码似乎已运行,但伴有多个警告,并且“disease_rate”列被填充为NA。可以有人请教如何解决这个问题吗?
谢谢!
注意:这不应该与按变量分组并计算ntiles的方式相同 - 然后中断该过程,按先前的变量和先前的ntiles分组...然后计算新的ntiles。我希望分组和ntile计算是不间断的。
<details>
<summary>英文:</summary>
I am working with the R programming language.
I have the following dataset:
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status <- as.factor(status )
Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
Patient_ID Gender Status Height Weight Hospital_Visits Disease
1 1 Female Citizen 145.0583 113.70725 1 No
2 2 Male Immigrant 161.2759 88.33188 18 No
3 3 Female Immigrant 138.5305 99.26961 6 Yes
4 4 Male Citizen 164.8102 84.31848 12 No
5 5 Male Citizen 159.1619 92.25090 12 Yes
6 6 Female Citizen 153.3513 101.31986 11 Yes
**I am trying to accomplish the following task:**
- Select all rows where Gender = Male AND Status = Citizen
- Based on this condition, split the "Height" variable into 3 "ntiles"
- Now, based on these (nested) cumulative conditions (i.e. Gender = Male & Status = Citizen & Height_ntile = 1 , Gender = Male & Status = Citizen & Height_ntile = 2, Gender = Male & Status = Citizen & Height_ntile = 3), split the "Weight" variable into 3 "ntiles"
- Now, based on these added cumulative conditions (i.e. Gender = Male & Status = Citizen & Height_ntile = 1, Weight_ntile = 1,etc.), split the "Hospital_Visits" into 3 "ntiles"
- Finally, based on these added cumulative conditions (ender = Male & Status = Citizen & Height_ntile = 1, Weight_ntile = 1, Hosptial_Visits_ntile = 1, etc.), calculate the percentage of patients who have the disease
- Repeat for Gender = Female AND Status = Citizen, etc.
Based on a previous question (https://stackoverflow.com/questions/75014703/r-interchanging-quantile-and-ntile-functions), I tried to adapt the code for this question:
results <- my_data %>%
group_by(Gender, Status) %>%
mutate(group = ntile(Height, 3)) %>%
group_by(Gender, Status, group) %>%
mutate(min = min(Height),
max = max(Height)) %>%
mutate(range_Height = paste(min, max, sep = "-")) %>%
ungroup() %>%
group_by(Gender, Status, range_Height) %>%
mutate(group = ntile(Weight, 3)) %>%
group_by(Gender, Status, range_Height, group) %>%
mutate(min = min(Weight),
max = max(Weight)) %>%
mutate(range_Weight = paste(min, max, sep = "-")) %>%
ungroup() %>%
group_by(Gender, Status, range_Height, range_Weight) %>%
mutate(group = ntile(Hospital_Visits, 3)) %>%
group_by(Gender, Status, range_Height, range_Weight, group) %>%
mutate(min = min(Hospital_Visits),
max = max(Hospital_Visits)) %>%
mutate(range_Hospital_Visits = paste(min, max, sep = "-")) %>%
ungroup() %>% group_by (Gender, Status, range_Height, range_Weight, range_Hospital_Visits) %>% summarise(disease_rate = mean(Disease), count = n())
**The code seems to have run - but with several warnings and the "disease_rate" column being filled with NA's:**
There were 50 or more warnings (use warnings() to see the first 50)
> results
# A tibble: 108 x 7
# Groups: Gender, Status, range_Height, range_Weight [36]
Gender Status range_Height range_Weight range_Hospita~1 disea~2 count
<fct> <fct> <chr> <chr> <chr> <dbl> <int>
1 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 1-7 NA 72
2 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 14-20 NA 71
3 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 8-14 NA 72
4 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 1-8 NA 72
5 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 14-20 NA 71
6 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 8-14 NA 72
7 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596 1-6 NA 72
8 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596 14-20 NA 71
9 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596 7-14 NA 71
10 Female Citizen 145.52366174493-153.896976138595 60.0595633761862-86.5042722953509 1-8 NA 72
# ... with 98 more rows, and abbreviated variable names 1: range_Hospital_Visits, 2: disease_rate
# i Use `print(n = ...)` to see more rows
Can someone please show me how to fix this problem?
Thanks!
Note: This should NOT be the same as grouping by a variable and calculating ntiles - then interrupt the process, group by previous variable and previous ntiles ... and then calculate new ntiles. I would like the grouping and ntile calculations to be uninterrupted.
</details>
# 答案1
**得分**: 1
OP的答案接近,但不正确。
当您不想覆盖时,`group_by` 需要使用 `.add = TRUE`。
```R
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3)) %>%
group_by(Height_ntile, .add = TRUE) %>%
mutate(Weight_ntile = ntile(Weight, 3)) %>%
group_by(Weight_ntile, .add = TRUE) %>%
mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %>%
group_by(Hospital_Visits_ntile, .add = TRUE) %>%
summarize(percent_disease = mean(Disease == "Yes"),
count = n(),
.groups = "drop")
【附加说明】
.add
的默认值是 FALSE。
这意味着当您两次使用 group_by
时,第一个 group_by
创建的组被删除,新的组被创建(-> 覆盖)。
以下是一个示例:
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3)) %>%
group_by(Height_ntile) # .add 的默认值是 FALSE
# A tibble: 5,000 × 8
# Groups: Height_ntile [3]
# ...
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3)) %>%
group_by(Height_ntile, .add = TRUE)
# A tibble: 5,000 × 8
# Groups: Gender, Status, Height_ntile [12]
关于范围,我想不出更好的想法,所以我用简单但冗长的代码来实现:
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3),
Height_range = paste(min(Height), max(Height), sep = "-")) %>%
group_by(Height_ntile, Height_range, .add = TRUE) %>%
mutate(Weight_ntile = ntile(Weight, 3),
Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
group_by(Weight_ntile, Weight_range, .add = TRUE) %>%
mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3),
Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
group_by(Hospital_Visits_ntile, Hospital_range, .add = TRUE) %>%
summarize(percent_disease = mean(Disease == "Yes"),
count = n(),
.groups = "drop")
英文:
OP's answer is close, but not right.
group_by
needs .add = TRUE
when you don't want to override.
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3)) %>%
group_by(Height_ntile, .add = TRUE) %>%
mutate(Weight_ntile = ntile(Weight, 3)) %>%
group_by(Weight_ntile, .add = TRUE) %>%
mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %>%
group_by(Hospital_Visits_ntile, .add = TRUE) %>%
summarize(percent_disease = mean(Disease == "Yes"),
count = n(),
.groups = "drop")
[addition]
The default value of .add
is FALSE.
It means when you use group_by
twice, the group made by 1st group_by
was dropped and new group is made (-> override).
Below is an example:
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3)) %>%
group_by(Height_ntile) # the default value of .add is FALSE
# A tibble: 5,000 × 8
# Groups: Height_ntile [3]
# ...
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3)) %>%
group_by(Height_ntile, .add = TRUE)
# A tibble: 5,000 × 8
# Groups: Gender, Status, Height_ntile [12]
I couldn't come up with good ideas about range, so I did it by the simple but long code:
my_data %>%
group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3),
Height_range = paste(min(Height), max(Height), sep = "-")) %>%
group_by(Height_ntile, Height_range, .add = TRUE) %>%
mutate(Weight_ntile = ntile(Weight, 3),
Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
group_by(Weight_ntile, Weight_range, .add = TRUE) %>%
mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3),
Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
group_by(Hospital_Visits_ntile, Hospital_range, .add = TRUE) %>%
summarize(percent_disease = mean(Disease == "Yes"),
count = n(),
.groups = "drop")
答案2
得分: 0
我在思考也许我可能过于复杂化了解决这个问题的方法 - 这个答案是否正确?
library(dplyr)
my_data %>% group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3)) %>%
group_by(Height_ntile) %>%
mutate(Weight_ntile = ntile(Weight, 3)) %>%
group_by(Height_ntile, Weight_ntile) %>%
mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %
group_by(Gender, Status, Height_ntile, Weight_ntile, Hospital_Visits_ntile) %>%
summarize(percent_disease = mean(Disease == "Yes"), count = n())
我仍在努力弄清楚如何为每个ntile "粘贴范围"?
英文:
I am thinking that perhaps I might have overcomplicated the approach I was using to solve this problem - is this answer correct?
library(dplyr)
my_data %>% group_by(Gender, Status) %>%
mutate(Height_ntile = ntile(Height, 3)) %>%
group_by(Height_ntile) %>%
mutate(Weight_ntile = ntile(Weight, 3)) %>%
group_by(Height_ntile, Weight_ntile) %>%
mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %>%
group_by(Gender, Status, Height_ntile, Weight_ntile, Hospital_Visits_ntile) %>%
summarize(percent_disease = mean(Disease == "Yes"), count = n())
I am still trying to figure out how to "paste the ranges" for each ntile?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论