英文:
R: Merging a Lookup Table with a Data Frame
问题
以下是要翻译的部分:
基于这个数据集,我试图计算“嵌套组”内的疾病比例,即:
- 首先,选择所有男性
- 然后,选择所有男性公民
- 然后,在所有男性公民中,识别最矮的20%的一组
- 然后,在所有男性公民中,在最矮的20%高度内,进一步分离出体重最轻的20%的一组
- 最后,在所有男性公民中,在最矮的20%高度和体重最轻的20%中,再将其分成医院访问次数最少的20%的一组:这将是第一组
- 对所有可能的组合重复此过程
第一部分: 使用DPLYR中的“.add = TRUE”参数,我认为可以按以下方式完成:
nested_combinations <- my_data %>%
group_by(Gender) %>%
group_by(Status, add = TRUE) %>%
mutate(height_group = ntile(Height, 5)) %>%
group_by(height_group, add = TRUE) %>%
mutate(weight_group = ntile(Weight, 5)) %>%
group_by(weight_group, add = TRUE) %>%
mutate(visits_group = ntile(Hospital_Visits, 5)) %>%
group_by(visits_group, add = TRUE) %>%
summarize(total_count = n(),
disease_count = sum(Disease == "Yes"),
disease_proportion = mean(Disease == "Yes"))
第二部分: 接下来,我通过计算这些组的“范围”(即最小值和最大值)创建了一个“查找表”:
table_data <- data.frame(
Groups = paste0("Group ", 1:5),
Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)
我的问题是: 是否有一种方法可以将第二部分中不同变量的最小/最大范围插入到第一部分作为新列(例如,min_height,max_height,min_weight,max_weight,min_visits,max_visits)?
目前,我正在使用一系列“ifelse”语句来完成这个操作,但这似乎不太高效。有人能否请教我更好的方法?
谢谢!
英文:
I am working with the R programming language.
I have the following dataset on medical characteristics of patients and disease prevalance:
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status <- as.factor(status )
Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
Patient_ID Gender Status Height Weight Hospital_Visits Disease
1 1 Female Citizen 145.0583 113.70725 1 No
2 2 Male Immigrant 161.2759 88.33188 18 No
3 3 Female Immigrant 138.5305 99.26961 6 Yes
4 4 Male Citizen 164.8102 84.31848 12 No
5 5 Male Citizen 159.1619 92.25090 12 Yes
6 6 Female Citizen 153.3513 101.31986 11 Yes
Based on this dataset, I am trying to calculate the disease proportions within "nested groups", i.e.
- First, select all males
- Then, select all male citizens
- Then, out of the set of all male citizens - identify a group of 20% of this set with the smallest heights
- Then, out of the set of all male citizens within the shortest 20% height - further isolate a group of 20% with the smallest weights
- Finally, out of the set of all male citizens within the shortest 20% height and within the shortest 20% height having the 20% smallest weight - further isolate them into a group with the 20% fewest number of hospital visits : This will now be the first group
- Repeat this process for all possible group combinations
Part 1: Using the ".add = TRUE" argument within DPLYR, I think I can accomplish this as follows:
nested_combinations <- my_data %>%
group_by(Gender) %>%
group_by(Status, add = TRUE) %>%
mutate(height_group = ntile(Height, 5)) %>%
group_by(height_group, add = TRUE) %>%
mutate(weight_group = ntile(Weight, 5)) %>%
group_by(weight_group, add = TRUE) %>%
mutate(visits_group = ntile(Hospital_Visits, 5)) %>%
group_by(visits_group, add = TRUE) %>%
summarize(total_count = n(),
disease_count = sum(Disease == "Yes"),
disease_proportion = mean(Disease == "Yes"))
# results
Gender Status height_group weight_group visits_group total_count disease_count disease_proportion
<fct> <fct> <int> <int> <int> <int> <int> <dbl>
1 Female Citizen 1 1 1 16 5 0.312
2 Female Citizen 1 1 2 16 4 0.25
3 Female Citizen 1 1 3 16 7 0.438
4 Female Citizen 1 1 4 15 4 0.267
5 Female Citizen 1 1 5 15 8 0.533
6 Female Citizen 1 2 1 16 5 0.312
7 Female Citizen 1 2 2 16 4 0.25
8 Female Citizen 1 2 3 16 8 0.5
9 Female Citizen 1 2 4 15 6 0.4
10 Female Citizen 1 2 5 15 6 0.4
Part 2: Next, I created a "Lookup Table" by calculating the "ranges" (i.e. min and max) for each of these groups:
table_data <- data.frame(
Groups = paste0("Group ", 1:5),
Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)
# results
Groups Min_Height Max_Height Min_Weight Max_Weight Min_Visits Max_Visits
1 Group 1 111.5468 141.4839 56.53098 81.83402 1 4
2 Group 2 141.4965 147.4422 81.85064 87.45406 4 8
3 Group 3 147.4487 152.3924 87.45935 92.72041 8 12
4 Group 4 152.4016 158.5178 92.72941 98.54624 12 17
5 Group 5 158.5187 188.4777 98.55533 121.02420 17 20
My Question: Is there a way I can take these min/max ranges for the different variables from Part 2 and insert them into Part 1 as new columns (e.g. min_height, max_height, min_weight, max_weight, min_visits, max_visits)?
Currently I am doing this with a series of "ifelse" statements but this does not seem very efficient. Can someone please show me a better way?
Thanks!
答案1
得分: 1
## 示例数据
d <-
d <- data.frame(gender = gl(2, 10),
height = 160 + sample(1:40, 40),
weight = 50 + sample(1:50, 40),
disease = sample(c(TRUE, FALSE), 40, replace = TRUE)
)
d |>
group_by(gender) |>
mutate(low_height = height < quantile(height, .2)) |>
group_by(gender, low_height) |>
mutate(low_weight = weight < quantile(weight, .2)) |>
group_by(gender, low_height, low_weight) |>
summarise(across(c(height, weight),
## 在这里列出自定义统计数据:
list(min = \(xs) min(xs, na.rm = TRUE),
max = \(xs) max(xs, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
),
prop_disease = sum(disease)/n(),
## 等等。
)
# 一个数据框:8 x 8
# 分组:gender, low_height [4]
gender low_height low_weight height_min height_max weight_min weight_max
<fct> <lgl> <lgl> <dbl> <dbl> <dbl> <dbl>
1 1 FALSE FALSE 172 199 67 100
2 1 FALSE TRUE 173 190 52 65
3 1 TRUE FALSE 161 169 74 94
4 1 TRUE TRUE 165 165 61 61
5 2 FALSE FALSE 168 200 56 96
6 2 FALSE TRUE 170 192 51 54
7 2 TRUE FALSE 164 167 68 93
8 2 TRUE TRUE 163 163 55 55
# 另有 1 个变量:prop_disease <dbl>
英文:
something along these lines?
## example data
d <-
d <- data.frame(gender = gl(2, 10),
height = 160 + sample(1:40, 40),
weight = 50 + sample(1:50, 40),
disease = sample(c(TRUE, FALSE), 40, replace = TRUE)
)
d |>
group_by(gender) |>
mutate(low_height = height < quantile(height, .2)) |>
group_by(gender, low_height) |>
mutate(low_weight = weight < quantile(weight, .2)) |>
group_by(gender, low_height, low_weight) |>
summarise(across(c(height, weight),
## list custom stats here:
list(min = \(xs) min(xs, na.rm = TRUE),
max = \(xs) max(xs, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
),
prop_disease = sum(disease)/n(),
## etc.
)
# A tibble: 8 x 8
# Groups: gender, low_height [4]
gender low_height low_weight height_min height_max weight_min weight_max
<fct> <lgl> <lgl> <dbl> <dbl> <dbl> <dbl>
1 1 FALSE FALSE 172 199 67 100
2 1 FALSE TRUE 173 190 52 65
3 1 TRUE FALSE 161 169 74 94
4 1 TRUE TRUE 165 165 61 61
5 2 FALSE FALSE 168 200 56 96
6 2 FALSE TRUE 170 192 51 54
7 2 TRUE FALSE 164 167 68 93
8 2 TRUE TRUE 163 163 55 55
# i 1 more variable: prop_disease <dbl>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论