英文:
R: How to Prevent Overlapping Ranges Within (Nested) Summary Groups
问题
我正在使用R编程语言进行工作。
我有关于患者的医疗特征和疾病患病率的以下数据集:
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male", "Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)
status <- c("Immigrant", "Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status)
height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20, 5000, replace = TRUE)
################
disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)
###################
my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)
我的问题: 基于这个数据集,我尝试计算“嵌套组内”的疾病比例,即:
- 首先,选择所有男性
- 然后,选择所有男性公民
- 然后,在所有男性公民中,找出身高最小的20%的人
- 然后,在身高最小的20%男性公民中,进一步筛选出体重最小的20%的人
- 最后,在身高最小的20%、体重最小的20%、且医院就诊次数最少的20%的男性公民中,进一步筛选出医院就诊次数最少的20%的人:这将是第一组
- 对所有可能的组合重复此过程
这意味着一个患者只能属于一个单一组 - 换句话说,这些组不能有任何重叠范围。
在之前的问题中(https://stackoverflow.com/questions/76551409/r-merging-a-lookup-table-with-a-data-frame),我学到了如何计算嵌套组内的疾病比例:
final = my_data %>%
group_by(gender, status) %>%
mutate(low_height = height < quantile(height, .2)) %>%
group_by(gender, status, low_height) %>%
mutate(low_weight = weight < quantile(weight, .2)) %>%
group_by(gender, status, low_height, low_weight) %>%
mutate(low_visit = hospital_visits < quantile(hospital_visits , .2)) %>%
group_by(gender, status, low_height, low_weight, low_visit) %>%
summarise(across(c(height, weight, hospital_visits),
## list custom stats here:
list(min = \(xs) min(xs, na.rm = TRUE),
max = \(xs) max(xs, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
),
prop_disease = sum(disease)/n(),
## etc.
)
final$low_height = final$low_weight = final$low_visit = NULL
我的问题 当我查看此代码的结果时:
> final
# A tibble: 32 x 9
# Groups: gender, status [4]
gender status height_min height_max weight_min weight_max hospital_visits_min hospital_visits_max prop_disease
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 Female Citizen 142. 188. 82.3 119. 4 20 0.495
2 Female Citizen 142. 175. 82.4 118. 1 3 0.495
3 Female Citizen 142. 177. 57.4 82.3 5 20 0.482
- 在第1行和第2行:我可以看到已经创建了重叠的身高范围,如(142,188)和(142,175)
- 这意味着如果一个医疗患者身高为150厘米,那么该患者可以分配到这两个组中:因此,这违反了不重叠组的原始条件。
有人能否请示我如何解决这个问题?
英文:
I am working with the R programming language.
I have the following dataset on medical characteristics of patients and disease prevalance:
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status )
height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20, 5000, replace = TRUE)
################
disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)
###################
my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)
Patient_ID gender status height weight hospital_visits disease
1 1 Female Citizen 145.0583 113.70725 1 TRUE
2 2 Male Immigrant 161.2759 88.33188 18 FALSE
3 3 Female Immigrant 138.5305 99.26961 6 FALSE
4 4 Male Citizen 164.8102 84.31848 12 TRUE
5 5 Male Citizen 159.1619 92.25090 12 TRUE
6 6 Female Citizen 153.3513 101.31986 11 TRUE
My Problem: Based on this dataset, I am trying to calculate the disease proportions within "nested groups", i.e.
- First, select all males
- Then, select all male citizens
- Then, out of the set of all male citizens - identify a group of 20% of this set with the smallest heights
- Then, out of the set of all male citizens within the shortest 20% height - further isolate a group of 20% with the smallest weights
- Finally, out of the set of all male citizens within the shortest 20% height and within the shortest 20% height having the 20% smallest weight - further isolate them into a group with the 20% fewest number of hospital visits : This will now be the first group
- Repeat this process for all possible group combinations
This means an individual patient can only belong to a single group - in other words, the groups can not have any overlapping ranges.
In a previous question (https://stackoverflow.com/questions/76551409/r-merging-a-lookup-table-with-a-data-frame), I learned how to calculate the disease proportions within nested groups:
final = my_data |>
group_by(gender, status) |>
mutate(low_height = height < quantile(height, .2)) |>
group_by(gender, status, low_height) |>
mutate(low_weight = weight < quantile(weight, .2)) |>
group_by(gender, status, low_height, low_weight) |>
mutate(low_visit = hospital_visits < quantile(hospital_visits , .2)) |>
group_by(gender, status, low_height, low_weight, low_visit) |>
summarise(across(c(height, weight, hospital_visits),
## list custom stats here:
list(min = \(xs) min(xs, na.rm = TRUE),
max = \(xs) max(xs, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
),
prop_disease = sum(disease)/n(),
## etc.
)
final$low_height = final$low_weight = final$low_visit = NULL
My Question When I look at the results from this code:
> final
# A tibble: 32 x 9
# Groups: gender, status [4]
gender status height_min height_max weight_min weight_max hospital_visits_min hospital_visits_max prop_disease
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 Female Citizen 142. 188. 82.3 119. 4 20 0.495
2 Female Citizen 142. 175. 82.4 118. 1 3 0.495
3 Female Citizen 142. 177. 57.4 82.3 5 20 0.482
- In row 1 and row 2: I can see that overlapping height ranges have been created such as (142,188) and (142, 175)
- This means that if a medical patient was 150 cm tall - the patient could be assigned to both of these groups: as such, this violates the original condition of non-overlapping groups.
Can someone please show me if there is a way to fix this problem?
Thanks!
答案1
得分: 3
首先,如@I_O的评论中所说,我同意身高和体重可能不能完全分开以避免重叠。
如果您想要对所有感兴趣特征的组合进行总结,并且可能尝试以下代码:
df %>%
group_by(gender, status) %>%
group_map(~ {
.x %>%
filter(height < quantile(height, .2)) %>%
filter(weight < quantile(weight, .2)) %>%
filter(hospital_visits < quantile(hospital_visits, .2)) %>%
summarise(
across(c(height, weight, hospital_visits),
list(
min = \(xs) min(xs, na.rm = TRUE),
max = \(xs) max(xs, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
),
prop_disease = mean(disease),
.by = c(gender, status)
)
}, .keep = TRUE) %>%
bind_rows()
该代码会生成如下结果:
# A tibble: 4 × 9
gender status height_min height_max weight_min weight_max hospital_visits_min
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 Female Citizen 123. 141. 70.7 80.6 1
2 Female Immigr… 124. 140. 62.0 83.5 2
3 Male Citizen 125. 140. 66.7 79.8 1
4 Male Immigr… 130. 140. 68.4 81.1 1
# ℹ 2 more variables: hospital_visits_max <int>, prop_disease <dbl>
英文:
First of all, as said in the comment by @I_O, I agree that height and weight may not be absolutely separated to avoid the overlap.
If you want want to have a summary of all combinations with quantile(... < 0.2)
for the features of interest, probably you can try
df %>%
group_by(gender, status) %>%
group_map(~ {
.x %>%
filter(height < quantile(height, .2)) %>%
filter(weight < quantile(weight, .2)) %>%
filter(hospital_visits < quantile(hospital_visits, .2)) %>%
summarise(
across(c(height, weight, hospital_visits),
list(
min = \(xs) min(xs, na.rm = TRUE),
max = \(xs) max(xs, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
),
prop_disease = mean(disease),
.by = c(gender, status)
)
}, .keep = TRUE) %>%
bind_rows()
which gives
# A tibble: 4 × 9
gender status height_min height_max weight_min weight_max hospital_visits_min
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 Female Citizen 123. 141. 70.7 80.6 1
2 Female Immigr… 124. 140. 62.0 83.5 2
3 Male Citizen 125. 140. 66.7 79.8 1
4 Male Immigr… 130. 140. 68.4 81.1 1
# ℹ 2 more variables: hospital_visits_max <int>, prop_disease <dbl>
答案2
得分: 2
也许这样做过于简单,但为了避免重叠,我们可以将(身高、体重、医院访问次数)作为一组使用分位数切割,而不是单独对每个水平进行切割:
对于男性公民,我们可以这样做:
my_data %>%
filter(gender == "男性", status == "公民") %>%
mutate(rank_height = rank(身高),
rank_weight = rank(体重),
rank_hospital_visits = rank(医院访问次数)) %>%
mutate(total_score = rank_height + rank_weight + rank_hospital_visits,
group_number = ntile(total_score, 5)) %>%
mutate(prop_disease = mean(疾病), .by = group_number)
这确保了根据(身高、体重、医院访问次数)组的等级而不是每个单独水平来分配每个患者到一个组中。
患者_ID 性别 状态 身高 体重 医院访问次数 疾病 等级_身高 等级_体重 等级_医院访问次数 总分 组号 疾病比例
1 4 男性 公民 164.8102 84.31848 12 TRUE 1502 434 930.0 2866.0 4 0.4672897
2 5 男性 公民 159.1619 92.25090 12 TRUE 1316 936 930.0 3182.0 5 0.5125000
3 8 男性 公民 152.0362 87.67197 4 FALSE 933 667 286.5 1886.5 2 0.4579439
4 11 男性 公民 143.9616 87.24376 7 TRUE 438 637 519.0 1594.0 1 0.4704050
5 13 男性 公民 155.9949 100.89346 5 FALSE 1157 1381 362.5 2900.5 4 0.4672897
6 20 男性 公民 136.0402 82.95463 2 TRUE 133 368 123.5 624.5 1 0.4704050
7 24 男性 公民 138.7447 90.35814 3 FALSE 222 822 206.0 1250.0 1 0.4704050
8 25 男性 公民 148.1227 100.64930 6 FALSE 682 1373 438.0 2493.0 3 0.4517134
9 26 男性 公民 139.6196 84.32218 1 FALSE 260 435 41.0 736.0 1 0.4704050
10 31 男性 公民 149.8554 92.32891 15 TRUE 812 941 1159.5 2912.5 4 0.4672897
11 32 男性 公民 155.4558 78.81120 18 TRUE 1131 204 1401.0 2736.0 4 0.4672897
12 33 男性 公民 150.9811 75.29820 18 TRUE 877 113 1401.0 2391.0 3 0.4517134
13 34 男性 公民 150.1760 120.35185 8 TRUE 829 1604 599.0 3032.0 4 0.4672897
14 55 男性 公民 151.4056 92.70188 16 TRUE 894 969 1233.5 3096.5 4 0.4672897
15 59 男性 公民 141.6127 86.62372 11 FALSE 342 597 846.0 1785.0 2 0.4579439
...
英文:
Maybe this is too simple but to avoid overlapping we could use the quantile cuts for (height, weight, hospital visits) as a group, and not on each level separately:
For the male citizens we could do:
my_data %>%
filter(gender == "Male", status == "Citizen") %>%
mutate(rank_height = rank(height),
rank_weight = rank(weight),
rank_hospital_visits = rank(hospital_visits)) %>%
mutate(total_score = rank_height + rank_weight + rank_hospital_visits,
group_number = ntile(total_score, 5)) %>%
mutate(prop_disease = mean(disease), .by = group_number)
This ensure the assignment of each patient to one group based on the rank of the groups (height, weight, hospital visits) instead of each single level.
Patient_ID gender status height weight hospital_visits disease rank_height rank_weight rank_hospital_visits total_score group_number prop_disease
1 4 Male Citizen 164.8102 84.31848 12 TRUE 1502 434 930.0 2866.0 4 0.4672897
2 5 Male Citizen 159.1619 92.25090 12 TRUE 1316 936 930.0 3182.0 5 0.5125000
3 8 Male Citizen 152.0362 87.67197 4 FALSE 933 667 286.5 1886.5 2 0.4579439
4 11 Male Citizen 143.9616 87.24376 7 TRUE 438 637 519.0 1594.0 1 0.4704050
5 13 Male Citizen 155.9949 100.89346 5 FALSE 1157 1381 362.5 2900.5 4 0.4672897
6 20 Male Citizen 136.0402 82.95463 2 TRUE 133 368 123.5 624.5 1 0.4704050
7 24 Male Citizen 138.7447 90.35814 3 FALSE 222 822 206.0 1250.0 1 0.4704050
8 25 Male Citizen 148.1227 100.64930 6 FALSE 682 1373 438.0 2493.0 3 0.4517134
9 26 Male Citizen 139.6196 84.32218 1 FALSE 260 435 41.0 736.0 1 0.4704050
10 31 Male Citizen 149.8554 92.32891 15 TRUE 812 941 1159.5 2912.5 4 0.4672897
11 32 Male Citizen 155.4558 78.81120 18 TRUE 1131 204 1401.0 2736.0 4 0.4672897
12 33 Male Citizen 150.9811 75.29820 18 TRUE 877 113 1401.0 2391.0 3 0.4517134
13 34 Male Citizen 150.1760 120.35185 8 TRUE 829 1604 599.0 3032.0 4 0.4672897
14 55 Male Citizen 151.4056 92.70188 16 TRUE 894 969 1233.5 3096.5 4 0.4672897
15 59 Male Citizen 141.6127 86.62372 11 FALSE 342 597 846.0 1785.0 2 0.4579439
...
答案3
得分: 0
看一下这个示例,它应该是你的第一个例子的解决方案,使用slice_min
。
my_data |> filter(gender=="Male",
status == "Citizen") |>
slice_min(height, prop= 0.2) |>
slice_min(weight, prop = 0.2) |>
slice_min(hospital_visits, prop = 0.2)
如果这对于例子有用的话,你可以创建一个映射函数或者只是一个函数来遍历所有可能的结果。
通过组合,你是指性别 X 状态吗?
或者你需要有前20%,然后20-40%,然后40-60%的体重等,这会创建一个5 X 5 X 5(身高 X 体重 X 医院)乘以性别 X 状态?
英文:
Take a look at this sample, it is supposed to be the solution for your first example, using slice_min
.
my_data |> filter(gender=="Male",
status == "Citizen") |>
slice_min(height, prop= 0.2) |>
slice_min(weight, prop = 0.2) |>
slice_min(hospital_visits, prop = 0.2)
If that works out for the example, you could create either a mapping function or just a function to iterate over all possible outcomes
By combinations you mean the gender X status?
or you need to have top 20, then 20-40, then 40-60% of weight and such, which creates a 5 X 5 X 5 [height X weight X hospital) and i guess times gender X status ?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论