2023年6月26日 03:14:42go评论122阅读模式

英文:

R: How to Prevent Overlapping Ranges Within (Nested) Summary Groups

问题

我正在使用R编程语言进行工作。

我有关于患者的医疗特征和疾病患病率的以下数据集：

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender <- c("Male", "Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)

status <- c("Immigrant", "Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status)

height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20,  5000, replace = TRUE)

################

disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)

###################
my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)

我的问题： 基于这个数据集，我尝试计算“嵌套组内”的疾病比例，即：

首先，选择所有男性
然后，选择所有男性公民
然后，在所有男性公民中，找出身高最小的20%的人
然后，在身高最小的20%男性公民中，进一步筛选出体重最小的20%的人
最后，在身高最小的20%、体重最小的20%、且医院就诊次数最少的20%的男性公民中，进一步筛选出医院就诊次数最少的20%的人：这将是第一组
对所有可能的组合重复此过程

这意味着一个患者只能属于一个单一组 - 换句话说，这些组不能有任何重叠范围。

在之前的问题中（https://stackoverflow.com/questions/76551409/r-merging-a-lookup-table-with-a-data-frame），我学到了如何计算嵌套组内的疾病比例：

final = my_data %>%
    group_by(gender, status) %>%
    mutate(low_height = height < quantile(height, .2)) %>%
    group_by(gender, status, low_height) %>%
    mutate(low_weight = weight < quantile(weight, .2)) %>%
    group_by(gender,  status, low_height, low_weight) %>%
    mutate(low_visit = hospital_visits  < quantile(hospital_visits , .2)) %>%
    group_by(gender, status, low_height, low_weight, low_visit) %>%
    summarise(across(c(height, weight, hospital_visits),
                     ## list custom stats here:
                     list(min = \(xs) min(xs, na.rm = TRUE),
                          max = \(xs) max(xs, na.rm = TRUE)
                     ),
                     .names = "{.col}_{.fn}"
    ),
    prop_disease = sum(disease)/n(),
    ## etc.
)
final$low_height = final$low_weight = final$low_visit = NULL

我的问题 当我查看此代码的结果时：

> final
# A tibble: 32 x 9
# Groups:   gender, status [4]
   gender status    height_min height_max weight_min weight_max hospital_visits_min hospital_visits_max prop_disease
   <fct>  <fct>          <dbl>      <dbl>      <dbl>      <dbl>               <int>               <int>        <dbl>
 1 Female Citizen         142.       188.       82.3      119.                    4                  20        0.495
 2 Female Citizen         142.       175.       82.4      118.                    1                   3        0.495
 3 Female Citizen         142.       177.       57.4       82.3                   5                  20        0.482

在第1行和第2行：我可以看到已经创建了重叠的身高范围，如（142，188）和（142，175）
这意味着如果一个医疗患者身高为150厘米，那么该患者可以分配到这两个组中：因此，这违反了不重叠组的原始条件。

有人能否请示我如何解决这个问题？

英文:

I am working with the R programming language.

I have the following dataset on medical characteristics of patients and disease prevalance:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender &lt;- c(&quot;Male&quot;,&quot;Female&quot;)
gender &lt;- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender &lt;- as.factor(gender)


status &lt;- c(&quot;Immigrant&quot;,&quot;Citizen&quot;)
status &lt;- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status  &lt;- as.factor(status )

height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20,  5000, replace = TRUE)

################

disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)

###################
my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)

  Patient_ID gender    status   height    weight hospital_visits disease
1          1 Female   Citizen 145.0583 113.70725               1    TRUE
2          2   Male Immigrant 161.2759  88.33188              18   FALSE
3          3 Female Immigrant 138.5305  99.26961               6   FALSE
4          4   Male   Citizen 164.8102  84.31848              12    TRUE
5          5   Male   Citizen 159.1619  92.25090              12    TRUE
6          6 Female   Citizen 153.3513 101.31986              11    TRUE

My Problem: Based on this dataset, I am trying to calculate the disease proportions within "nested groups", i.e.

First, select all males
Then, select all male citizens
Then, out of the set of all male citizens - identify a group of 20% of this set with the smallest heights
Then, out of the set of all male citizens within the shortest 20% height - further isolate a group of 20% with the smallest weights
Finally, out of the set of all male citizens within the shortest 20% height and within the shortest 20% height having the 20% smallest weight - further isolate them into a group with the 20% fewest number of hospital visits : This will now be the first group
Repeat this process for all possible group combinations

This means an individual patient can only belong to a single group - in other words, the groups can not have any overlapping ranges.

In a previous question (https://stackoverflow.com/questions/76551409/r-merging-a-lookup-table-with-a-data-frame), I learned how to calculate the disease proportions within nested groups:

final = my_data |&gt;
    group_by(gender, status) |&gt;
    mutate(low_height = height &lt; quantile(height, .2)) |&gt;
    group_by(gender, status, low_height) |&gt;
    mutate(low_weight = weight &lt; quantile(weight, .2)) |&gt;
    group_by(gender,  status, low_height, low_weight) |&gt;
    mutate(low_visit = hospital_visits  &lt; quantile(hospital_visits , .2)) |&gt;
    group_by(gender, status, low_height, low_weight, low_visit) |&gt;
    summarise(across(c(height, weight, hospital_visits),
                     ## list custom stats here:
                     list(min = \(xs) min(xs, na.rm = TRUE),
                          max = \(xs) max(xs, na.rm = TRUE)
                     ),
                     .names = &quot;{.col}_{.fn}&quot;
    ),
    prop_disease = sum(disease)/n(),
    ## etc.
    )

final$low_height = final$low_weight = final$low_visit = NULL

My Question When I look at the results from this code:

&gt; final
# A tibble: 32 x 9
# Groups:   gender, status [4]
   gender status    height_min height_max weight_min weight_max hospital_visits_min hospital_visits_max prop_disease
   &lt;fct&gt;  &lt;fct&gt;          &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;               &lt;int&gt;               &lt;int&gt;        &lt;dbl&gt;
 1 Female Citizen         142.       188.       82.3      119.                    4                  20        0.495
 2 Female Citizen         142.       175.       82.4      118.                    1                   3        0.495
 3 Female Citizen         142.       177.       57.4       82.3                   5                  20        0.482

In row 1 and row 2: I can see that overlapping height ranges have been created such as (142,188) and (142, 175)
This means that if a medical patient was 150 cm tall - the patient could be assigned to both of these groups: as such, this violates the original condition of non-overlapping groups.

Can someone please show me if there is a way to fix this problem?

Thanks!

答案1

得分: 3

首先，如@I_O的评论中所说，我同意身高和体重可能不能完全分开以避免重叠。

如果您想要对所有感兴趣特征的组合进行总结，并且可能尝试以下代码：

df %>%
    group_by(gender, status) %>%
    group_map(~ {
        .x %>%
            filter(height < quantile(height, .2)) %>%
            filter(weight < quantile(weight, .2)) %>%
            filter(hospital_visits < quantile(hospital_visits, .2)) %>%
            summarise(
                across(c(height, weight, hospital_visits),
                    list(
                        min = \(xs) min(xs, na.rm = TRUE),
                        max = \(xs) max(xs, na.rm = TRUE)
                    ),
                    .names = "{.col}_{.fn}"
                ),
                prop_disease = mean(disease),
                .by = c(gender, status)
            )
    }, .keep = TRUE) %>%
    bind_rows()

该代码会生成如下结果：

# A tibble: 4 × 9
  gender status  height_min height_max weight_min weight_max hospital_visits_min
  <fct>  <fct>        <dbl>      <dbl>      <dbl>      <dbl>               <int>
1 Female Citizen       123.       141.       70.7       80.6                   1
2 Female Immigr…       124.       140.       62.0       83.5                   2
3 Male   Citizen       125.       140.       66.7       79.8                   1
4 Male   Immigr…       130.       140.       68.4       81.1                   1
# ℹ 2 more variables: hospital_visits_max <int>, prop_disease <dbl>

英文:

First of all, as said in the comment by @I_O, I agree that height and weight may not be absolutely separated to avoid the overlap.

If you want want to have a summary of all combinations with quantile(... < 0.2) for the features of interest, probably you can try

df %&gt;%
    group_by(gender, status) %&gt;%
    group_map(~ {
        .x %&gt;%
            filter(height &lt; quantile(height, .2)) %&gt;%
            filter(weight &lt; quantile(weight, .2)) %&gt;%
            filter(hospital_visits &lt; quantile(hospital_visits, .2)) %&gt;%
            summarise(
                across(c(height, weight, hospital_visits),
                    list(
                        min = \(xs) min(xs, na.rm = TRUE),
                        max = \(xs) max(xs, na.rm = TRUE)
                    ),
                    .names = &quot;{.col}_{.fn}&quot;
                ),
                prop_disease = mean(disease),
                .by = c(gender, status)
            )
    }, .keep = TRUE) %&gt;%
    bind_rows()

which gives

# A tibble: 4 &#215; 9
  gender status  height_min height_max weight_min weight_max hospital_visits_min
  &lt;fct&gt;  &lt;fct&gt;        &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;               &lt;int&gt;
1 Female Citizen       123.       141.       70.7       80.6                   1
2 Female Immigr…       124.       140.       62.0       83.5                   2
3 Male   Citizen       125.       140.       66.7       79.8                   1
4 Male   Immigr…       130.       140.       68.4       81.1                   1
# ℹ 2 more variables: hospital_visits_max &lt;int&gt;, prop_disease &lt;dbl&gt;

答案2

得分: 2

也许这样做过于简单，但为了避免重叠，我们可以将（身高、体重、医院访问次数）作为一组使用分位数切割，而不是单独对每个水平进行切割：

对于男性公民，我们可以这样做：

my_data %>%
  filter(gender == "男性", status == "公民") %>%
  mutate(rank_height = rank(身高),
         rank_weight = rank(体重),
         rank_hospital_visits = rank(医院访问次数)) %>%
  mutate(total_score = rank_height + rank_weight + rank_hospital_visits,
         group_number = ntile(total_score, 5)) %>%
  mutate(prop_disease = mean(疾病), .by = group_number)

这确保了根据（身高、体重、医院访问次数）组的等级而不是每个单独水平来分配每个患者到一个组中。

  患者_ID   性别  状态      身高       体重      医院访问次数 疾病   等级_身高   等级_体重   等级_医院访问次数 总分  组号  疾病比例
1           4    男性    公民   164.8102   84.31848             12          TRUE      1502           434                   930.0        2866.0        4           0.4672897
2           5    男性    公民   159.1619   92.25090             12          TRUE      1316           936                   930.0        3182.0        5           0.5125000
3           8    男性    公民   152.0362   87.67197              4            FALSE        933           667                   286.5        1886.5        2           0.4579439
4          11   男性    公民   143.9616   87.24376              7          TRUE        438           637                   519.0        1594.0        1           0.4704050
5          13   男性    公民   155.9949  100.89346              5            FALSE      1157         1381                   362.5        2900.5        4           0.4672897
6          20   男性    公民   136.0402   82.95463              2          TRUE          133           368                   123.5          624.5        1           0.4704050
7          24   男性    公民   138.7447   90.35814              3            FALSE        222           822                   206.0        1250.0        1           0.4704050
8          25   男性    公民   148.1227  100.64930              6            FALSE        682         1373                   438.0        2493.0        3           0.4517134
9          26   男性    公民   139.6196   84.32218              1            FALSE        260           435                     41.0          736.0        1           0.4704050
10         31   男性    公民   149.8554   92.32891            15          TRUE          812           941                   1159.5      2912.5        4           0.4672897
11         32   男性    公民   155.4558   78.81120            18          TRUE        1131           204                   1401.0      2736.0        4           0.4672897
12         33   男性    公民   150.9811   75.29820            18          TRUE          877           113                   1401.0      2391.0        3           0.4517134
13         34   男性    公民   150.1760  120.35185              8          TRUE          829         1604                     599.0        3032.0        4           0.4672897
14         55   男性    公民   151.4056   92.70188            16          TRUE          894           969                   1233.5      3096.5        4           0.4672897
15         59   男性    公民   141.6127   86.62372            11            FALSE        342           597                     846.0        1785.0        2           0.4579439
...

英文:

Maybe this is too simple but to avoid overlapping we could use the quantile cuts for (height, weight, hospital visits) as a group, and not on each level separately:

For the male citizens we could do:

my_data %&gt;% 
filter(gender == &quot;Male&quot;, status == &quot;Citizen&quot;) %&gt;% 
mutate(rank_height = rank(height),
rank_weight = rank(weight),
rank_hospital_visits = rank(hospital_visits)) %&gt;% 
mutate(total_score = rank_height + rank_weight + rank_hospital_visits,
group_number = ntile(total_score, 5)) %&gt;% 
mutate(prop_disease = mean(disease), .by = group_number)

This ensure the assignment of each patient to one group based on the rank of the groups (height, weight, hospital visits) instead of each single level.

  Patient_ID gender  status   height    weight hospital_visits disease rank_height rank_weight rank_hospital_visits total_score group_number prop_disease
1           4   Male Citizen 164.8102  84.31848              12    TRUE        1502         434                930.0      2866.0            4    0.4672897
2           5   Male Citizen 159.1619  92.25090              12    TRUE        1316         936                930.0      3182.0            5    0.5125000
3           8   Male Citizen 152.0362  87.67197               4   FALSE         933         667                286.5      1886.5            2    0.4579439
4          11   Male Citizen 143.9616  87.24376               7    TRUE         438         637                519.0      1594.0            1    0.4704050
5          13   Male Citizen 155.9949 100.89346               5   FALSE        1157        1381                362.5      2900.5            4    0.4672897
6          20   Male Citizen 136.0402  82.95463               2    TRUE         133         368                123.5       624.5            1    0.4704050
7          24   Male Citizen 138.7447  90.35814               3   FALSE         222         822                206.0      1250.0            1    0.4704050
8          25   Male Citizen 148.1227 100.64930               6   FALSE         682        1373                438.0      2493.0            3    0.4517134
9          26   Male Citizen 139.6196  84.32218               1   FALSE         260         435                 41.0       736.0            1    0.4704050
10         31   Male Citizen 149.8554  92.32891              15    TRUE         812         941               1159.5      2912.5            4    0.4672897
11         32   Male Citizen 155.4558  78.81120              18    TRUE        1131         204               1401.0      2736.0            4    0.4672897
12         33   Male Citizen 150.9811  75.29820              18    TRUE         877         113               1401.0      2391.0            3    0.4517134
13         34   Male Citizen 150.1760 120.35185               8    TRUE         829        1604                599.0      3032.0            4    0.4672897
14         55   Male Citizen 151.4056  92.70188              16    TRUE         894         969               1233.5      3096.5            4    0.4672897
15         59   Male Citizen 141.6127  86.62372              11   FALSE         342         597                846.0      1785.0            2    0.4579439
...

答案3

得分: 0

看一下这个示例，它应该是你的第一个例子的解决方案，使用slice_min。

my_data |&gt; filter(gender==&quot;Male&quot;,
                  status == &quot;Citizen&quot;) |&gt; 
  slice_min(height, prop= 0.2) |&gt; 
  slice_min(weight, prop = 0.2) |&gt; 
  slice_min(hospital_visits, prop = 0.2)

如果这对于例子有用的话，你可以创建一个映射函数或者只是一个函数来遍历所有可能的结果。

通过组合，你是指性别 X 状态吗？
或者你需要有前20%，然后20-40%，然后40-60%的体重等，这会创建一个5 X 5 X 5（身高 X 体重 X 医院）乘以性别 X 状态？

英文:

Take a look at this sample, it is supposed to be the solution for your first example, using slice_min.

my_data |&gt; filter(gender==&quot;Male&quot;,
status == &quot;Citizen&quot;) |&gt; 
slice_min(height, prop= 0.2) |&gt; 
slice_min(weight, prop = 0.2) |&gt; 
slice_min(hospital_visits, prop = 0.2)

If that works out for the example, you could create either a mapping function or just a function to iterate over all possible outcomes

By combinations you mean the gender X status?
or you need to have top 20, then 20-40, then 40-60% of weight and such, which creates a 5 X 5 X 5 [height X weight X hospital) and i guess times gender X status ?

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R: 如何防止（嵌套）摘要组内的重叠范围

问题

答案1

答案2

答案3

如何在R中使用scale_fill_discrete时更改颜色？

使quarto HTML文档使用完整的窗口宽度

如何在R tidyverse中将结果就地放回原始数据框？

file.choose() 在 Windows 上打开对话框时没有文件名。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论