2023年6月26日 00:28:30go评论76阅读模式

英文:

R: Merging a Lookup Table with a Data Frame

问题

以下是要翻译的部分：

基于这个数据集，我试图计算“嵌套组”内的疾病比例，即：

首先，选择所有男性
然后，选择所有男性公民
然后，在所有男性公民中，识别最矮的20%的一组
然后，在所有男性公民中，在最矮的20%高度内，进一步分离出体重最轻的20%的一组
最后，在所有男性公民中，在最矮的20%高度和体重最轻的20%中，再将其分成医院访问次数最少的20%的一组：这将是第一组
对所有可能的组合重复此过程

第一部分： 使用DPLYR中的“.add = TRUE”参数，我认为可以按以下方式完成：

nested_combinations <- my_data %>%
  group_by(Gender) %>%
  group_by(Status, add = TRUE) %>%
  mutate(height_group = ntile(Height, 5)) %>%
  group_by(height_group, add = TRUE) %>%
  mutate(weight_group = ntile(Weight, 5)) %>%
  group_by(weight_group, add = TRUE) %>%
  mutate(visits_group = ntile(Hospital_Visits, 5)) %>%
  group_by(visits_group, add = TRUE) %>%
  summarize(total_count = n(),
            disease_count = sum(Disease == "Yes"),
            disease_proportion = mean(Disease == "Yes"))

第二部分： 接下来，我通过计算这些组的“范围”（即最小值和最大值）创建了一个“查找表”：

table_data <- data.frame(
 Groups = paste0("Group ", 1:5),
  Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
  Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
  Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
  Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
 Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)

我的问题是： 是否有一种方法可以将第二部分中不同变量的最小/最大范围插入到第一部分作为新列（例如，min_height，max_height，min_weight，max_weight，min_visits，max_visits）？

目前，我正在使用一系列“ifelse”语句来完成这个操作，但这似乎不太高效。有人能否请教我更好的方法？

谢谢！

英文:

I am working with the R programming language.

I have the following dataset on medical characteristics of patients and disease prevalance:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender &lt;- c(&quot;Male&quot;,&quot;Female&quot;)
gender &lt;- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender &lt;- as.factor(gender)


status &lt;- c(&quot;Immigrant&quot;,&quot;Citizen&quot;)
status &lt;- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status  &lt;- as.factor(status )

Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20,  5000, replace = TRUE)

################

disease &lt;- c(&quot;Yes&quot;,&quot;No&quot;)
disease &lt;- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease &lt;- as.factor(disease)

###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)

  Patient_ID Gender    Status   Height    Weight Hospital_Visits Disease
1          1 Female   Citizen 145.0583 113.70725               1      No
2          2   Male Immigrant 161.2759  88.33188              18      No
3          3 Female Immigrant 138.5305  99.26961               6     Yes
4          4   Male   Citizen 164.8102  84.31848              12      No
5          5   Male   Citizen 159.1619  92.25090              12     Yes
6          6 Female   Citizen 153.3513 101.31986              11     Yes

Based on this dataset, I am trying to calculate the disease proportions within "nested groups", i.e.

First, select all males
Then, select all male citizens
Then, out of the set of all male citizens - identify a group of 20% of this set with the smallest heights
Then, out of the set of all male citizens within the shortest 20% height - further isolate a group of 20% with the smallest weights
Finally, out of the set of all male citizens within the shortest 20% height and within the shortest 20% height having the 20% smallest weight - further isolate them into a group with the 20% fewest number of hospital visits : This will now be the first group
Repeat this process for all possible group combinations

Part 1: Using the ".add = TRUE" argument within DPLYR, I think I can accomplish this as follows:

nested_combinations &lt;- my_data %&gt;%
  group_by(Gender) %&gt;%
  group_by(Status, add = TRUE) %&gt;%
  mutate(height_group = ntile(Height, 5)) %&gt;%
  group_by(height_group, add = TRUE) %&gt;%
  mutate(weight_group = ntile(Weight, 5)) %&gt;%
  group_by(weight_group, add = TRUE) %&gt;%
  mutate(visits_group = ntile(Hospital_Visits, 5)) %&gt;%
  group_by(visits_group, add = TRUE) %&gt;%
  summarize(total_count = n(),
            disease_count = sum(Disease == &quot;Yes&quot;),
            disease_proportion = mean(Disease == &quot;Yes&quot;))

# results 

  Gender Status  height_group weight_group visits_group total_count disease_count disease_proportion
   &lt;fct&gt;  &lt;fct&gt;          &lt;int&gt;        &lt;int&gt;        &lt;int&gt;       &lt;int&gt;         &lt;int&gt;              &lt;dbl&gt;
 1 Female Citizen            1            1            1          16             5              0.312
 2 Female Citizen            1            1            2          16             4              0.25 
 3 Female Citizen            1            1            3          16             7              0.438
 4 Female Citizen            1            1            4          15             4              0.267
 5 Female Citizen            1            1            5          15             8              0.533
 6 Female Citizen            1            2            1          16             5              0.312
 7 Female Citizen            1            2            2          16             4              0.25 
 8 Female Citizen            1            2            3          16             8              0.5  
 9 Female Citizen            1            2            4          15             6              0.4  
10 Female Citizen            1            2            5          15             6              0.4

Part 2: Next, I created a "Lookup Table" by calculating the "ranges" (i.e. min and max) for each of these groups:

table_data &lt;- data.frame(
 Groups = paste0(&quot;Group &quot;, 1:5),
  Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
  Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
  Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
  Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
 Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)

# results

   Groups Min_Height Max_Height Min_Weight Max_Weight Min_Visits Max_Visits
1 Group 1   111.5468   141.4839   56.53098   81.83402          1          4
2 Group 2   141.4965   147.4422   81.85064   87.45406          4          8
3 Group 3   147.4487   152.3924   87.45935   92.72041          8         12
4 Group 4   152.4016   158.5178   92.72941   98.54624         12         17
5 Group 5   158.5187   188.4777   98.55533  121.02420         17         20

My Question: Is there a way I can take these min/max ranges for the different variables from Part 2 and insert them into Part 1 as new columns (e.g. min_height, max_height, min_weight, max_weight, min_visits, max_visits)?

Currently I am doing this with a series of "ifelse" statements but this does not seem very efficient. Can someone please show me a better way?

Thanks!

答案1

得分: 1

## 示例数据
d &lt;-
d &lt;- data.frame(gender = gl(2, 10),
           height = 160 + sample(1:40, 40),
           weight = 50 + sample(1:50, 40),
           disease = sample(c(TRUE, FALSE), 40, replace = TRUE)
           )

d |&gt;
  group_by(gender) |&gt;
  mutate(low_height = height &lt; quantile(height, .2)) |&gt;
  group_by(gender, low_height) |&gt;
  mutate(low_weight = weight &lt; quantile(weight, .2)) |&gt;
  group_by(gender, low_height, low_weight) |&gt;
  summarise(across(c(height, weight),
                   ## 在这里列出自定义统计数据:
                   list(min = \(xs) min(xs, na.rm = TRUE),
                        max = \(xs) max(xs, na.rm = TRUE)
                        ),
                   .names = &quot;{.col}_{.fn}&quot;
                   ),
            prop_disease = sum(disease)/n(),
            ## 等等。
)

# 一个数据框：8 x 8
# 分组：gender, low_height [4]
  gender low_height low_weight height_min height_max weight_min weight_max
  &lt;fct&gt;  &lt;lgl&gt;      &lt;lgl&gt;           &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;
1 1      FALSE      FALSE             172        199         67        100
2 1      FALSE      TRUE              173        190         52         65
3 1      TRUE       FALSE             161        169         74         94
4 1      TRUE       TRUE              165        165         61         61
5 2      FALSE      FALSE             168        200         56         96
6 2      FALSE      TRUE              170        192         51         54
7 2      TRUE       FALSE             164        167         68         93
8 2      TRUE       TRUE              163        163         55         55
# 另有 1 个变量：prop_disease &lt;dbl&gt;

英文:

something along these lines?

## example data
d &lt;-
d &lt;- data.frame(gender = gl(2, 10),
           height = 160 + sample(1:40, 40),
           weight = 50 + sample(1:50, 40),
           disease = sample(c(TRUE, FALSE), 40, replace = TRUE)
           )

d |&gt;
  group_by(gender) |&gt;
  mutate(low_height = height &lt; quantile(height, .2)) |&gt;
  group_by(gender, low_height) |&gt;
  mutate(low_weight = weight &lt; quantile(weight, .2)) |&gt;
  group_by(gender, low_height, low_weight) |&gt;
  summarise(across(c(height, weight),
                   ## list custom stats here:
                   list(min = \(xs) min(xs, na.rm = TRUE),
                        max = \(xs) max(xs, na.rm = TRUE)
                        ),
                   .names = &quot;{.col}_{.fn}&quot;
                   ),
            prop_disease = sum(disease)/n(),
            ## etc.
)

# A tibble: 8 x 8
# Groups:   gender, low_height [4]
  gender low_height low_weight height_min height_max weight_min weight_max
  &lt;fct&gt;  &lt;lgl&gt;      &lt;lgl&gt;           &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;
1 1      FALSE      FALSE             172        199         67        100
2 1      FALSE      TRUE              173        190         52         65
3 1      TRUE       FALSE             161        169         74         94
4 1      TRUE       TRUE              165        165         61         61
5 2      FALSE      FALSE             168        200         56         96
6 2      FALSE      TRUE              170        192         51         54
7 2      TRUE       FALSE             164        167         68         93
8 2      TRUE       TRUE              163        163         55         55
# i 1 more variable: prop_disease &lt;dbl&gt;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

合并查找表与数据框。

问题

答案1

如何在ggsurvplot中插入一张图片？

将复杂的长宽数据集转换为R中的长数据集

如何使用dplyr和purrr计算总和？

你怎样从一个矩阵（2D数组）中获取一个数据框？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论