合并查找表与数据框。

huangapple go评论67阅读模式
英文:

R: Merging a Lookup Table with a Data Frame

问题

以下是要翻译的部分:

基于这个数据集,我试图计算“嵌套组”内的疾病比例,即:

  • 首先,选择所有男性
  • 然后,选择所有男性公民
  • 然后,在所有男性公民中,识别最矮的20%的一组
  • 然后,在所有男性公民中,在最矮的20%高度内,进一步分离出体重最轻的20%的一组
  • 最后,在所有男性公民中,在最矮的20%高度和体重最轻的20%中,再将其分成医院访问次数最少的20%的一组:这将是第一组
  • 对所有可能的组合重复此过程

第一部分: 使用DPLYR中的“.add = TRUE”参数,我认为可以按以下方式完成:

nested_combinations <- my_data %>%
  group_by(Gender) %>%
  group_by(Status, add = TRUE) %>%
  mutate(height_group = ntile(Height, 5)) %>%
  group_by(height_group, add = TRUE) %>%
  mutate(weight_group = ntile(Weight, 5)) %>%
  group_by(weight_group, add = TRUE) %>%
  mutate(visits_group = ntile(Hospital_Visits, 5)) %>%
  group_by(visits_group, add = TRUE) %>%
  summarize(total_count = n(),
            disease_count = sum(Disease == "Yes"),
            disease_proportion = mean(Disease == "Yes"))

第二部分: 接下来,我通过计算这些组的“范围”(即最小值和最大值)创建了一个“查找表”:

table_data <- data.frame(
 Groups = paste0("Group ", 1:5),
  Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
  Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
  Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
  Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
 Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)

我的问题是: 是否有一种方法可以将第二部分中不同变量的最小/最大范围插入到第一部分作为新列(例如,min_height,max_height,min_weight,max_weight,min_visits,max_visits)?

目前,我正在使用一系列“ifelse”语句来完成这个操作,但这似乎不太高效。有人能否请教我更好的方法?

谢谢!

英文:

I am working with the R programming language.

I have the following dataset on medical characteristics of patients and disease prevalance:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender &lt;- c(&quot;Male&quot;,&quot;Female&quot;)
gender &lt;- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender &lt;- as.factor(gender)


status &lt;- c(&quot;Immigrant&quot;,&quot;Citizen&quot;)
status &lt;- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status  &lt;- as.factor(status )

Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20,  5000, replace = TRUE)

################

disease &lt;- c(&quot;Yes&quot;,&quot;No&quot;)
disease &lt;- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease &lt;- as.factor(disease)

###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)

  Patient_ID Gender    Status   Height    Weight Hospital_Visits Disease
1          1 Female   Citizen 145.0583 113.70725               1      No
2          2   Male Immigrant 161.2759  88.33188              18      No
3          3 Female Immigrant 138.5305  99.26961               6     Yes
4          4   Male   Citizen 164.8102  84.31848              12      No
5          5   Male   Citizen 159.1619  92.25090              12     Yes
6          6 Female   Citizen 153.3513 101.31986              11     Yes

Based on this dataset, I am trying to calculate the disease proportions within "nested groups", i.e.

  • First, select all males
  • Then, select all male citizens
  • Then, out of the set of all male citizens - identify a group of 20% of this set with the smallest heights
  • Then, out of the set of all male citizens within the shortest 20% height - further isolate a group of 20% with the smallest weights
  • Finally, out of the set of all male citizens within the shortest 20% height and within the shortest 20% height having the 20% smallest weight - further isolate them into a group with the 20% fewest number of hospital visits : This will now be the first group
  • Repeat this process for all possible group combinations

Part 1: Using the ".add = TRUE" argument within DPLYR, I think I can accomplish this as follows:

nested_combinations &lt;- my_data %&gt;%
  group_by(Gender) %&gt;%
  group_by(Status, add = TRUE) %&gt;%
  mutate(height_group = ntile(Height, 5)) %&gt;%
  group_by(height_group, add = TRUE) %&gt;%
  mutate(weight_group = ntile(Weight, 5)) %&gt;%
  group_by(weight_group, add = TRUE) %&gt;%
  mutate(visits_group = ntile(Hospital_Visits, 5)) %&gt;%
  group_by(visits_group, add = TRUE) %&gt;%
  summarize(total_count = n(),
            disease_count = sum(Disease == &quot;Yes&quot;),
            disease_proportion = mean(Disease == &quot;Yes&quot;))

# results 

  Gender Status  height_group weight_group visits_group total_count disease_count disease_proportion
   &lt;fct&gt;  &lt;fct&gt;          &lt;int&gt;        &lt;int&gt;        &lt;int&gt;       &lt;int&gt;         &lt;int&gt;              &lt;dbl&gt;
 1 Female Citizen            1            1            1          16             5              0.312
 2 Female Citizen            1            1            2          16             4              0.25 
 3 Female Citizen            1            1            3          16             7              0.438
 4 Female Citizen            1            1            4          15             4              0.267
 5 Female Citizen            1            1            5          15             8              0.533
 6 Female Citizen            1            2            1          16             5              0.312
 7 Female Citizen            1            2            2          16             4              0.25 
 8 Female Citizen            1            2            3          16             8              0.5  
 9 Female Citizen            1            2            4          15             6              0.4  
10 Female Citizen            1            2            5          15             6              0.4  

Part 2: Next, I created a "Lookup Table" by calculating the "ranges" (i.e. min and max) for each of these groups:

table_data &lt;- data.frame(
 Groups = paste0(&quot;Group &quot;, 1:5),
  Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
  Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
  Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
  Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
 Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)

# results

   Groups Min_Height Max_Height Min_Weight Max_Weight Min_Visits Max_Visits
1 Group 1   111.5468   141.4839   56.53098   81.83402          1          4
2 Group 2   141.4965   147.4422   81.85064   87.45406          4          8
3 Group 3   147.4487   152.3924   87.45935   92.72041          8         12
4 Group 4   152.4016   158.5178   92.72941   98.54624         12         17
5 Group 5   158.5187   188.4777   98.55533  121.02420         17         20

My Question: Is there a way I can take these min/max ranges for the different variables from Part 2 and insert them into Part 1 as new columns (e.g. min_height, max_height, min_weight, max_weight, min_visits, max_visits)?

Currently I am doing this with a series of "ifelse" statements but this does not seem very efficient. Can someone please show me a better way?

Thanks!

答案1

得分: 1

## 示例数据
d &lt;-
d &lt;- data.frame(gender = gl(2, 10),
           height = 160 + sample(1:40, 40),
           weight = 50 + sample(1:50, 40),
           disease = sample(c(TRUE, FALSE), 40, replace = TRUE)
           )
d |&gt;
  group_by(gender) |&gt;
  mutate(low_height = height &lt; quantile(height, .2)) |&gt;
  group_by(gender, low_height) |&gt;
  mutate(low_weight = weight &lt; quantile(weight, .2)) |&gt;
  group_by(gender, low_height, low_weight) |&gt;
  summarise(across(c(height, weight),
                   ## 在这里列出自定义统计数据:
                   list(min = \(xs) min(xs, na.rm = TRUE),
                        max = \(xs) max(xs, na.rm = TRUE)
                        ),
                   .names = &quot;{.col}_{.fn}&quot;
                   ),
            prop_disease = sum(disease)/n(),
            ## 等等。
)
# 一个数据框:8 x 8
# 分组:gender, low_height [4]
  gender low_height low_weight height_min height_max weight_min weight_max
  &lt;fct&gt;  &lt;lgl&gt;      &lt;lgl&gt;           &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;
1 1      FALSE      FALSE             172        199         67        100
2 1      FALSE      TRUE              173        190         52         65
3 1      TRUE       FALSE             161        169         74         94
4 1      TRUE       TRUE              165        165         61         61
5 2      FALSE      FALSE             168        200         56         96
6 2      FALSE      TRUE              170        192         51         54
7 2      TRUE       FALSE             164        167         68         93
8 2      TRUE       TRUE              163        163         55         55
# 另有 1 个变量:prop_disease &lt;dbl&gt;
英文:

something along these lines?

## example data
d &lt;-
d &lt;- data.frame(gender = gl(2, 10),
           height = 160 + sample(1:40, 40),
           weight = 50 + sample(1:50, 40),
           disease = sample(c(TRUE, FALSE), 40, replace = TRUE)
           )
d |&gt;
  group_by(gender) |&gt;
  mutate(low_height = height &lt; quantile(height, .2)) |&gt;
  group_by(gender, low_height) |&gt;
  mutate(low_weight = weight &lt; quantile(weight, .2)) |&gt;
  group_by(gender, low_height, low_weight) |&gt;
  summarise(across(c(height, weight),
                   ## list custom stats here:
                   list(min = \(xs) min(xs, na.rm = TRUE),
                        max = \(xs) max(xs, na.rm = TRUE)
                        ),
                   .names = &quot;{.col}_{.fn}&quot;
                   ),
            prop_disease = sum(disease)/n(),
            ## etc.
)
# A tibble: 8 x 8
# Groups:   gender, low_height [4]
  gender low_height low_weight height_min height_max weight_min weight_max
  &lt;fct&gt;  &lt;lgl&gt;      &lt;lgl&gt;           &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;
1 1      FALSE      FALSE             172        199         67        100
2 1      FALSE      TRUE              173        190         52         65
3 1      TRUE       FALSE             161        169         74         94
4 1      TRUE       TRUE              165        165         61         61
5 2      FALSE      FALSE             168        200         56         96
6 2      FALSE      TRUE              170        192         51         54
7 2      TRUE       FALSE             164        167         68         93
8 2      TRUE       TRUE              163        163         55         55
# i 1 more variable: prop_disease &lt;dbl&gt;

huangapple
  • 本文由 发表于 2023年6月26日 00:28:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76551409.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定