合并查找表与数据框。

huangapple go评论91阅读模式
英文:

R: Merging a Lookup Table with a Data Frame

问题

以下是要翻译的部分:

基于这个数据集,我试图计算“嵌套组”内的疾病比例,即:

  • 首先,选择所有男性
  • 然后,选择所有男性公民
  • 然后,在所有男性公民中,识别最矮的20%的一组
  • 然后,在所有男性公民中,在最矮的20%高度内,进一步分离出体重最轻的20%的一组
  • 最后,在所有男性公民中,在最矮的20%高度和体重最轻的20%中,再将其分成医院访问次数最少的20%的一组:这将是第一组
  • 对所有可能的组合重复此过程

第一部分: 使用DPLYR中的“.add = TRUE”参数,我认为可以按以下方式完成:

  1. nested_combinations <- my_data %>%
  2. group_by(Gender) %>%
  3. group_by(Status, add = TRUE) %>%
  4. mutate(height_group = ntile(Height, 5)) %>%
  5. group_by(height_group, add = TRUE) %>%
  6. mutate(weight_group = ntile(Weight, 5)) %>%
  7. group_by(weight_group, add = TRUE) %>%
  8. mutate(visits_group = ntile(Hospital_Visits, 5)) %>%
  9. group_by(visits_group, add = TRUE) %>%
  10. summarize(total_count = n(),
  11. disease_count = sum(Disease == "Yes"),
  12. disease_proportion = mean(Disease == "Yes"))

第二部分: 接下来,我通过计算这些组的“范围”(即最小值和最大值)创建了一个“查找表”:

  1. table_data <- data.frame(
  2. Groups = paste0("Group ", 1:5),
  3. Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
  4. Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
  5. Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
  6. Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
  7. Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
  8. Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
  9. )

我的问题是: 是否有一种方法可以将第二部分中不同变量的最小/最大范围插入到第一部分作为新列(例如,min_height,max_height,min_weight,max_weight,min_visits,max_visits)?

目前,我正在使用一系列“ifelse”语句来完成这个操作,但这似乎不太高效。有人能否请教我更好的方法?

谢谢!

英文:

I am working with the R programming language.

I have the following dataset on medical characteristics of patients and disease prevalance:

  1. set.seed(123)
  2. library(dplyr)
  3. Patient_ID = 1:5000
  4. gender &lt;- c(&quot;Male&quot;,&quot;Female&quot;)
  5. gender &lt;- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
  6. Gender &lt;- as.factor(gender)
  7. status &lt;- c(&quot;Immigrant&quot;,&quot;Citizen&quot;)
  8. status &lt;- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
  9. Status &lt;- as.factor(status )
  10. Height = rnorm(5000, 150, 10)
  11. Weight = rnorm(5000, 90, 10)
  12. Hospital_Visits = sample.int(20, 5000, replace = TRUE)
  13. ################
  14. disease &lt;- c(&quot;Yes&quot;,&quot;No&quot;)
  15. disease &lt;- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
  16. Disease &lt;- as.factor(disease)
  17. ###################
  18. my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
  19. Patient_ID Gender Status Height Weight Hospital_Visits Disease
  20. 1 1 Female Citizen 145.0583 113.70725 1 No
  21. 2 2 Male Immigrant 161.2759 88.33188 18 No
  22. 3 3 Female Immigrant 138.5305 99.26961 6 Yes
  23. 4 4 Male Citizen 164.8102 84.31848 12 No
  24. 5 5 Male Citizen 159.1619 92.25090 12 Yes
  25. 6 6 Female Citizen 153.3513 101.31986 11 Yes

Based on this dataset, I am trying to calculate the disease proportions within "nested groups", i.e.

  • First, select all males
  • Then, select all male citizens
  • Then, out of the set of all male citizens - identify a group of 20% of this set with the smallest heights
  • Then, out of the set of all male citizens within the shortest 20% height - further isolate a group of 20% with the smallest weights
  • Finally, out of the set of all male citizens within the shortest 20% height and within the shortest 20% height having the 20% smallest weight - further isolate them into a group with the 20% fewest number of hospital visits : This will now be the first group
  • Repeat this process for all possible group combinations

Part 1: Using the ".add = TRUE" argument within DPLYR, I think I can accomplish this as follows:

  1. nested_combinations &lt;- my_data %&gt;%
  2. group_by(Gender) %&gt;%
  3. group_by(Status, add = TRUE) %&gt;%
  4. mutate(height_group = ntile(Height, 5)) %&gt;%
  5. group_by(height_group, add = TRUE) %&gt;%
  6. mutate(weight_group = ntile(Weight, 5)) %&gt;%
  7. group_by(weight_group, add = TRUE) %&gt;%
  8. mutate(visits_group = ntile(Hospital_Visits, 5)) %&gt;%
  9. group_by(visits_group, add = TRUE) %&gt;%
  10. summarize(total_count = n(),
  11. disease_count = sum(Disease == &quot;Yes&quot;),
  12. disease_proportion = mean(Disease == &quot;Yes&quot;))
  13. # results
  14. Gender Status height_group weight_group visits_group total_count disease_count disease_proportion
  15. &lt;fct&gt; &lt;fct&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
  16. 1 Female Citizen 1 1 1 16 5 0.312
  17. 2 Female Citizen 1 1 2 16 4 0.25
  18. 3 Female Citizen 1 1 3 16 7 0.438
  19. 4 Female Citizen 1 1 4 15 4 0.267
  20. 5 Female Citizen 1 1 5 15 8 0.533
  21. 6 Female Citizen 1 2 1 16 5 0.312
  22. 7 Female Citizen 1 2 2 16 4 0.25
  23. 8 Female Citizen 1 2 3 16 8 0.5
  24. 9 Female Citizen 1 2 4 15 6 0.4
  25. 10 Female Citizen 1 2 5 15 6 0.4

Part 2: Next, I created a "Lookup Table" by calculating the "ranges" (i.e. min and max) for each of these groups:

  1. table_data &lt;- data.frame(
  2. Groups = paste0(&quot;Group &quot;, 1:5),
  3. Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
  4. Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
  5. Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
  6. Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
  7. Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
  8. Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
  9. )
  10. # results
  11. Groups Min_Height Max_Height Min_Weight Max_Weight Min_Visits Max_Visits
  12. 1 Group 1 111.5468 141.4839 56.53098 81.83402 1 4
  13. 2 Group 2 141.4965 147.4422 81.85064 87.45406 4 8
  14. 3 Group 3 147.4487 152.3924 87.45935 92.72041 8 12
  15. 4 Group 4 152.4016 158.5178 92.72941 98.54624 12 17
  16. 5 Group 5 158.5187 188.4777 98.55533 121.02420 17 20

My Question: Is there a way I can take these min/max ranges for the different variables from Part 2 and insert them into Part 1 as new columns (e.g. min_height, max_height, min_weight, max_weight, min_visits, max_visits)?

Currently I am doing this with a series of "ifelse" statements but this does not seem very efficient. Can someone please show me a better way?

Thanks!

答案1

得分: 1

  1. ## 示例数据
  2. d &lt;-
  3. d &lt;- data.frame(gender = gl(2, 10),
  4. height = 160 + sample(1:40, 40),
  5. weight = 50 + sample(1:50, 40),
  6. disease = sample(c(TRUE, FALSE), 40, replace = TRUE)
  7. )
  1. d |&gt;
  2. group_by(gender) |&gt;
  3. mutate(low_height = height &lt; quantile(height, .2)) |&gt;
  4. group_by(gender, low_height) |&gt;
  5. mutate(low_weight = weight &lt; quantile(weight, .2)) |&gt;
  6. group_by(gender, low_height, low_weight) |&gt;
  7. summarise(across(c(height, weight),
  8. ## 在这里列出自定义统计数据:
  9. list(min = \(xs) min(xs, na.rm = TRUE),
  10. max = \(xs) max(xs, na.rm = TRUE)
  11. ),
  12. .names = &quot;{.col}_{.fn}&quot;
  13. ),
  14. prop_disease = sum(disease)/n(),
  15. ## 等等。
  16. )
  1. # 一个数据框:8 x 8
  2. # 分组:gender, low_height [4]
  3. gender low_height low_weight height_min height_max weight_min weight_max
  4. &lt;fct&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  5. 1 1 FALSE FALSE 172 199 67 100
  6. 2 1 FALSE TRUE 173 190 52 65
  7. 3 1 TRUE FALSE 161 169 74 94
  8. 4 1 TRUE TRUE 165 165 61 61
  9. 5 2 FALSE FALSE 168 200 56 96
  10. 6 2 FALSE TRUE 170 192 51 54
  11. 7 2 TRUE FALSE 164 167 68 93
  12. 8 2 TRUE TRUE 163 163 55 55
  13. # 另有 1 个变量:prop_disease &lt;dbl&gt;
英文:

something along these lines?

  1. ## example data
  2. d &lt;-
  3. d &lt;- data.frame(gender = gl(2, 10),
  4. height = 160 + sample(1:40, 40),
  5. weight = 50 + sample(1:50, 40),
  6. disease = sample(c(TRUE, FALSE), 40, replace = TRUE)
  7. )
  1. d |&gt;
  2. group_by(gender) |&gt;
  3. mutate(low_height = height &lt; quantile(height, .2)) |&gt;
  4. group_by(gender, low_height) |&gt;
  5. mutate(low_weight = weight &lt; quantile(weight, .2)) |&gt;
  6. group_by(gender, low_height, low_weight) |&gt;
  7. summarise(across(c(height, weight),
  8. ## list custom stats here:
  9. list(min = \(xs) min(xs, na.rm = TRUE),
  10. max = \(xs) max(xs, na.rm = TRUE)
  11. ),
  12. .names = &quot;{.col}_{.fn}&quot;
  13. ),
  14. prop_disease = sum(disease)/n(),
  15. ## etc.
  16. )
  1. # A tibble: 8 x 8
  2. # Groups: gender, low_height [4]
  3. gender low_height low_weight height_min height_max weight_min weight_max
  4. &lt;fct&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  5. 1 1 FALSE FALSE 172 199 67 100
  6. 2 1 FALSE TRUE 173 190 52 65
  7. 3 1 TRUE FALSE 161 169 74 94
  8. 4 1 TRUE TRUE 165 165 61 61
  9. 5 2 FALSE FALSE 168 200 56 96
  10. 6 2 FALSE TRUE 170 192 51 54
  11. 7 2 TRUE FALSE 164 167 68 93
  12. 8 2 TRUE TRUE 163 163 55 55
  13. # i 1 more variable: prop_disease &lt;dbl&gt;

huangapple
  • 本文由 发表于 2023年6月26日 00:28:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76551409.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定