R: 基于嵌套群组计算比例

huangapple go评论109阅读模式
英文:

R: Calculating Proportions Based on Nested Groups

问题

以下是您提供的代码的翻译部分:

  1. 我正在使用R编程语言。
  2. 我有以下数据集:
  3. set.seed(123)
  4. library(dplyr)
  5. Patient_ID = 1:5000
  6. gender <- c("Male","Female")
  7. gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
  8. Gender <- as.factor(gender)
  9. status <- c("Immigrant","Citizen")
  10. status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
  11. Status <- as.factor(status )
  12. Height = rnorm(5000, 150, 10)
  13. Weight = rnorm(5000, 90, 10)
  14. Hospital_Visits = sample.int(20, 5000, replace = TRUE)
  15. ################
  16. disease <- c("Yes","No")
  17. disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
  18. Disease <- as.factor(disease)
  19. ###################
  20. my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
  21. Patient_ID Gender Status Height Weight Hospital_Visits Disease
  22. 1 1 Female Citizen 145.0583 113.70725 1 No
  23. 2 2 Male Immigrant 161.2759 88.33188 18 No
  24. 3 3 Female Immigrant 138.5305 99.26961 6 Yes
  25. 4 4 Male Citizen 164.8102 84.31848 12 No
  26. 5 5 Male Citizen 159.1619 92.25090 12 Yes
  27. 6 6 Female Citizen 153.3513 101.31986 11 Yes
  28. 我试图完成以下任务:
  29. - 选择所有性别为男性且状态为公民的行
  30. - 基于此条件,将“Height”变量分为3个“ntiles
  31. - 现在,基于这些(嵌套的)累积条件(即性别=男性且状态=公民且Height_ntile = 1,性别=男性且状态=公民且Height_ntile = 2,性别=男性且状态=公民且Height_ntile = 3),将“Weight”变量分为3个“ntiles
  32. - 现在,基于这些添加的累积条件(即性别=男性且状态=公民且Height_ntile = 1Weight_ntile = 1等),将“Hospital_Visits”分为3个“ntiles
  33. - 最后,基于这些添加的累积条件(性别=男性且状态=公民且Height_ntile = 1Weight_ntile = 1Hosptial_Visits_ntile = 1等),计算患病率
  34. - 对于性别=女性且状态=公民等重复上述步骤。
  35. 根据以前的问题(https://stackoverflow.com/questions/75014703/r-interchanging-quantile-and-ntile-functions),我尝试调整代码以解决此问题:
  36. results <- my_data %>%
  37. group_by(Gender, Status) %>%
  38. mutate(group = ntile(Height, 3)) %>%
  39. group_by(Gender, Status, group) %>%
  40. mutate(min = min(Height),
  41. max = max(Height)) %>%
  42. mutate(range_Height = paste(min, max, sep = "-")) %>%
  43. ungroup() %>%
  44. group_by(Gender, Status, range_Height) %>%
  45. mutate(group = ntile(Weight, 3)) %>%
  46. group_by(Gender, Status, range_Height, group) %>%
  47. mutate(min = min(Weight),
  48. max = max(Weight)) %>%
  49. mutate(range_Weight = paste(min, max, sep = "-")) %>%
  50. ungroup() %>%
  51. group_by(Gender, Status, range_Height, range_Weight) %>%
  52. mutate(group = ntile(Hospital_Visits, 3)) %>%
  53. group_by(Gender, Status, range_Height, range_Weight, group) %>%
  54. mutate(min = min(Hospital_Visits),
  55. max = max(Hospital_Visits)) %>%
  56. mutate(range_Hospital_Visits = paste(min, max, sep = "-")) %>%
  57. ungroup() %>% group_by (Gender, Status, range_Height, range_Weight, range_Hospital_Visits) %>% summarise(disease_rate = mean(Disease), count = n())
  58. 该代码似乎已运行,但伴有多个警告,并且“disease_rate”列被填充为NA。可以有人请教如何解决这个问题吗?
  59. 谢谢!
  60. 注意:这不应该与按变量分组并计算ntiles的方式相同 - 然后中断该过程,按先前的变量和先前的ntiles分组...然后计算新的ntiles。我希望分组和ntile计算是不间断的。
  61. <details>
  62. <summary>英文:</summary>
  63. I am working with the R programming language.
  64. I have the following dataset:
  65. set.seed(123)
  66. library(dplyr)
  67. Patient_ID = 1:5000
  68. gender &lt;- c(&quot;Male&quot;,&quot;Female&quot;)
  69. gender &lt;- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
  70. Gender &lt;- as.factor(gender)
  71. status &lt;- c(&quot;Immigrant&quot;,&quot;Citizen&quot;)
  72. status &lt;- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
  73. Status &lt;- as.factor(status )
  74. Height = rnorm(5000, 150, 10)
  75. Weight = rnorm(5000, 90, 10)
  76. Hospital_Visits = sample.int(20, 5000, replace = TRUE)
  77. ################
  78. disease &lt;- c(&quot;Yes&quot;,&quot;No&quot;)
  79. disease &lt;- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
  80. Disease &lt;- as.factor(disease)
  81. ###################
  82. my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
  83. Patient_ID Gender Status Height Weight Hospital_Visits Disease
  84. 1 1 Female Citizen 145.0583 113.70725 1 No
  85. 2 2 Male Immigrant 161.2759 88.33188 18 No
  86. 3 3 Female Immigrant 138.5305 99.26961 6 Yes
  87. 4 4 Male Citizen 164.8102 84.31848 12 No
  88. 5 5 Male Citizen 159.1619 92.25090 12 Yes
  89. 6 6 Female Citizen 153.3513 101.31986 11 Yes
  90. **I am trying to accomplish the following task:**
  91. - Select all rows where Gender = Male AND Status = Citizen
  92. - Based on this condition, split the &quot;Height&quot; variable into 3 &quot;ntiles&quot;
  93. - Now, based on these (nested) cumulative conditions (i.e. Gender = Male &amp; Status = Citizen &amp; Height_ntile = 1 , Gender = Male &amp; Status = Citizen &amp; Height_ntile = 2, Gender = Male &amp; Status = Citizen &amp; Height_ntile = 3), split the &quot;Weight&quot; variable into 3 &quot;ntiles&quot;
  94. - Now, based on these added cumulative conditions (i.e. Gender = Male &amp; Status = Citizen &amp; Height_ntile = 1, Weight_ntile = 1,etc.), split the &quot;Hospital_Visits&quot; into 3 &quot;ntiles&quot;
  95. - Finally, based on these added cumulative conditions (ender = Male &amp; Status = Citizen &amp; Height_ntile = 1, Weight_ntile = 1, Hosptial_Visits_ntile = 1, etc.), calculate the percentage of patients who have the disease
  96. - Repeat for Gender = Female AND Status = Citizen, etc.
  97. Based on a previous question (https://stackoverflow.com/questions/75014703/r-interchanging-quantile-and-ntile-functions), I tried to adapt the code for this question:
  98. results &lt;- my_data %&gt;%
  99. group_by(Gender, Status) %&gt;%
  100. mutate(group = ntile(Height, 3)) %&gt;%
  101. group_by(Gender, Status, group) %&gt;%
  102. mutate(min = min(Height),
  103. max = max(Height)) %&gt;%
  104. mutate(range_Height = paste(min, max, sep = &quot;-&quot;)) %&gt;%
  105. ungroup() %&gt;%
  106. group_by(Gender, Status, range_Height) %&gt;%
  107. mutate(group = ntile(Weight, 3)) %&gt;%
  108. group_by(Gender, Status, range_Height, group) %&gt;%
  109. mutate(min = min(Weight),
  110. max = max(Weight)) %&gt;%
  111. mutate(range_Weight = paste(min, max, sep = &quot;-&quot;)) %&gt;%
  112. ungroup() %&gt;%
  113. group_by(Gender, Status, range_Height, range_Weight) %&gt;%
  114. mutate(group = ntile(Hospital_Visits, 3)) %&gt;%
  115. group_by(Gender, Status, range_Height, range_Weight, group) %&gt;%
  116. mutate(min = min(Hospital_Visits),
  117. max = max(Hospital_Visits)) %&gt;%
  118. mutate(range_Hospital_Visits = paste(min, max, sep = &quot;-&quot;)) %&gt;%
  119. ungroup() %&gt;% group_by (Gender, Status, range_Height, range_Weight, range_Hospital_Visits) %&gt;% summarise(disease_rate = mean(Disease), count = n())
  120. **The code seems to have run - but with several warnings and the &quot;disease_rate&quot; column being filled with NA&#39;s:**
  121. There were 50 or more warnings (use warnings() to see the first 50)
  122. &gt; results
  123. # A tibble: 108 x 7
  124. # Groups: Gender, Status, range_Height, range_Weight [36]
  125. Gender Status range_Height range_Weight range_Hospita~1 disea~2 count
  126. &lt;fct&gt; &lt;fct&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
  127. 1 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 1-7 NA 72
  128. 2 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 14-20 NA 71
  129. 3 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 8-14 NA 72
  130. 4 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 1-8 NA 72
  131. 5 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 14-20 NA 71
  132. 6 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 8-14 NA 72
  133. 7 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596 1-6 NA 72
  134. 8 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596 14-20 NA 71
  135. 9 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596 7-14 NA 71
  136. 10 Female Citizen 145.52366174493-153.896976138595 60.0595633761862-86.5042722953509 1-8 NA 72
  137. # ... with 98 more rows, and abbreviated variable names 1: range_Hospital_Visits, 2: disease_rate
  138. # i Use `print(n = ...)` to see more rows
  139. Can someone please show me how to fix this problem?
  140. Thanks!
  141. Note: This should NOT be the same as grouping by a variable and calculating ntiles - then interrupt the process, group by previous variable and previous ntiles ... and then calculate new ntiles. I would like the grouping and ntile calculations to be uninterrupted.
  142. </details>
  143. # 答案1
  144. **得分**: 1
  145. OP的答案接近,但不正确。
  146. 当您不想覆盖时,`group_by` 需要使用 `.add = TRUE`
  147. ```R
  148. my_data %>%
  149. group_by(Gender, Status) %>%
  150. mutate(Height_ntile = ntile(Height, 3)) %>%
  151. group_by(Height_ntile, .add = TRUE) %>%
  152. mutate(Weight_ntile = ntile(Weight, 3)) %>%
  153. group_by(Weight_ntile, .add = TRUE) %>%
  154. mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %>%
  155. group_by(Hospital_Visits_ntile, .add = TRUE) %>%
  156. summarize(percent_disease = mean(Disease == "Yes"),
  157. count = n(),
  158. .groups = "drop")

【附加说明】
.add 的默认值是 FALSE。
这意味着当您两次使用 group_by 时,第一个 group_by 创建的组被删除,新的组被创建(-> 覆盖)。
以下是一个示例:

  1. my_data %>%
  2. group_by(Gender, Status) %>%
  3. mutate(Height_ntile = ntile(Height, 3)) %>%
  4. group_by(Height_ntile) # .add 的默认值是 FALSE
  5. # A tibble: 5,000 × 8
  6. # Groups: Height_ntile [3]
  7. # ...
  8. my_data %>%
  9. group_by(Gender, Status) %>%
  10. mutate(Height_ntile = ntile(Height, 3)) %>%
  11. group_by(Height_ntile, .add = TRUE)
  12. # A tibble: 5,000 × 8
  13. # Groups: Gender, Status, Height_ntile [12]

关于范围,我想不出更好的想法,所以我用简单但冗长的代码来实现:

  1. my_data %>%
  2. group_by(Gender, Status) %>%
  3. mutate(Height_ntile = ntile(Height, 3),
  4. Height_range = paste(min(Height), max(Height), sep = "-")) %>%
  5. group_by(Height_ntile, Height_range, .add = TRUE) %>%
  6. mutate(Weight_ntile = ntile(Weight, 3),
  7. Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
  8. group_by(Weight_ntile, Weight_range, .add = TRUE) %>%
  9. mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3),
  10. Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
  11. group_by(Hospital_Visits_ntile, Hospital_range, .add = TRUE) %>%
  12. summarize(percent_disease = mean(Disease == "Yes"),
  13. count = n(),
  14. .groups = "drop")
英文:

OP's answer is close, but not right.
group_by needs .add = TRUE when you don't want to override.

  1. my_data %&gt;%
  2. group_by(Gender, Status) %&gt;%
  3. mutate(Height_ntile = ntile(Height, 3)) %&gt;%
  4. group_by(Height_ntile, .add = TRUE) %&gt;%
  5. mutate(Weight_ntile = ntile(Weight, 3)) %&gt;%
  6. group_by(Weight_ntile, .add = TRUE) %&gt;%
  7. mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %&gt;%
  8. group_by(Hospital_Visits_ntile, .add = TRUE) %&gt;%
  9. summarize(percent_disease = mean(Disease == &quot;Yes&quot;),
  10. count = n(),
  11. .groups = &quot;drop&quot;)

[addition]
The default value of .add is FALSE.
It means when you use group_by twice, the group made by 1st group_by was dropped and new group is made (-> override).
Below is an example:

  1. my_data %&gt;%
  2. group_by(Gender, Status) %&gt;%
  3. mutate(Height_ntile = ntile(Height, 3)) %&gt;%
  4. group_by(Height_ntile) # the default value of .add is FALSE
  5. # A tibble: 5,000 &#215; 8
  6. # Groups: Height_ntile [3]
  7. # ...
  8. my_data %&gt;%
  9. group_by(Gender, Status) %&gt;%
  10. mutate(Height_ntile = ntile(Height, 3)) %&gt;%
  11. group_by(Height_ntile, .add = TRUE)
  12. # A tibble: 5,000 &#215; 8
  13. # Groups: Gender, Status, Height_ntile [12]

I couldn't come up with good ideas about range, so I did it by the simple but long code:

  1. my_data %&gt;%
  2. group_by(Gender, Status) %&gt;%
  3. mutate(Height_ntile = ntile(Height, 3),
  4. Height_range = paste(min(Height), max(Height), sep = &quot;-&quot;)) %&gt;%
  5. group_by(Height_ntile, Height_range, .add = TRUE) %&gt;%
  6. mutate(Weight_ntile = ntile(Weight, 3),
  7. Weight_range = paste(min(Weight), max(Weight), sep = &quot;-&quot;)) %&gt;%
  8. group_by(Weight_ntile, Weight_range, .add = TRUE) %&gt;%
  9. mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3),
  10. Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = &quot;-&quot;)) %&gt;%
  11. group_by(Hospital_Visits_ntile, Hospital_range, .add = TRUE) %&gt;%
  12. summarize(percent_disease = mean(Disease == &quot;Yes&quot;),
  13. count = n(),
  14. .groups = &quot;drop&quot;)

答案2

得分: 0

我在思考也许我可能过于复杂化了解决这个问题的方法 - 这个答案是否正确?

  1. library(dplyr)
  2. my_data %>% group_by(Gender, Status) %>%
  3. mutate(Height_ntile = ntile(Height, 3)) %>%
  4. group_by(Height_ntile) %>%
  5. mutate(Weight_ntile = ntile(Weight, 3)) %>%
  6. group_by(Height_ntile, Weight_ntile) %>%
  7. mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %
  8. group_by(Gender, Status, Height_ntile, Weight_ntile, Hospital_Visits_ntile) %>%
  9. summarize(percent_disease = mean(Disease == "Yes"), count = n())

我仍在努力弄清楚如何为每个ntile "粘贴范围"?

英文:

I am thinking that perhaps I might have overcomplicated the approach I was using to solve this problem - is this answer correct?

  1. library(dplyr)
  2. my_data %&gt;% group_by(Gender, Status) %&gt;%
  3. mutate(Height_ntile = ntile(Height, 3)) %&gt;%
  4. group_by(Height_ntile) %&gt;%
  5. mutate(Weight_ntile = ntile(Weight, 3)) %&gt;%
  6. group_by(Height_ntile, Weight_ntile) %&gt;%
  7. mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %&gt;%
  8. group_by(Gender, Status, Height_ntile, Weight_ntile, Hospital_Visits_ntile) %&gt;%
  9. summarize(percent_disease = mean(Disease == &quot;Yes&quot;), count = n())

I am still trying to figure out how to "paste the ranges" for each ntile?

huangapple
  • 本文由 发表于 2023年1月6日 13:00:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75027090.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定