2023年1月6日 13:00:50go评论116阅读模式

英文:

R: Calculating Proportions Based on Nested Groups

问题

以下是您提供的代码的翻译部分：

我正在使用R编程语言。
我有以下数据集：
    set.seed(123)
    library(dplyr)
    Patient_ID = 1:5000
    gender <- c("Male","Female")
    gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
    Gender <- as.factor(gender)
    
    status <- c("Immigrant","Citizen")
    status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
    Status  <- as.factor(status )
    
    Height = rnorm(5000, 150, 10)
    Weight = rnorm(5000, 90, 10)
    Hospital_Visits = sample.int(20,  5000, replace = TRUE)
    
    ################
    
    disease <- c("Yes","No")
    disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
    Disease <- as.factor(disease)
    
    ###################
    my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
      Patient_ID Gender    Status   Height    Weight Hospital_Visits Disease
    1          1 Female   Citizen 145.0583 113.70725               1      No
    2          2   Male Immigrant 161.2759  88.33188              18      No
    3          3 Female Immigrant 138.5305  99.26961               6     Yes
    4          4   Male   Citizen 164.8102  84.31848              12      No
    5          5   Male   Citizen 159.1619  92.25090              12     Yes
    6          6 Female   Citizen 153.3513 101.31986              11     Yes
我试图完成以下任务：
- 选择所有性别为男性且状态为公民的行
- 基于此条件，将“Height”变量分为3个“ntiles”
- 现在，基于这些（嵌套的）累积条件（即性别=男性且状态=公民且Height_ntile = 1，性别=男性且状态=公民且Height_ntile = 2，性别=男性且状态=公民且Height_ntile = 3），将“Weight”变量分为3个“ntiles”
- 现在，基于这些添加的累积条件（即性别=男性且状态=公民且Height_ntile = 1，Weight_ntile = 1等），将“Hospital_Visits”分为3个“ntiles”
- 最后，基于这些添加的累积条件（性别=男性且状态=公民且Height_ntile = 1，Weight_ntile = 1，Hosptial_Visits_ntile = 1等），计算患病率
- 对于性别=女性且状态=公民等重复上述步骤。
根据以前的问题（https://stackoverflow.com/questions/75014703/r-interchanging-quantile-and-ntile-functions），我尝试调整代码以解决此问题：
    results <- my_data %>%
      group_by(Gender, Status) %>%
      mutate(group = ntile(Height, 3)) %>%
      group_by(Gender, Status, group) %>%
      mutate(min = min(Height),
             max = max(Height)) %>%
      mutate(range_Height = paste(min, max, sep = "-")) %>%
      ungroup() %>%
     group_by(Gender, Status, range_Height) %>%
      mutate(group = ntile(Weight, 3)) %>%
      group_by(Gender, Status, range_Height, group) %>%
      mutate(min = min(Weight),
             max = max(Weight)) %>%
      mutate(range_Weight = paste(min, max, sep = "-")) %>%
      ungroup() %>%
     group_by(Gender, Status, range_Height, range_Weight) %>%
      mutate(group = ntile(Hospital_Visits, 3)) %>%
      group_by(Gender, Status, range_Height, range_Weight, group) %>%
      mutate(min = min(Hospital_Visits),
             max = max(Hospital_Visits)) %>%
      mutate(range_Hospital_Visits = paste(min, max, sep = "-")) %>%
      ungroup() %>% group_by (Gender, Status, range_Height, range_Weight, range_Hospital_Visits) %>% summarise(disease_rate = mean(Disease), count = n())
该代码似乎已运行，但伴有多个警告，并且“disease_rate”列被填充为NA。可以有人请教如何解决这个问题吗？
谢谢！
注意：这不应该与按变量分组并计算ntiles的方式相同 - 然后中断该过程，按先前的变量和先前的ntiles分组...然后计算新的ntiles。我希望分组和ntile计算是不间断的。
<details>
<summary>英文:</summary>
I am working with the R programming language.
I have the following dataset:
    set.seed(123)
    library(dplyr)
    Patient_ID = 1:5000
    gender &lt;- c(&quot;Male&quot;,&quot;Female&quot;)
    gender &lt;- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
    Gender &lt;- as.factor(gender)
    
    
    status &lt;- c(&quot;Immigrant&quot;,&quot;Citizen&quot;)
    status &lt;- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
    Status  &lt;- as.factor(status )
    
    Height = rnorm(5000, 150, 10)
    Weight = rnorm(5000, 90, 10)
    Hospital_Visits = sample.int(20,  5000, replace = TRUE)
    
    ################
    
    disease &lt;- c(&quot;Yes&quot;,&quot;No&quot;)
    disease &lt;- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
    Disease &lt;- as.factor(disease)
    
    ###################
    my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
      Patient_ID Gender    Status   Height    Weight Hospital_Visits Disease
    1          1 Female   Citizen 145.0583 113.70725               1      No
    2          2   Male Immigrant 161.2759  88.33188              18      No
    3          3 Female Immigrant 138.5305  99.26961               6     Yes
    4          4   Male   Citizen 164.8102  84.31848              12      No
    5          5   Male   Citizen 159.1619  92.25090              12     Yes
    6          6 Female   Citizen 153.3513 101.31986              11     Yes
**I am trying to accomplish the following task:**
- Select all rows where Gender = Male AND Status = Citizen
- Based on this condition, split the &quot;Height&quot; variable into 3 &quot;ntiles&quot; 
- Now, based on these (nested) cumulative conditions (i.e. Gender = Male &amp; Status = Citizen &amp; Height_ntile = 1 , Gender = Male &amp; Status = Citizen &amp; Height_ntile = 2, Gender = Male &amp; Status = Citizen &amp; Height_ntile = 3), split the &quot;Weight&quot; variable into 3 &quot;ntiles&quot;
- Now, based on these added cumulative conditions (i.e.  Gender = Male &amp; Status = Citizen &amp; Height_ntile = 1, Weight_ntile = 1,etc.), split the &quot;Hospital_Visits&quot; into 3 &quot;ntiles&quot;
- Finally, based on these added cumulative conditions (ender = Male &amp; Status = Citizen &amp; Height_ntile = 1, Weight_ntile = 1, Hosptial_Visits_ntile = 1, etc.), calculate the percentage of patients who have the disease
- Repeat for Gender = Female AND Status = Citizen, etc.
Based on a previous question (https://stackoverflow.com/questions/75014703/r-interchanging-quantile-and-ntile-functions), I tried to adapt the code for this question:
        results &lt;- my_data %&gt;%
      group_by(Gender, Status) %&gt;%
      mutate(group = ntile(Height, 3)) %&gt;%
      group_by(Gender, Status, group) %&gt;%
      mutate(min = min(Height),
             max = max(Height)) %&gt;%
      mutate(range_Height = paste(min, max, sep = &quot;-&quot;)) %&gt;%
      ungroup() %&gt;%
     group_by(Gender, Status, range_Height) %&gt;%
      mutate(group = ntile(Weight, 3)) %&gt;%
      group_by(Gender, Status, range_Height, group) %&gt;%
      mutate(min = min(Weight),
             max = max(Weight)) %&gt;%
      mutate(range_Weight = paste(min, max, sep = &quot;-&quot;)) %&gt;%
      ungroup() %&gt;%
     group_by(Gender, Status, range_Height, range_Weight) %&gt;%
      mutate(group = ntile(Hospital_Visits, 3)) %&gt;%
      group_by(Gender, Status, range_Height, range_Weight, group) %&gt;%
      mutate(min = min(Hospital_Visits),
             max = max(Hospital_Visits)) %&gt;%
      mutate(range_Hospital_Visits = paste(min, max, sep = &quot;-&quot;)) %&gt;%
      ungroup() %&gt;% group_by (Gender, Status, range_Height, range_Weight, range_Hospital_Visits) %&gt;% summarise(disease_rate = mean(Disease), count = n())
**The code seems to have run - but with several warnings and the &quot;disease_rate&quot; column being filled with NA&#39;s:**
    There were 50 or more warnings (use warnings() to see the first 50)
    &gt; results
    # A tibble: 108 x 7
    # Groups:   Gender, Status, range_Height, range_Weight [36]
       Gender Status  range_Height                      range_Weight                      range_Hospita~1 disea~2 count
       &lt;fct&gt;  &lt;fct&gt;   &lt;chr&gt;                             &lt;chr&gt;                             &lt;chr&gt;             &lt;dbl&gt; &lt;int&gt;
     1 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 1-7                  NA    72
     2 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 14-20                NA    71
     3 Female Citizen 115.862119946011-145.511095001117 58.9918455266656-85.4984683588085 8-14                 NA    72
     4 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 1-8                  NA    72
     5 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 14-20                NA    71
     6 Female Citizen 115.862119946011-145.511095001117 85.5651200062572-94.2555060657397 8-14                 NA    72
     7 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596  1-6                  NA    72
     8 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596  14-20                NA    71
     9 Female Citizen 115.862119946011-145.511095001117 94.342739710942-121.024202204596  7-14                 NA    71
    10 Female Citizen 145.52366174493-153.896976138595  60.0595633761862-86.5042722953509 1-8                  NA    72
    # ... with 98 more rows, and abbreviated variable names 1: range_Hospital_Visits, 2: disease_rate
    # i Use `print(n = ...)` to see more rows
Can someone please show me how to fix this problem?
Thanks!
Note: This should NOT be the same as grouping by a variable and calculating ntiles - then interrupt the process, group by previous variable and previous ntiles ... and then calculate new ntiles. I would like the grouping and ntile calculations to be uninterrupted. 
</details>
# 答案1
**得分**: 1
OP的答案接近，但不正确。
当您不想覆盖时，`group_by` 需要使用 `.add = TRUE`。
```R
my_data %>%
  group_by(Gender, Status) %>%
  mutate(Height_ntile = ntile(Height, 3)) %>%
  group_by(Height_ntile, .add = TRUE) %>%
  mutate(Weight_ntile = ntile(Weight, 3)) %>%
  group_by(Weight_ntile, .add = TRUE) %>%
  mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %>%
  group_by(Hospital_Visits_ntile, .add = TRUE) %>%
  summarize(percent_disease = mean(Disease == "Yes"), 
            count = n(),
            .groups = "drop")

【附加说明】
.add 的默认值是 FALSE。
这意味着当您两次使用 group_by 时，第一个 group_by 创建的组被删除，新的组被创建（-> 覆盖）。
以下是一个示例：

my_data %>%
  group_by(Gender, Status) %>%
  mutate(Height_ntile = ntile(Height, 3)) %>%
  group_by(Height_ntile)  # .add 的默认值是 FALSE
# A tibble: 5,000 × 8
# Groups:   Height_ntile [3]
# ...
my_data %>%
  group_by(Gender, Status) %>%
  mutate(Height_ntile = ntile(Height, 3)) %>%
  group_by(Height_ntile, .add = TRUE)
# A tibble: 5,000 × 8
# Groups:   Gender, Status, Height_ntile [12]

关于范围，我想不出更好的想法，所以我用简单但冗长的代码来实现：

my_data %>%
  group_by(Gender, Status) %>%
  mutate(Height_ntile = ntile(Height, 3),
         Height_range = paste(min(Height), max(Height), sep = "-")) %>%
  group_by(Height_ntile, Height_range, .add = TRUE) %>%
  mutate(Weight_ntile = ntile(Weight, 3),
         Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
  group_by(Weight_ntile, Weight_range, .add = TRUE) %>%
  mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3),
         Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
  group_by(Hospital_Visits_ntile, Hospital_range, .add = TRUE) %>%
  summarize(percent_disease = mean(Disease == "Yes"), 
            count = n(),
            .groups = "drop")

英文:

OP's answer is close, but not right.
group_by needs .add = TRUE when you don't want to override.

my_data %&gt;% 
group_by(Gender, Status) %&gt;%
mutate(Height_ntile = ntile(Height, 3)) %&gt;%
group_by(Height_ntile, .add = TRUE) %&gt;%
mutate(Weight_ntile = ntile(Weight, 3)) %&gt;%
group_by(Weight_ntile, .add = TRUE) %&gt;%
mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %&gt;%
group_by(Hospital_Visits_ntile, .add = TRUE) %&gt;%
summarize(percent_disease = mean(Disease == &quot;Yes&quot;), 
count = n(),
.groups = &quot;drop&quot;)

[addition]
The default value of .add is FALSE.
It means when you use group_by twice, the group made by 1st group_by was dropped and new group is made (-> override).
Below is an example:

my_data %&gt;% 
group_by(Gender, Status) %&gt;%
mutate(Height_ntile = ntile(Height, 3)) %&gt;%
group_by(Height_ntile)  # the default value of .add is FALSE
# A tibble: 5,000 &#215; 8
# Groups:   Height_ntile [3]
# ...
my_data %&gt;% 
group_by(Gender, Status) %&gt;%
mutate(Height_ntile = ntile(Height, 3)) %&gt;%
group_by(Height_ntile, .add = TRUE)
# A tibble: 5,000 &#215; 8
# Groups:   Gender, Status, Height_ntile [12]

I couldn't come up with good ideas about range, so I did it by the simple but long code:

my_data %&gt;% 
group_by(Gender, Status) %&gt;%
mutate(Height_ntile = ntile(Height, 3),
Height_range = paste(min(Height), max(Height), sep = &quot;-&quot;)) %&gt;%
group_by(Height_ntile, Height_range, .add = TRUE) %&gt;%
mutate(Weight_ntile = ntile(Weight, 3),
Weight_range = paste(min(Weight), max(Weight), sep = &quot;-&quot;)) %&gt;%
group_by(Weight_ntile, Weight_range, .add = TRUE) %&gt;%
mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3),
Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = &quot;-&quot;)) %&gt;%
group_by(Hospital_Visits_ntile, Hospital_range, .add = TRUE) %&gt;%
summarize(percent_disease = mean(Disease == &quot;Yes&quot;), 
count = n(),
.groups = &quot;drop&quot;)

答案2

得分: 0

我在思考也许我可能过于复杂化了解决这个问题的方法 - 这个答案是否正确？

library(dplyr)
my_data %>% group_by(Gender, Status) %>%
    mutate(Height_ntile = ntile(Height, 3)) %>%
    group_by(Height_ntile) %>%
    mutate(Weight_ntile = ntile(Weight, 3)) %>%
    group_by(Height_ntile, Weight_ntile) %>%
    mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %
    group_by(Gender, Status, Height_ntile, Weight_ntile, Hospital_Visits_ntile) %>%
    summarize(percent_disease = mean(Disease == "Yes"), count = n())

我仍在努力弄清楚如何为每个ntile "粘贴范围"？

英文:

I am thinking that perhaps I might have overcomplicated the approach I was using to solve this problem - is this answer correct?

library(dplyr)
my_data %&gt;% group_by(Gender, Status) %&gt;%
mutate(Height_ntile = ntile(Height, 3)) %&gt;%
group_by(Height_ntile) %&gt;%
mutate(Weight_ntile = ntile(Weight, 3)) %&gt;%
group_by(Height_ntile, Weight_ntile) %&gt;%
mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3)) %&gt;%
group_by(Gender, Status, Height_ntile, Weight_ntile, Hospital_Visits_ntile) %&gt;%
summarize(percent_disease = mean(Disease == &quot;Yes&quot;), count = n())

I am still trying to figure out how to "paste the ranges" for each ntile?

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R: 基于嵌套群组计算比例

问题

答案2

计算多个数据框列的最佳方法是使用模式

Efficient way to take a product of vectors, and then apply an operator to the output matrix and then sum all of the matrices

mlr3的类别权重是否应用于验证分数计算？

如何在同一列嵌套两次时使tidyr::complete起作用？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。