R: “保护” 代码免受 “参数暗示不同行数” 的影响

huangapple go评论70阅读模式
英文:

R: "Protecting" Code Against " arguments imply differing number of rows"

问题

I am working with the R programming language.

Suppose I have the following dataset:

library(dplyr)

set.seed(123)

Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)

status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status <- as.factor(status)

Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20,  5000, replace = TRUE)

################

disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)

###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)

I am trying to calculate the min/max ranges for the height, weight, and hospital_visit variables based on 5 ntiles. I did this with the following code:

table_data <- data.frame(
 Groups = paste0("Group ", 1:5),
 Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
 Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
 Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
 Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
 Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
 Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)

My Question: Suppose now I want to repeat the above code - but for some of the variables have 5 ntiles and for some of the variables have 4 ntiles:

table_data <- data.frame(
 Groups = paste0("Group ", 1:5),
 Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
 Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
 Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
 Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
 Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 4), min),
 Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 4), max)
)

I then get the following error:

Error in data.frame(Groups = paste0("Group ", 1:5), Min_Height = tapply(my_data$Height,  : 
 arguments imply differing number of rows: 5, 4

In general, is there something I can do to "protect" my R code from such errors? That is, in situations where I have a differing number of ntiles being calculated - can something be done to automatically assign NA values to groups which are not "relevant" for a specific variable?

   Groups Min_Height Max_Height Min_Weight Max_Weight Min_Visits Max_Visits
1 Group 1   111.5468   141.4839   56.53098   81.83402          1          5
2 Group 2   141.4965   147.4422   81.85064   87.45406          5         10
3 Group 3   147.4487   152.3924   87.45935   92.72041         10         15
4 Group 4   152.4016   158.5178   92.72941   98.54624         16         20
5 Group 5   158.5187   188.4777   98.55533  121.02420         NA         NA

Thanks!

Note: Currently I am doing this manually (i.e. create a separate table for ntile = 4 and ntile = 5) and then merging the results - but ideally I would like to perform all ntile calculations within the same code.

英文:

I am working with the R programming language.

Suppose I have the following dataset:

library(dplyr)

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender &lt;- c(&quot;Male&quot;,&quot;Female&quot;)
gender &lt;- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender &lt;- as.factor(gender)


status &lt;- c(&quot;Immigrant&quot;,&quot;Citizen&quot;)
status &lt;- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status  &lt;- as.factor(status )

Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20,  5000, replace = TRUE)

################

disease &lt;- c(&quot;Yes&quot;,&quot;No&quot;)
disease &lt;- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease &lt;- as.factor(disease)

###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)

I am trying to calculate the min/max ranges for the height, weight and hospital_visit variables based on 5 ntiles. I did this with the following code:

table_data &lt;- data.frame(
 Groups = paste0(&quot;Group &quot;, 1:5),
  Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
  Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
  Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
  Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
 Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)

My Question: Suppose now I want to repeat the above code - but for some of the variables have 5 ntiles and for some of the variables have 4 ntiles:

  table_data &lt;- data.frame(
     Groups = paste0(&quot;Group &quot;, 1:5),
      Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
      Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
      Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
      Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
     Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 4), min),
    Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 4), max)
    )

I then get the following error:

Error in data.frame(Groups = paste0(&quot;Group &quot;, 1:5), Min_Height = tapply(my_data$Height,  : 
  arguments imply differing number of rows: 5, 4

I would have thought that a value of NA would have been inserted on the "Group 5" row for variables where ntile < 5 ... but instead, the entire code does not run now.

In general, is there something I can do to "protect" my R code from such errors? That is, in situations where I have a differing number of ntiles being calculated - can something be done to automatically assign NA values to groups which are not "relevant" for a specific variable?

   Groups Min_Height Max_Height Min_Weight Max_Weight Min_Visits Max_Visits
1 Group 1   111.5468   141.4839   56.53098   81.83402          1          5
2 Group 2   141.4965   147.4422   81.85064   87.45406          5          10
3 Group 3   147.4487   152.3924   87.45935   92.72041         10          15
4 Group 4   152.4016   158.5178   92.72941   98.54624         16         20
5 Group 5   158.5187   188.4777   98.55533  121.02420         NA         NA

Thanks!

Note: Currently I am doing this manually (i.e. create a separate table for ntile = 4 and ntile = 5) and then merging the results - but ideally I would like to perform all ntile calculations within the same code.

答案1

得分: 2

I think you're asking yourself "How can I get the answer I want from the data I have?". I think a better question is "How do I construct my data to get the answer I want easily and robustly?".

The answer to the second question is by pivoting your input data. For example:

my_data %>%
  pivot_longer(
    c(Height, Weight, Hospital_Visits), 
    names_to = "Column", 
    values_to = "Value"
  ) 
# A tibble: 15,000 × 6
   Patient_ID Gender Status    Disease Column          Value
        <int> <fct>  <fct>     <fct>   <chr>           <dbl>
 1          1 Female Citizen   No      Height          145. 
 2          1 Female Citizen   No      Weight          114. 
 3          1 Female Citizen   No      Hospital_Visits   1  
 4          2 Male   Immigrant No      Height          161. 
 5          2 Male   Immigrant No      Weight           88.3
 6          2 Male   Immigrant No      Hospital_Visits  18  
 7          3 Female Immigrant Yes     Height          139. 
 8          3 Female Immigrant Yes     Weight           99.3
 9          3 Female Immigrant Yes     Hospital_Visits   6  
10          4 Male   Citizen   No      Height          165. 
# … with 14,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Now we can easily calculate a "by-column" ntile by using group_map. (This function applies the function defined by its argument to each of the current groups of a data frame.)

Conventionally, the function takes two arguments, .x, which contains the data in the current group, and .y which is a single-row tibble that defines the current group.

Setting .keep to TRUE ensures that the group columns remain in .x. By default, they don't. group_map returns a list, so I use bind_rows to combine the results into a single data frame.

Note that I define the desired number of groups for each column in the original data frame in a vector.

nGroups <- c("Height" = 5, "Weight" = 5, "Hospital_Visits" = 4)

my_data %>%
  pivot_longer(
    c(Height, Weight, Hospital_Visits), 
    names_to = "Column", 
    values_to = "Value"
  ) %>%
  group_by(Column) %>%
  group_map(
    function(.x, .y) {
      .x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
    },
    .keep = TRUE
  ) %>%
  bind_rows() 
# A tibble: 15,000 × 7
   Patient_ID Gender Status    Disease Column Value Group  
        <int> <fct>  <fct>     <fct>   <chr>  <dbl> <chr>  
 1          1 Female Citizen   No      Height  145. Group_2
 2          2 Male   Immigrant No      Height  161. Group_5
 3          3 Female Immigrant Yes     Height  139. Group_1
 4          4 Male   Citizen   No      Height  165. Group_5
 5          5 Male   Citizen   Yes     Height  159. Group_5
 6          6 Female Citizen   Yes     Height  153. Group_4
 7          7 Female Citizen   No      Height  156. Group_4
 8          8 Male   Citizen   Yes     Height  152. Group_3
 9          9 Male   Immigrant Yes     Height  146. Group_2
10         10 Female Citizen   No      Height  147. Group_2
# … with 14,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Now I can calculate the summaries you want.

my_data %>%
  pivot_longer(
    c(Height, Weight, Hospital_Visits), 
    names_to = "Column", 
    values_to = "Value"
  ) %>%
  group_by(Column) %>%
  group_map(
    function(.x, .y) {
      .x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
    },
    .keep = TRUE
  ) %>%
  bind_rows() %>%
  group_by(Column, Group) %>%
  summarise(
    Min = min(Value),
    Max = max(Value),
    .groups = "drop"
  )
# A tibble: 14 × 4
   Column          Group     Min   Max
   <chr>           <chr>   <dbl> <dbl>
 1 Height          Group_1 112.  141. 
 2 Height          Group_2 141.  147. 
 3 Height          Group_3 147.  152. 
 4 Height          Group_4 152.  159. 
 5 Height          Group_5 159.  188. 
 6 Hospital_Visits Group_1   1     5  
 7 Hospital_Visits Group_2   5    10  
 8 Hospital_Visits Group_3  10    15  
 9 Hospital_Visits Group_4  16    20  
10 Weight          Group_1  56.5  81.8
11 Weight          Group_2  81.9  87.5
12 Weight          Group_3  87.5  92.7
13 Weight          Group_4  92.7  98.5
14 Weight          Group_5  98.6 121.  

Personally, I'd keep the results in this format for further processing - because it's tidy. But often presentation works better in wider rather than long format. So when ready to present, you can:

my_data %>%
  pivot_longer(
    c(Height, Weight, Hospital_Visits), 
    names_to = "Column", 
    values_to = "Value"
  ) %>%
  group_by(Column) %>%
  group_map(
    function(.x, .y) {
      .x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
    },
    .keep = TRUE
  ) %>%
  bind_rows() %>%
  group_by(Column, Group) %>%
  summarise(
    Min = min(Value),
    Max = max(Value),
    .groups = "drop"
  ) %>%
  pivot_wider(
    id_cols = Group, 
    values_from = c(Min, Max),
    names_from = Column,


<details>
<summary>英文:</summary>

I think you&#39;re asking yourself &quot;How can I get the answer I want from the data I have?&quot;.  I think a better question is &quot;How do I construct my data to get the answer I want easily and robustly?&quot;.

The answer to the second question is by pivoting your input data.  For example:

my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
)

A tibble: 15,000 × 6

Patient_ID Gender Status Disease Column Value
<int> <fct> <fct> <fct> <chr> <dbl>
1 1 Female Citizen No Height 145.
2 1 Female Citizen No Weight 114.
3 1 Female Citizen No Hospital_Visits 1
4 2 Male Immigrant No Height 161.
5 2 Male Immigrant No Weight 88.3
6 2 Male Immigrant No Hospital_Visits 18
7 3 Female Immigrant Yes Height 139.
8 3 Female Immigrant Yes Weight 99.3
9 3 Female Immigrant Yes Hospital_Visits 6
10 4 Male Citizen No Height 165.

… with 14,990 more rows

ℹ Use print(n = ...) to see more rows


Now we can easily calculate a &quot;by-column&quot; ntile by using `group_map`.  (This function applies the function defined by its argument to each of the current groups of a data frame.)  

Conventionally, the function takes two arguments, `.x`, which contains the data in the current group, and `.y` which is a single-row tibble that _defines_ the current group.

Setting `.keep` to `TRUE` ensures that the group columns remain in `.x`.  By default, they don&#39;t. `group_map` returns a list, so I use `bind_rows` to combine the results into a single data frame.

Note that I define the desired number of groups for each column in the original data frame in a vector.

nGroups <- c("Height" = 5, "Weight" = 5, "Hospital_Visits" = 4)

my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
) %>%
group_by(Column) %>%
group_map(
function(.x, .y) {
.x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
},
.keep = TRUE
) %>%
bind_rows()

A tibble: 15,000 × 7

Patient_ID Gender Status Disease Column Value Group
<int> <fct> <fct> <fct> <chr> <dbl> <chr>
1 1 Female Citizen No Height 145. Group_2
2 2 Male Immigrant No Height 161. Group_5
3 3 Female Immigrant Yes Height 139. Group_1
4 4 Male Citizen No Height 165. Group_5
5 5 Male Citizen Yes Height 159. Group_5
6 6 Female Citizen Yes Height 153. Group_4
7 7 Female Citizen No Height 156. Group_4
8 8 Male Citizen Yes Height 152. Group_3
9 9 Male Immigrant Yes Height 146. Group_2
10 10 Female Citizen No Height 147. Group_2

… with 14,990 more rows

ℹ Use print(n = ...) to see more rows


Now I can calculate the summaries you want.

my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
) %>%
group_by(Column) %>%
group_map(
function(.x, .y) {
.x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
},
.keep = TRUE
) %>%
bind_rows() %>%
group_by(Column, Group) %>%
summarise(
Min = min(Value),
Max = max(Value),
.groups = "drop"
)

A tibble: 14 × 4

Column Group Min Max
<chr> <chr> <dbl> <dbl>
1 Height Group_1 112. 141.
2 Height Group_2 141. 147.
3 Height Group_3 147. 152.
4 Height Group_4 152. 159.
5 Height Group_5 159. 188.
6 Hospital_Visits Group_1 1 5
7 Hospital_Visits Group_2 5 10
8 Hospital_Visits Group_3 10 15
9 Hospital_Visits Group_4 16 20
10 Weight Group_1 56.5 81.8
11 Weight Group_2 81.9 87.5
12 Weight Group_3 87.5 92.7
13 Weight Group_4 92.7 98.5
14 Weight Group_5 98.6 121.


Personally, I&#39;d keep the results in this format for further processing - because it&#39;s tidy.  But often _presentation_ works better in wider rather than long format.  So when ready to present, you can:

my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
) %>%
group_by(Column) %>%
group_map(
function(.x, .y) {
.x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
},
.keep = TRUE
) %>%
bind_rows() %>%
group_by(Column, Group) %>%
summarise(
Min = min(Value),
Max = max(Value),
.groups = "drop"
) %>%
pivot_wider(
id_cols = Group,
values_from = c(Min, Max),
names_from = Column,
names_glue = '{.value}_{Column}'
)

A tibble: 5 × 7

Group Min_Height Min_Hospital_Visits Min_Weight Max_Height Max_Hospital_Visits Max_Weight
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Group_1 112. 1 56.5 141. 5 81.8
2 Group_2 141. 5 81.9 147. 10 87.5
3 Group_3 147. 10 87.5 152. 15 92.7
4 Group_4 152. 16 92.7 159. 20 98.5
5 Group_5 159. NA 98.6 188. NA 121.


This approach is robust against changes in column names and the desired number of groups.  Hence, it &quot;protects&quot; your code as you request.  The only thing you need to check if you have different variables in the future is that you change the value of `nGroups` accordingly.  if you have extra columns in `nGroups`, that&#39;s fine.  If you have a column that you want to summarise that doesn&#39;t have an entry in `nGroups`, then you&#39;ll get an error.  But it&#39;s easy to protect yourself against that.  To *really* protect yourself, define `nGroups` at the start of your code and change the initial `pivot_longer` to

my_data %>%
pivot_longer(
names(nGroups),
names_to = "Column",
values_to = "Value"
) ...


</details>



# 答案2
**得分**: 1

以下是翻译好的部分:

"pivot_longer()" 函数将数据首先进行长格式的变换,通过调用 `ntile()` 计算分组变量,然后根据变量名称和分位数组进行分组。首先,我们可以生成数据。

```r
library(dplyr)
library(tidyr)
set.seed(123)

Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)

status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status  <- as.factor(status )

Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20,  5000, replace = TRUE)

################

disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)

###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)

接下来,我们实际进行汇总。

my_data %>%
  pivot_longer(Height:Hospital_Visits, names_to="vbl", values_to="vals") %>%
  group_by(vbl) %>%
  mutate(group = case_when(
    vbl %in% c("Height", "Weight") ~ ntile(vals, 5), 
    vbl %in% c("Hospital_Visits") ~ ntile(vals, 4)), 
    vbl = gsub("Hospital_", "", vbl)) %>%
  group_by(vbl, group) %>%
  reframe(min = min(vals), 
          max=max(vals)) %>%
  pivot_wider(names_from = "vbl", 
              names_glue = "{vbl}_{.value}",
              values_from = c("min", "max"))
#> # A tibble: 5 × 7
#>   group Height_min Visits_min Weight_min Height_max Visits_max Weight_max
#>   <int>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1     1       112.          1       56.5       141.          5       81.8
#> 2     2       141.          5       81.9       147.         10       87.5
#> 3     3       147.         10       87.5       152.         15       92.7
#> 4     4       152.         16       92.7       159.         20       98.5
#> 5     5       159.         NA       98.6       188.         NA      121.

创建于2023年06月25日,使用 reprex v2.0.2

英文:

It may be easier to pivot the data longer first, calculate the group variable by a call to ntile() and then group_by() on the variable name and quantile group. First, we can make the data.

library(dplyr)
library(tidyr)
set.seed(123)

Patient_ID = 1:5000
gender &lt;- c(&quot;Male&quot;,&quot;Female&quot;)
gender &lt;- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender &lt;- as.factor(gender)


status &lt;- c(&quot;Immigrant&quot;,&quot;Citizen&quot;)
status &lt;- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status  &lt;- as.factor(status )

Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20,  5000, replace = TRUE)

################

disease &lt;- c(&quot;Yes&quot;,&quot;No&quot;)
disease &lt;- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease &lt;- as.factor(disease)

###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)

Below, we actually do the summarizing.

my_data %&gt;% 
  pivot_longer(Height:Hospital_Visits, names_to=&quot;vbl&quot;, values_to=&quot;vals&quot;) %&gt;% 
  group_by(vbl) %&gt;% 
  mutate(group = case_when(
    vbl %in% c(&quot;Height&quot;, &quot;Weight&quot;) ~ ntile(vals, 5), 
    vbl %in% c(&quot;Hospital_Visits&quot;) ~ ntile(vals, 4)), 
    vbl = gsub(&quot;Hospital_&quot;, &quot;&quot;, vbl)) %&gt;% 
  group_by(vbl, group) %&gt;% 
  reframe(min = min(vals), 
          max=max(vals)) %&gt;%
  pivot_wider(names_from = &quot;vbl&quot;, 
              names_glue = &quot;{vbl}_{.value}&quot;,
              values_from = c(&quot;min&quot;, &quot;max&quot;))
#&gt; # A tibble: 5 &#215; 7
#&gt;   group Height_min Visits_min Weight_min Height_max Visits_max Weight_max
#&gt;   &lt;int&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;
#&gt; 1     1       112.          1       56.5       141.          5       81.8
#&gt; 2     2       141.          5       81.9       147.         10       87.5
#&gt; 3     3       147.         10       87.5       152.         15       92.7
#&gt; 4     4       152.         16       92.7       159.         20       98.5
#&gt; 5     5       159.         NA       98.6       188.         NA      121.

<sup>Created on 2023-06-25 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月26日 00:11:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76551341.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定