英文:
R: "Protecting" Code Against " arguments imply differing number of rows"
问题
I am working with the R programming language.
Suppose I have the following dataset:
library(dplyr)
set.seed(123)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status <- as.factor(status)
Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
I am trying to calculate the min/max ranges for the height, weight, and hospital_visit variables based on 5 ntiles. I did this with the following code:
table_data <- data.frame(
Groups = paste0("Group ", 1:5),
Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)
My Question: Suppose now I want to repeat the above code - but for some of the variables have 5 ntiles and for some of the variables have 4 ntiles:
table_data <- data.frame(
Groups = paste0("Group ", 1:5),
Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 4), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 4), max)
)
I then get the following error:
Error in data.frame(Groups = paste0("Group ", 1:5), Min_Height = tapply(my_data$Height, :
arguments imply differing number of rows: 5, 4
In general, is there something I can do to "protect" my R code from such errors? That is, in situations where I have a differing number of ntiles being calculated - can something be done to automatically assign NA values to groups which are not "relevant" for a specific variable?
Groups Min_Height Max_Height Min_Weight Max_Weight Min_Visits Max_Visits
1 Group 1 111.5468 141.4839 56.53098 81.83402 1 5
2 Group 2 141.4965 147.4422 81.85064 87.45406 5 10
3 Group 3 147.4487 152.3924 87.45935 92.72041 10 15
4 Group 4 152.4016 158.5178 92.72941 98.54624 16 20
5 Group 5 158.5187 188.4777 98.55533 121.02420 NA NA
Thanks!
Note: Currently I am doing this manually (i.e. create a separate table for ntile = 4 and ntile = 5) and then merging the results - but ideally I would like to perform all ntile calculations within the same code.
英文:
I am working with the R programming language.
Suppose I have the following dataset:
library(dplyr)
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status <- as.factor(status )
Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
I am trying to calculate the min/max ranges for the height, weight and hospital_visit variables based on 5 ntiles. I did this with the following code:
table_data <- data.frame(
Groups = paste0("Group ", 1:5),
Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)
My Question: Suppose now I want to repeat the above code - but for some of the variables have 5 ntiles and for some of the variables have 4 ntiles:
table_data <- data.frame(
Groups = paste0("Group ", 1:5),
Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 4), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 4), max)
)
I then get the following error:
Error in data.frame(Groups = paste0("Group ", 1:5), Min_Height = tapply(my_data$Height, :
arguments imply differing number of rows: 5, 4
I would have thought that a value of NA would have been inserted on the "Group 5" row for variables where ntile < 5 ... but instead, the entire code does not run now.
In general, is there something I can do to "protect" my R code from such errors? That is, in situations where I have a differing number of ntiles being calculated - can something be done to automatically assign NA values to groups which are not "relevant" for a specific variable?
Groups Min_Height Max_Height Min_Weight Max_Weight Min_Visits Max_Visits
1 Group 1 111.5468 141.4839 56.53098 81.83402 1 5
2 Group 2 141.4965 147.4422 81.85064 87.45406 5 10
3 Group 3 147.4487 152.3924 87.45935 92.72041 10 15
4 Group 4 152.4016 158.5178 92.72941 98.54624 16 20
5 Group 5 158.5187 188.4777 98.55533 121.02420 NA NA
Thanks!
Note: Currently I am doing this manually (i.e. create a separate table for ntile = 4 and ntile = 5) and then merging the results - but ideally I would like to perform all ntile calculations within the same code.
答案1
得分: 2
I think you're asking yourself "How can I get the answer I want from the data I have?". I think a better question is "How do I construct my data to get the answer I want easily and robustly?".
The answer to the second question is by pivoting your input data. For example:
my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
)
# A tibble: 15,000 × 6
Patient_ID Gender Status Disease Column Value
<int> <fct> <fct> <fct> <chr> <dbl>
1 1 Female Citizen No Height 145.
2 1 Female Citizen No Weight 114.
3 1 Female Citizen No Hospital_Visits 1
4 2 Male Immigrant No Height 161.
5 2 Male Immigrant No Weight 88.3
6 2 Male Immigrant No Hospital_Visits 18
7 3 Female Immigrant Yes Height 139.
8 3 Female Immigrant Yes Weight 99.3
9 3 Female Immigrant Yes Hospital_Visits 6
10 4 Male Citizen No Height 165.
# … with 14,990 more rows
# ℹ Use `print(n = ...)` to see more rows
Now we can easily calculate a "by-column" ntile by using group_map
. (This function applies the function defined by its argument to each of the current groups of a data frame.)
Conventionally, the function takes two arguments, .x
, which contains the data in the current group, and .y
which is a single-row tibble that defines the current group.
Setting .keep
to TRUE
ensures that the group columns remain in .x
. By default, they don't. group_map
returns a list, so I use bind_rows
to combine the results into a single data frame.
Note that I define the desired number of groups for each column in the original data frame in a vector.
nGroups <- c("Height" = 5, "Weight" = 5, "Hospital_Visits" = 4)
my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
) %>%
group_by(Column) %>%
group_map(
function(.x, .y) {
.x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
},
.keep = TRUE
) %>%
bind_rows()
# A tibble: 15,000 × 7
Patient_ID Gender Status Disease Column Value Group
<int> <fct> <fct> <fct> <chr> <dbl> <chr>
1 1 Female Citizen No Height 145. Group_2
2 2 Male Immigrant No Height 161. Group_5
3 3 Female Immigrant Yes Height 139. Group_1
4 4 Male Citizen No Height 165. Group_5
5 5 Male Citizen Yes Height 159. Group_5
6 6 Female Citizen Yes Height 153. Group_4
7 7 Female Citizen No Height 156. Group_4
8 8 Male Citizen Yes Height 152. Group_3
9 9 Male Immigrant Yes Height 146. Group_2
10 10 Female Citizen No Height 147. Group_2
# … with 14,990 more rows
# ℹ Use `print(n = ...)` to see more rows
Now I can calculate the summaries you want.
my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
) %>%
group_by(Column) %>%
group_map(
function(.x, .y) {
.x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
},
.keep = TRUE
) %>%
bind_rows() %>%
group_by(Column, Group) %>%
summarise(
Min = min(Value),
Max = max(Value),
.groups = "drop"
)
# A tibble: 14 × 4
Column Group Min Max
<chr> <chr> <dbl> <dbl>
1 Height Group_1 112. 141.
2 Height Group_2 141. 147.
3 Height Group_3 147. 152.
4 Height Group_4 152. 159.
5 Height Group_5 159. 188.
6 Hospital_Visits Group_1 1 5
7 Hospital_Visits Group_2 5 10
8 Hospital_Visits Group_3 10 15
9 Hospital_Visits Group_4 16 20
10 Weight Group_1 56.5 81.8
11 Weight Group_2 81.9 87.5
12 Weight Group_3 87.5 92.7
13 Weight Group_4 92.7 98.5
14 Weight Group_5 98.6 121.
Personally, I'd keep the results in this format for further processing - because it's tidy. But often presentation works better in wider rather than long format. So when ready to present, you can:
my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
) %>%
group_by(Column) %>%
group_map(
function(.x, .y) {
.x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
},
.keep = TRUE
) %>%
bind_rows() %>%
group_by(Column, Group) %>%
summarise(
Min = min(Value),
Max = max(Value),
.groups = "drop"
) %>%
pivot_wider(
id_cols = Group,
values_from = c(Min, Max),
names_from = Column,
<details>
<summary>英文:</summary>
I think you're asking yourself "How can I get the answer I want from the data I have?". I think a better question is "How do I construct my data to get the answer I want easily and robustly?".
The answer to the second question is by pivoting your input data. For example:
my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
)
A tibble: 15,000 × 6
Patient_ID Gender Status Disease Column Value
<int> <fct> <fct> <fct> <chr> <dbl>
1 1 Female Citizen No Height 145.
2 1 Female Citizen No Weight 114.
3 1 Female Citizen No Hospital_Visits 1
4 2 Male Immigrant No Height 161.
5 2 Male Immigrant No Weight 88.3
6 2 Male Immigrant No Hospital_Visits 18
7 3 Female Immigrant Yes Height 139.
8 3 Female Immigrant Yes Weight 99.3
9 3 Female Immigrant Yes Hospital_Visits 6
10 4 Male Citizen No Height 165.
… with 14,990 more rows
ℹ Use print(n = ...)
to see more rows
Now we can easily calculate a "by-column" ntile by using `group_map`. (This function applies the function defined by its argument to each of the current groups of a data frame.)
Conventionally, the function takes two arguments, `.x`, which contains the data in the current group, and `.y` which is a single-row tibble that _defines_ the current group.
Setting `.keep` to `TRUE` ensures that the group columns remain in `.x`. By default, they don't. `group_map` returns a list, so I use `bind_rows` to combine the results into a single data frame.
Note that I define the desired number of groups for each column in the original data frame in a vector.
nGroups <- c("Height" = 5, "Weight" = 5, "Hospital_Visits" = 4)
my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
) %>%
group_by(Column) %>%
group_map(
function(.x, .y) {
.x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
},
.keep = TRUE
) %>%
bind_rows()
A tibble: 15,000 × 7
Patient_ID Gender Status Disease Column Value Group
<int> <fct> <fct> <fct> <chr> <dbl> <chr>
1 1 Female Citizen No Height 145. Group_2
2 2 Male Immigrant No Height 161. Group_5
3 3 Female Immigrant Yes Height 139. Group_1
4 4 Male Citizen No Height 165. Group_5
5 5 Male Citizen Yes Height 159. Group_5
6 6 Female Citizen Yes Height 153. Group_4
7 7 Female Citizen No Height 156. Group_4
8 8 Male Citizen Yes Height 152. Group_3
9 9 Male Immigrant Yes Height 146. Group_2
10 10 Female Citizen No Height 147. Group_2
… with 14,990 more rows
ℹ Use print(n = ...)
to see more rows
Now I can calculate the summaries you want.
my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
) %>%
group_by(Column) %>%
group_map(
function(.x, .y) {
.x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
},
.keep = TRUE
) %>%
bind_rows() %>%
group_by(Column, Group) %>%
summarise(
Min = min(Value),
Max = max(Value),
.groups = "drop"
)
A tibble: 14 × 4
Column Group Min Max
<chr> <chr> <dbl> <dbl>
1 Height Group_1 112. 141.
2 Height Group_2 141. 147.
3 Height Group_3 147. 152.
4 Height Group_4 152. 159.
5 Height Group_5 159. 188.
6 Hospital_Visits Group_1 1 5
7 Hospital_Visits Group_2 5 10
8 Hospital_Visits Group_3 10 15
9 Hospital_Visits Group_4 16 20
10 Weight Group_1 56.5 81.8
11 Weight Group_2 81.9 87.5
12 Weight Group_3 87.5 92.7
13 Weight Group_4 92.7 98.5
14 Weight Group_5 98.6 121.
Personally, I'd keep the results in this format for further processing - because it's tidy. But often _presentation_ works better in wider rather than long format. So when ready to present, you can:
my_data %>%
pivot_longer(
c(Height, Weight, Hospital_Visits),
names_to = "Column",
values_to = "Value"
) %>%
group_by(Column) %>%
group_map(
function(.x, .y) {
.x %>% mutate(Group = paste0("Group_", ntile(Value, nGroups[.y$Column])))
},
.keep = TRUE
) %>%
bind_rows() %>%
group_by(Column, Group) %>%
summarise(
Min = min(Value),
Max = max(Value),
.groups = "drop"
) %>%
pivot_wider(
id_cols = Group,
values_from = c(Min, Max),
names_from = Column,
names_glue = '{.value}_{Column}'
)
A tibble: 5 × 7
Group Min_Height Min_Hospital_Visits Min_Weight Max_Height Max_Hospital_Visits Max_Weight
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Group_1 112. 1 56.5 141. 5 81.8
2 Group_2 141. 5 81.9 147. 10 87.5
3 Group_3 147. 10 87.5 152. 15 92.7
4 Group_4 152. 16 92.7 159. 20 98.5
5 Group_5 159. NA 98.6 188. NA 121.
This approach is robust against changes in column names and the desired number of groups. Hence, it "protects" your code as you request. The only thing you need to check if you have different variables in the future is that you change the value of `nGroups` accordingly. if you have extra columns in `nGroups`, that's fine. If you have a column that you want to summarise that doesn't have an entry in `nGroups`, then you'll get an error. But it's easy to protect yourself against that. To *really* protect yourself, define `nGroups` at the start of your code and change the initial `pivot_longer` to
my_data %>%
pivot_longer(
names(nGroups),
names_to = "Column",
values_to = "Value"
) ...
</details>
# 答案2
**得分**: 1
以下是翻译好的部分:
"pivot_longer()" 函数将数据首先进行长格式的变换,通过调用 `ntile()` 计算分组变量,然后根据变量名称和分位数组进行分组。首先,我们可以生成数据。
```r
library(dplyr)
library(tidyr)
set.seed(123)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status <- as.factor(status )
Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
接下来,我们实际进行汇总。
my_data %>%
pivot_longer(Height:Hospital_Visits, names_to="vbl", values_to="vals") %>%
group_by(vbl) %>%
mutate(group = case_when(
vbl %in% c("Height", "Weight") ~ ntile(vals, 5),
vbl %in% c("Hospital_Visits") ~ ntile(vals, 4)),
vbl = gsub("Hospital_", "", vbl)) %>%
group_by(vbl, group) %>%
reframe(min = min(vals),
max=max(vals)) %>%
pivot_wider(names_from = "vbl",
names_glue = "{vbl}_{.value}",
values_from = c("min", "max"))
#> # A tibble: 5 × 7
#> group Height_min Visits_min Weight_min Height_max Visits_max Weight_max
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 112. 1 56.5 141. 5 81.8
#> 2 2 141. 5 81.9 147. 10 87.5
#> 3 3 147. 10 87.5 152. 15 92.7
#> 4 4 152. 16 92.7 159. 20 98.5
#> 5 5 159. NA 98.6 188. NA 121.
创建于2023年06月25日,使用 reprex v2.0.2
英文:
It may be easier to pivot the data longer first, calculate the group variable by a call to ntile()
and then group_by()
on the variable name and quantile group. First, we can make the data.
library(dplyr)
library(tidyr)
set.seed(123)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status <- as.factor(status )
Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
Below, we actually do the summarizing.
my_data %>%
pivot_longer(Height:Hospital_Visits, names_to="vbl", values_to="vals") %>%
group_by(vbl) %>%
mutate(group = case_when(
vbl %in% c("Height", "Weight") ~ ntile(vals, 5),
vbl %in% c("Hospital_Visits") ~ ntile(vals, 4)),
vbl = gsub("Hospital_", "", vbl)) %>%
group_by(vbl, group) %>%
reframe(min = min(vals),
max=max(vals)) %>%
pivot_wider(names_from = "vbl",
names_glue = "{vbl}_{.value}",
values_from = c("min", "max"))
#> # A tibble: 5 × 7
#> group Height_min Visits_min Weight_min Height_max Visits_max Weight_max
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 112. 1 56.5 141. 5 81.8
#> 2 2 141. 5 81.9 147. 10 87.5
#> 3 3 147. 10 87.5 152. 15 92.7
#> 4 4 152. 16 92.7 159. 20 98.5
#> 5 5 159. NA 98.6 188. NA 121.
<sup>Created on 2023-06-25 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论