英文:
R: Splitting a Dataset Into Parts Based on The Current Order
问题
I am working with the R programming language.
I have the following dataset:
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status)
height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20, 5000, replace = TRUE)
disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)
my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)
My Problem: I would now like to do this using a function/loop (i.e. in case I want to make more than 4 splits). But when I try doing this using a function/loop, the splits are not being made properly (i.e. note the overlapping ranges):
create_h <- function(part_1) {
limits = as.integer(seq(1, nrow(part_1), by = 0.25*nrow(part_1)))
limits = c(limits, nrow(part_1))
h_list = list()
for (i in 1:(length(limits)-1)) {
h_list[[i]] = part_1[(limits[i]+1):limits[i+1],]
}
return(h_list)
}
for (i in seq_along(h_list)) {
assign(paste0("h_", i), h_list[[i]])
}
> range(h_1$height)
[1] 120.7571 182.7591
> range(h_2$height)
[1] 115.3251 178.7666
> range(h_3$height)
[1] 125.9139 173.9946
> range(h_4$height)
[1] 124.3712 173.7773
Can someone please show me what I am doing wrong and what I can do to fix this?
Thanks!
英文:
I am working with the R programming language.
I have the following dataset:
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status )
height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20, 5000, replace = TRUE)
disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)
my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)
My Question: Suppose I select all rows with male immigrants - I want to split this resulting dataset into 4 parts such that the top 25% tallest are in one dataset, the second 25% tallest are in another dataset, etc.
Currently, I am doing this manually:
part_1 = my_data[my_data$gender == "Male" & my_data$status == "Immigrant",]
part_1 = part_1 %>% arrange(desc(height))
limits = as.integer(seq(1, nrow(part_1), by = 0.25*nrow(part_1)))
limits = c(limits, nrow(part_1))
> limits
[1] 1 163 326 489 651
h_1 = part_1[1:163,]
h_2 = part_1[164:326,]
h_3 = part_1[327:489,]
h_4 = part_1[490:651,]
I can verify these results manually by making sure that there is no overlap in heights:
> range(h_1$height)
[1] 157.3934 182.7591
> range(h_2$height)
[1] 149.8167 157.3084
> range(h_3$height)
[1] 143.8353 149.7927
> range(h_4$height)
[1] 111.5468 143.8141
My Problem: I would now like to do this using a function/loop (i.e. in case I want to make more than 4 splits). But when I try doing this using a function/loop, the splits are not being made properly (i.e. note the overlapping ranges):
create_h <- function(part_1) {
limits = as.integer(seq(1, nrow(part_1), by = 0.25*nrow(part_1)))
limits = c(limits, nrow(part_1))
h_list = list()
for (i in 1:(length(limits)-1)) {
h_list[[i]] = part_1[(limits[i]+1):limits[i+1],]
}
return(h_list)
}
for (i in seq_along(h_list)) {
assign(paste0("h_", i), h_list[[i]])
}
> range(h_1$height)
[1] 120.7571 182.7591
> range(h_2$height)
[1] 115.3251 178.7666
> range(h_3$height)
[1] 125.9139 173.9946
> range(h_4$height)
[1] 124.3712 173.7773
Can someone please show me what I am doing wrong and what I can do to fix this?
Thanks!
答案1
得分: 1
无法重现您的问题。您没有展示如何调用create_h
函数,但我假设它类似于以下方式。然后,我们可以验证范围(在list
中比分配之后更容易进行验证)
h_list = create_h(part_1)
# [[1]]
# [1] 157.3934 178.7666
#
# [[2]]
# [1] 149.8167 157.3084
#
# [[3]]
# [1] 143.8353 149.7927
#
# [[4]]
# [1] 111.5468 143.8141
我对问题的最佳猜测是,在调试函数后,您的环境中保留了旧版本的h_list
,并且您从未在更新函数后更新它。
您确实存在一个错误,即即使i = 1
时,也会省略第一行,您有h_list[[i]] = part_1[(limits[i]+1):limits[i+1],]
,而(limits[i]+1)
将从2开始。
除了这种方法之外,我们可以使用便捷函数dplyr::ntile
来简化一切:
n_quant = 4
h_list2 = part_1 |>
group_by(n_tile = ntile(n = n_quant)) |>
group_split()
## 验证
lapply(h_list2, \(x) c(n = nrow(x), range = range(x$height)))
# [[1]]
# n range1 range2
# 163.0000 157.3934 182.7591
#
# [[2]]
# n range1 range2
# 163.0000 149.8167 157.3084
#
# [[3]]
# n range1 range2
# 163.0000 143.8353 149.7927
#
# [[4]]
# n range1 range2
# 162.0000 111.5468 143.8141
英文:
I can't reproduce your problem. You don't show how you call create_h
, but I assume it's something like this. We can then verify the ranges (easier to do in the list
than after assigning)
h_list = create_h(part_1)
# [[1]]
# [1] 157.3934 178.7666
#
# [[2]]
# [1] 149.8167 157.3084
#
# [[3]]
# [1] 143.8353 149.7927
#
# [[4]]
# [1] 111.5468 143.8141
My best guess as to the problem is that you have an old version of h_list
in your environment and you never updated it after debugging your function.
You do have one bug, the first row is omitted, even when i = 1
you have h_list[[i]] = part_1[(limits[i]+1):limits[i+1],]
, and (limits[i]+1)
will start at 2.
Instead of this approach, we can use the convenience function dplyr::ntile
to simplify everything:
n_quant = 4
h_list2 = part_1 |>
group_by(n_tile = ntile(n = n_quant)) |>
group_split()
## verify
lapply(h_list2, \(x) c(n = nrow(x), range = range(x$height)))
# [[1]]
# n range1 range2
# 163.0000 157.3934 182.7591
#
# [[2]]
# n range1 range2
# 163.0000 149.8167 157.3084
#
# [[3]]
# n range1 range2
# 163.0000 143.8353 149.7927
#
# [[4]]
# n range1 range2
# 162.0000 111.5468 143.8141
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论