R: 根据当前顺序将数据集分成部分

huangapple go评论57阅读模式
英文:

R: Splitting a Dataset Into Parts Based on The Current Order

问题

I am working with the R programming language.

I have the following dataset:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)

status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status)

height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20,  5000, replace = TRUE)

disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)

my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)

My Problem: I would now like to do this using a function/loop (i.e. in case I want to make more than 4 splits). But when I try doing this using a function/loop, the splits are not being made properly (i.e. note the overlapping ranges):

create_h <- function(part_1) {
    limits = as.integer(seq(1, nrow(part_1), by = 0.25*nrow(part_1)))
    limits = c(limits, nrow(part_1))
    h_list = list()
    for (i in 1:(length(limits)-1)) {
        h_list[[i]] = part_1[(limits[i]+1):limits[i+1],]
    }
    return(h_list)
}

for (i in seq_along(h_list)) {
    assign(paste0("h_", i), h_list[[i]])
}

>  range(h_1$height)
[1] 120.7571 182.7591
>  range(h_2$height)
[1] 115.3251 178.7666
>  range(h_3$height)
[1] 125.9139 173.9946
> range(h_4$height)
[1] 124.3712 173.7773

Can someone please show me what I am doing wrong and what I can do to fix this?

Thanks!

英文:

I am working with the R programming language.

I have the following dataset:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender &lt;- c(&quot;Male&quot;,&quot;Female&quot;)
gender &lt;- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender &lt;- as.factor(gender)


status &lt;- c(&quot;Immigrant&quot;,&quot;Citizen&quot;)
status &lt;- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status  &lt;- as.factor(status )

height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20,  5000, replace = TRUE)

disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)

my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)

My Question: Suppose I select all rows with male immigrants - I want to split this resulting dataset into 4 parts such that the top 25% tallest are in one dataset, the second 25% tallest are in another dataset, etc.

Currently, I am doing this manually:

part_1 = my_data[my_data$gender == &quot;Male&quot; &amp; my_data$status == &quot;Immigrant&quot;,]
part_1 = part_1 %&gt;% arrange(desc(height))

limits = as.integer(seq(1, nrow(part_1), by = 0.25*nrow(part_1)))
limits = c(limits, nrow(part_1))

&gt; limits
[1]   1 163 326 489 651

h_1 = part_1[1:163,]
h_2 = part_1[164:326,]
h_3 = part_1[327:489,]
h_4 = part_1[490:651,]

I can verify these results manually by making sure that there is no overlap in heights:

&gt; range(h_1$height)
[1] 157.3934 182.7591
&gt; range(h_2$height)
[1] 149.8167 157.3084
&gt; range(h_3$height)
[1] 143.8353 149.7927
&gt; range(h_4$height)
[1] 111.5468 143.8141

My Problem: I would now like to do this using a function/loop (i.e. in case I want to make more than 4 splits). But when I try doing this using a function/loop, the splits are not being made properly (i.e. note the overlapping ranges):

create_h &lt;- function(part_1) {
    limits = as.integer(seq(1, nrow(part_1), by = 0.25*nrow(part_1)))
    limits = c(limits, nrow(part_1))
    h_list = list()
    for (i in 1:(length(limits)-1)) {
        h_list[[i]] = part_1[(limits[i]+1):limits[i+1],]
    }
    return(h_list)
}

for (i in seq_along(h_list)) {
    assign(paste0(&quot;h_&quot;, i), h_list[[i]])
}

&gt;  range(h_1$height)
[1] 120.7571 182.7591
&gt;  range(h_2$height)
[1] 115.3251 178.7666
&gt;  range(h_3$height)
[1] 125.9139 173.9946
&gt; range(h_4$height)
[1] 124.3712 173.7773

Can someone please show me what I am doing wrong and what I can do to fix this?

Thanks!

答案1

得分: 1

无法重现您的问题。您没有展示如何调用create_h函数,但我假设它类似于以下方式。然后,我们可以验证范围(在list中比分配之后更容易进行验证)

h_list = create_h(part_1)
# [[1]]
# [1] 157.3934 178.7666
# 
# [[2]]
# [1] 149.8167 157.3084
# 
# [[3]]
# [1] 143.8353 149.7927
# 
# [[4]]
# [1] 111.5468 143.8141

我对问题的最佳猜测是,在调试函数后,您的环境中保留了旧版本的h_list,并且您从未在更新函数后更新它。

您确实存在一个错误,即即使i = 1时,也会省略第一行,您有h_list[[i]] = part_1[(limits[i]+1):limits[i+1],],而(limits[i]+1)将从2开始。

除了这种方法之外,我们可以使用便捷函数dplyr::ntile来简化一切:

n_quant = 4
h_list2 = part_1 |&gt;
  group_by(n_tile = ntile(n = n_quant)) |&gt;
  group_split()

## 验证
lapply(h_list2, \(x) c(n = nrow(x), range = range(x$height)))
# [[1]]
#        n   range1   range2 
# 163.0000 157.3934 182.7591 
# 
# [[2]]
#        n   range1   range2 
# 163.0000 149.8167 157.3084 
# 
# [[3]]
#        n   range1   range2 
# 163.0000 143.8353 149.7927 
# 
# [[4]]
#        n   range1   range2 
# 162.0000 111.5468 143.8141
英文:

I can't reproduce your problem. You don't show how you call create_h, but I assume it's something like this. We can then verify the ranges (easier to do in the list than after assigning)

h_list = create_h(part_1)
# [[1]]
# [1] 157.3934 178.7666
# 
# [[2]]
# [1] 149.8167 157.3084
# 
# [[3]]
# [1] 143.8353 149.7927
# 
# [[4]]
# [1] 111.5468 143.8141

My best guess as to the problem is that you have an old version of h_list in your environment and you never updated it after debugging your function.

You do have one bug, the first row is omitted, even when i = 1 you have h_list[[i]] = part_1[(limits[i]+1):limits[i+1],], and (limits[i]+1) will start at 2.

Instead of this approach, we can use the convenience function dplyr::ntile to simplify everything:

n_quant = 4
h_list2 = part_1 |&gt;
  group_by(n_tile = ntile(n = n_quant)) |&gt;
  group_split()

## verify
lapply(h_list2, \(x) c(n = nrow(x), range = range(x$height)))
# [[1]]
#        n   range1   range2 
# 163.0000 157.3934 182.7591 
# 
# [[2]]
#        n   range1   range2 
# 163.0000 149.8167 157.3084 
# 
# [[3]]
#        n   range1   range2 
# 163.0000 143.8353 149.7927 
# 
# [[4]]
#        n   range1   range2 
# 162.0000 111.5468 143.8141

huangapple
  • 本文由 发表于 2023年7月10日 11:15:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76650486.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定