2023年7月10日 11:15:29go评论73阅读模式

英文:

R: Splitting a Dataset Into Parts Based on The Current Order

问题

I am working with the R programming language.

I have the following dataset:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)

status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status <- as.factor(status)

height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20,  5000, replace = TRUE)

disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)

my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)

My Problem: I would now like to do this using a function/loop (i.e. in case I want to make more than 4 splits). But when I try doing this using a function/loop, the splits are not being made properly (i.e. note the overlapping ranges):

create_h <- function(part_1) {
    limits = as.integer(seq(1, nrow(part_1), by = 0.25*nrow(part_1)))
    limits = c(limits, nrow(part_1))
    h_list = list()
    for (i in 1:(length(limits)-1)) {
        h_list[[i]] = part_1[(limits[i]+1):limits[i+1],]
    }
    return(h_list)
}

for (i in seq_along(h_list)) {
    assign(paste0("h_", i), h_list[[i]])
}

>  range(h_1$height)
[1] 120.7571 182.7591
>  range(h_2$height)
[1] 115.3251 178.7666
>  range(h_3$height)
[1] 125.9139 173.9946
> range(h_4$height)
[1] 124.3712 173.7773

Can someone please show me what I am doing wrong and what I can do to fix this?

Thanks!

英文:

I am working with the R programming language.

I have the following dataset:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender &lt;- c(&quot;Male&quot;,&quot;Female&quot;)
gender &lt;- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender &lt;- as.factor(gender)


status &lt;- c(&quot;Immigrant&quot;,&quot;Citizen&quot;)
status &lt;- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status  &lt;- as.factor(status )

height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20,  5000, replace = TRUE)

disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)

my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)

My Question: Suppose I select all rows with male immigrants - I want to split this resulting dataset into 4 parts such that the top 25% tallest are in one dataset, the second 25% tallest are in another dataset, etc.

Currently, I am doing this manually:

part_1 = my_data[my_data$gender == &quot;Male&quot; &amp; my_data$status == &quot;Immigrant&quot;,]
part_1 = part_1 %&gt;% arrange(desc(height))

limits = as.integer(seq(1, nrow(part_1), by = 0.25*nrow(part_1)))
limits = c(limits, nrow(part_1))

&gt; limits
[1]   1 163 326 489 651

h_1 = part_1[1:163,]
h_2 = part_1[164:326,]
h_3 = part_1[327:489,]
h_4 = part_1[490:651,]

I can verify these results manually by making sure that there is no overlap in heights:

&gt; range(h_1$height)
[1] 157.3934 182.7591
&gt; range(h_2$height)
[1] 149.8167 157.3084
&gt; range(h_3$height)
[1] 143.8353 149.7927
&gt; range(h_4$height)
[1] 111.5468 143.8141

My Problem: I would now like to do this using a function/loop (i.e. in case I want to make more than 4 splits). But when I try doing this using a function/loop, the splits are not being made properly (i.e. note the overlapping ranges):

create_h &lt;- function(part_1) {
    limits = as.integer(seq(1, nrow(part_1), by = 0.25*nrow(part_1)))
    limits = c(limits, nrow(part_1))
    h_list = list()
    for (i in 1:(length(limits)-1)) {
        h_list[[i]] = part_1[(limits[i]+1):limits[i+1],]
    }
    return(h_list)
}

for (i in seq_along(h_list)) {
    assign(paste0(&quot;h_&quot;, i), h_list[[i]])
}

&gt;  range(h_1$height)
[1] 120.7571 182.7591
&gt;  range(h_2$height)
[1] 115.3251 178.7666
&gt;  range(h_3$height)
[1] 125.9139 173.9946
&gt; range(h_4$height)
[1] 124.3712 173.7773

Can someone please show me what I am doing wrong and what I can do to fix this?

Thanks!

答案1

得分: 1

无法重现您的问题。您没有展示如何调用create_h函数，但我假设它类似于以下方式。然后，我们可以验证范围（在list中比分配之后更容易进行验证）

h_list = create_h(part_1)
# [[1]]
# [1] 157.3934 178.7666
# 
# [[2]]
# [1] 149.8167 157.3084
# 
# [[3]]
# [1] 143.8353 149.7927
# 
# [[4]]
# [1] 111.5468 143.8141

我对问题的最佳猜测是，在调试函数后，您的环境中保留了旧版本的h_list，并且您从未在更新函数后更新它。

您确实存在一个错误，即即使i = 1时，也会省略第一行，您有h_list[[i]] = part_1[(limits[i]+1):limits[i+1],]，而(limits[i]+1)将从2开始。

除了这种方法之外，我们可以使用便捷函数dplyr::ntile来简化一切：

n_quant = 4
h_list2 = part_1 |&gt;
  group_by(n_tile = ntile(n = n_quant)) |&gt;
  group_split()

## 验证
lapply(h_list2, \(x) c(n = nrow(x), range = range(x$height)))
# [[1]]
#        n   range1   range2 
# 163.0000 157.3934 182.7591 
# 
# [[2]]
#        n   range1   range2 
# 163.0000 149.8167 157.3084 
# 
# [[3]]
#        n   range1   range2 
# 163.0000 143.8353 149.7927 
# 
# [[4]]
#        n   range1   range2 
# 162.0000 111.5468 143.8141

英文:

I can't reproduce your problem. You don't show how you call create_h, but I assume it's something like this. We can then verify the ranges (easier to do in the list than after assigning)

h_list = create_h(part_1)
# [[1]]
# [1] 157.3934 178.7666
# 
# [[2]]
# [1] 149.8167 157.3084
# 
# [[3]]
# [1] 143.8353 149.7927
# 
# [[4]]
# [1] 111.5468 143.8141

My best guess as to the problem is that you have an old version of h_list in your environment and you never updated it after debugging your function.

You do have one bug, the first row is omitted, even when i = 1 you have h_list[[i]] = part_1[(limits[i]+1):limits[i+1],], and (limits[i]+1) will start at 2.

Instead of this approach, we can use the convenience function dplyr::ntile to simplify everything:

n_quant = 4
h_list2 = part_1 |&gt;
  group_by(n_tile = ntile(n = n_quant)) |&gt;
  group_split()

## verify
lapply(h_list2, \(x) c(n = nrow(x), range = range(x$height)))
# [[1]]
#        n   range1   range2 
# 163.0000 157.3934 182.7591 
# 
# [[2]]
#        n   range1   range2 
# 163.0000 149.8167 157.3084 
# 
# [[3]]
#        n   range1   range2 
# 163.0000 143.8353 149.7927 
# 
# [[4]]
#        n   range1   range2 
# 162.0000 111.5468 143.8141

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R: 根据当前顺序将数据集分成部分

问题

答案1

Group and merge rows by ID when there are identical start and end date fields in R columns

设置新的双轴 Y 轴限制

为什么我的图例与ggplot2中的柱形颜色不匹配？

Sapply函数在R中：NA由强制转换引入，但我只有数值。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论