英文:
Split list based on rows of list items
问题
我尝试将我的数据帧列表拆分成一些子组,比如嵌套列表或多个列表。拆分应基于每个数据帧的行数,因此具有相同行数的数据帧应该放在同一个列表中。
full_list <- list(
df1 = replicate(10, sample(0:1, 10, replace = TRUE)),
df2 = replicate(10, sample(0:1, 15, replace = TRUE)),
df3 = replicate(10, sample(0:1, 20, replace = TRUE)),
df4 = replicate(10, sample(0:1, 10, replace = TRUE))
)
现在有两个数据帧,其中 nrow() == 10
,因此它们应该放在它们自己的列表或子列表中。
我尝试了类似这样的方法,但我认为 split
不适用于列表:
sublist <- lapply(full_list, function(x) split(full_list, f = nrow(x)))
顺便提一下,更大的目标是将所有数据帧拆分为用于机器学习的训练数据集和测试数据集,使用以下函数。sample
将用于创建子集,但我希望相同行数的数据帧使用相同的 sample_vector
。因此,我想事先将完整列表拆分为子列表。之后,我将所有数据帧再次组合在一起进行进一步处理(有点像拆分 - 应用 - 合并)。只是提一下,如果我可能过于复杂化了事情。
英文:
I'm trying to split my list of data frames into some kind of sub groups like a nested list or several lists. The split should be based on the number of rows per data frame, so data frames with the same number of rows should end up in the same list.
full_list <- list(
df1 = replicate(10, sample(0:1, 10, replace = TRUE)),
df2 = replicate(10, sample(0:1, 15, replace = TRUE)),
df3 = replicate(10, sample(0:1, 20, replace = TRUE)),
df4 = replicate(10, sample(0:1, 10, replace = TRUE))
)
There are now two data frames with nrow() == 10
, so they should end up in their own list or sublist
I tried something like this, but I don't think split
is applicable for lists:
sublist <- lapply(full_list, function(x) split(full_list, f = nrow(x)))
BTW: The greater goal is to split all data frames into a training and a test data set for machine learning with the function below. sample
will be used to create the subsets, but I want the same sample_vector
for data frames of same length. Therefore, I want to split the full list into sub lists beforehand. Afterwards I will put all data frames together again for further processing (kind of split - apply - combine). Just mentioning if I might be overcomplicating things here.
# function to split data frames in each sub list into train and test data frames
counter <- 0
train_test_list <- list()
for (x_table in sublist) {
counter <- counter + 1
current_name <- paste(names(sublist)0+网站访问量, sep = "_")
sample_vector <- sample.int(n = nrow(x_table),
size = floor(0.8 * nrow(x_table)), replace = FALSE)
train_set <- x_table[sample_vector, ]
test_set <- x_table[-sample_vector, ]
train_test_list[[current_name]] <- list(
train_set = train_set, test_set = test_set,
table_name = names(sublist)0+网站访问量
)
}
# combine all lists with test and train pairs back into one list
full_train_test_list <- c(train_test_list1, train_test_list2, train_test_list3, ...)
答案1
得分: 4
我们可以使用sapply
和split
来根据这些信息获取行数。
new_list <- split(full_list, sapply(full_list, nrow))
str(new_list)
#List of 3
# $ 10:List of 2
# ..$ df1: int [1:10, 1:10] 1 0 0 1 1 0 1 0 0 1 ...
# ..$ df4: int [1:10, 1:10] 1 0 1 1 1 0 0 0 1 1 ...
# $ 15:List of 1
# ..$ df2: int [1:15, 1:10] 0 1 1 0 0 0 0 0 0 1 ...
# $ 20:List of 1
# ..$ df3: int [1:20, 1:10] 1 1 0 1 0 1 1 1 0 1 ...
由于这是一个嵌套的list
,我们可以在第一个lapply
内部调用lapply
来处理内部的list
。
traintestlst <- lapply(new_list, function(sublst) lapply(sublst, function(x_table) {
sample_vector <- sample.int(n = nrow(x_table),
size = floor(0.8 * nrow(x_table)), replace = FALSE)
train_set <- x_table[sample_vector, ]
test_set <- x_table[-sample_vector, ]
list(train_set = train_set, test_set = test_set)
})
)
检查输出:
traintestlst[[1]]$df1
#$train_set
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 1 1 0 1 0 0 1 1 1 0
#[2,] 1 0 1 1 1 0 0 0 1 0
#[3,] 0 1 0 0 1 1 0 1 1 0
#[4,] 1 1 0 1 0 0 1 0 0 1
#[5,] 0 0 0 1 0 0 1 0 1 0
#[6,] 0 1 1 0 1 0 1 0 1 0
#[7,] 1 0 1 1 0 0 0 0 0 1
#[8,] 0 1 0 0 0 1 0 0 1 0
#$test_set
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 0 0 0 0 0 1 0 1 0 1
#[2,] 1 0 0 0 0 0 0 1 1 0
英文:
We can get the number of rows with sapply
and split
based on that info
new_list <- split(full_list, sapply(full_list, nrow))
str(new_list)
#List of 3
# $ 10:List of 2
# ..$ df1: int [1:10, 1:10] 1 0 0 1 1 0 1 0 0 1 ...
# ..$ df4: int [1:10, 1:10] 1 0 1 1 1 0 0 0 1 1 ...
# $ 15:List of 1
# ..$ df2: int [1:15, 1:10] 0 1 1 0 0 0 0 0 0 1 ...
# $ 20:List of 1
# ..$ df3: int [1:20, 1:10] 1 1 0 1 0 1 1 1 0 1 ...
As it is a nested list
, we can do the processing in the inner list
by calling lapply
inside the first lapply
traintestlst <- lapply(new_list, function(sublst) lapply(sublst, function(x_table) {
sample_vector <- sample.int(n = nrow(x_table),
size = floor(0.8 * nrow(x_table)), replace = FALSE)
train_set <- x_table[sample_vector, ]
test_set <- x_table[-sample_vector, ]
list(train_set = train_set, test_set = test_set)
})
)
-checking the output
traintestlst[[1]]$df1
#$train_set
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 1 1 0 1 0 0 1 1 1 0
#[2,] 1 0 1 1 1 0 0 0 1 0
#[3,] 0 1 0 0 1 1 0 1 1 0
#[4,] 1 1 0 1 0 0 1 0 0 1
#[5,] 0 0 0 1 0 0 1 0 1 0
#[6,] 0 1 1 0 1 0 1 0 1 0
#[7,] 1 0 1 1 0 0 0 0 0 1
#[8,] 0 1 0 0 0 1 0 0 1 0
#$test_set
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,] 0 0 0 0 0 1 0 1 0 1
#[2,] 1 0 0 0 0 0 0 1 1 0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论