根据列表项的行拆分列表。

huangapple go评论59阅读模式
英文:

Split list based on rows of list items

问题

我尝试将我的数据帧列表拆分成一些子组,比如嵌套列表或多个列表。拆分应基于每个数据帧的行数,因此具有相同行数的数据帧应该放在同一个列表中。

full_list <- list(
  df1 = replicate(10, sample(0:1, 10, replace = TRUE)),
  df2 = replicate(10, sample(0:1, 15, replace = TRUE)),
  df3 = replicate(10, sample(0:1, 20, replace = TRUE)),
  df4 = replicate(10, sample(0:1, 10, replace = TRUE))
)

现在有两个数据帧,其中 nrow() == 10,因此它们应该放在它们自己的列表或子列表中。

我尝试了类似这样的方法,但我认为 split 不适用于列表:

sublist <- lapply(full_list, function(x) split(full_list, f = nrow(x)))

顺便提一下,更大的目标是将所有数据帧拆分为用于机器学习的训练数据集和测试数据集,使用以下函数。sample 将用于创建子集,但我希望相同行数的数据帧使用相同的 sample_vector。因此,我想事先将完整列表拆分为子列表。之后,我将所有数据帧再次组合在一起进行进一步处理(有点像拆分 - 应用 - 合并)。只是提一下,如果我可能过于复杂化了事情。

英文:

I'm trying to split my list of data frames into some kind of sub groups like a nested list or several lists. The split should be based on the number of rows per data frame, so data frames with the same number of rows should end up in the same list.

full_list &lt;- list(
  df1 = replicate(10, sample(0:1, 10, replace = TRUE)),
  df2 = replicate(10, sample(0:1, 15, replace = TRUE)),
  df3 = replicate(10, sample(0:1, 20, replace = TRUE)),
  df4 = replicate(10, sample(0:1, 10, replace = TRUE))
)

There are now two data frames with nrow() == 10, so they should end up in their own list or sublist

I tried something like this, but I don't think split is applicable for lists:

sublist &lt;- lapply(full_list, function(x) split(full_list, f = nrow(x)))

BTW: The greater goal is to split all data frames into a training and a test data set for machine learning with the function below. sample will be used to create the subsets, but I want the same sample_vector for data frames of same length. Therefore, I want to split the full list into sub lists beforehand. Afterwards I will put all data frames together again for further processing (kind of split - apply - combine). Just mentioning if I might be overcomplicating things here.

# function to split data frames in each sub list into train and test data frames 
counter &lt;- 0
train_test_list &lt;- list()
for (x_table in sublist) {
  counter &lt;- counter + 1
  current_name &lt;- paste(names(sublist)
0
+
网站访问量
, sep = &quot;_&quot;)
sample_vector &lt;- sample.int(n = nrow(x_table), size = floor(0.8 * nrow(x_table)), replace = FALSE) train_set &lt;- x_table[sample_vector, ] test_set &lt;- x_table[-sample_vector, ] train_test_list[[current_name]] &lt;- list( train_set = train_set, test_set = test_set, table_name = names(sublist)
0
+
网站访问量
) } # combine all lists with test and train pairs back into one list full_train_test_list &lt;- c(train_test_list1, train_test_list2, train_test_list3, ...)

答案1

得分: 4

我们可以使用sapplysplit来根据这些信息获取行数。

new_list <- split(full_list, sapply(full_list, nrow))
str(new_list)
#List of 3
# $ 10:List of 2
#  ..$ df1: int [1:10, 1:10] 1 0 0 1 1 0 1 0 0 1 ...
#  ..$ df4: int [1:10, 1:10] 1 0 1 1 1 0 0 0 1 1 ...
# $ 15:List of 1
#  ..$ df2: int [1:15, 1:10] 0 1 1 0 0 0 0 0 0 1 ...
# $ 20:List of 1
#  ..$ df3: int [1:20, 1:10] 1 1 0 1 0 1 1 1 0 1 ...

由于这是一个嵌套的list,我们可以在第一个lapply内部调用lapply来处理内部的list

traintestlst <- lapply(new_list, function(sublst) lapply(sublst, function(x_table) {
     sample_vector <- sample.int(n = nrow(x_table), 
                size = floor(0.8 * nrow(x_table)), replace = FALSE)
      train_set <- x_table[sample_vector, ]
      test_set  <- x_table[-sample_vector, ]
      list(train_set = train_set, test_set = test_set)
     })
)

检查输出:

traintestlst[[1]]$df1
#$train_set
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,]    1    1    0    1    0    0    1    1    1     0
#[2,]    1    0    1    1    1    0    0    0    1     0
#[3,]    0    1    0    0    1    1    0    1    1     0
#[4,]    1    1    0    1    0    0    1    0    0     1
#[5,]    0    0    0    1    0    0    1    0    1     0
#[6,]    0    1    1    0    1    0    1    0    1     0
#[7,]    1    0    1    1    0    0    0    0    0     1
#[8,]    0    1    0    0    0    1    0    0    1     0

#$test_set
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,]    0    0    0    0    0    1    0    1    0     1
#[2,]    1    0    0    0    0    0    0    1    1     0
英文:

We can get the number of rows with sapply and split based on that info

new_list &lt;- split(full_list, sapply(full_list, nrow))
str(new_list)
#List of 3
# $ 10:List of 2
#  ..$ df1: int [1:10, 1:10] 1 0 0 1 1 0 1 0 0 1 ...
#  ..$ df4: int [1:10, 1:10] 1 0 1 1 1 0 0 0 1 1 ...
# $ 15:List of 1
#  ..$ df2: int [1:15, 1:10] 0 1 1 0 0 0 0 0 0 1 ...
# $ 20:List of 1
#  ..$ df3: int [1:20, 1:10] 1 1 0 1 0 1 1 1 0 1 ...

As it is a nested list, we can do the processing in the inner list by calling lapply inside the first lapply

traintestlst &lt;- lapply(new_list, function(sublst) lapply(sublst, function(x_table) {

     sample_vector &lt;- sample.int(n = nrow(x_table), 
                size = floor(0.8 * nrow(x_table)), replace = FALSE)
      train_set &lt;- x_table[sample_vector, ]
      test_set  &lt;- x_table[-sample_vector, ]
      list(train_set = train_set, test_set = test_set)


     })
    )

-checking the output

traintestlst[[1]]$df1
#$train_set
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,]    1    1    0    1    0    0    1    1    1     0
#[2,]    1    0    1    1    1    0    0    0    1     0
#[3,]    0    1    0    0    1    1    0    1    1     0
#[4,]    1    1    0    1    0    0    1    0    0     1
#[5,]    0    0    0    1    0    0    1    0    1     0
#[6,]    0    1    1    0    1    0    1    0    1     0
#[7,]    1    0    1    1    0    0    0    0    0     1
#[8,]    0    1    0    0    0    1    0    0    1     0

#$test_set
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#[1,]    0    0    0    0    0    1    0    1    0     1
#[2,]    1    0    0    0    0    0    0    1    1     0

huangapple
  • 本文由 发表于 2020年1月4日 01:42:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/59583018.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定