以计算效率为基础,操纵大型深度嵌套对象的方法?

huangapple go评论67阅读模式
英文:

computationally efficient way to manipulate the levels of large deeply-nested objects?

问题

I understand your request. Here is the translated part:

有一个长度为7600万的向量列表的列表(不是拼写错误,再次确认它确实是一个向量列表的列表),其中包含7600万个项目的列表,其中每个项目都是两个向量的列表。

所有向量都具有相同的长度(6个项目)。

例如,list_of_list[1:50] 的数据如下所示:

dput 输出

只是作为信息提醒,列表的列表是使用以下模板使用 combn() 创建的:combn(focal_list,2,simplify = FALSE)

是否有一种计算上高效的方法将其转换为两列表,其中每一行都是列表的一个项目?所有第一个向量都成为第一列,所有第二个向量都成为第二列?

我尝试了以下方法,但在10-12分钟后仍然没有输出,这对我的用例来说太昂贵了:

dt <- data.table(col1 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][1]),
                 col2 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][2]))

我可以使用 foreach 循环来解开深层嵌套的对象,并将向量读取为由简单字符分隔的字符,然后使用另一个 foreach 循环来创建一个 data.table,但在这样做之前,是否有我在R中遗漏的更简单的方法?

请注意,为了澄清,我希望保留最低级别项目的 vector() 特性。即当您从列表的列表中创建表格时,每个项目应该是一个向量,数据表应该有两列,似乎在尝试创建表格时,R喜欢展平 vectorslist

英文:

I have a list of lists of vectors (non a typo, re-confirming that it is infact a list of lists of vectors) that is 76 million in length. So, there is a list of 76 million items where each item is a list of two vectors.

All the vectors are, of uniform length (6 items).

For example the data itself looks as follows for list_of_list[1:50]:

dput output

list(list(c(4, 4, 1, 0, 1, 0), c(3, 3, 2, 2, 0, 0)), list(c(4, 
4, 1, 0, 1, 0), c(3, 4, 3, 1, 0, 0)), list(c(4, 4, 1, 0, 1, 0
), c(4, 5, 1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(5, 8, 0, 
0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(5, 5, 0, 2, 0, 0)), list(
    c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 0, 0)), list(c(4, 4, 
1, 0, 1, 0), c(4, 5, 1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), 
    c(4, 4, 1, 0, 1, 0)), list(c(4, 4, 1, 0, 1, 0), c(6, 10, 
1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 4, 3, 1, 0, 0)), 
    list(c(4, 4, 1, 0, 1, 0), c(5, 7, 2, 0, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(40, 10, 0, 15, 8, 0)), list(c(4, 4, 1, 
    0, 1, 0), c(24L, 7L, 6L, 20L, 8L, 1L)), list(c(4, 4, 1, 0, 
    1, 0), c(39L, 22L, 9L, 5L, 8L, 1L)), list(c(4, 4, 1, 0, 1, 
    0), c(34, 36, 17, 15, 0, 2)), list(c(4, 4, 1, 0, 1, 0), c(36L, 
    42L, 18L, 4L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(4, 5, 
    1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(4, 8, 3, 0, 0, 
    0)), list(c(4, 4, 1, 0, 1, 0), c(3, 1, 2, 2, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(6, 9, 0, 1, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(5, 5, 0, 2, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(6, 10, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(6, 
    10, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 15, 0, 0, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 0, 0)), 
    list(c(4, 4, 1, 0, 1, 0), c(4, 2, 1, 2, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(28, 24, 19, 14, 4, 0)), list(c(4, 4, 1, 
    0, 1, 0), c(40, 56, 19, 11, 0, 0)), list(c(4, 4, 1, 0, 1, 
    0), c(32L, 33L, 14L, 17L, 1L, 2L)), list(c(4, 4, 1, 0, 1, 
    0), c(24L, 55L, 11L, 16L, 6L, 1L)), list(c(4, 4, 1, 0, 1, 
    0), c(27, 10, 6, 19, 8, 0)), list(c(4, 4, 1, 0, 1, 0), c(31, 
    21, 11, 19, 4, 0)), list(c(4, 4, 1, 0, 1, 0), c(37L, 60L, 
    12L, 7L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(29L, 8L, 3L, 
    18L, 8L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(21L, 24L, 20L, 
    14L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(6, 10, 1, 0, 0, 
    0)), list(c(4, 4, 1, 0, 1, 0), c(5, 9, 2, 0, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(7, 13, 0, 0, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(6, 12, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(5, 8, 1, 1, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 
    7, 0, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 6, 1, 1, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(4, 3, 0, 3, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(3, 2, 3, 1, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(4, 4, 1, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 
    3, 2, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 7, 0, 2, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 1, 2, 2, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(6, 7, 0, 1, 0, 0)))

Just FYI, the list of lists was made using combn() using this template: combn(focal_list,2,simplify = FALSE)

Is there a computationally efficient way to turn this into a table of two columns where each row is one item from the list of lists? All the first vectors become the first column and all the second vectors become the second column?

I tried the following and this just kept going after 10-12 minutes with no output, which is just to expensive for my use-case :

dt &lt;- data.table(col1 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][1]),
                 col2 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][2])))

I could use a foreach loop to detangle the deeply nested object and read in the vectors as chars separated by a simple char and then use another foreach loop to create a data.table but before I do that, is there a simpler way in R that I am missing?

Please note for clarification that I want to maintain the vector() like nature of the lowest level items .i.e when you make a table out of the list of lists, each item should be a vector and the data.table should be two columns, it seems R likes to flatten vectors and list when trying to make tables.

答案1

得分: 1

  • rbindlist + rapply
rbindlist(rapply(list_of_list, list, how = "replace"))
  • as.data.frame + rbind
as.data.frame(do.call(rbind, list_of_list))

然而,第二个选项,即基于R的方法 as.data.table + rbind,似乎比第一个方法快得多(请参见以下性能基准测试结果)

microbenchmark(
    f1 = rbindlist(rapply(list_of_list, list, how = "replace")),
    f2 = as.data.frame(do.call(rbind, list_of_list)),
    check = "equivalent"
)

这会产生以下结果

Unit: microseconds
 expr   min    lq    mean median     uq   max neval
   f1 138.7 168.7 177.896 174.10 185.00 392.6   100
   f2  31.7  38.5  45.127  43.55  50.25  88.8   100
英文:

I think you may have several approaches to make it, for example

  • rbindlist + rapply
rbindlist(rapply(list_of_list, list, how = &quot;replace&quot;))
  • as.data.frame + rbind
as.data.frame(do.call(rbind, list_of_list))

However, the second option, i.e., the base R approach as.data.table + rbind seems much faster than the first one (see the benchmarking below)

microbenchmark(
    f1 = rbindlist(rapply(list_of_list, list, how = &quot;replace&quot;)),
    f2 = as.data.frame(do.call(rbind, list_of_list)),
    check = &quot;equivalent&quot;
)

which gives

Unit: microseconds
 expr   min    lq    mean median     uq   max neval
   f1 138.7 168.7 177.896 174.10 185.00 392.6   100
   f2  31.7  38.5  45.127  43.55  50.25  88.8   100

答案2

得分: 0

我建议您使用Rcpp,就像下面的代码一样。由于您有7600万个数据,我建议分批处理,每次处理1000万个数据。在我的计算机上,将1000万个数据转换为矩阵只需要8秒。这意味着如果您执行这个操作8次,大约需要70-80秒的时间。将不同的矩阵匹配存储起来,然后将它们合并成一个,可能通过将它们写入硬盘上的一个文件来实现。

Rcpp::cppFunction(
  'NumericVector combineList(std::vector< std::vector<std::vector<double>> > x){
    int n = x.size();
    int m = x[0].size();
    int p = x[0][0].size();
    std::vector<double> y(n*p*m);
    for(int i = 0; i < n; i++)
      for(int j = 0; j < m; j++)
        for(int k = 0; k < p; k++)
          y

= x[i][j][k]; NumericVector z = wrap(y); z.attr("dim") = Dimension(n*p, m); return z; }' ) combineList(list_of_lists) [,1] [,2] [1,] 4 3 [2,] 4 3 [3,] 1 2 [4,] 0 2 [5,] 1 0 [6,] 0 0 [7,] 4 3 [8,] 4 4 [9,] 1 3 [10,] 0 1 [11,] 1 0 [12,] 0 0 [13,] 4 4 [14,] 4 5 [15,] 1 1 [16,] 0 0 [17,] 1 0 [18,] 0 1 [19,] 4 5 [20,] 4 8

英文:

I would suggest you use Rcpp like the code below. Since you have 76million, I recomment running the data in batches, ie 10million each. In my computer, it takes 8 secs to convert 10million into a matrix. Meaning if you do this 8 times, it will take approx 70-80 sec. Store the different matrix matches then combine them into one, probably by writting them into one file in the hard drive.

Rcpp::cppFunction(
&#39;NumericVector combineList(std::vector&lt; std::vector&lt;std::vector&lt;double&gt;&gt;&gt; x){
	int n = x.size();
	int m = x[0].size();
	int p = x[0][0].size();
	std::vector&lt;double&gt; y(n*p*m);
	for(int i = 0; i &lt; n; i++)
		for(int j = 0; j &lt; m; j++)
			for(int k = 0; k &lt; p; k++)
				y

= x[i][j][k]; NumericVector z = wrap(y); z.attr(&quot;dim&quot;) = Dimension(n*p, m); return z; }&#39; ) combineList(list_of_lists) [,1] [,2] [1,] 4 3 [2,] 4 3 [3,] 1 2 [4,] 0 2 [5,] 1 0 [6,] 0 0 [7,] 4 3 [8,] 4 4 [9,] 1 3 [10,] 0 1 [11,] 1 0 [12,] 0 0 [13,] 4 4 [14,] 4 5 [15,] 1 1 [16,] 0 0 [17,] 1 0 [18,] 0 1 [19,] 4 5 [20,] 4 8

答案3

得分: 0

我能够相当容易地使用这一行代码解决这个问题:

rbindlist(rapply(focal_list, list, how = "replace"))

有趣的部分是,上面的代码在大约2分钟内处理了所有7600万个项目,无需使用Rcpp(无法确定包是否在底层使用Rcpp)。

英文:

I was able to solve this issue fairly easily using this one line of code:

rbindlist(rapply(focal_list, list, how = &quot;replace&quot;))

The fascinating part is that the above code process all 76 millions items in about 2-ish minutes, no Rcpp required (can't say if the packages are using Rcpp underneath the hood).

huangapple
  • 本文由 发表于 2023年7月13日 20:02:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76679165.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定