2023年7月13日 20:02:59go评论98阅读模式

英文:

computationally efficient way to manipulate the levels of large deeply-nested objects?

问题

I understand your request. Here is the translated part:

有一个长度为7600万的向量列表的列表（不是拼写错误，再次确认它确实是一个向量列表的列表），其中包含7600万个项目的列表，其中每个项目都是两个向量的列表。

所有向量都具有相同的长度（6个项目）。

例如，list_of_list[1:50] 的数据如下所示：

dput 输出

只是作为信息提醒，列表的列表是使用以下模板使用 combn() 创建的：combn(focal_list,2,simplify = FALSE)

是否有一种计算上高效的方法将其转换为两列表，其中每一行都是列表的一个项目？所有第一个向量都成为第一列，所有第二个向量都成为第二列？

我尝试了以下方法，但在10-12分钟后仍然没有输出，这对我的用例来说太昂贵了：

dt <- data.table(col1 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][1]),
                 col2 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][2]))

我可以使用 foreach 循环来解开深层嵌套的对象，并将向量读取为由简单字符分隔的字符，然后使用另一个 foreach 循环来创建一个 data.table，但在这样做之前，是否有我在R中遗漏的更简单的方法？

请注意，为了澄清，我希望保留最低级别项目的 vector() 特性。即当您从列表的列表中创建表格时，每个项目应该是一个向量，数据表应该有两列，似乎在尝试创建表格时，R喜欢展平 vectors 和 list。

英文:

I have a list of lists of vectors (non a typo, re-confirming that it is infact a list of lists of vectors) that is 76 million in length. So, there is a list of 76 million items where each item is a list of two vectors.

All the vectors are, of uniform length (6 items).

For example the data itself looks as follows for list_of_list[1:50]:

dput output

list(list(c(4, 4, 1, 0, 1, 0), c(3, 3, 2, 2, 0, 0)), list(c(4, 
4, 1, 0, 1, 0), c(3, 4, 3, 1, 0, 0)), list(c(4, 4, 1, 0, 1, 0
), c(4, 5, 1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(5, 8, 0, 
0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(5, 5, 0, 2, 0, 0)), list(
    c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 0, 0)), list(c(4, 4, 
1, 0, 1, 0), c(4, 5, 1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), 
    c(4, 4, 1, 0, 1, 0)), list(c(4, 4, 1, 0, 1, 0), c(6, 10, 
1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 4, 3, 1, 0, 0)), 
    list(c(4, 4, 1, 0, 1, 0), c(5, 7, 2, 0, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(40, 10, 0, 15, 8, 0)), list(c(4, 4, 1, 
    0, 1, 0), c(24L, 7L, 6L, 20L, 8L, 1L)), list(c(4, 4, 1, 0, 
    1, 0), c(39L, 22L, 9L, 5L, 8L, 1L)), list(c(4, 4, 1, 0, 1, 
    0), c(34, 36, 17, 15, 0, 2)), list(c(4, 4, 1, 0, 1, 0), c(36L, 
    42L, 18L, 4L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(4, 5, 
    1, 0, 0, 1)), list(c(4, 4, 1, 0, 1, 0), c(4, 8, 3, 0, 0, 
    0)), list(c(4, 4, 1, 0, 1, 0), c(3, 1, 2, 2, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(6, 9, 0, 1, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(5, 5, 0, 2, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(6, 10, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(6, 
    10, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 15, 0, 0, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 0, 0)), 
    list(c(4, 4, 1, 0, 1, 0), c(4, 2, 1, 2, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(28, 24, 19, 14, 4, 0)), list(c(4, 4, 1, 
    0, 1, 0), c(40, 56, 19, 11, 0, 0)), list(c(4, 4, 1, 0, 1, 
    0), c(32L, 33L, 14L, 17L, 1L, 2L)), list(c(4, 4, 1, 0, 1, 
    0), c(24L, 55L, 11L, 16L, 6L, 1L)), list(c(4, 4, 1, 0, 1, 
    0), c(27, 10, 6, 19, 8, 0)), list(c(4, 4, 1, 0, 1, 0), c(31, 
    21, 11, 19, 4, 0)), list(c(4, 4, 1, 0, 1, 0), c(37L, 60L, 
    12L, 7L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(29L, 8L, 3L, 
    18L, 8L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(21L, 24L, 20L, 
    14L, 5L, 1L)), list(c(4, 4, 1, 0, 1, 0), c(6, 10, 1, 0, 0, 
    0)), list(c(4, 4, 1, 0, 1, 0), c(5, 9, 2, 0, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(7, 13, 0, 0, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(6, 12, 1, 0, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(5, 8, 1, 1, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 
    7, 0, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(7, 11, 0, 0, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 6, 1, 1, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(4, 3, 0, 3, 0, 0)), list(c(4, 
    4, 1, 0, 1, 0), c(3, 2, 3, 1, 0, 0)), list(c(4, 4, 1, 0, 
    1, 0), c(4, 4, 1, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 
    3, 2, 2, 0, 0)), list(c(4, 4, 1, 0, 1, 0), c(5, 7, 0, 2, 
    0, 0)), list(c(4, 4, 1, 0, 1, 0), c(3, 1, 2, 2, 0, 0)), list(
        c(4, 4, 1, 0, 1, 0), c(6, 7, 0, 1, 0, 0)))

Just FYI, the list of lists was made using combn() using this template: combn(focal_list,2,simplify = FALSE)

Is there a computationally efficient way to turn this into a table of two columns where each row is one item from the list of lists? All the first vectors become the first column and all the second vectors become the second column?

I tried the following and this just kept going after 10-12 minutes with no output, which is just to expensive for my use-case :

dt &lt;- data.table(col1 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][1]),
                 col2 = lapply(1:length(list_of_list), function(x) list_of_list[[x]][2])))

I could use a foreach loop to detangle the deeply nested object and read in the vectors as chars separated by a simple char and then use another foreach loop to create a data.table but before I do that, is there a simpler way in R that I am missing?

Please note for clarification that I want to maintain the vector() like nature of the lowest level items .i.e when you make a table out of the list of lists, each item should be a vector and the data.table should be two columns, it seems R likes to flatten vectors and list when trying to make tables.

答案1

得分: 1

rbindlist + rapply

rbindlist(rapply(list_of_list, list, how = "replace"))

as.data.frame + rbind

as.data.frame(do.call(rbind, list_of_list))

然而，第二个选项，即基于R的方法 as.data.table + rbind，似乎比第一个方法快得多（请参见以下性能基准测试结果）

microbenchmark(
    f1 = rbindlist(rapply(list_of_list, list, how = "replace")),
    f2 = as.data.frame(do.call(rbind, list_of_list)),
    check = "equivalent"
)

这会产生以下结果

Unit: microseconds
 expr   min    lq    mean median     uq   max neval
   f1 138.7 168.7 177.896 174.10 185.00 392.6   100
   f2  31.7  38.5  45.127  43.55  50.25  88.8   100

英文:

I think you may have several approaches to make it, for example

rbindlist + rapply

rbindlist(rapply(list_of_list, list, how = &quot;replace&quot;))

as.data.frame + rbind

as.data.frame(do.call(rbind, list_of_list))

However, the second option, i.e., the base R approach as.data.table + rbind seems much faster than the first one (see the benchmarking below)

microbenchmark(
    f1 = rbindlist(rapply(list_of_list, list, how = &quot;replace&quot;)),
    f2 = as.data.frame(do.call(rbind, list_of_list)),
    check = &quot;equivalent&quot;
)

which gives

Unit: microseconds
 expr   min    lq    mean median     uq   max neval
   f1 138.7 168.7 177.896 174.10 185.00 392.6   100
   f2  31.7  38.5  45.127  43.55  50.25  88.8   100

答案2

得分: 0

我建议您使用Rcpp，就像下面的代码一样。由于您有7600万个数据，我建议分批处理，每次处理1000万个数据。在我的计算机上，将1000万个数据转换为矩阵只需要8秒。这意味着如果您执行这个操作8次，大约需要70-80秒的时间。将不同的矩阵匹配存储起来，然后将它们合并成一个，可能通过将它们写入硬盘上的一个文件来实现。

Rcpp::cppFunction(
  'NumericVector combineList(std::vector< std::vector<std::vector<double>> > x){
    int n = x.size();
    int m = x[0].size();
    int p = x[0][0].size();
    std::vector<double> y(n*p*m);
    for(int i = 0; i < n; i++)
      for(int j = 0; j < m; j++)
        for(int k = 0; k < p; k++)
          y = x[i][j][k];
    NumericVector z = wrap(y);
    z.attr("dim") = Dimension(n*p, m);
    return z;
  }'
)
combineList(list_of_lists)
       [,1] [,2]
  [1,]    4    3
  [2,]    4    3
  [3,]    1    2
  [4,]    0    2
  [5,]    1    0
  [6,]    0    0
  [7,]    4    3
  [8,]    4    4
  [9,]    1    3
 [10,]    0    1
 [11,]    1    0
 [12,]    0    0
 [13,]    4    4
 [14,]    4    5
 [15,]    1    1
 [16,]    0    0
 [17,]    1    0
 [18,]    0    1
 [19,]    4    5
 [20,]    4    8

英文:

I would suggest you use Rcpp like the code below. Since you have 76million, I recomment running the data in batches, ie 10million each. In my computer, it takes 8 secs to convert 10million into a matrix. Meaning if you do this 8 times, it will take approx 70-80 sec. Store the different matrix matches then combine them into one, probably by writting them into one file in the hard drive.

Rcpp::cppFunction(
&#39;NumericVector combineList(std::vector&lt; std::vector&lt;std::vector&lt;double&gt;&gt;&gt; x){
	int n = x.size();
	int m = x[0].size();
	int p = x[0][0].size();
	std::vector&lt;double&gt; y(n*p*m);
	for(int i = 0; i &lt; n; i++)
		for(int j = 0; j &lt; m; j++)
			for(int k = 0; k &lt; p; k++)
				y = x[i][j][k];
	NumericVector z = wrap(y);
	z.attr(&quot;dim&quot;) = Dimension(n*p, m);
	return z;
}&#39;
)
combineList(list_of_lists)
       [,1] [,2]
  [1,]    4    3
  [2,]    4    3
  [3,]    1    2
  [4,]    0    2
  [5,]    1    0
  [6,]    0    0
  [7,]    4    3
  [8,]    4    4
  [9,]    1    3
 [10,]    0    1
 [11,]    1    0
 [12,]    0    0
 [13,]    4    4
 [14,]    4    5
 [15,]    1    1
 [16,]    0    0
 [17,]    1    0
 [18,]    0    1
 [19,]    4    5
 [20,]    4    8

答案3

得分: 0

我能够相当容易地使用这一行代码解决这个问题：

rbindlist(rapply(focal_list, list, how = "replace"))

有趣的部分是，上面的代码在大约2分钟内处理了所有7600万个项目，无需使用Rcpp（无法确定包是否在底层使用Rcpp）。

英文:

I was able to solve this issue fairly easily using this one line of code:

rbindlist(rapply(focal_list, list, how = "replace"))

The fascinating part is that the above code process all 76 millions items in about 2-ish minutes, no Rcpp required (can't say if the packages are using Rcpp underneath the hood).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

以计算效率为基础，操纵大型深度嵌套对象的方法？

问题

答案1

答案2

答案3

闪亮，表格切片，文本字段

R Shiny App generate tabPanel in lapply (and unlist behaviour)

使用另一个变量更改ggplot的facet标签。

根据它们的属性选择列如何操作？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。