2023年3月31日 22:20:49go评论88阅读模式

英文:

mclapply() chokes when elements to be parallelized on are too big - how to get around this?

问题

我正在尝试在R中并行执行一个大操作。我正在使用mclapply()。

并行化是针对相对较少的操作（50个）进行的，但每个操作的成本很高，大约在十分之一秒的数量级。因此，在这里不会出现并行化应用于任务太少的天真开销问题。此外，用于计算的对象很大，但由于我使用了分叉并行，所以不应该有任何复制。

然而，解决问题的并行化成本比顺序执行更高！
为什么会这样？有什么办法可以解决吗？

最小工作示例：

M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
microbenchmark::microbenchmark(
  lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
  parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { x %*% B}, B = B)
  , times = 5
)

编辑：感谢@HenrikB的答案，现在对我来说看起来有点清晰，终究还存在开销。我认为这是由于返回对象的大小所致。如果您运行以下代码，那么并行化再次变得有用。

M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
microbenchmark::microbenchmark(
  lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
  parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { Matrix::t(B) %*% x %*% B}, B = B)
  , times = 5
)

英文:

I am trying to parallelize a big operation in R. I am using mclapply().
The parallelization is done on a relatively small number of operations (50) but each operation is costly, in the order of the tenth of second. Therefore, the naive overhead problem, applying parallelization on too few, too little tasks, is no trouble here. Also, the objects that are used for the computations are big, but since I am using fork parallelism there should not be any copy.

However, it turns that solving the problem in parallel is costlier than in sequence !
How comes ? Any idea get around it ?

Minimal working example :

M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
microbenchmark::microbenchmark(
  lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
  parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { x %*% B}, B = B)
  , times = 5
)

Edit : thanks to @HenrikB 's answer, it seems clear to me that there was overhead after all. I think that it is due to the size of the returned object. If you run the following, parallel is useful again.

M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
microbenchmark::microbenchmark(
  lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
  parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { Matrix::t(B) %*% x %*% B}, B = B)
  , times = 5
)

答案1

得分: 3

我看了一下。原来，这个问题属于经典的“并行化带来的开销大于性能提升”的情况。如果我们使用例如 proffer 来对代码进行性能分析，我们就能看到这一点；

与 lapply() 不同，它在处理 FUN 函数时大部分时间都花在了并行编排上（readChild()、unserialize() 和 mcfork()）。

总结一下，在 FUN 中的处理时间非常短，因此不值得进行并行化。

英文:

I had a look. It turns out, this one falls under the classical "The overhead from parallelization is greater than the performance gain". We can see this if we profile the code using, for instance, proffer;

library(Matrix)
library(parallel)
## Create a 15000-by-15000 sparse matrix (dgCMatrix; ~ 600 kB)
M &lt;- bdiag(lapply(seq_len(5000L), FUN = function(i) matrix(rnorm(9L), nrow = 3L)))
M_list &lt;- vector(&quot;list&quot;, length = 500L)
for (ii in seq_along(M_list)) M_list[[ii]] &lt;- M
## Create a 15000-by-50 sparse matrix (dgCMatrix; ~ 180 kB)
n &lt;- nrow(M)
B &lt;- sparseMatrix(i = seq_len(n), j = ceiling(50*runif(n)), x = rnorm(n))
FUN &lt;- function(x, B) { x %*% B }
## Profile lapply()
proffer::pprof(lapply(M_list, FUN = function(x, B) { x %*% B }, B = B, mc.cores = 4L))
# Flat	Flat%	Sum%	Cum	Cum%	Name	Inlined?
# 15	83.33%	83.33%	17	94.44%	_Call	
# 1	5.56%	88.89%	1	5.56%	isVirtualExt	
# 1	5.56%	94.44%	1	5.56%	_classEnv	
# 1	5.56%	100.00%	18	100.00%	%*%	
# 0	0.00%	100.00%	1	5.56%	vapply	
# 0	0.00%	100.00%	18	100.00%	record_rprof	
# 0	0.00%	100.00%	18	100.00%	record_pprof	
# 0	0.00%	100.00%	18	100.00%	proffer::pprof	
# 0	0.00%	100.00%	18	100.00%	lapply	
# 0	0.00%	100.00%	1	5.56%	_selectSuperClasses	
# 0	0.00%	100.00%	18	100.00%	FUN
## Profile mclapply() with 4 parallel workers
proffer::pprof(mclapply(M_list, FUN = function(x, B) { x %*% B }, B = B, mc.cores = 4L))
# Flat	Flat%	Sum%	Cum	Cum%	Name
# 12	54.55%	54.55%	12	54.55%	readChild	
# 9	40.91%	95.45%	9	40.91%	unserialize	
# 1	4.55%	100.00%	1	4.55%	mcfork	
# 0	0.00%	100.00%	22	100.00%	record_rprof	
# 0	0.00%	100.00%	22	100.00%	record_pprof	
# 0	0.00%	100.00%	22	100.00%	proffer::pprof	
# 0	0.00%	100.00%	22	100.00%	mclapply	
# 0	0.00%	100.00%	1	4.55%	lapply	
# 0	0.00%	100.00%	1	4.55%	FUN

Contrary to lapply() who spends most of its time processing FUN, mclapply(), comes with a lot of overhead from parallel orchestration (readChild(), unserialize(), and mcfork()).

In summary, the processing time in FUN is so short that it's not worth parallelizing.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

mclapply()在要并行处理的元素过大时会出现问题 – 如何解决这个问题？

问题

答案1

R DBI::dbGetQuery的where子句将字符串解释为列名。

如何在组合柱状图和折线图中修复第二个y轴

error as.formula function: 尝试在NULL上设置属性

如何在单个图/地图中可视化三个参数的相对贡献？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。