mclapply()在要并行处理的元素过大时会出现问题 – 如何解决这个问题?

huangapple go评论60阅读模式
英文:

mclapply() chokes when elements to be parallelized on are too big - how to get around this?

问题

我正在尝试在R中并行执行一个大操作。我正在使用mclapply()。

并行化是针对相对较少的操作(50个)进行的,但每个操作的成本很高,大约在十分之一秒的数量级。因此,在这里不会出现并行化应用于任务太少的天真开销问题。此外,用于计算的对象很大,但由于我使用了分叉并行,所以不应该有任何复制。

然而,解决问题的并行化成本比顺序执行更高!
为什么会这样?有什么办法可以解决吗?

最小工作示例:

M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))

microbenchmark::microbenchmark(
  lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
  parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { x %*% B}, B = B)
  , times = 5
)

编辑:感谢@HenrikB的答案,现在对我来说看起来有点清晰,终究还存在开销。我认为这是由于返回对象的大小所致。如果您运行以下代码,那么并行化再次变得有用。

M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))

microbenchmark::microbenchmark(
  lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
  parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { Matrix::t(B) %*% x %*% B}, B = B)
  , times = 5
)
英文:

I am trying to parallelize a big operation in R. I am using mclapply().
The parallelization is done on a relatively small number of operations (50) but each operation is costly, in the order of the tenth of second. Therefore, the naive overhead problem, applying parallelization on too few, too little tasks, is no trouble here. Also, the objects that are used for the computations are big, but since I am using fork parallelism there should not be any copy.

However, it turns that solving the problem in parallel is costlier than in sequence !
How comes ? Any idea get around it ?

Minimal working example :

M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))

microbenchmark::microbenchmark(
  lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
  parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { x %*% B}, B = B)
  , times = 5
)

Edit : thanks to @HenrikB 's answer, it seems clear to me that there was overhead after all. I think that it is due to the size of the returned object. If you run the following, parallel is useful again.

M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))

microbenchmark::microbenchmark(
  lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
  parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { Matrix::t(B) %*% x %*% B}, B = B)
  , times = 5
)

答案1

得分: 3

我看了一下。原来,这个问题属于经典的“并行化带来的开销大于性能提升”的情况。如果我们使用例如 proffer 来对代码进行性能分析,我们就能看到这一点;

lapply() 不同,它在处理 FUN 函数时大部分时间都花在了并行编排上(readChild()unserialize()mcfork())。

总结一下,在 FUN 中的处理时间非常短,因此不值得进行并行化。

英文:

I had a look. It turns out, this one falls under the classical "The overhead from parallelization is greater than the performance gain". We can see this if we profile the code using, for instance, proffer;

library(Matrix)
library(parallel)

## Create a 15000-by-15000 sparse matrix (dgCMatrix; ~ 600 kB)
M <- bdiag(lapply(seq_len(5000L), FUN = function(i) matrix(rnorm(9L), nrow = 3L)))

M_list <- vector("list", length = 500L)
for (ii in seq_along(M_list)) M_list[[ii]] <- M

## Create a 15000-by-50 sparse matrix (dgCMatrix; ~ 180 kB)
n <- nrow(M)
B <- sparseMatrix(i = seq_len(n), j = ceiling(50*runif(n)), x = rnorm(n))

FUN <- function(x, B) { x %*% B }


## Profile lapply()
proffer::pprof(lapply(M_list, FUN = function(x, B) { x %*% B }, B = B, mc.cores = 4L))
# Flat	Flat%	Sum%	Cum	Cum%	Name	Inlined?
# 15	83.33%	83.33%	17	94.44%	_Call	
# 1	5.56%	88.89%	1	5.56%	isVirtualExt	
# 1	5.56%	94.44%	1	5.56%	_classEnv	
# 1	5.56%	100.00%	18	100.00%	%*%	
# 0	0.00%	100.00%	1	5.56%	vapply	
# 0	0.00%	100.00%	18	100.00%	record_rprof	
# 0	0.00%	100.00%	18	100.00%	record_pprof	
# 0	0.00%	100.00%	18	100.00%	proffer::pprof	
# 0	0.00%	100.00%	18	100.00%	lapply	
# 0	0.00%	100.00%	1	5.56%	_selectSuperClasses	
# 0	0.00%	100.00%	18	100.00%	FUN

## Profile mclapply() with 4 parallel workers
proffer::pprof(mclapply(M_list, FUN = function(x, B) { x %*% B }, B = B, mc.cores = 4L))
# Flat	Flat%	Sum%	Cum	Cum%	Name
# 12	54.55%	54.55%	12	54.55%	readChild	
# 9	40.91%	95.45%	9	40.91%	unserialize	
# 1	4.55%	100.00%	1	4.55%	mcfork	
# 0	0.00%	100.00%	22	100.00%	record_rprof	
# 0	0.00%	100.00%	22	100.00%	record_pprof	
# 0	0.00%	100.00%	22	100.00%	proffer::pprof	
# 0	0.00%	100.00%	22	100.00%	mclapply	
# 0	0.00%	100.00%	1	4.55%	lapply	
# 0	0.00%	100.00%	1	4.55%	FUN

Contrary to lapply() who spends most of its time processing FUN, mclapply(), comes with a lot of overhead from parallel orchestration (readChild(), unserialize(), and mcfork()).

In summary, the processing time in FUN is so short that it's not worth parallelizing.

huangapple
  • 本文由 发表于 2023年3月31日 22:20:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75899623.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定