mclapply() chokes when elements to be parallelized on are too big - how to get around this?
M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { x %*% B}, B = B)
, times = 5
M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { Matrix::t(B) %*% x %*% B}, B = B)
, times = 5
I am trying to parallelize a big operation in R. I am using mclapply().
The parallelization is done on a relatively small number of operations (50) but each operation is costly, in the order of the tenth of second. Therefore, the naive overhead problem, applying parallelization on too few, too little tasks, is no trouble here. Also, the objects that are used for the computations are big, but since I am using fork parallelism there should not be any copy.
However, it turns that solving the problem in parallel is costlier than in sequence !
How comes ? Any idea get around it ?
Minimal working example :
M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { x %*% B}, B = B)
, times = 5
Edit : thanks to @HenrikB 's answer, it seems clear to me that there was overhead after all. I think that it is due to the size of the returned object. If you run the following, parallel is useful again.
M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { Matrix::t(B) %*% x %*% B}, B = B)
, times = 5
得分: 3
我看了一下。原来,这个问题属于经典的“并行化带来的开销大于性能提升”的情况。如果我们使用例如 proffer 来对代码进行性能分析,我们就能看到这一点;
与 lapply()
不同,它在处理 FUN
和 mcfork()
总结一下,在 FUN
I had a look. It turns out, this one falls under the classical "The overhead from parallelization is greater than the performance gain". We can see this if we profile the code using, for instance, proffer;
## Create a 15000-by-15000 sparse matrix (dgCMatrix; ~ 600 kB)
M <- bdiag(lapply(seq_len(5000L), FUN = function(i) matrix(rnorm(9L), nrow = 3L)))
M_list <- vector("list", length = 500L)
for (ii in seq_along(M_list)) M_list[[ii]] <- M
## Create a 15000-by-50 sparse matrix (dgCMatrix; ~ 180 kB)
n <- nrow(M)
B <- sparseMatrix(i = seq_len(n), j = ceiling(50*runif(n)), x = rnorm(n))
FUN <- function(x, B) { x %*% B }
## Profile lapply()
proffer::pprof(lapply(M_list, FUN = function(x, B) { x %*% B }, B = B, mc.cores = 4L))
# Flat Flat% Sum% Cum Cum% Name Inlined?
# 15 83.33% 83.33% 17 94.44% _Call
# 1 5.56% 88.89% 1 5.56% isVirtualExt
# 1 5.56% 94.44% 1 5.56% _classEnv
# 1 5.56% 100.00% 18 100.00% %*%
# 0 0.00% 100.00% 1 5.56% vapply
# 0 0.00% 100.00% 18 100.00% record_rprof
# 0 0.00% 100.00% 18 100.00% record_pprof
# 0 0.00% 100.00% 18 100.00% proffer::pprof
# 0 0.00% 100.00% 18 100.00% lapply
# 0 0.00% 100.00% 1 5.56% _selectSuperClasses
# 0 0.00% 100.00% 18 100.00% FUN
## Profile mclapply() with 4 parallel workers
proffer::pprof(mclapply(M_list, FUN = function(x, B) { x %*% B }, B = B, mc.cores = 4L))
# Flat Flat% Sum% Cum Cum% Name
# 12 54.55% 54.55% 12 54.55% readChild
# 9 40.91% 95.45% 9 40.91% unserialize
# 1 4.55% 100.00% 1 4.55% mcfork
# 0 0.00% 100.00% 22 100.00% record_rprof
# 0 0.00% 100.00% 22 100.00% record_pprof
# 0 0.00% 100.00% 22 100.00% proffer::pprof
# 0 0.00% 100.00% 22 100.00% mclapply
# 0 0.00% 100.00% 1 4.55% lapply
# 0 0.00% 100.00% 1 4.55% FUN
Contrary to lapply()
who spends most of its time processing FUN
, mclapply()
, comes with a lot of overhead from parallel orchestration (readChild()
, unserialize()
, and mcfork()
In summary, the processing time in FUN
is so short that it's not worth parallelizing.