英文:
mclapply() chokes when elements to be parallelized on are too big - how to get around this?
问题
我正在尝试在R中并行执行一个大操作。我正在使用mclapply()。
并行化是针对相对较少的操作(50个)进行的,但每个操作的成本很高,大约在十分之一秒的数量级。因此,在这里不会出现并行化应用于任务太少的天真开销问题。此外,用于计算的对象很大,但由于我使用了分叉并行,所以不应该有任何复制。
然而,解决问题的并行化成本比顺序执行更高!
为什么会这样?有什么办法可以解决吗?
最小工作示例:
M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
microbenchmark::microbenchmark(
lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { x %*% B}, B = B)
, times = 5
)
编辑:感谢@HenrikB的答案,现在对我来说看起来有点清晰,终究还存在开销。我认为这是由于返回对象的大小所致。如果您运行以下代码,那么并行化再次变得有用。
M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
microbenchmark::microbenchmark(
lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { Matrix::t(B) %*% x %*% B}, B = B)
, times = 5
)
英文:
I am trying to parallelize a big operation in R. I am using mclapply().
The parallelization is done on a relatively small number of operations (50) but each operation is costly, in the order of the tenth of second. Therefore, the naive overhead problem, applying parallelization on too few, too little tasks, is no trouble here. Also, the objects that are used for the computations are big, but since I am using fork parallelism there should not be any copy.
However, it turns that solving the problem in parallel is costlier than in sequence !
How comes ? Any idea get around it ?
Minimal working example :
M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
microbenchmark::microbenchmark(
lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { x %*% B}, B = B)
, times = 5
)
Edit : thanks to @HenrikB 's answer, it seems clear to me that there was overhead after all. I think that it is due to the size of the returned object. If you run the following, parallel is useful again.
M = Matrix::bdiag(lapply(seq(5000), function(i)matrix(rnorm(9),3)))
M_list = list();for(i in seq(500))M_list[[i]]=M
B = Matrix::sparseMatrix(i = seq(15000), j = ceiling(50*runif(15000)), x = rnorm(15000))
microbenchmark::microbenchmark(
lapply(M_list, FUN = function(x, B) { x %*% B }, B = B),
parallel::mclapply(mc.cores = 4, M_list, FUN = function(x, B) { Matrix::t(B) %*% x %*% B}, B = B)
, times = 5
)
答案1
得分: 3
我看了一下。原来,这个问题属于经典的“并行化带来的开销大于性能提升”的情况。如果我们使用例如 proffer 来对代码进行性能分析,我们就能看到这一点;
与 lapply()
不同,它在处理 FUN
函数时大部分时间都花在了并行编排上(readChild()
、unserialize()
和 mcfork()
)。
总结一下,在 FUN
中的处理时间非常短,因此不值得进行并行化。
英文:
I had a look. It turns out, this one falls under the classical "The overhead from parallelization is greater than the performance gain". We can see this if we profile the code using, for instance, proffer;
library(Matrix)
library(parallel)
## Create a 15000-by-15000 sparse matrix (dgCMatrix; ~ 600 kB)
M <- bdiag(lapply(seq_len(5000L), FUN = function(i) matrix(rnorm(9L), nrow = 3L)))
M_list <- vector("list", length = 500L)
for (ii in seq_along(M_list)) M_list[[ii]] <- M
## Create a 15000-by-50 sparse matrix (dgCMatrix; ~ 180 kB)
n <- nrow(M)
B <- sparseMatrix(i = seq_len(n), j = ceiling(50*runif(n)), x = rnorm(n))
FUN <- function(x, B) { x %*% B }
## Profile lapply()
proffer::pprof(lapply(M_list, FUN = function(x, B) { x %*% B }, B = B, mc.cores = 4L))
# Flat Flat% Sum% Cum Cum% Name Inlined?
# 15 83.33% 83.33% 17 94.44% _Call
# 1 5.56% 88.89% 1 5.56% isVirtualExt
# 1 5.56% 94.44% 1 5.56% _classEnv
# 1 5.56% 100.00% 18 100.00% %*%
# 0 0.00% 100.00% 1 5.56% vapply
# 0 0.00% 100.00% 18 100.00% record_rprof
# 0 0.00% 100.00% 18 100.00% record_pprof
# 0 0.00% 100.00% 18 100.00% proffer::pprof
# 0 0.00% 100.00% 18 100.00% lapply
# 0 0.00% 100.00% 1 5.56% _selectSuperClasses
# 0 0.00% 100.00% 18 100.00% FUN
## Profile mclapply() with 4 parallel workers
proffer::pprof(mclapply(M_list, FUN = function(x, B) { x %*% B }, B = B, mc.cores = 4L))
# Flat Flat% Sum% Cum Cum% Name
# 12 54.55% 54.55% 12 54.55% readChild
# 9 40.91% 95.45% 9 40.91% unserialize
# 1 4.55% 100.00% 1 4.55% mcfork
# 0 0.00% 100.00% 22 100.00% record_rprof
# 0 0.00% 100.00% 22 100.00% record_pprof
# 0 0.00% 100.00% 22 100.00% proffer::pprof
# 0 0.00% 100.00% 22 100.00% mclapply
# 0 0.00% 100.00% 1 4.55% lapply
# 0 0.00% 100.00% 1 4.55% FUN
Contrary to lapply()
who spends most of its time processing FUN
, mclapply()
, comes with a lot of overhead from parallel orchestration (readChild()
, unserialize()
, and mcfork()
).
In summary, the processing time in FUN
is so short that it's not worth parallelizing.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论