英文:
How to convert foreach into a function?
问题
I am using a foreach to calculate the correlation coefficients and p values, using the mtcars as an example ( foreach is overkill here but the dataframe I'm using has 450 obs for 3400 variables). I use combn to get rid of duplicate correlations and self-correlations.
combo_cars <- data.frame(t(combn(names(mtcars),2)))
library(foreach)
cars_res <- foreach(i=1:nrow(combo_cars), .combine=rbind, .packages=c("magrittr", "dplyr")) %dopar% {
out2 <- broom::tidy(cor.test(mtcars[, combo_cars[i,1]],
mtcars[,combo_cars[i,2]],
method = "spearman")) %>%
mutate(Var1=combo_cars[i,1], Var2=combo_cars[i,2])
}
I would like to convert this into a function, as I would like to try using the future package because I need to run correlations on subsections of the original dataframe and it's more efficient than running in parallel. When trying to devise a function that replicates the above, I can use:
car_res2 <- data.frame(t(combn(names(mtcars), 2, function(x)
cor.test(mtcars[[x[1]]],
mtcars[[x[2]]], method="spearman"), simplify=TRUE)))
Ultimately, I would like to be able to have four futures running in parallel, each computing the above on a different fraction of the dataset.
However, the car_res2 output has 8 columns instead of 7 (the second one is completely empty). I had to use the output from the cars_res to know what the values were, and these were in the order of statistic, blank, p-value, estimate, etc., while the car_res had labeled columns with estimate, statistic, p-value.
- was wondering why the output is in different orders and not labeled with the second approach?
- can I use one of the apply functions in place of the above function?
Any comments would be appreciated.
英文:
I am using a foreach to calculate the correlation coefficients and p values, using the mtcars as an example ( foreach is overkill here but the dataframe I'm using has 450 obs for 3400 variables). I use combn to get rid of duplicate correlations and self-correlations.
combo_cars <- data.frame(t(combn(names(mtcars),2)))
library(foreach)
cars_res <- foreach(i=1:nrow(combo_cars), .combine=rbind, .packages=c("magrittr", "dplyr")) %dopar% {
out2 <- broom::tidy(cor.test(mtcars[, combo_cars[i,1]],
mtcars[,combo_cars[i,2]],
method = "spearman")) %>%
mutate(Var1=combo_cars[i,1], Var2=combo_cars[i,2])
}
I would like to convert this into a function, as I would like to try using the future package because I need to run correlations on subsections of the original dataframe and its more efficient them running in parallel. When trying to devise a function that replicates the above, I can use:
car_res2 <- data.frame(t(combn(names(mtcars), 2, function(x)
cor.test(mtcars[[x[1]]],
mtcars[[x[2]]], method="spearman"), simplify=TRUE)))
Ultimately I would like to be able to have four futures running in parallel, each computing the above on a different fraction of the dataset.
However, the car_res2 output has 8 columns instead of 7 (the second one is completely empty). I had to use the output from the cars_res to know what the values were and these were in the order of statistic, blank, p-value, estimate etc, whilst the car_res had labelled columns with estimate, statistic, p value.
- was wondering why the output is in different orders and not
labelled with the second approach? - can I use one of the apply functions in place of the above function?
Any comments would be appreciated.
答案1
得分: 1
不进行并行化,您可以首先尝试使用RcppAlgos::comboGeneral
,它与combn
非常相似,但是它是用C++实现的,因此可能更快(它还有一个Parallel=
选项,但是当使用FUN
时会被忽略)。此外,我不加载broom
和dplyr
。
res <- RcppAlgos::comboGeneral(names(mtcars), 2, FUN=function(x) {
data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x))
}, Parallel=TRUE, nThreads=7) |> do.call(what=rbind) |> `rownames<-`(NULL)
head(res)
# estimate statistic p.value method alternative X1 X2
# 1 -0.9108013 10425.332 4.690287e-13 Spearman's rank correlation rho two.sided mpg cyl
# 2 -0.9088824 10414.862 6.370336e-13 Spearman's rank correlation rho two.sided mpg disp
# 3 -0.8946646 10337.290 5.085969e-12 Spearman's rank correlation rho two.sided mpg hp
# 4 0.6514555 1901.659 5.381347e-05 Spearman's rank correlation rho two.sided mpg drat
# 5 -0.8864220 10292.319 1.487595e-11 Spearman's rank correlation rho two.sided mpg wt
# 6 0.4669358 2908.399 7.055765e-03 Spearman's rank correlation rho two.sided mpg qsec
或者,如果您在Linux上(或者Mac上,但未经测试),您可以使用parallel::mclapply
,它类似于lapply
,但具有多个核心,并在之前使用combn
。这使您可以自由选择任意子集的组合。
ncomb <- as.data.frame(combn(names(mtcars), 2))
parallel::mclapply(ncomb[, c(1:2, 11:12)], function(x) {
data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x))
}, mc.cores=7) |> do.call(what=rbind) |> `rownames<-`(NULL)
# estimate statistic p.value method alternative X1 X2
# 1 -0.9108013 10425.3320 4.690287e-13 Spearman's rank correlation rho two.sided mpg cyl
# 2 -0.9088824 10414.8622 6.370336e-13 Spearman's rank correlation rho two.sided mpg disp
# 3 0.9276516 394.7330 2.275443e-14 Spearman's rank correlation rho two.sided cyl disp
# 4 0.9017909 535.8287 1.867686e-12 Spearman's rank correlation rho two.sided cyl hp
在Windows上,您可以使用parallel::parLapply
。
library(parallel)
CL <- makeCluster(detectCores() - 1)
clusterExport(CL, c('ncomb', 'mtcars')) ## `mtcars`代表您的数据
parLapply(CL, ncomb[, c(1:2, 11:12)], function(x) {
data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x))
}) |> do.call(what=rbind) |> `rownames<-`(NULL)
stopCluster(CL)
有关parLapply
与mclapply
的使用的更多细节,请参阅此答案。
英文:
Without parallelization you can try RcppAlgos::comboGeneral
first, which works very similar to combn
but is implemented in C++ and therefore may be faster (it also has a Parallel=
option, however it is ignored when FUN
is used). Moreover I don't load broom
and dplyr
.
res <- RcppAlgos::comboGeneral(names(mtcars), 2, FUN=\(x) {
data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x))
}, Parallel=TRUE, nThreads=7) |> do.call(what=rbind) |> `rownames<-`(NULL)
head(res)
# estimate statistic p.value method alternative X1 X2
# 1 -0.9108013 10425.332 4.690287e-13 Spearman's rank correlation rho two.sided mpg cyl
# 2 -0.9088824 10414.862 6.370336e-13 Spearman's rank correlation rho two.sided mpg disp
# 3 -0.8946646 10337.290 5.085969e-12 Spearman's rank correlation rho two.sided mpg hp
# 4 0.6514555 1901.659 5.381347e-05 Spearman's rank correlation rho two.sided mpg drat
# 5 -0.8864220 10292.319 1.487595e-11 Spearman's rank correlation rho two.sided mpg wt
# 6 0.4669358 2908.399 7.055765e-03 Spearman's rank correlation rho two.sided mpg qsec
Alternatively, if you're on Linux (or Mac, but not tested), you could use parallel::mclapply
, which works like lapply
but with multiple cores, and use combn
beforehand. This gives you the freedom to choose an arbitrary subset of combinations.
ncomb <- as.data.frame(combn(names(mtcars), 2))
parallel::mclapply(ncomb[, c(1:2, 11:12)], \(x) {
data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x))
}, mc.cores=7) |> do.call(what=rbind) |> `rownames<-`(NULL)
# estimate statistic p.value method alternative X1 X2
# 1 -0.9108013 10425.3320 4.690287e-13 Spearman's rank correlation rho two.sided mpg cyl
# 2 -0.9088824 10414.8622 6.370336e-13 Spearman's rank correlation rho two.sided mpg disp
# 3 0.9276516 394.7330 2.275443e-14 Spearman's rank correlation rho two.sided cyl disp
# 4 0.9017909 535.8287 1.867686e-12 Spearman's rank correlation rho two.sided cyl hp
On Windows you can use parallel::parLapply
.
library(parallel)
CL <- makeCluster(detectCores() - 1)
clusterExport(CL, c('ncomb', 'mtcars')) ## `mtcars` symbolizes you data
parLapply(CL, ncomb[, c(1:2, 11:12)], \(x) {
data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x))
}) |> do.call(what=rbind) |> `rownames<-`(NULL)
stopCluster(CL)
See this answer for more details on the use of parLapply
vs mclapply
.
答案2
得分: 1
首先,你可以轻松地使用现有的 foreach()
构造来通过 future 框架进行并行化。只需添加 doFuture::registerDoFuture()
并使用 plan()
选择你的并行后端,例如:
library(foreach)
doFuture::registerDoFuture() ## 让 %dopar% 使用 futureverse
plan(multisession) ## 并行后台工作进程
combo_cars <- data.frame(t(combn(names(mtcars),2)))
cars_res <- foreach(i=1:nrow(combo_cars), .combine=rbind, .packages=c("magrittr", "dplyr")) %dopar% {
broom::tidy(cor.test(mtcars[, combo_cars[i,1]],
mtcars[,combo_cars[i,2]],
method = "spearman")) %>%
mutate(Var1=combo_cars[i,1], Var2=combo_cars[i,2])
}
其次,类似于 jay.sf 的解决方案,你可以使用 future.apply,如下所示:
library(future.apply)
plan(multisession)
ncomb <- as.data.frame(combn(names(mtcars), 2))
res <- future_lapply(ncomb[, c(1:2, 11:12)], \(x) {
data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x))
}) |> do.call(what=rbind) |> `rownames<-`(NULL)
plan(multisession)
对应于 parallel 包的 PSOCK 集群,类似于 parallel::makeCluster()
。如果你切换到 plan(multicore)
,并行化将通过 forked 进程使用与 parallel::mclapply()
相同的框架完成。
与 parallel::mclapply()
和 parallel::parLapply()
相比,使用 futureverse 的优势在于获得更好的错误处理,以及消息和警告的传递。例如,如果你运行上述代码,你会得到以下警告:
警告信息:
1: 在 cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") 中:
无法计算带有并列的确切 p 值
2: 在 cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") 中:
无法计算带有并列的确切 p 值
3: 在 cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") 中:
无法计算带有并列的确切 p 值
4: 在 cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") 中:
无法计算带有并列的确切 p 值
请注意,这些警告在 R 中的其他并行化框架中会被完全抑制。
英文:
First, you can easily use your existing foreach()
construct for parallelizing via the future framework. Just add doFuture::registerDoFuture()
and pick your parallel backend with plan()
, e.g.
library(foreach)
doFuture::registerDoFuture() ## make %dopar% use futureverse
plan(multisession) ## parallel background workers
combo_cars <- data.frame(t(combn(names(mtcars),2)))
cars_res <- foreach(i=1:nrow(combo_cars), .combine=rbind, .packages=c("magrittr", "dplyr")) %dopar% {
broom::tidy(cor.test(mtcars[, combo_cars[i,1]],
mtcars[,combo_cars[i,2]],
method = "spearman")) %>%
mutate(Var1=combo_cars[i,1], Var2=combo_cars[i,2])
}
Second, analogously to jay.sf's solutions, you can use future.apply as:
library(future.apply)
plan(multisession)
ncomb <- as.data.frame(combn(names(mtcars), 2))
res <- future_lapply(ncomb[, c(1:2, 11:12)], \(x) {
data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x))
}) |> do.call(what=rbind) |> `rownames<-`(NULL)
plan(multisession)
corresponds uses a PSOCK cluster of the parallel package, similarly to parallel::makeCluster()
. If you switch toplan(multicore)
, the parallelization will be done via forked processing using the same framework as parallel::mclapply()
.
The advantage of using futureverse compared to parallel::mclapply()
and parallel::parLapply()
is that you get better error handling, and messages and warnings are relayed. For example, if you run the above, you'll get:
Warning messages:
1: In cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") :
Cannot compute exact p-value with ties
2: In cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") :
Cannot compute exact p-value with ties
3: In cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") :
Cannot compute exact p-value with ties
4: In cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") :
Cannot compute exact p-value with ties
Note that those warnings are completely muffled by other parallelization frameworks in R.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论