2023年3月3日 22:33:36go评论77阅读模式

英文:

How to convert foreach into a function?

问题

I am using a foreach to calculate the correlation coefficients and p values, using the mtcars as an example ( foreach is overkill here but the dataframe I'm using has 450 obs for 3400 variables). I use combn to get rid of duplicate correlations and self-correlations.

combo_cars <- data.frame(t(combn(names(mtcars),2)))

library(foreach)
cars_res <-  foreach(i=1:nrow(combo_cars), .combine=rbind, .packages=c("magrittr", "dplyr"))     %dopar% {
  out2 <-  broom::tidy(cor.test(mtcars[, combo_cars[i,1]],
                                mtcars[,combo_cars[i,2]],
                                method = "spearman")) %>%
    mutate(Var1=combo_cars[i,1], Var2=combo_cars[i,2])
}

I would like to convert this into a function, as I would like to try using the future package because I need to run correlations on subsections of the original dataframe and it's more efficient than running in parallel. When trying to devise a function that replicates the above, I can use:

car_res2 <- data.frame(t(combn(names(mtcars), 2, function(x)  
  cor.test(mtcars[[x[1]]],
           mtcars[[x[2]]], method="spearman"), simplify=TRUE)))

Ultimately, I would like to be able to have four futures running in parallel, each computing the above on a different fraction of the dataset.

However, the car_res2 output has 8 columns instead of 7 (the second one is completely empty). I had to use the output from the cars_res to know what the values were, and these were in the order of statistic, blank, p-value, estimate, etc., while the car_res had labeled columns with estimate, statistic, p-value.

was wondering why the output is in different orders and not labeled with the second approach?
can I use one of the apply functions in place of the above function?

Any comments would be appreciated.

英文:

combo_cars &lt;- data.frame(t(combn(names(mtcars),2)))

library(foreach)
cars_res &lt;-  foreach(i=1:nrow(combo_cars), .combine=rbind, .packages=c(&quot;magrittr&quot;, &quot;dplyr&quot;))     %dopar% {
  out2 &lt;-  broom::tidy(cor.test(mtcars[, combo_cars[i,1]],
                                mtcars[,combo_cars[i,2]],
                                method = &quot;spearman&quot;)) %&gt;% 
    mutate(Var1=combo_cars[i,1], Var2=combo_cars[i,2])
}

I would like to convert this into a function, as I would like to try using the future package because I need to run correlations on subsections of the original dataframe and its more efficient them running in parallel. When trying to devise a function that replicates the above, I can use:

car_res2 &lt;- data.frame(t(combn(names(mtcars), 2, function(x)  
  cor.test(mtcars[[x[1]]],
           mtcars[[x[2]]], method=&quot;spearman&quot;), simplify=TRUE)))

Ultimately I would like to be able to have four futures running in parallel, each computing the above on a different fraction of the dataset.

However, the car_res2 output has 8 columns instead of 7 (the second one is completely empty). I had to use the output from the cars_res to know what the values were and these were in the order of statistic, blank, p-value, estimate etc, whilst the car_res had labelled columns with estimate, statistic, p value.

was wondering why the output is in different orders and not
labelled with the second approach?
can I use one of the apply functions in place of the above function?

Any comments would be appreciated.

答案1

得分: 1

不进行并行化，您可以首先尝试使用RcppAlgos::comboGeneral，它与combn非常相似，但是它是用C++实现的，因此可能更快（它还有一个Parallel=选项，但是当使用FUN时会被忽略）。此外，我不加载broom和dplyr。

res <- RcppAlgos::comboGeneral(names(mtcars), 2, FUN=function(x) {
  data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x))
}, Parallel=TRUE, nThreads=7) |> do.call(what=rbind) |> `rownames<-`(NULL)

head(res)
#     estimate statistic      p.value                          method alternative  X1   X2
# 1 -0.9108013 10425.332 4.690287e-13 Spearman's rank correlation rho   two.sided mpg  cyl
# 2 -0.9088824 10414.862 6.370336e-13 Spearman's rank correlation rho   two.sided mpg disp
# 3 -0.8946646 10337.290 5.085969e-12 Spearman's rank correlation rho   two.sided mpg   hp
# 4  0.6514555  1901.659 5.381347e-05 Spearman's rank correlation rho   two.sided mpg drat
# 5 -0.8864220 10292.319 1.487595e-11 Spearman's rank correlation rho   two.sided mpg   wt
# 6  0.4669358  2908.399 7.055765e-03 Spearman's rank correlation rho   two.sided mpg qsec

或者，如果您在Linux上（或者Mac上，但未经测试），您可以使用parallel::mclapply，它类似于lapply，但具有多个核心，并在之前使用combn。这使您可以自由选择任意子集的组合。

ncomb <- as.data.frame(combn(names(mtcars), 2))

parallel::mclapply(ncomb[, c(1:2, 11:12)], function(x) {
  data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x)) 
}, mc.cores=7) |> do.call(what=rbind) |> `rownames<-`(NULL)
#     estimate  statistic      p.value                          method alternative  X1   X2
# 1 -0.9108013 10425.3320 4.690287e-13 Spearman's rank correlation rho   two.sided mpg  cyl
# 2 -0.9088824 10414.8622 6.370336e-13 Spearman's rank correlation rho   two.sided mpg disp
# 3  0.9276516   394.7330 2.275443e-14 Spearman's rank correlation rho   two.sided cyl disp
# 4  0.9017909   535.8287 1.867686e-12 Spearman's rank correlation rho   two.sided cyl   hp

在Windows上，您可以使用parallel::parLapply。

library(parallel)

CL <- makeCluster(detectCores() - 1)
clusterExport(CL, c('ncomb', 'mtcars'))  ## `mtcars`代表您的数据

parLapply(CL, ncomb[, c(1:2, 11:12)], function(x) {
  data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method="spearman")[c(4, 1, 3, 7, 6)], t(x)) 
}) |> do.call(what=rbind) |> `rownames<-`(NULL)

stopCluster(CL)

有关parLapply与mclapply的使用的更多细节，请参阅此答案。

英文:

Without parallelization you can try RcppAlgos::comboGeneral first, which works very similar to combn but is implemented in C++ and therefore may be faster (it also has a Parallel= option, however it is ignored when FUN is used). Moreover I don't load broom and dplyr.

res &lt;- RcppAlgos::comboGeneral(names(mtcars), 2, FUN=\(x) {
  data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method=&quot;spearman&quot;)[c(4, 1, 3, 7, 6)], t(x))
}, Parallel=TRUE, nThreads=7) |&gt; do.call(what=rbind) |&gt; `rownames&lt;-`(NULL)

head(res)
#     estimate statistic      p.value                          method alternative  X1   X2
# 1 -0.9108013 10425.332 4.690287e-13 Spearman&#39;s rank correlation rho   two.sided mpg  cyl
# 2 -0.9088824 10414.862 6.370336e-13 Spearman&#39;s rank correlation rho   two.sided mpg disp
# 3 -0.8946646 10337.290 5.085969e-12 Spearman&#39;s rank correlation rho   two.sided mpg   hp
# 4  0.6514555  1901.659 5.381347e-05 Spearman&#39;s rank correlation rho   two.sided mpg drat
# 5 -0.8864220 10292.319 1.487595e-11 Spearman&#39;s rank correlation rho   two.sided mpg   wt
# 6  0.4669358  2908.399 7.055765e-03 Spearman&#39;s rank correlation rho   two.sided mpg qsec

Alternatively, if you're on Linux (or Mac, but not tested), you could use parallel::mclapply, which works like lapply but with multiple cores, and use combn beforehand. This gives you the freedom to choose an arbitrary subset of combinations.

ncomb &lt;- as.data.frame(combn(names(mtcars), 2))

parallel::mclapply(ncomb[, c(1:2, 11:12)], \(x) {
  data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method=&quot;spearman&quot;)[c(4, 1, 3, 7, 6)], t(x)) 
}, mc.cores=7) |&gt; do.call(what=rbind) |&gt; `rownames&lt;-`(NULL)
#     estimate  statistic      p.value                          method alternative  X1   X2
# 1 -0.9108013 10425.3320 4.690287e-13 Spearman&#39;s rank correlation rho   two.sided mpg  cyl
# 2 -0.9088824 10414.8622 6.370336e-13 Spearman&#39;s rank correlation rho   two.sided mpg disp
# 3  0.9276516   394.7330 2.275443e-14 Spearman&#39;s rank correlation rho   two.sided cyl disp
# 4  0.9017909   535.8287 1.867686e-12 Spearman&#39;s rank correlation rho   two.sided cyl   hp

On Windows you can use parallel::parLapply.

library(parallel)

CL &lt;- makeCluster(detectCores() - 1)
clusterExport(CL, c(&#39;ncomb&#39;, &#39;mtcars&#39;))  ## `mtcars` symbolizes you data

parLapply(CL, ncomb[, c(1:2, 11:12)], \(x) {
  data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method=&quot;spearman&quot;)[c(4, 1, 3, 7, 6)], t(x)) 
}) |&gt; do.call(what=rbind) |&gt; `rownames&lt;-`(NULL)

stopCluster(CL)

See this answer for more details on the use of parLapply vs mclapply.

答案2

得分: 1

首先，你可以轻松地使用现有的 foreach() 构造来通过 future 框架进行并行化。只需添加 doFuture::registerDoFuture() 并使用 plan() 选择你的并行后端，例如：

library(foreach)
doFuture::registerDoFuture() ## 让 %dopar% 使用 futureverse
plan(multisession)           ## 并行后台工作进程

combo_cars &lt;- data.frame(t(combn(names(mtcars),2)))

cars_res &lt;- foreach(i=1:nrow(combo_cars), .combine=rbind, .packages=c(&quot;magrittr&quot;, &quot;dplyr&quot;)) %dopar% {
  broom::tidy(cor.test(mtcars[, combo_cars[i,1]],
                       mtcars[,combo_cars[i,2]],
                       method = &quot;spearman&quot;)) %&gt;% 
  mutate(Var1=combo_cars[i,1], Var2=combo_cars[i,2])
}

其次，类似于 jay.sf 的解决方案，你可以使用 future.apply，如下所示：

library(future.apply)
plan(multisession)

ncomb &lt;- as.data.frame(combn(names(mtcars), 2))

res &lt;- future_lapply(ncomb[, c(1:2, 11:12)], \(x) {
  data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method=&quot;spearman&quot;)[c(4, 1, 3, 7, 6)], t(x)) 
}) |&gt; do.call(what=rbind) |&gt; `rownames&lt;-`(NULL)

plan(multisession) 对应于 parallel 包的 PSOCK 集群，类似于 parallel::makeCluster()。如果你切换到 plan(multicore)，并行化将通过 forked 进程使用与 parallel::mclapply() 相同的框架完成。

与 parallel::mclapply() 和 parallel::parLapply() 相比，使用 futureverse 的优势在于获得更好的错误处理，以及消息和警告的传递。例如，如果你运行上述代码，你会得到以下警告：

警告信息:
1: 在 cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") 中：
  无法计算带有并列的确切 p 值
2: 在 cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") 中：
  无法计算带有并列的确切 p 值
3: 在 cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") 中：
  无法计算带有并列的确切 p 值
4: 在 cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = "spearman") 中：
  无法计算带有并列的确切 p 值

请注意，这些警告在 R 中的其他并行化框架中会被完全抑制。

英文:

First, you can easily use your existing foreach() construct for parallelizing via the future framework. Just add doFuture::registerDoFuture() and pick your parallel backend with plan(), e.g.

library(foreach)
doFuture::registerDoFuture() ## make %dopar% use futureverse
plan(multisession)           ## parallel background workers

combo_cars &lt;- data.frame(t(combn(names(mtcars),2)))

cars_res &lt;- foreach(i=1:nrow(combo_cars), .combine=rbind, .packages=c(&quot;magrittr&quot;, &quot;dplyr&quot;)) %dopar% {
  broom::tidy(cor.test(mtcars[, combo_cars[i,1]],
                       mtcars[,combo_cars[i,2]],
                       method = &quot;spearman&quot;)) %&gt;% 
  mutate(Var1=combo_cars[i,1], Var2=combo_cars[i,2])
}

Second, analogously to jay.sf's solutions, you can use future.apply as:

library(future.apply)
plan(multisession)

ncomb &lt;- as.data.frame(combn(names(mtcars), 2))

res &lt;- future_lapply(ncomb[, c(1:2, 11:12)], \(x) {
  data.frame(cor.test(mtcars[, x[1]], mtcars[, x[2]], method=&quot;spearman&quot;)[c(4, 1, 3, 7, 6)], t(x)) 
}) |&gt; do.call(what=rbind) |&gt; `rownames&lt;-`(NULL)

plan(multisession) corresponds uses a PSOCK cluster of the parallel package, similarly to parallel::makeCluster(). If you switch toplan(multicore), the parallelization will be done via forked processing using the same framework as parallel::mclapply().

The advantage of using futureverse compared to parallel::mclapply() and parallel::parLapply() is that you get better error handling, and messages and warnings are relayed. For example, if you run the above, you'll get:

Warning messages:
1: In cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = &quot;spearman&quot;) :
  Cannot compute exact p-value with ties
2: In cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = &quot;spearman&quot;) :
  Cannot compute exact p-value with ties
3: In cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = &quot;spearman&quot;) :
  Cannot compute exact p-value with ties
4: In cor.test.default(mtcars[, x[1]], mtcars[, x[2]], method = &quot;spearman&quot;) :
  Cannot compute exact p-value with ties

Note that those warnings are completely muffled by other parallelization frameworks in R.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将foreach转换为一个函数如何做？

问题

答案1

答案2

制作一张比较两年间类别的表格，使用R。

自定义分级地图

NAs produced using indexing to calculate RMSE.

如何添加一个在绘图组成部分计算数值的注释？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论