2023年5月14日 01:08:40go评论139阅读模式

英文:

Parallel computing for mediation analyses - foreach and dopar Error not finding assigned object within loop

问题

The issue you're facing is due to the scoping rules in R when using parallel processing. To solve this problem, you can modify your code inside the loop to explicitly pass the necessary variables as arguments to the parallelized function. Here's the modified code snippet:

results <- foreach(i = 1:nrow(combinations), .combine = rbind) %dopar% {
  independent <- as.character(combinations[i, 1])
  mediator <- as.character(combinations[i, 2])
  dependent <- as.character(combinations[i, 3])
  
  # ... (your code for data manipulation and model fitting)
  
  # Pass necessary variables as arguments to the parallelized function
  calculate_mediation_results(independent, mediator, dependent, df1, df2, df3)
}

calculate_mediation_results <- function(independent, mediator, dependent, df1, df2, df3) {
  # Your code for model fitting and mediation analysis here
  
  # Return the results as a list or a data frame
  return(data.frame(independent = independent,
                    mediator = mediator,
                    dependent = dependent,
                    ab = ab,
                    ac = ac,
                    bc = bc,
                    p.value = p.value,
                    stringsAsFactors = FALSE))
}

In this modified approach, the calculate_mediation_results function takes all the necessary variables as arguments, ensuring they are accessible within the worker function. The results are then returned and combined in the main loop.

Please note that you need to adjust the calculate_mediation_results function to include the actual code for model fitting and mediation analysis using the provided variables (independent, mediator, dependent, df1, df2, df3).

英文:

I am trying to iterate mediation analyses with a large number of variable combinations (around 700.000 combinations). Thus far I have generated code to produce the exact output that I need and I have tested it on a subset of variable combinations. the code looks (except for the dummy data frames) as follows:

library(mediation)
library(broom)

# Generate random data frames
df1 &lt;- data.frame(x = rnorm(10), y = rnorm(10))
df2 &lt;- data.frame(a = rnorm(10), b = rnorm(10))
df3 &lt;- data.frame(q = rnorm(10), r = rnorm(10))

#specify column names to generate the desired combinations 
cols1 &lt;- names(df1)
cols2 &lt;- names(df2)
cols3 &lt;- names(df3)

#generate the combinations
combinations &lt;- expand.grid(cols1, cols2, cols3)

# Initialize a data frame to store the results
results &lt;- data.frame(independent = character(),
                      mediator = character(),
                      dependent = character(),
                      ab = numeric(),
                      ac = numeric(),
                      bc = numeric(),
                      p.value = numeric(),
                      stringsAsFactors = FALSE)

# Loop through each combination and perform a nonparametric causal mediation analysis
for (i in 1:nrow(combinations)) {
  independent &lt;- as.character(combinations[i, 1])
  mediator &lt;- as.character(combinations[i, 2])
  dependent &lt;- as.character(combinations[i, 3])
  
  
  # Combine the independent and mediator variables into a single data frame
  m.data &lt;- data.frame(df1[independent], df2[mediator], df3[dependent])
 
  
  independent &lt;- names(m.data)[1]
  mediator &lt;- names(m.data)[2]
  dependent &lt;- names(m.data)[3]
  
  #  # Fit a model for the independent variable and mediator
   
  model1 &lt;- paste(mediator,dependent,sep = &quot; ~ &quot;) 
  model2 &lt;- paste(dependent,&quot;~&quot;, independent,&quot;+&quot;,mediator,sep=&#39; &#39;) 
  
  model.M &lt;- lm(model1, data=m.data) 
  model.Y &lt;- lm(model2, data=m.data)
  
  fit &lt;- mediate(model.M, model.Y, treat=dependent, mediator=mediator, 
                 boot=TRUE, sims=500)
  
  tidy_fit &lt;- tidy(fit)
  # Extract the estimates of the total effect (ac) = fit$d0, the direct effect (ab),
  # and the indirect effect (bc) and their p-values
  ab &lt;- tidy_fit$estimate[3]
  ac &lt;- tidy_fit$estimate[1]
  bc &lt;- tidy_fit$estimate[2]
  p.value &lt;-  tidy_fit$p.value[1]
  
  # Add the results to the data frame
  results &lt;- rbind(results, data.frame(independent = independent,
                                       mediator = mediator,
                                       dependent = dependent,
                                       ab = ab,
                                       ac = ac,
                                       bc = bc,
                                       p.value = p.value,
                                       stringsAsFactors = FALSE))
}

print(results)

The code is basically optimal. The issue is that given the number of combinations, this runs forever. So I am trying to parallelize the process but have basically zero knowledge on this and my attempts thus far have failed. I have tried the following:

library(doParallel)
library(foreach)
cores &lt;- 4

# Initialize a cluster
cl &lt;- makeCluster(cores)


# Register the cluster
registerDoParallel(cl)
results &lt;- foreach(i = 1:nrow(combinations), .combine = rbind) %dopar% {
  independent &lt;- as.character(combinations[i, 1])
  mediator &lt;- as.character(combinations[i, 2])
  dependent &lt;- as.character(combinations[i, 3])
  
  independent_col &lt;- df1[, independent]
  mediator_col &lt;- df2[, mediator]
  dependent_col &lt;- df3[, dependent]
  
  # Combine the independent and mediator variables into a single data frame
  m.data &lt;- data.frame(df1[independent], df2[mediator], df3[dependent])
  # colnames(m.data) &lt;- c(&quot;independent&quot;, &quot;mediator&quot;, &quot;dependent&quot;)
  
  independent &lt;- names(m.data)[1]
  mediator &lt;- names(m.data)[2]
  dependent &lt;- names(m.data)[3]
  
  model1 &lt;- paste(mediator,dependent,sep = &quot; ~ &quot;) 
  model2 &lt;- paste(dependent,&quot;~&quot;, independent,&quot;+&quot;,mediator,sep=&#39; &#39;) 
  
  model.M &lt;- lm(model1, data=m.data) 
  model.Y &lt;- lm(model2, data=m.data)
  
  fit &lt;- mediation::mediate(model.M, model.Y, treat = dependent, mediator = mediator,
                            boot = TRUE, sims = 500)
  
  tidy_fit &lt;- tidy(fit)
  # Extract the estimates of the total effect (ac) = fit$d0, the direct effect (ab),
  # and the indirect effect (bc) and their p-values
  ab &lt;- tidy_fit$estimate[3] #ADE
  ac &lt;- tidy_fit$estimate[1] #ACME
  bc &lt;- tidy_fit$estimate[2] #TOTAL EFFECT
  p.value &lt;-  tidy_fit$p.value[1]
  
  # Add the results to the data frame
  data.frame(independent = independent,
             mediator = mediator,
             dependent = dependent,
             ab = ab,
             ac = ac,
             bc = bc,
             p.value = p.value,
             stringsAsFactors = FALSE)
}

stopCluster(cl)

this throws an error: Error in { : task 1 failed - "object 'model1' not found". if I define i = 1 and run the code that is in the loop as a single iteration with the defined i, the code works fine, so it seems to me that the variable model1 is not accessible within the worker function. anyone has any idea how to go about solving this? any other suggestion for parallelization (bbapply or splitting in chunks...) is welcome (I am equally unfamiliar with these options :D). Thanks!

technical info: MacOS Big Sur, R Version 4.1.1

答案1

得分: 1

这段代码主要是关于在计算服务器上运行中介分析的内容，包括初始化数据框架、设置核心数、创建并注册计算集群、定义自定义函数执行中介分析等步骤。最后，它输出了中介分析的结果。

请问您需要对这段代码的哪部分进行翻译或解释？

英文:

ok, so I tried something else and this seems to work. Importantly, the small addition from Ben makes all the difference.

#after initialising the results dataframe as per the above 
# set the number of cores to use
cores &lt;- 20
# Initialize a cluster
cl &lt;- makeCluster(cores)
# Register the cluster
registerDoParallel(cl)
# define a custom function to perform mediation analysis on each combination of variables
mediation_analysis &lt;-
function(i) {
independent &lt;- as.character(combinations[i, 1])
mediator &lt;- as.character(combinations[i, 2])
dependent &lt;- as.character(combinations[i, 3])
independent_col &lt;- df1[, independent]
mediator_col &lt;- df2[, mediator]
dependent_col &lt;- df3[, dependent]
# Combine the independent and mediator variables into a single data frame
m.data &lt;- data.frame(df1[independent], df2[mediator], df3[dependent])
independent &lt;- names(m.data)[1]
mediator &lt;- names(m.data)[2]
dependent &lt;- names(m.data)[3]
model1 &lt;- paste(mediator,dependent,sep = &quot; ~ &quot;) 
model2 &lt;- paste(dependent,&quot;~&quot;, 
independent,&quot;+&quot;,mediator,sep=&#39; &#39;) 
model.M &lt;- do.call(lm, list(model1, data=m.data))
model.Y &lt;- do.call(lm, list(model2, data=m.data))
fit &lt;- mediation::mediate(model.M, model.Y, treat = 
dependent, mediator = mediator,
boot = TRUE, sims = 500)
tidy_fit &lt;-broom::tidy(fit)
# Extract the estimates of the total effect (ac) = fit$d0, the direct effect (ab),
# and the indirect effect (bc) and their p-values
ab &lt;- tidy_fit$estimate[3] #ADE
ac &lt;- tidy_fit$estimate[1] #ACME
bc &lt;- tidy_fit$estimate[2] #TOTAL EFFECT
p.value &lt;-  tidy_fit$p.value[1] #p-value
# Add the results to the data frame
data.frame(independent = independent,
mediator = mediator,
dependent = dependent,
ab = ab,
ac = ac,
bc = bc,
p.value = p.value,
stringsAsFactors = FALSE)
}
# perform mediation analysis on each combination of variables using foreach
results &lt;- foreach(i = 1:nrow(combinations), .combine = rbind) %dopar% {
mediation_analysis(i)
}
# stop the cluster
stopCluster(cl)
# view the results
print(results)

I ran this in an interactive session on a compute server and the elapsed time is 7.839 for the same example given above. So it's a decent improvement. Thanks for any other helpful input.

答案2

得分: 0

I messed around with this for a while, but it's hard, due to something about the way that mediate works with environments. You can get around the first set of errors by replacing your model fits in the loop payload with:

model.M &lt;- do.call(lm, list(model1, data=m.data))
model.Y &lt;- do.call(lm, list(model2, data=m.data))

Setting .errorhandling = "pass" gets you a little bit more information about what's going wrong, but not much: it returns

> <simpleError in if (xhat == 0) out <- 1 else { out <- 2 * min(sum(x > 0), sum(x < 0))/length(x)}: missing value where TRUE/FALSE needed>

Searching for "xhat" in the mediation package source code tells us that this is happening in mediation:::pval, but I can't get much farther than that (it doesn't really help that the mediate.R file is 2000 lines of R code ...)

A possible workaround (??): your original, non-parallelized R code is doing something that is known to be very slow, i.e. growing a data frame one row at a time (see chapter 2 of https://www.burns-stat.com/pages/Tutor/R_inferno.pdf). If your loop is instead structured as:

res_list &lt;- list()
for (i in ...) {
   ... 
   res_list[[i]] &lt;- data.frame(...)
}
results &lt;- do.call(&quot;rbind&quot;, res_list)

you may find that your code goes much faster. I would try running your loop on (say) the first 100, 200, 400 ... rows of your 'combinations' data frame and see how it scales ...

(Unasked-for statistical comment: although I admit I don't know any of the details, I have to admit that I am suspicious of any analysis that looks at possible effects of mediation over 700,000 predictor combinations ...)

英文:

model.M &lt;- do.call(lm, list(model1, data=m.data))
model.Y &lt;- do.call(lm, list(model2, data=m.data))

Setting .errorhandling = "pass" gets you a little bit more information about what's going wrong, but not much: it returns

> <simpleError in if (xhat == 0) out <- 1 else { out <- 2 * min(sum(x > 0), sum(x < 0))/length(x)}: missing value where TRUE/FALSE needed>

res_list &lt;- list()
for (i in ...) {
   ... 
   res_list[[i]] &lt;- data.frame(...)
}
results &lt;- do.call(&quot;rbind&quot;, res_list)

you may find that your code goes much faster. I would try running your loop on (say) the first 100, 200, 400 ... rows of your 'combinations' data frame and see how it scales ...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Parallel computing for mediation analyses – foreach and dopar Error not finding assigned object within loop

问题

答案1

答案2

使用annotation_custom在ggplot上以编程方式定位图像。

尝试在R中比较两个具有不同行和列的数据框。

根据页面范围创建新的因子列

将列除以前一列[已解决]

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论