英文:
Parallel computing for mediation analyses - foreach and dopar Error not finding assigned object within loop
问题
The issue you're facing is due to the scoping rules in R when using parallel processing. To solve this problem, you can modify your code inside the loop to explicitly pass the necessary variables as arguments to the parallelized function. Here's the modified code snippet:
results <- foreach(i = 1:nrow(combinations), .combine = rbind) %dopar% {
independent <- as.character(combinations[i, 1])
mediator <- as.character(combinations[i, 2])
dependent <- as.character(combinations[i, 3])
# ... (your code for data manipulation and model fitting)
# Pass necessary variables as arguments to the parallelized function
calculate_mediation_results(independent, mediator, dependent, df1, df2, df3)
}
calculate_mediation_results <- function(independent, mediator, dependent, df1, df2, df3) {
# Your code for model fitting and mediation analysis here
# Return the results as a list or a data frame
return(data.frame(independent = independent,
mediator = mediator,
dependent = dependent,
ab = ab,
ac = ac,
bc = bc,
p.value = p.value,
stringsAsFactors = FALSE))
}
In this modified approach, the calculate_mediation_results
function takes all the necessary variables as arguments, ensuring they are accessible within the worker function. The results are then returned and combined in the main loop.
Please note that you need to adjust the calculate_mediation_results
function to include the actual code for model fitting and mediation analysis using the provided variables (independent
, mediator
, dependent
, df1
, df2
, df3
).
英文:
I am trying to iterate mediation analyses with a large number of variable combinations (around 700.000 combinations). Thus far I have generated code to produce the exact output that I need and I have tested it on a subset of variable combinations. the code looks (except for the dummy data frames) as follows:
library(mediation)
library(broom)
# Generate random data frames
df1 <- data.frame(x = rnorm(10), y = rnorm(10))
df2 <- data.frame(a = rnorm(10), b = rnorm(10))
df3 <- data.frame(q = rnorm(10), r = rnorm(10))
#specify column names to generate the desired combinations
cols1 <- names(df1)
cols2 <- names(df2)
cols3 <- names(df3)
#generate the combinations
combinations <- expand.grid(cols1, cols2, cols3)
# Initialize a data frame to store the results
results <- data.frame(independent = character(),
mediator = character(),
dependent = character(),
ab = numeric(),
ac = numeric(),
bc = numeric(),
p.value = numeric(),
stringsAsFactors = FALSE)
# Loop through each combination and perform a nonparametric causal mediation analysis
for (i in 1:nrow(combinations)) {
independent <- as.character(combinations[i, 1])
mediator <- as.character(combinations[i, 2])
dependent <- as.character(combinations[i, 3])
# Combine the independent and mediator variables into a single data frame
m.data <- data.frame(df1[independent], df2[mediator], df3[dependent])
independent <- names(m.data)[1]
mediator <- names(m.data)[2]
dependent <- names(m.data)[3]
# # Fit a model for the independent variable and mediator
model1 <- paste(mediator,dependent,sep = " ~ ")
model2 <- paste(dependent,"~", independent,"+",mediator,sep=' ')
model.M <- lm(model1, data=m.data)
model.Y <- lm(model2, data=m.data)
fit <- mediate(model.M, model.Y, treat=dependent, mediator=mediator,
boot=TRUE, sims=500)
tidy_fit <- tidy(fit)
# Extract the estimates of the total effect (ac) = fit$d0, the direct effect (ab),
# and the indirect effect (bc) and their p-values
ab <- tidy_fit$estimate[3]
ac <- tidy_fit$estimate[1]
bc <- tidy_fit$estimate[2]
p.value <- tidy_fit$p.value[1]
# Add the results to the data frame
results <- rbind(results, data.frame(independent = independent,
mediator = mediator,
dependent = dependent,
ab = ab,
ac = ac,
bc = bc,
p.value = p.value,
stringsAsFactors = FALSE))
}
print(results)
The code is basically optimal. The issue is that given the number of combinations, this runs forever. So I am trying to parallelize the process but have basically zero knowledge on this and my attempts thus far have failed. I have tried the following:
library(doParallel)
library(foreach)
cores <- 4
# Initialize a cluster
cl <- makeCluster(cores)
# Register the cluster
registerDoParallel(cl)
results <- foreach(i = 1:nrow(combinations), .combine = rbind) %dopar% {
independent <- as.character(combinations[i, 1])
mediator <- as.character(combinations[i, 2])
dependent <- as.character(combinations[i, 3])
independent_col <- df1[, independent]
mediator_col <- df2[, mediator]
dependent_col <- df3[, dependent]
# Combine the independent and mediator variables into a single data frame
m.data <- data.frame(df1[independent], df2[mediator], df3[dependent])
# colnames(m.data) <- c("independent", "mediator", "dependent")
independent <- names(m.data)[1]
mediator <- names(m.data)[2]
dependent <- names(m.data)[3]
model1 <- paste(mediator,dependent,sep = " ~ ")
model2 <- paste(dependent,"~", independent,"+",mediator,sep=' ')
model.M <- lm(model1, data=m.data)
model.Y <- lm(model2, data=m.data)
fit <- mediation::mediate(model.M, model.Y, treat = dependent, mediator = mediator,
boot = TRUE, sims = 500)
tidy_fit <- tidy(fit)
# Extract the estimates of the total effect (ac) = fit$d0, the direct effect (ab),
# and the indirect effect (bc) and their p-values
ab <- tidy_fit$estimate[3] #ADE
ac <- tidy_fit$estimate[1] #ACME
bc <- tidy_fit$estimate[2] #TOTAL EFFECT
p.value <- tidy_fit$p.value[1]
# Add the results to the data frame
data.frame(independent = independent,
mediator = mediator,
dependent = dependent,
ab = ab,
ac = ac,
bc = bc,
p.value = p.value,
stringsAsFactors = FALSE)
}
stopCluster(cl)
this throws an error: Error in { : task 1 failed - "object 'model1' not found". if I define i = 1 and run the code that is in the loop as a single iteration with the defined i, the code works fine, so it seems to me that the variable model1 is not accessible within the worker function. anyone has any idea how to go about solving this? any other suggestion for parallelization (bbapply or splitting in chunks...) is welcome (I am equally unfamiliar with these options :D). Thanks!
technical info: MacOS Big Sur, R Version 4.1.1
答案1
得分: 1
这段代码主要是关于在计算服务器上运行中介分析的内容,包括初始化数据框架、设置核心数、创建并注册计算集群、定义自定义函数执行中介分析等步骤。最后,它输出了中介分析的结果。
请问您需要对这段代码的哪部分进行翻译或解释?
英文:
ok, so I tried something else and this seems to work. Importantly, the small addition from Ben makes all the difference.
#after initialising the results dataframe as per the above
# set the number of cores to use
cores <- 20
# Initialize a cluster
cl <- makeCluster(cores)
# Register the cluster
registerDoParallel(cl)
# define a custom function to perform mediation analysis on each combination of variables
mediation_analysis <-
function(i) {
independent <- as.character(combinations[i, 1])
mediator <- as.character(combinations[i, 2])
dependent <- as.character(combinations[i, 3])
independent_col <- df1[, independent]
mediator_col <- df2[, mediator]
dependent_col <- df3[, dependent]
# Combine the independent and mediator variables into a single data frame
m.data <- data.frame(df1[independent], df2[mediator], df3[dependent])
independent <- names(m.data)[1]
mediator <- names(m.data)[2]
dependent <- names(m.data)[3]
model1 <- paste(mediator,dependent,sep = " ~ ")
model2 <- paste(dependent,"~",
independent,"+",mediator,sep=' ')
model.M <- do.call(lm, list(model1, data=m.data))
model.Y <- do.call(lm, list(model2, data=m.data))
fit <- mediation::mediate(model.M, model.Y, treat =
dependent, mediator = mediator,
boot = TRUE, sims = 500)
tidy_fit <-broom::tidy(fit)
# Extract the estimates of the total effect (ac) = fit$d0, the direct effect (ab),
# and the indirect effect (bc) and their p-values
ab <- tidy_fit$estimate[3] #ADE
ac <- tidy_fit$estimate[1] #ACME
bc <- tidy_fit$estimate[2] #TOTAL EFFECT
p.value <- tidy_fit$p.value[1] #p-value
# Add the results to the data frame
data.frame(independent = independent,
mediator = mediator,
dependent = dependent,
ab = ab,
ac = ac,
bc = bc,
p.value = p.value,
stringsAsFactors = FALSE)
}
# perform mediation analysis on each combination of variables using foreach
results <- foreach(i = 1:nrow(combinations), .combine = rbind) %dopar% {
mediation_analysis(i)
}
# stop the cluster
stopCluster(cl)
# view the results
print(results)
I ran this in an interactive session on a compute server and the elapsed time is 7.839 for the same example given above. So it's a decent improvement. Thanks for any other helpful input.
答案2
得分: 0
I messed around with this for a while, but it's hard, due to something about the way that mediate
works with environments. You can get around the first set of errors by replacing your model fits in the loop payload with:
model.M <- do.call(lm, list(model1, data=m.data))
model.Y <- do.call(lm, list(model2, data=m.data))
Setting .errorhandling = "pass"
gets you a little bit more information about what's going wrong, but not much: it returns
> <simpleError in if (xhat == 0) out <- 1 else { out <- 2 * min(sum(x > 0), sum(x < 0))/length(x)}: missing value where TRUE/FALSE needed>
Searching for "xhat" in the mediation package source code tells us that this is happening in mediation:::pval
, but I can't get much farther than that (it doesn't really help that the mediate.R
file is 2000 lines of R code ...)
A possible workaround (??): your original, non-parallelized R code is doing something that is known to be very slow, i.e. growing a data frame one row at a time (see chapter 2 of https://www.burns-stat.com/pages/Tutor/R_inferno.pdf). If your loop is instead structured as:
res_list <- list()
for (i in ...) {
...
res_list[[i]] <- data.frame(...)
}
results <- do.call("rbind", res_list)
you may find that your code goes much faster. I would try running your loop on (say) the first 100, 200, 400 ... rows of your 'combinations' data frame and see how it scales ...
(Unasked-for statistical comment: although I admit I don't know any of the details, I have to admit that I am suspicious of any analysis that looks at possible effects of mediation over 700,000 predictor combinations ...)
英文:
I messed around with this for a while, but it's hard, due to something about the way that mediate
works with environments. You can get around the first set of errors by replacing your model fits in the loop payload with:
model.M <- do.call(lm, list(model1, data=m.data))
model.Y <- do.call(lm, list(model2, data=m.data))
Setting .errorhandling = "pass"
gets you a little bit more information about what's going wrong, but not much: it returns
> <simpleError in if (xhat == 0) out <- 1 else { out <- 2 * min(sum(x > 0), sum(x < 0))/length(x)}: missing value where TRUE/FALSE needed>
Searching for "xhat" in the mediation package source code tells us that this is happening in mediation:::pval
, but I can't get much farther than that (it doesn't really help that the mediate.R
file is 2000 lines of R code ...)
A possible workaround (??): your original, non-parallelized R code is doing something that is known to be very slow, i.e. growing a data frame one row at a time (see chapter 2 of https://www.burns-stat.com/pages/Tutor/R_inferno.pdf). If your loop is instead structured as:
res_list <- list()
for (i in ...) {
...
res_list[[i]] <- data.frame(...)
}
results <- do.call("rbind", res_list)
you may find that your code goes much faster. I would try running your loop on (say) the first 100, 200, 400 ... rows of your 'combinations' data frame and see how it scales ...
(Unasked-for statistical comment: although I admit I don't know any of the details, I have to admit that I am suspicious of any analysis that looks at possible effects of mediation over 700,000 predictor combinations ...)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论