Run nested function for multiple regressions on independent variables for each outcome variable and plot coefficients graphically

huangapple go评论101阅读模式
英文:

Run nested function for multiple regressions on independent variables for each outcome variable and plot coefficients graphically

问题

我正在尝试运行一组嵌套函数,以执行多元回归并绘制结果图表。对于回归,我有一组因变量和一组自变量。我想要将每个自变量作为与单独的回归交互项运行,并对每个因变量执行此操作。我的意图是执行上述操作,将结果存储为数据框,然后为每个因变量创建ggplot图表,其中每个自变量都是不同的预测线。

我的代码目前类似于以下内容(使用mtcars数据集模拟):

  1. library(ggplot2)
  2. library(tidyverse)
  3. outcome_var_list <- mtcars %>% select(qsec,hp) %>% names()
  4. var_list <- mtcars %>% select(cyl,wt) %>% names()
  5. iterate <- sapply(outcome_var_list, function(x) {
  6. (one_df <- sapply(var_list, function(k) {
  7. mod <- lm(data=mtcars, paste(x, " ~ ", k, ":factor(drat)", sep = ""))
  8. mod_df <- as.data.frame(coef(mod))
  9. }))
  10. # 使用rownames和coefficient作为占位符列名
  11. mod_df %>% ggplot(aes(x = factor(mod_df), y = coefficient, color = variable)) + geom_line()
  12. })

上述代码无法成功运行。我遇到的问题是系数似乎不会在sapply函数内作为数据框存储,而只是成为矩阵。也许不应该在嵌套函数中使用它,或者是否需要做其他更改才能正确执行此操作?我想存储函数产生的每条线图,并使用相应的因变量对其进行标题。是否可能实现这一点,或者是否应考虑其他编写函数的方法?

编辑

以下是仅为一个因变量和仅为2个自变量(手动)生成单个图表的代码:

  1. mod <- lm(data = mtcars, qsec ~ cyl:factor(drat))
  2. mod_df <- as.data.frame(coef(mod))
  3. mod_df <- tibble::rownames_to_column(mod_df)
  4. # 重命名行名
  5. mod_df <- mod_df %>% mutate(rowname = str_replace(mod_df$rowname, pattern = "cyl:factor[(]drat[)]", replacement = ""))
  6. # 重命名列名
  7. names(mod_df) <- c("xvar", "yvar")
  8. # 移除截距项
  9. mod_df <- mod_df[2:nrow(mod_df),]
  10. # 添加自变量标签
  11. mod_df$color = "cyl"
  12. # 重复第二个自变量的操作
  13. mod2 <- lm(data = mtcars, qsec ~ wt:factor(drat))
  14. mod2_df <- as.data.frame(coef(mod2))
  15. mod2_df <- tibble::rownames_to_column(mod2_df)
  16. # 重命名行名
  17. mod2_df <- mod2_df %>% mutate(rowname = str_replace(mod2_df$rowname, pattern = "wt:factor[(]drat[)]", replacement = ""))
  18. # 重命名列名
  19. names(mod2_df) <- c("xvar", "yvar")
  20. # 移除截距项
  21. mod2_df <- mod2_df[2:nrow(mod2_df),]
  22. # 添加自变量标签
  23. mod2_df$color = "wt"
  24. combined_df <- rbind(mod_df, mod2_df)
  25. combined_df %>% ggplot(aes(x = as.numeric(xvar), y = yvar, color = color)) + geom_line() + ggtitle("qsec") + theme_light() + theme(plot.title = element_text(hjust = 0.5))

生成的图表如下所示:

Run nested function for multiple regressions on independent variables for each outcome variable and plot coefficients graphically

是否有办法自动化多个因变量和每个因变量的多个自变量?

英文:

I am trying to run a set of nested functions in order to perform multiple regressions and graph the results. For the regression, I have a list of outcome variables, and a list of independent variables. I want to run each independent variable as an interaction term with in separate regressions, and do this for each outcome variable. My intention is to perform the above, store the results as dataframes, then create ggplot graphs for each outcome variable, with each independent variable as a different predicted line.

My code as of right now would be something like the following (modelled using the mtcars dataset)

  1. library(ggplot2)
  2. library(tidyverse)
  3. outcome_var_list = mtcars %&gt;% select(qsec,hp) %&gt;% names()
  4. var_list = mtcars %&gt;% select(cyl,wt) %&gt;% names()
  5. iterate &lt;- sapply(outcome_var_list,function(x){(one_df = sapply(var_list,function(k){
  6. mod &lt;- lm(data=mtcars, paste(x,&quot; ~ &quot;,k,&quot;:factor(drat)&quot;,sep = &quot;&quot;))
  7. mod_df &lt;- as.data.frame(coef(mod))
  8. })
  9. #using rownames and coefficient as placeholder names for the actual column names
  10. mod_df %&gt;% ggplot(aes(x = factor(mod_df), y= coefficient, color = variable)) + geom_line()
  11. })

The above does not successfully run. Problems I encounter are that the coefficients do not seem to store as dataframes within the sapply function, and only take on a matrix. Is it perhaps not the right case in which to use nested functions? Or, is there something else I must change to execute this properly? I would like to store each line plot the function produces, and title it with the corresponding outcome variable. Is this possible, or is there another approach to writing a function that I should consider?

EDIT

Here is the code to generate a single plot for only 1 outcome variable and only 2 independent variables (manually)

  1. mod &lt;- lm(data= mtcars, qsec ~ cyl:factor(drat))
  2. mod_df &lt;- as.data.frame(coef(mod))
  3. mod_df &lt;- tibble::rownames_to_column(mod_df)
  4. #rename rowname
  5. mod_df &lt;- mod_df %&gt;% mutate(rowname = str_replace(mod_df$rowname,pattern = &quot;cyl:factor[(]drat[)]&quot;,replacement = &quot;&quot;))
  6. #rename
  7. names(mod_df) &lt;- c(&quot;xvar&quot;,&quot;yvar&quot;)
  8. #remove intercept
  9. mod_df &lt;- mod_df[2:nrow(mod_df),]
  10. #add label for independent variable
  11. mod_df$color = &quot;cyl&quot;
  12. #repeat for second independent variable
  13. mod2 &lt;- lm(data= mtcars, qsec ~ wt:factor(drat))
  14. mod2_df &lt;- as.data.frame(coef(mod2))
  15. mod2_df &lt;- tibble::rownames_to_column(mod2_df)
  16. #rename rowname
  17. mod2_df &lt;- mod2_df %&gt;% mutate(rowname = str_replace(mod2_df$rowname,pattern = &quot;wt:factor[(]drat[)]&quot;,replacement = &quot;&quot;))
  18. #rename
  19. names(mod2_df) &lt;- c(&quot;xvar&quot;,&quot;yvar&quot;)
  20. #remove intercept
  21. mod2_df &lt;- mod2_df[2:nrow(mod2_df),]
  22. #add label for independent variable
  23. mod2_df$color = &quot;wt&quot;
  24. combined_df &lt;- rbind(mod_df,mod2_df)
  25. combined_df %&gt;% ggplot(aes(x=as.numeric(xvar), y=yvar, color = color)) + geom_line() + ggtitle(&quot;qsec&quot;) + theme_light() + theme(plot.title = element_text(hjust = 0.5))

The resulting plot is the following:

Run nested function for multiple regressions on independent variables for each outcome variable and plot coefficients graphically

Is there a way to automate this over multiple outcome variables and independent variables for each one?

答案1

得分: 2

获取所有必要系数的代码如下:

  1. mtcars %>%
  2. select(all_of(c(outcome_var_list, var_list)), drat) %>%
  3. pivot_longer(all_of(var_list)) %>%
  4. reframe(broom::tidy(lm(as.matrix(pick(all_of(outcome_var_list)))~
  5. name/factor(drat):value + 0, data = .))) %>%
  6. select(response, term, estimate) %>%
  7. mutate(color = str_extract(term, "(?<=name)[^:]+"),
  8. xvar = as.numeric(str_extract(term, "[0-9.]+"))) %>%
  9. drop_na()

检查所有生成的值与您的值是否匹配。如果您只需要与 qsec 相关的系数,可以筛选数据框。

英文:

The code to obtain all the necessary coefficients will be:

  1. mtcars %&gt;%
  2. select(all_of(c(outcome_var_list, var_list)), drat)%&gt;%
  3. pivot_longer(all_of(var_list))%&gt;%
  4. reframe(broom::tidy(lm(as.matrix(pick(all_of(outcome_var_list)))~
  5. name/factor(drat):value + 0, data = .)))%&gt;%
  6. select(response, term, estimate)%&gt;%
  7. mutate(color = str_extract(term, &quot;(?&lt;=name)[^:]+&quot;),
  8. xvar = as.numeric(str_extract(term, &quot;[0-9.]+&quot;)))%&gt;%
  9. drop_na()

Check all the values results generated against your values. They should match.

eg if you only need the coefficients regarding qsec you could filter the dataframe.

答案2

得分: 2

以下是翻译好的代码部分:

  1. # 基础 R 代码:
  2. fm <- sprintf("cbind(%s)~time/factor(drat):%s + 0",
  3. toString(outcome_var_list), var_list[1])
  4. cfs <- subset(mtcars, select = c(outcome_var_list, var_list, "drat")) %>%
  5. reshape(varying = list(var_list), dir = 'long', times = var_list) %>%
  6. lm(fm, data=_) %>%
  7. coef()
  8. final_df <- na.omit(data.frame(xvar = as.numeric(trimws(rownames(cfs),,"\\D")),
  9. yvar = c(cfs), color = sub("time(.*?):.*", '\', rownames(cfs)),
  10. response = colnames(cfs)[col(cfs)]))
  11. xvar yvar color response
  12. 3 2.76 -1.286761 cyl qsec
  13. 4 2.76 -2.701420 wt qsec
  14. 5 2.93 -1.189608 cyl qsec
  15. 6 2.93 -1.900810 wt qsec
  16. 7 3.00 -1.209608 cyl qsec
  17. 8 3.00 -1.869331 wt qsec
  18. 9 3.07 -1.228775 cyl qsec
  19. 10 3.07 -2.664112 wt qsec
  20. 11 3.08 -1.319161 cyl qsec
  21. 12 3.08 -2.760143 wt qsec
  22. 13 3.15 -1.292108 cyl qsec
  23. 14 3.15 -3.141629 wt qsec
  24. 15 3.21 -1.457108 cyl qsec
  25. 16 3.21 -3.394749 wt qsec
  26. 17 3.23 -1.259608 cyl qsec
  27. 18 3.23 -1.971797 wt qsec
  28. 19 3.54 -1.612108 cyl qsec
  29. # 将结果与您的组合数据框进行比较。
  30. # 请注意,提供的方法是通用的。
英文:

base R code:

  1. fm &lt;- sprintf(&quot;cbind(%s)~time/factor(drat):%s + 0&quot;,
  2. toString(outcome_var_list), var_list[1])
  3. cfs &lt;- subset(mtcars, select = c(outcome_var_list, var_list, &quot;drat&quot;))|&gt;
  4. reshape(varying = list(var_list), dir = &#39;long&#39;, times = var_list) |&gt;
  5. lm(fm, data=_) |&gt;
  6. coef()
  7. final_df &lt;- na.omit(data.frame(xvar = as.numeric(trimws(rownames(cfs),,&quot;\\D&quot;)),
  8. yvar = c(cfs), color = sub(&quot;time(.*?):.*&quot;, &#39;\&#39;, rownames(cfs)),
  9. response = colnames(cfs)[col(cfs)]))
  10. xvar yvar color response
  11. 3 2.76 -1.286761 cyl qsec
  12. 4 2.76 -2.701420 wt qsec
  13. 5 2.93 -1.189608 cyl qsec
  14. 6 2.93 -1.900810 wt qsec
  15. 7 3.00 -1.209608 cyl qsec
  16. 8 3.00 -1.869331 wt qsec
  17. 9 3.07 -1.228775 cyl qsec
  18. 10 3.07 -2.664112 wt qsec
  19. 11 3.08 -1.319161 cyl qsec
  20. 12 3.08 -2.760143 wt qsec
  21. 13 3.15 -1.292108 cyl qsec
  22. 14 3.15 -3.141629 wt qsec
  23. 15 3.21 -1.457108 cyl qsec
  24. 16 3.21 -3.394749 wt qsec
  25. 17 3.23 -1.259608 cyl qsec
  26. 18 3.23 -1.971797 wt qsec
  27. 19 3.54 -1.612108 cyl qsec

Compare the results to your combinded df.

Note that the method provided is generic

huangapple
  • 本文由 发表于 2023年7月11日 07:18:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76657847.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定