Run nested function for multiple regressions on independent variables for each outcome variable and plot coefficients graphically

huangapple go评论69阅读模式
英文:

Run nested function for multiple regressions on independent variables for each outcome variable and plot coefficients graphically

问题

我正在尝试运行一组嵌套函数,以执行多元回归并绘制结果图表。对于回归,我有一组因变量和一组自变量。我想要将每个自变量作为与单独的回归交互项运行,并对每个因变量执行此操作。我的意图是执行上述操作,将结果存储为数据框,然后为每个因变量创建ggplot图表,其中每个自变量都是不同的预测线。

我的代码目前类似于以下内容(使用mtcars数据集模拟):

library(ggplot2)
library(tidyverse)

outcome_var_list <- mtcars %>% select(qsec,hp) %>% names()
var_list <- mtcars %>% select(cyl,wt) %>% names()

iterate <- sapply(outcome_var_list, function(x) {
  (one_df <- sapply(var_list, function(k) {
    mod <- lm(data=mtcars, paste(x, " ~ ", k, ":factor(drat)", sep = ""))
    mod_df <- as.data.frame(coef(mod))
  }))
  
  # 使用rownames和coefficient作为占位符列名
  mod_df %>% ggplot(aes(x = factor(mod_df), y = coefficient, color = variable)) + geom_line()
})

上述代码无法成功运行。我遇到的问题是系数似乎不会在sapply函数内作为数据框存储,而只是成为矩阵。也许不应该在嵌套函数中使用它,或者是否需要做其他更改才能正确执行此操作?我想存储函数产生的每条线图,并使用相应的因变量对其进行标题。是否可能实现这一点,或者是否应考虑其他编写函数的方法?

编辑

以下是仅为一个因变量和仅为2个自变量(手动)生成单个图表的代码:

mod <- lm(data = mtcars, qsec ~ cyl:factor(drat))
mod_df <- as.data.frame(coef(mod))
mod_df <- tibble::rownames_to_column(mod_df)
# 重命名行名
mod_df <- mod_df %>% mutate(rowname = str_replace(mod_df$rowname, pattern = "cyl:factor[(]drat[)]", replacement = ""))
# 重命名列名
names(mod_df) <- c("xvar", "yvar")
# 移除截距项
mod_df <- mod_df[2:nrow(mod_df),]
# 添加自变量标签
mod_df$color = "cyl"

# 重复第二个自变量的操作
mod2 <- lm(data = mtcars, qsec ~ wt:factor(drat))
mod2_df <- as.data.frame(coef(mod2))
mod2_df <- tibble::rownames_to_column(mod2_df)
# 重命名行名
mod2_df <- mod2_df %>% mutate(rowname = str_replace(mod2_df$rowname, pattern = "wt:factor[(]drat[)]", replacement = ""))
# 重命名列名
names(mod2_df) <- c("xvar", "yvar")
# 移除截距项
mod2_df <- mod2_df[2:nrow(mod2_df),]
# 添加自变量标签
mod2_df$color = "wt"

combined_df <- rbind(mod_df, mod2_df)
combined_df %>% ggplot(aes(x = as.numeric(xvar), y = yvar, color = color)) + geom_line() + ggtitle("qsec") + theme_light() + theme(plot.title = element_text(hjust = 0.5))

生成的图表如下所示:

Run nested function for multiple regressions on independent variables for each outcome variable and plot coefficients graphically

是否有办法自动化多个因变量和每个因变量的多个自变量?

英文:

I am trying to run a set of nested functions in order to perform multiple regressions and graph the results. For the regression, I have a list of outcome variables, and a list of independent variables. I want to run each independent variable as an interaction term with in separate regressions, and do this for each outcome variable. My intention is to perform the above, store the results as dataframes, then create ggplot graphs for each outcome variable, with each independent variable as a different predicted line.

My code as of right now would be something like the following (modelled using the mtcars dataset)

library(ggplot2)
library(tidyverse)

outcome_var_list = mtcars %&gt;% select(qsec,hp) %&gt;% names()
var_list = mtcars %&gt;% select(cyl,wt) %&gt;% names()

iterate &lt;- sapply(outcome_var_list,function(x){(one_df = sapply(var_list,function(k){

mod &lt;- lm(data=mtcars, paste(x,&quot; ~ &quot;,k,&quot;:factor(drat)&quot;,sep = &quot;&quot;))

mod_df &lt;- as.data.frame(coef(mod))

})
  
#using rownames and coefficient as placeholder names for the actual column names
mod_df %&gt;% ggplot(aes(x = factor(mod_df), y= coefficient, color = variable)) + geom_line()

})

The above does not successfully run. Problems I encounter are that the coefficients do not seem to store as dataframes within the sapply function, and only take on a matrix. Is it perhaps not the right case in which to use nested functions? Or, is there something else I must change to execute this properly? I would like to store each line plot the function produces, and title it with the corresponding outcome variable. Is this possible, or is there another approach to writing a function that I should consider?

EDIT

Here is the code to generate a single plot for only 1 outcome variable and only 2 independent variables (manually)

mod &lt;- lm(data= mtcars, qsec ~ cyl:factor(drat))
mod_df &lt;- as.data.frame(coef(mod))
mod_df &lt;- tibble::rownames_to_column(mod_df)
#rename rowname
mod_df &lt;- mod_df %&gt;% mutate(rowname = str_replace(mod_df$rowname,pattern = &quot;cyl:factor[(]drat[)]&quot;,replacement = &quot;&quot;))
#rename 
names(mod_df) &lt;- c(&quot;xvar&quot;,&quot;yvar&quot;)
#remove intercept
mod_df &lt;- mod_df[2:nrow(mod_df),]
#add label for independent variable
mod_df$color = &quot;cyl&quot;
  
#repeat for second independent variable

mod2 &lt;- lm(data= mtcars, qsec ~ wt:factor(drat))
mod2_df &lt;- as.data.frame(coef(mod2))
mod2_df &lt;- tibble::rownames_to_column(mod2_df)
#rename rowname
mod2_df &lt;- mod2_df %&gt;% mutate(rowname = str_replace(mod2_df$rowname,pattern = &quot;wt:factor[(]drat[)]&quot;,replacement = &quot;&quot;))
#rename 
names(mod2_df) &lt;- c(&quot;xvar&quot;,&quot;yvar&quot;)
#remove intercept
mod2_df &lt;- mod2_df[2:nrow(mod2_df),]
#add label for independent variable
mod2_df$color = &quot;wt&quot;

combined_df &lt;- rbind(mod_df,mod2_df)
combined_df %&gt;% ggplot(aes(x=as.numeric(xvar), y=yvar, color = color)) + geom_line() + ggtitle(&quot;qsec&quot;) + theme_light() + theme(plot.title = element_text(hjust = 0.5))

The resulting plot is the following:

Run nested function for multiple regressions on independent variables for each outcome variable and plot coefficients graphically

Is there a way to automate this over multiple outcome variables and independent variables for each one?

答案1

得分: 2

获取所有必要系数的代码如下:

mtcars %>%
  select(all_of(c(outcome_var_list, var_list)), drat) %>%
  pivot_longer(all_of(var_list)) %>%
  reframe(broom::tidy(lm(as.matrix(pick(all_of(outcome_var_list)))~
               name/factor(drat):value + 0, data = .))) %>%
  select(response, term, estimate) %>%
  mutate(color = str_extract(term, "(?<=name)[^:]+"),
         xvar = as.numeric(str_extract(term, "[0-9.]+"))) %>%
  drop_na()

检查所有生成的值与您的值是否匹配。如果您只需要与 qsec 相关的系数,可以筛选数据框。

英文:

The code to obtain all the necessary coefficients will be:

mtcars %&gt;%
  select(all_of(c(outcome_var_list, var_list)), drat)%&gt;%
  pivot_longer(all_of(var_list))%&gt;%
  reframe(broom::tidy(lm(as.matrix(pick(all_of(outcome_var_list)))~
               name/factor(drat):value + 0, data = .)))%&gt;%
  select(response, term, estimate)%&gt;%
  mutate(color = str_extract(term, &quot;(?&lt;=name)[^:]+&quot;),
         xvar = as.numeric(str_extract(term, &quot;[0-9.]+&quot;)))%&gt;%
  drop_na()

Check all the values results generated against your values. They should match.

eg if you only need the coefficients regarding qsec you could filter the dataframe.

答案2

得分: 2

以下是翻译好的代码部分:

# 基础 R 代码:

fm <- sprintf("cbind(%s)~time/factor(drat):%s + 0", 
              toString(outcome_var_list), var_list[1])

cfs <- subset(mtcars, select = c(outcome_var_list, var_list, "drat")) %>%
  reshape(varying = list(var_list), dir = 'long', times = var_list) %>%
  lm(fm, data=_) %>%
  coef()

final_df <- na.omit(data.frame(xvar = as.numeric(trimws(rownames(cfs),,"\\D")),
      yvar = c(cfs), color = sub("time(.*?):.*", '\', rownames(cfs)),
      response = colnames(cfs)[col(cfs)]))

  xvar       yvar color response
3  2.76  -1.286761   cyl     qsec
4  2.76  -2.701420    wt     qsec
5  2.93  -1.189608   cyl     qsec
6  2.93  -1.900810    wt     qsec
7  3.00  -1.209608   cyl     qsec
8  3.00  -1.869331    wt     qsec
9  3.07  -1.228775   cyl     qsec
10 3.07  -2.664112    wt     qsec
11 3.08  -1.319161   cyl     qsec
12 3.08  -2.760143    wt     qsec
13 3.15  -1.292108   cyl     qsec
14 3.15  -3.141629    wt     qsec
15 3.21  -1.457108   cyl     qsec
16 3.21  -3.394749    wt     qsec
17 3.23  -1.259608   cyl     qsec
18 3.23  -1.971797    wt     qsec
19 3.54  -1.612108   cyl     qsec

# 将结果与您的组合数据框进行比较。

# 请注意,提供的方法是通用的。
英文:

base R code:

fm &lt;- sprintf(&quot;cbind(%s)~time/factor(drat):%s + 0&quot;, 
              toString(outcome_var_list), var_list[1])

cfs &lt;- subset(mtcars, select = c(outcome_var_list, var_list, &quot;drat&quot;))|&gt;
  reshape(varying = list(var_list), dir = &#39;long&#39;, times = var_list) |&gt;
  lm(fm, data=_) |&gt;
  coef()

final_df &lt;- na.omit(data.frame(xvar = as.numeric(trimws(rownames(cfs),,&quot;\\D&quot;)),
      yvar = c(cfs), color = sub(&quot;time(.*?):.*&quot;, &#39;\&#39;, rownames(cfs)),
      response = colnames(cfs)[col(cfs)]))

  xvar       yvar color response
3  2.76  -1.286761   cyl     qsec
4  2.76  -2.701420    wt     qsec
5  2.93  -1.189608   cyl     qsec
6  2.93  -1.900810    wt     qsec
7  3.00  -1.209608   cyl     qsec
8  3.00  -1.869331    wt     qsec
9  3.07  -1.228775   cyl     qsec
10 3.07  -2.664112    wt     qsec
11 3.08  -1.319161   cyl     qsec
12 3.08  -2.760143    wt     qsec
13 3.15  -1.292108   cyl     qsec
14 3.15  -3.141629    wt     qsec
15 3.21  -1.457108   cyl     qsec
16 3.21  -3.394749    wt     qsec
17 3.23  -1.259608   cyl     qsec
18 3.23  -1.971797    wt     qsec
19 3.54  -1.612108   cyl     qsec

Compare the results to your combinded df.

Note that the method provided is generic

huangapple
  • 本文由 发表于 2023年7月11日 07:18:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76657847.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定