2023年3月4日 01:22:20go评论155阅读模式

英文:

How to create formulas programmatically in R based on pattern of omitted/included variables

问题

我定期需要为单个分析创建超过10甚至20个模型。这些模型的基本公式通常遵循明确的模式，但我经常难以想出一种编程方式来创建它们。如果我需要在许多模型中添加或更改某些内容，为每个模型重复自己就变得更加重要。

以下是我的意思的一个示例。

假设我们有这些数据：

mydata <- mtcars %>%
  mutate(standardized_mpg = scale(mpg) %>% as.numeric(),
         logged_mpg = log(mpg))

现在我们想为三个因变量 mpg，standaridzed_mpg 和 logged_mpg 创建回归公式。由于某种任意原因，我们总是想要两个自变量，其中一个是 gear 或 cyl，另一个是 disp 或 hp。换句话说，我们想要这样：

mpg ~ gear + disp
mpg ~ cyl + disp
mpg ~ gear + hp
mpg ~ cyl + hp
logged_mpg ~ gear + disp
等等
等等
等等
standardized_mpg ~ gear + disp
等等
等等
等等

可以使用 expand.grid() 来实现这一点：

specs_dv <- c("mpg", "standardized_mpg", "logged_mpg")
specs_iv1 <- c("gear", "cyl")
specs_iv2 <- c("disp", "hp")
f <- expand.grid(
  specs_dv = specs_dv,
  " ~ ",
  specs_iv1 = specs_iv1,
  " + ",
  specs_iv2 = specs_iv2
) %>%
  arrange(specs_dv)
# 将每行折叠为字符向量，然后将它们转换为公式
# 最后，我们将得到一个列表类型的对象，其中每个列表项都是一个公式
f <- lapply(f, function(x) {
  as.formula(
    paste0(
      as.character(x), collapse = ""
    )
  )
})

从这里，我可以使用循环和函数来创建模型和表格。我不会在这里添加它们，因为这会偏离主要问题。

现在，当变量在模型之间固定时，上述方法运行得很好。然而，更常见的情况是，某些模型会省略其他模型的变量。这就是我目前遇到困难的地方。以下是一些示例数据：

mydata <- ggplot2::economics %>%
  mutate(psavert_l1 = lag(psavert),
         standardized_psavert = scale(psavert) %>% as.numeric(),
         standardized_psavert_l1 = scale(psavert_l1) %>% as.numeric(),
         logged_psavert = log(psavert),
         logged_psavert_l1 = log(psavert_l1)
         )

以下是我想要生成的公式：

uempmed ~ psavert + psavert_l1 + unemploy
uempmed ~ psavert + psavert_l1           
uempmed ~ psavert +              unemploy
uempmed ~           psavert_l1 + unemploy
uempmed ~ standardized_psavert + standardized_psavert_l1 + unemploy
uempmed ~ standardized_psavert + standardized_psavert_l1           
uempmed ~ standardized_psavert +                           unemploy
uempmed ~                        standardized_psavert_l1 + unemploy
uempmed ~ log_psavert + log_psavert_l1 + unemploy
uempmed ~ log_psavert + log_psavert_l1           
uempmed ~ log_psavert +                  unemploy
uempmed ~               log_psavert_l1 + unemploy

正如您所见，这次有两个或三个自变量。这里的模式是有一个模型包括所有三个自变量，然后有三个额外的模型，每次省略一个自变量。最后，我需要对其中两个或三个变量的三个修改版本重复此操作。

像这样添加或删除变量在建模时是很常见的事情。因此，我想知道是否有一种更快的方式来创建这些公式，就像在变量数量固定的情况下那样。您处理这个问题的方法是什么，尤其是如果有更多的变量和模型要处理？逐个手动编写似乎...不太合适。

英文:

I regularly have to create upwards of 10 sometimes more than 20 models for a single analysis. The underlying formulas for these models usually follow a clear pattern, but I often struggle to come up with a programmatic way to create them. If I need to add or change something in many models, not repeating myself for every model becomes all the more crucial.

Here is an example of what I mean.

Suppose we have this data:

mydata &lt;- mtcars %&gt;%
  mutate(standardized_mpg = scale(mpg)%&gt;%as.numeric(),
         logged_mpg = log(mpg))

Now we want to create regression formulas for the three DVs mpg,standaridzed_mpg, and logged_mpg. For some arbitrary reason, we always want two IVs where the one of them is either gear or cyl and the other is either disp or hp. In other words, we want this:

mpg ~ gear + dips
mpg ~ cyl + dips
mpg ~ gear + hp
mpg ~ cyl + hp
logged_mpg ~ gear + dips
etc.
etc.
etc.
standardized_mpg ~ gear + dips
etc.
etc.
etc.

It's possible to achieve this with expand.grid():

specs_dv &lt;- c(&quot;mpg&quot;, &quot;standardized_mpg&quot;, &quot;logged_mpg&quot;)
specs_iv1 &lt;- c(&quot;gear&quot;, &quot;cyl&quot;)
specs_iv2 &lt;- c(&quot;disp&quot;, &quot;hp&quot;)
f &lt;-  expand.grid(
  specs_dv = specs_dv,
  &quot; ~ &quot;,
  specs_iv1 = specs_iv1,
  &quot; + &quot;,
  specs_iv2 = specs_iv2
) %&gt;%
  arrange(specs_dv)
# Collapse each df row to char vector, then convert them to formulas
# At the end we will have a list-type object where every list item is a formula
f &lt;- apply(f, 1, function(x) {
  as.formula(
    paste0(
      as.character(x), collapse = &quot;&quot;
    )
    )})

From here, I can use for loops and functions to create models and tables. I will not add them here as it's digresses from the main point.

Now, when the number of variables is fixed between models, the method above works nicely. However, more often, some models omit variables of other models. This is where I am stuck right now. Here is some sample data:

mydata &lt;- ggplot2::economics %&gt;%
  mutate(psavert_l1 = lag(psavert),
         standardized_psavert = scale(psavert)%&gt;%as.numeric(),
         standardized_psavert_l1 = scale(psavert_l1)%&gt;%as.numeric(),
         logged_psavert = log(psavert),
         logged_psavert_l1 = log(psavert_l1)
         )

And these are the formulas that I would like to generate:

uempmed ~ psavert + psavert_l1 + unemploy
uempmed ~ psavert + psavert_l1           
uempmed ~ psavert +              unemploy
uempmed ~           psavert_l1 + unemploy
uempmed ~ standardized_psavert + standardized_psavert_l1 + unemploy
uempmed ~ standardized_psavert + standardized_psavert_l1           
uempmed ~ standardized_psavert +                           unemploy
uempmed ~                        standardized_psavert_l1 + unemploy
uempmed ~ log_psavert + log_psavert_l1 + unemploy
uempmed ~ log_psavert + log_psavert_l1           
uempmed ~ log_psavert +                  unemploy
uempmed ~               log_psavert_l1 + unemploy

As you can see, this time I have two or three independent variables. The pattern here is to have one model with all three IVs and then three additional model where each time one of the IVs is omitted. Finally, I need this for three modified versions of two of the three variables.

Adding or removing variables like this is a very normal thing when modelling. Therefore, I was wondering if there is a faster way to create these formulas, like there is if the number of variables is fixed. What is your approach to handling this problem, especially if there are more variables more models to take care of? Writing everything by hand seems... just wrong.

答案1

得分: 2

你可以使用combn与前缀一起使用lapply。

lapply(c('', 'standardized_', 'log_'), \(z) {
  combn(c('psavert', 'psavert_l1', 'unemploy'), 2, FUN=\(x) 
        reformulate(paste0(z, x), 'uempmed'), simplify=FALSE)
})

[[1]]

[[1]][[1]]

uempmed ~ psavert + psavert_l1

<environment: 0x55c229491980>

[[1]][[2]]

uempmed ~ psavert + unemploy

<environment: 0x55c22948f078>

[[1]][[3]]

uempmed ~ psavert_l1 + unemploy

<environment: 0x55c2294883c8>

[[2]]

[[2]][[1]]

uempmed ~ standardized_psavert + standardized_psavert_l1

<environment: 0x55c22821cc60>

[[2]][[2]]

uempmed ~ standardized_psavert + standardized_unemploy

<environment: 0x55c22820f3d0>

[[2]][[3]]

uempmed ~ standardized_psavert_l1 + standardized_unemploy

<environment: 0x55c228206758>

[[3]]

[[3]][[1]]

uempmed ~ log_psavert + log_psavert_l1

<environment: 0x55c2281fed40>

[[3]][[2]]

uempmed ~ log_psavert + log_unemploy

<environment: 0x55c2281fc438>

[[3]][[3]]

uempmed ~ log_psavert_l1 + log_unemploy

<environment: 0x55c2281f37f8>


<details>
<summary>英文:</summary>
You can use `combn` with an `lapply` over the prefixes.
    lapply(c(&#39;&#39;, &#39;standardized_&#39;, &#39;log_&#39;), \(z) {
      combn(c(&#39;psavert&#39;, &#39;psavert_l1&#39;, &#39;unemploy&#39;), 2, FUN=\(x) 
            reformulate(paste0(z, x), &#39;uempmed&#39;), simplify=FALSE)
    })
    # [[1]]
    # [[1]][[1]]
    # uempmed ~ psavert + psavert_l1
    # &lt;environment: 0x55c229491980&gt;
    #   
    # [[1]][[2]]
    # uempmed ~ psavert + unemploy
    # &lt;environment: 0x55c22948f078&gt;
    #   
    # [[1]][[3]]
    # uempmed ~ psavert_l1 + unemploy
    # &lt;environment: 0x55c2294883c8&gt;
    #   
    #   
    # [[2]]
    # [[2]][[1]]
    # uempmed ~ standardized_psavert + standardized_psavert_l1
    # &lt;environment: 0x55c22821cc60&gt;
    #  
    # [[2]][[2]]
    # uempmed ~ standardized_psavert + standardized_unemploy
    # &lt;environment: 0x55c22820f3d0&gt;
    #   
    # [[2]][[3]]
    # uempmed ~ standardized_psavert_l1 + standardized_unemploy
    # &lt;environment: 0x55c228206758&gt;
    #   
    #   
    # [[3]]
    # [[3]][[1]]
    # uempmed ~ log_psavert + log_psavert_l1
    # &lt;environment: 0x55c2281fed40&gt;
    #   
    # [[3]][[2]]
    # uempmed ~ log_psavert + log_unemploy
    # &lt;environment: 0x55c2281fc438&gt;
    #   
    # [[3]][[3]]
    # uempmed ~ log_psavert_l1 + log_unemploy
    # &lt;environment: 0x55c2281f37f8&gt;
</details>
# 答案2
**得分**: 0
经过一些测试，我找到了一种相对程序化的方法。
我的方法可以用于如果您有一个公式，并希望根据减去该公式的各个变量和一些变量名称的更改来创建许多新公式。
请注意，我在这里使用了各种整洁的 tidyverse 函数来完成这项工作。
1. 创建一个包含公式的一行 df，将公式元素作为列。如果需要修改某些变量的开头（如 "logged_" 或 "standardized_"），可以添加前缀。
```R
f <- tibble(
  dv = "uempmed",
  tilde = "~",
  iv1 = "PREFIXpsavert",
  plus = "+",
  iv2 = "PREFIXpsavert_l1",
  plus2 = "+",
  iv3 = "unemploy"
)

复制行，次数等于所需的公式数。在这里，我们需要 3 组 4 个公式，所以是 12 个。

f <- f %>% slice(rep(1:n(), each = 12))

用空格替换某些公式中不需要的变量

# 这将在每四个模型中，从第二个开始，将第三个 IV 替换为空格
# （将 "by" 设置为每组公式的公式数。在这种情况下，是三组四个，所以是四。）
f[seq(2, nrow(f), by=4), c("iv3", "plus2")] <- " "
# 这将在每四个模型中，从第三个开始，将第二个 IV 替换为空格
f[seq(3, nrow(f), by=4), c("iv2", "plus")] <- " "
# 这将在每四个模型中，从第四个开始，将第一个 IV 替换为空格
f[seq(4, nrow(f), by=4), c("iv1", "plus")] <- " "

将数据框行合并为字符字符串

f <- apply(f, 1, function(x) {
    paste0(
      as.character(x), collapse = ""
    )})

更改前缀变量的前缀，对于第一、第二和第三组四个模型

f[1:4] <- str_replace_all(f[1:4], "PREFIX", "")
f[5:8] <- str_replace_all(f[5:8], "PREFIX", "standardized_")
f[9:12] <- str_replace_all(f[9:12], "PREFIX", "log_")

最后，将字符串向量更改为公式列表

f <- lapply(f, as.formula)

最终，我们得到一个公式列表 f，可以与类似 lm(formula = f[[1]], data = mydata) 的方法一起使用。

基于这个结构，您可以很容易地将所有内容封装到一个函数中，以更加程序化地创建公式。当然，实际的模型也可以在循环中创建，例如：

for(i in length(f)) {
  assign(
    x     = paste0("m", str_pad(i, width=2, pad=0)),
    value = lm(
      formula = f[[i]],
      data    = mydata
    )
  )}

基于这个可以创建 stargazer 表格，但这已经远离了原问题，所以我不会在这里详细说明。希望这篇文章对那些有类似问题的人有所帮助！

英文:

After some more testing, I found a way that is reasonably programmatic.

My approach can be used if you have a formula and want to create many new formulas based on subtracting individual variables of that formula and some changes in variable names.

Note that I am using various tidyverse functions for this to work.

Create a df with one row that contains the formula, with formula elements as columns. Add a prefix if you need to modify the beginning of certain variables (like "logged_" or "standardized_").

f &lt;- tibble(
  dv = &quot;uempmed&quot;
  tilde = &quot;~&quot;,
  iv1 = &quot;PREFIXpsavert&quot;,
  plus = &quot;+&quot;,
  iv2 = &quot;PREFIXpsavert_l1&quot;,
  plus2 = &quot;+&quot;,
  iv3 = &quot;unemploy&quot;
)

Duplicate rows for as many times as the desired number of formulas. Here we need 3 sets of 4 formulas, so 12.

f &lt;- f %&gt;% slice(rep(1:n(), each = 12))

Replace the variables that should be left out in certain formulas with blank spaces

# This replaces the third IV with a blank space for every fourth model, counting from the second
# (Set &quot;by&quot; to the number of formulas per set of formulas. In this case it&#39;s three sets of four, so it&#39;s four.)
f[seq(2, nrow(f), by=4), c(&quot;iv3&quot;, &quot;plus2&quot;)] &lt;- &quot; &quot;
# This replace the second IV with a blank space for every fourth model, counting from the third
f[seq(3, nrow(f), by=4), c(&quot;iv2&quot;, &quot;plus&quot;)] &lt;- &quot; &quot;
# This replace the first IV with a blank space for every fourth model, counting from the fourth
f[seq(4, nrow(f), by=4), c(&quot;iv1&quot;, &quot;plus&quot;)] &lt;- &quot; &quot;

Collapse the data frame rows to character strings

f &lt;- apply(f, 1, function(x) {
    paste0(
      as.character(x), collapse = &quot;&quot;
    )})

Change the prefix of variables for the first, second, third set of four models

f[1:4] &lt;- str_replace_all(f[1:4], &quot;PREFIX&quot;, &quot;&quot;)
f[5:8] &lt;- str_replace_all(f[5:8], &quot;PREFIX&quot;, &quot;standardized_&quot;)
f[9:12] &lt;- str_replace_all(f[9:12], &quot;PREFIX&quot;, &quot;log_&quot;)

Finally, change the vector of strings to a list of formulas

f &lt;- lapply(f, as.formula)

We end up with a list of formulas f that can be used with something like lm(formula = f[[1]], data = mydata).

Based on this structure, you could easily wrap everything in a function to make creating formulas even more programmatic. Of course, the actual models can also be created in a loop, like for example:

for(i in length(f)) {
  assign(
    x     = paste0(&quot;m&quot;, str_pad(i, width=2, pad=0)),
    value = lm(
      formula = f[[i]],
      data    = mydata
    )
  )}

Creating stargazer tables based on this is also possible, but it's too far from the original question, so I won't say more about it here. Hope this post helps people with a similar issue that I had!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

问题

答案1