不同的方式来指定回归中相同的公式会得到不同的结果。

huangapple go评论61阅读模式
英文:

Different ways of specify the same formula in regression get different results

问题

I understand your request to translate the code part. Here's the translated code:

最近,我遇到一个非常奇怪的问题,即使使用Chatgpt4,我也无法解决它。最初,我想要创建一个循环来执行重复的回归工作,因为我有不同的自变量和因变量的组合。这是我的编码和演示数据集:

###创建一个包含重复测量的数据集,d变量代表ID,Cal是因变量,a、b、c是自变量
n_participants <- 50
n_measurements <- 3
a <- rnorm(n_participants * n_measurements, mean = 10, sd = 2)
b <- rnorm(n_participants * n_measurements, mean = 5, sd = 1)
c <- rnorm(n_participants * n_measurements, mean = 20, sd = 3)
d <- rep(1:n_participants, each = n_measurements)
Cal <- rbinom(n_participants * n_measurements, size = 1, prob = 0.5)

###向数据集添加NA值
missing_prop <- 0.2
missing_index <- sample(length(a), size = ceiling(length(a) * missing_prop))
a[missing_index] <- NA
b[missing_index] <- NA
c[missing_index] <- NA
Cal[missing_index] <- NA
data <- data.frame(Cal, a, b, c, d)

###使用mice进行缺失值插补
imputed_data <- mice(data, m = 5, maxit = 50, seed = 123)
###检查数据是否完全被插补
complete_data <- complete(imputed_data)
summary(complete_data)

###使用插补后的数据进行回归,因此需要使用with()
###解决方案1
rg1 <- with(imputed_data, geeglm(Cal ~ a, family = binomial, id = d, corstr = "independence"))
summary(rg1)
###解决方案2
var <- "a"
formula <- as.formula(paste0("Cal ~ ", var))
rg2 <- with(imputed_data, geeglm(formula, family = binomial, id = d, corstr = "independence"))
summary(rg2)

在解决方案2中,我首先使用as.formula指定公式。我仔细检查了公式,确保它与"Cal ~ a"相同,换句话说,与解决方案1中直接在回归模型中键入的公式相同。但是,解决方案2中的系数与解决方案1不同。
当我使用as.formula而不是直接在回归模型中键入公式时会发生什么?

也许我包装回归到循环中的方式不恰当。有没有专家可以分享一些将回归包装到循环中的经验?非常感谢!


<details>
<summary>英文:</summary>

Recently, I came across a very wired problem and I cannot solve it even by using Chatgpt4. Initially, I want to make a loop to do repeated work of regression because i have different combinations of independent variables and dependent variables. Here is my coding and demo dataset:

###create a dataset with repeated measurement, var d stands for ID, Cal is dependent variable, a,b,c are independent variables
n_participants <- 50
n_measurements <- 3
a <- rnorm(n_participantsn_measurements, mean = 10, sd = 2)
b <- rnorm(n_participants
n_measurements, mean = 5, sd = 1)
c <- rnorm(n_participantsn_measurements, mean = 20, sd = 3)
d <- rep(1:n_participants, each = n_measurements)
Cal <- rbinom(n_participants
n_measurements, size = 1, prob = 0.5)

###add NA to dataset
missing_prop <- 0.2
missing_index <- sample(length(a), size = ceiling(length(a)*missing_prop))
a[missing_index] <- NA
b[missing_index] <- NA
c[missing_index] <- NA
Cal[missing_index] <- NA
data <- data.frame(Cal, a, b, c, d)

###mice imputed
imputed_data <- mice(data, m = 5, maxit = 50, seed = 123)
###check whether the data were imputed completely
complete_data <- complete(imputed_data)
summary(complete_data)

###regression by imputed data, thus with() is need
###solution 1
rg1 <- with(imputed_data, geeglm(Cal ~ a, family = binomial, id = d, corstr = "independence"))
summary(rg1)
###solution 2
var <- "a"
formula <- as.formula(paste0("Cal ~ ", var))
rg2 <- with(imputed_data, geeglm(formula, family = binomial, id = d, corstr = "independence"))
summary(rg2)


As shown in solution 2, I first specify the formula by as.formula. I double checked the formula is the same as &quot;Cal ~ a&quot;, in other word the same as in solution 1 directly type the formula in the regression model. But the coefficients from solution2 is different from solution1.
What is happening when I use as.formula rather than directly type the formula into the regression model.



Maybe the way I wrap the regression to a loop is not appropriate. Could any expert share some experience with wrapping a regression into a loop? Many thanks in advance!

</details>


# 答案1
**得分**: 0

以下是翻译好的部分:

- "I think that's the place where you call `as.formula()`." 我认为这是你调用 `as.formula()` 的地方。
- "It seems to record the environment in which it is called, and with your solution 2 I get the results on the original dataset `data`." 似乎记录了调用它的环境,并且通过你的解决方案2,我获得了在原始数据集 `data` 上的结果。
- "solution 1" 解决方案1
- "solution 2" 解决方案2
- "Produces the same" 产生相同的结果
- "Solution 3 (same results as solution 1)" 解决方案3(与解决方案1相同的结果)
- "My take is that when you call `as.formula` in global environment it 'remembers' that it is called here and searches for a data.frame in that environment with variables Cal and a." 我的看法是,当你在全局环境中调用 `as.formula` 时,它“记住”它是在这里调用的,并在该环境中搜索具有变量 Cal 和 a 的数据框。

希望这对你有帮助。

<details>
<summary>英文:</summary>

I think that&#39;s the place where you call `as.formula()`. It seems to record the environnment in which it is called, and with your solution 2 I get the results on the original dataset `data`.

I&#39;m not that experienced with these and don&#39;t know how to specify a proper environnment, but you can call `as.formula()` inside the `with()` function and it seems to work with me.

        ###solution 1
        rg1 &lt;- with(imputed_data, geeglm(Cal ~ a, family = binomial, id = d, corstr = &quot;independence&quot;))
        summary(rg1)
        # A tibble: 10 x 6
           term        estimate std.error statistic p.value  nobs
           &lt;chr&gt;          &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt; &lt;int&gt;
         1 (Intercept) -0.340      0.880   0.150      0.699   150
         2 a            0.0629     0.0862  0.533      0.465   150
         3 (Intercept) -0.253      0.802   0.0995     0.752   150
         4 a            0.0568     0.0779  0.531      0.466   150
         5 (Intercept)  0.512      0.834   0.376      0.540   150
         6 a           -0.0423     0.0810  0.272      0.602   150
         7 (Intercept)  0.0251     0.796   0.000992   0.975   150
         8 a            0.0265     0.0783  0.115      0.735   150
         9 (Intercept)  0.297      0.912   0.106      0.745   150
        10 a           -0.00538    0.0873  0.00380    0.951   150
    
        
    
        ###solution 2
        var &lt;- &quot;a&quot;
        formula &lt;- as.formula(paste0(&quot;Cal ~ &quot;, var))
        rg2 &lt;- with(imputed_data, geeglm(formula, family = binomial, id = d, corstr = &quot;independence&quot;))
        summary(rg2)
        # A tibble: 10 x 6
           term        estimate std.error statistic p.value  nobs
           &lt;chr&gt;          &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt; &lt;int&gt;
         1 (Intercept)   0.0688    0.879    0.00614   0.938   120
         2 a             0.0230    0.0859   0.0715    0.789   120
         3 (Intercept)   0.0688    0.879    0.00614   0.938   120
         4 a             0.0230    0.0859   0.0715    0.789   120
         5 (Intercept)   0.0688    0.879    0.00614   0.938   120
         6 a             0.0230    0.0859   0.0715    0.789   120
         7 (Intercept)   0.0688    0.879    0.00614   0.938   120
         8 a             0.0230    0.0859   0.0715    0.789   120
         9 (Intercept)   0.0688    0.879    0.00614   0.938   120
        10 a             0.0230    0.0859   0.0715    0.789   120
         
        # Produces the same
        summary(geeglm(Cal ~ a, data = data, family = binomial, id = d, corstr = &quot;independence&quot;))
                
                Call:
                geeglm(formula = Cal ~ a, family = binomial, data = data, id = d, 
                    corstr = &quot;independence&quot;)
                
                 Coefficients:
                            Estimate Std.err  Wald Pr(&gt;|W|)
                (Intercept)  0.06884 0.87884 0.006    0.938
                a            0.02296 0.08586 0.071    0.789
                
                Correlation structure = independence 
                Estimated Scale Parameters:
                
                            Estimate Std.err
                (Intercept)        1 0.03166
                Number of clusters:   50  Maximum cluster size: 3
        
        # Solution 3 (same results as solution 1)
        formula &lt;- paste0(&quot;Cal ~ &quot;, var)
        rg3 &lt;- with(imputed_data, geeglm(as.formula(formula), family = binomial, id = d, corstr = &quot;independence&quot;))
        summary(rg3)
            # A tibble: 10 x 6
               term        estimate std.error statistic p.value  nobs
           &lt;chr&gt;          &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt; &lt;int&gt;
         1 (Intercept) -0.340      0.880   0.150      0.699   150
         2 a            0.0629     0.0862  0.533      0.465   150
         3 (Intercept) -0.253      0.802   0.0995     0.752   150
         4 a            0.0568     0.0779  0.531      0.466   150
         5 (Intercept)  0.512      0.834   0.376      0.540   150
         6 a           -0.0423     0.0810  0.272      0.602   150
         7 (Intercept)  0.0251     0.796   0.000992   0.975   150
         8 a            0.0265     0.0783  0.115      0.735   150
         9 (Intercept)  0.297      0.912   0.106      0.745   150
        10 a           -0.00538    0.0873  0.00380    0.951   150
        
My take is that when you call `as.formula` in global environnment it &quot;remembers&quot; that it is called here and searches for a data.frame in that environnment with variables Cal and a. Moving it in with, the environnment isn&#39;t the global one anymore and it uses the data.frame in the mice object.
I hope this is not too clumpsy and that it helped. Maybe someone with further knowledge will explain that quirk so that we both understand it!


</details>



huangapple
  • 本文由 发表于 2023年4月6日 19:03:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/75948777.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定