英文:
Different ways of specify the same formula in regression get different results
问题
I understand your request to translate the code part. Here's the translated code:
最近,我遇到一个非常奇怪的问题,即使使用Chatgpt4,我也无法解决它。最初,我想要创建一个循环来执行重复的回归工作,因为我有不同的自变量和因变量的组合。这是我的编码和演示数据集:
###创建一个包含重复测量的数据集,d变量代表ID,Cal是因变量,a、b、c是自变量
n_participants <- 50
n_measurements <- 3
a <- rnorm(n_participants * n_measurements, mean = 10, sd = 2)
b <- rnorm(n_participants * n_measurements, mean = 5, sd = 1)
c <- rnorm(n_participants * n_measurements, mean = 20, sd = 3)
d <- rep(1:n_participants, each = n_measurements)
Cal <- rbinom(n_participants * n_measurements, size = 1, prob = 0.5)
###向数据集添加NA值
missing_prop <- 0.2
missing_index <- sample(length(a), size = ceiling(length(a) * missing_prop))
a[missing_index] <- NA
b[missing_index] <- NA
c[missing_index] <- NA
Cal[missing_index] <- NA
data <- data.frame(Cal, a, b, c, d)
###使用mice进行缺失值插补
imputed_data <- mice(data, m = 5, maxit = 50, seed = 123)
###检查数据是否完全被插补
complete_data <- complete(imputed_data)
summary(complete_data)
###使用插补后的数据进行回归,因此需要使用with()
###解决方案1
rg1 <- with(imputed_data, geeglm(Cal ~ a, family = binomial, id = d, corstr = "independence"))
summary(rg1)
###解决方案2
var <- "a"
formula <- as.formula(paste0("Cal ~ ", var))
rg2 <- with(imputed_data, geeglm(formula, family = binomial, id = d, corstr = "independence"))
summary(rg2)
在解决方案2中,我首先使用as.formula指定公式。我仔细检查了公式,确保它与"Cal ~ a"相同,换句话说,与解决方案1中直接在回归模型中键入的公式相同。但是,解决方案2中的系数与解决方案1不同。
当我使用as.formula而不是直接在回归模型中键入公式时会发生什么?
也许我包装回归到循环中的方式不恰当。有没有专家可以分享一些将回归包装到循环中的经验?非常感谢!
<details>
<summary>英文:</summary>
Recently, I came across a very wired problem and I cannot solve it even by using Chatgpt4. Initially, I want to make a loop to do repeated work of regression because i have different combinations of independent variables and dependent variables. Here is my coding and demo dataset:
###create a dataset with repeated measurement, var d stands for ID, Cal is dependent variable, a,b,c are independent variables
n_participants <- 50
n_measurements <- 3
a <- rnorm(n_participantsn_measurements, mean = 10, sd = 2)
b <- rnorm(n_participantsn_measurements, mean = 5, sd = 1)
c <- rnorm(n_participantsn_measurements, mean = 20, sd = 3)
d <- rep(1:n_participants, each = n_measurements)
Cal <- rbinom(n_participantsn_measurements, size = 1, prob = 0.5)
###add NA to dataset
missing_prop <- 0.2
missing_index <- sample(length(a), size = ceiling(length(a)*missing_prop))
a[missing_index] <- NA
b[missing_index] <- NA
c[missing_index] <- NA
Cal[missing_index] <- NA
data <- data.frame(Cal, a, b, c, d)
###mice imputed
imputed_data <- mice(data, m = 5, maxit = 50, seed = 123)
###check whether the data were imputed completely
complete_data <- complete(imputed_data)
summary(complete_data)
###regression by imputed data, thus with() is need
###solution 1
rg1 <- with(imputed_data, geeglm(Cal ~ a, family = binomial, id = d, corstr = "independence"))
summary(rg1)
###solution 2
var <- "a"
formula <- as.formula(paste0("Cal ~ ", var))
rg2 <- with(imputed_data, geeglm(formula, family = binomial, id = d, corstr = "independence"))
summary(rg2)
As shown in solution 2, I first specify the formula by as.formula. I double checked the formula is the same as "Cal ~ a", in other word the same as in solution 1 directly type the formula in the regression model. But the coefficients from solution2 is different from solution1.
What is happening when I use as.formula rather than directly type the formula into the regression model.
Maybe the way I wrap the regression to a loop is not appropriate. Could any expert share some experience with wrapping a regression into a loop? Many thanks in advance!
</details>
# 答案1
**得分**: 0
以下是翻译好的部分:
- "I think that's the place where you call `as.formula()`." 我认为这是你调用 `as.formula()` 的地方。
- "It seems to record the environment in which it is called, and with your solution 2 I get the results on the original dataset `data`." 似乎记录了调用它的环境,并且通过你的解决方案2,我获得了在原始数据集 `data` 上的结果。
- "solution 1" 解决方案1
- "solution 2" 解决方案2
- "Produces the same" 产生相同的结果
- "Solution 3 (same results as solution 1)" 解决方案3(与解决方案1相同的结果)
- "My take is that when you call `as.formula` in global environment it 'remembers' that it is called here and searches for a data.frame in that environment with variables Cal and a." 我的看法是,当你在全局环境中调用 `as.formula` 时,它“记住”它是在这里调用的,并在该环境中搜索具有变量 Cal 和 a 的数据框。
希望这对你有帮助。
<details>
<summary>英文:</summary>
I think that's the place where you call `as.formula()`. It seems to record the environnment in which it is called, and with your solution 2 I get the results on the original dataset `data`.
I'm not that experienced with these and don't know how to specify a proper environnment, but you can call `as.formula()` inside the `with()` function and it seems to work with me.
###solution 1
rg1 <- with(imputed_data, geeglm(Cal ~ a, family = binomial, id = d, corstr = "independence"))
summary(rg1)
# A tibble: 10 x 6
term estimate std.error statistic p.value nobs
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 (Intercept) -0.340 0.880 0.150 0.699 150
2 a 0.0629 0.0862 0.533 0.465 150
3 (Intercept) -0.253 0.802 0.0995 0.752 150
4 a 0.0568 0.0779 0.531 0.466 150
5 (Intercept) 0.512 0.834 0.376 0.540 150
6 a -0.0423 0.0810 0.272 0.602 150
7 (Intercept) 0.0251 0.796 0.000992 0.975 150
8 a 0.0265 0.0783 0.115 0.735 150
9 (Intercept) 0.297 0.912 0.106 0.745 150
10 a -0.00538 0.0873 0.00380 0.951 150
###solution 2
var <- "a"
formula <- as.formula(paste0("Cal ~ ", var))
rg2 <- with(imputed_data, geeglm(formula, family = binomial, id = d, corstr = "independence"))
summary(rg2)
# A tibble: 10 x 6
term estimate std.error statistic p.value nobs
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 (Intercept) 0.0688 0.879 0.00614 0.938 120
2 a 0.0230 0.0859 0.0715 0.789 120
3 (Intercept) 0.0688 0.879 0.00614 0.938 120
4 a 0.0230 0.0859 0.0715 0.789 120
5 (Intercept) 0.0688 0.879 0.00614 0.938 120
6 a 0.0230 0.0859 0.0715 0.789 120
7 (Intercept) 0.0688 0.879 0.00614 0.938 120
8 a 0.0230 0.0859 0.0715 0.789 120
9 (Intercept) 0.0688 0.879 0.00614 0.938 120
10 a 0.0230 0.0859 0.0715 0.789 120
# Produces the same
summary(geeglm(Cal ~ a, data = data, family = binomial, id = d, corstr = "independence"))
Call:
geeglm(formula = Cal ~ a, family = binomial, data = data, id = d,
corstr = "independence")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.06884 0.87884 0.006 0.938
a 0.02296 0.08586 0.071 0.789
Correlation structure = independence
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 1 0.03166
Number of clusters: 50 Maximum cluster size: 3
# Solution 3 (same results as solution 1)
formula <- paste0("Cal ~ ", var)
rg3 <- with(imputed_data, geeglm(as.formula(formula), family = binomial, id = d, corstr = "independence"))
summary(rg3)
# A tibble: 10 x 6
term estimate std.error statistic p.value nobs
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 (Intercept) -0.340 0.880 0.150 0.699 150
2 a 0.0629 0.0862 0.533 0.465 150
3 (Intercept) -0.253 0.802 0.0995 0.752 150
4 a 0.0568 0.0779 0.531 0.466 150
5 (Intercept) 0.512 0.834 0.376 0.540 150
6 a -0.0423 0.0810 0.272 0.602 150
7 (Intercept) 0.0251 0.796 0.000992 0.975 150
8 a 0.0265 0.0783 0.115 0.735 150
9 (Intercept) 0.297 0.912 0.106 0.745 150
10 a -0.00538 0.0873 0.00380 0.951 150
My take is that when you call `as.formula` in global environnment it "remembers" that it is called here and searches for a data.frame in that environnment with variables Cal and a. Moving it in with, the environnment isn't the global one anymore and it uses the data.frame in the mice object.
I hope this is not too clumpsy and that it helped. Maybe someone with further knowledge will explain that quirk so that we both understand it!
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论