英文:
linear regression using dataset with missing values
问题
我有14个变量(var1-var14)的效应大小数据。每个值代表某种处理对特定变量的效应大小。缺失值是因为某些文章没有考虑特定变量。正值表示该处理对变量的促进作用,负值表示其对变量的抑制作用。我想要:(1)进行一对一的线性回归,涵盖每个变量,比较变量之间是否存在关联,(2)将var1视为因变量,var2-var14都视为自变量,找到最佳拟合模型(可能使用glmulti
包?)并展示哪些变量对var1
的变化最重要。
以下是示例数据:
set.seed(123)
# 创建带有效应大小和缺失值的数据集
mydata <- data.frame(
Var1 = sample(c(-20:14, NA), 64, replace = TRUE),
Var2 = sample(c(-20:14, NA), 64, replace = TRUE),
Var3 = sample(c(-20:14, NA), 64, replace = TRUE),
Var4 = sample(c(-20:14, NA), 64, replace = TRUE),
Var5 = sample(c(-20:14, NA), 64, replace = TRUE),
Var6 = sample(c(-20:14, NA), 64, replace = TRUE),
Var7 = sample(c(-20:14, NA), 64, replace = TRUE),
Var8 = sample(c(-20:14, NA), 64, replace = TRUE),
Var9 = sample(c(-20:14, NA), 64, replace = TRUE),
Var10 = sample(c(-20:14, NA), 64, replace = TRUE),
Var11 = sample(c(-20:14, NA), 64, replace = TRUE),
Var12 = sample(c(-20:14, NA), 64, replace = TRUE),
Var13 = sample(c(-20:14, NA), 64, replace = TRUE),
Var14 = sample(c(-20:14, NA), 64, replace = TRUE)
)
# 在每列中设置超过50%的缺失值
for (col in 1:14) {
missing_indices <- sample(1:64, size = 32)
mydata[missing_indices, col] <- NA
}
使用这种数据集(即带有缺失值)是否可能执行所有这些操作?谢谢!
英文:
I have data on the effect sizes for 14 variables (var1-var14). Each value is the effect size of a specific treatment on a certain variable. Missing values are due to that some articles did not consider certain variables. A positive value show promoting while a negative value shows the inhibiting effect of that treatment on the variable. I want (1) to do a pairwise linear regression that runs through each and every variable and compare if there is an association between variables, (2) consider var1 as the dependent variable and var2-var14 all as independent variables to find the best-fit model (maybe using glmulti
package?) and show changes in which variables are most important for change in var1
.
Here is a sample data:
set.seed(123)
**# Create the dataset with effect sizes and missing values**
mydata <- data.frame(
Var1 = sample(c(-20:14, NA), 64, replace = TRUE),
Var2 = sample(c(-20:14, NA), 64, replace = TRUE),
Var3 = sample(c(-20:14, NA), 64, replace = TRUE),
Var4 = sample(c(-20:14, NA), 64, replace = TRUE),
Var5 = sample(c(-20:14, NA), 64, replace = TRUE),
Var6 = sample(c(-20:14, NA), 64, replace = TRUE),
Var7 = sample(c(-20:14, NA), 64, replace = TRUE),
Var8 = sample(c(-20:14, NA), 64, replace = TRUE),
Var9 = sample(c(-20:14, NA), 64, replace = TRUE),
Var10 = sample(c(-20:14, NA), 64, replace = TRUE),
Var11 = sample(c(-20:14, NA), 64, replace = TRUE),
Var12 = sample(c(-20:14, NA), 64, replace = TRUE),
Var13 = sample(c(-20:14, NA), 64, replace = TRUE),
Var14 = sample(c(-20:14, NA), 64, replace = TRUE)
)
**# Set more than 50% missing values in each column**
for (col in 1:14) {
missing_indices <- sample(1:64, size = 32)
mydata[missing_indices, col] <- NA
}
Is it possible to do all this with such dataset (i.e., missing values)? Thanks!
答案1
得分: 1
Here is the translated code:
d <-
paste0('Var_', 1:14) |>
Map(f = \(.) sample(c(-20:14, NA),
size = 64,
prob = c(rep(.49/35, 35), .51),
replace = TRUE
)
) |>
as.data.frame()
# To get the pairwise associations in terms of the correlation matrix:
correlation_matrix <- d |> cor(use = 'pairwise.complete.obs')
# For basic column-wise imputation (replacing NA with the mean value):
d_imputed <- d |>
apply(2, \(var) replace(var, is.na(var), mean(var, na.rm = TRUE)))
# To obtain the regression coefficients of the predictors (columns) for each column:
coefficients <- d_imputed |>
apply(2, FUN = \(var) coef(lm(var ~ ., as.data.frame(d_imputed))))
Note: The code has been translated, and only the code portions have been provided without additional content.
英文:
d
being your example data:
d <-
paste0('Var_', 1:14) |>
Map(f = \(.) sample(c(-20:14, NA),
size = 64,
prob = c(rep(.49/35, 35), .51),
replace = TRUE
)
) |>
as.data.frame()
... you get the pairwise associations in terms of the correlation matrix like so:
d |> cor(use = 'pairwise.complete.obs')
... and a basic column-wise imputation (replacing NA
with the mean value) this way:
d_imputed <- d |>
apply(2, \(var) replace(var, is.na(var), mean(var, na.rm = TRUE)))
Finally you can obtain the regression coefficients of the predictors (columns) for each column like so:
d_imputed |>
apply(2, FUN = \(var) coef(lm(var ~ ., as.data.frame(d_imputed))))
A word of caution: above is just a technical answer to your literal question. For a statistically sound solution, I'd recommend researching over at Cross Validated about imputation, dimensionality reduction, predictor selection and such (see Ben Bolker's comment).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论