线性回归使用带有缺失值的数据集

huangapple go评论107阅读模式
英文:

linear regression using dataset with missing values

问题

我有14个变量(var1-var14)的效应大小数据。每个值代表某种处理对特定变量的效应大小。缺失值是因为某些文章没有考虑特定变量。正值表示该处理对变量的促进作用,负值表示其对变量的抑制作用。我想要:(1)进行一对一的线性回归,涵盖每个变量,比较变量之间是否存在关联,(2)将var1视为因变量,var2-var14都视为自变量,找到最佳拟合模型(可能使用glmulti包?)并展示哪些变量对var1的变化最重要。

以下是示例数据:

  1. set.seed(123)
  2. # 创建带有效应大小和缺失值的数据集
  3. mydata <- data.frame(
  4. Var1 = sample(c(-20:14, NA), 64, replace = TRUE),
  5. Var2 = sample(c(-20:14, NA), 64, replace = TRUE),
  6. Var3 = sample(c(-20:14, NA), 64, replace = TRUE),
  7. Var4 = sample(c(-20:14, NA), 64, replace = TRUE),
  8. Var5 = sample(c(-20:14, NA), 64, replace = TRUE),
  9. Var6 = sample(c(-20:14, NA), 64, replace = TRUE),
  10. Var7 = sample(c(-20:14, NA), 64, replace = TRUE),
  11. Var8 = sample(c(-20:14, NA), 64, replace = TRUE),
  12. Var9 = sample(c(-20:14, NA), 64, replace = TRUE),
  13. Var10 = sample(c(-20:14, NA), 64, replace = TRUE),
  14. Var11 = sample(c(-20:14, NA), 64, replace = TRUE),
  15. Var12 = sample(c(-20:14, NA), 64, replace = TRUE),
  16. Var13 = sample(c(-20:14, NA), 64, replace = TRUE),
  17. Var14 = sample(c(-20:14, NA), 64, replace = TRUE)
  18. )
  19. # 在每列中设置超过50%的缺失值
  20. for (col in 1:14) {
  21. missing_indices <- sample(1:64, size = 32)
  22. mydata[missing_indices, col] <- NA
  23. }

使用这种数据集(即带有缺失值)是否可能执行所有这些操作?谢谢!

英文:

I have data on the effect sizes for 14 variables (var1-var14). Each value is the effect size of a specific treatment on a certain variable. Missing values are due to that some articles did not consider certain variables. A positive value show promoting while a negative value shows the inhibiting effect of that treatment on the variable. I want (1) to do a pairwise linear regression that runs through each and every variable and compare if there is an association between variables, (2) consider var1 as the dependent variable and var2-var14 all as independent variables to find the best-fit model (maybe using glmulti package?) and show changes in which variables are most important for change in var1.

Here is a sample data:

  1. set.seed(123)
  2. **# Create the dataset with effect sizes and missing values**
  3. mydata &lt;- data.frame(
  4. Var1 = sample(c(-20:14, NA), 64, replace = TRUE),
  5. Var2 = sample(c(-20:14, NA), 64, replace = TRUE),
  6. Var3 = sample(c(-20:14, NA), 64, replace = TRUE),
  7. Var4 = sample(c(-20:14, NA), 64, replace = TRUE),
  8. Var5 = sample(c(-20:14, NA), 64, replace = TRUE),
  9. Var6 = sample(c(-20:14, NA), 64, replace = TRUE),
  10. Var7 = sample(c(-20:14, NA), 64, replace = TRUE),
  11. Var8 = sample(c(-20:14, NA), 64, replace = TRUE),
  12. Var9 = sample(c(-20:14, NA), 64, replace = TRUE),
  13. Var10 = sample(c(-20:14, NA), 64, replace = TRUE),
  14. Var11 = sample(c(-20:14, NA), 64, replace = TRUE),
  15. Var12 = sample(c(-20:14, NA), 64, replace = TRUE),
  16. Var13 = sample(c(-20:14, NA), 64, replace = TRUE),
  17. Var14 = sample(c(-20:14, NA), 64, replace = TRUE)
  18. )
  19. **# Set more than 50% missing values in each column**
  20. for (col in 1:14) {
  21. missing_indices &lt;- sample(1:64, size = 32)
  22. mydata[missing_indices, col] &lt;- NA
  23. }

Is it possible to do all this with such dataset (i.e., missing values)? Thanks!

答案1

得分: 1

Here is the translated code:

  1. d <-
  2. paste0('Var_', 1:14) |>
  3. Map(f = \(.) sample(c(-20:14, NA),
  4. size = 64,
  5. prob = c(rep(.49/35, 35), .51),
  6. replace = TRUE
  7. )
  8. ) |>
  9. as.data.frame()
  10. # To get the pairwise associations in terms of the correlation matrix:
  11. correlation_matrix <- d |> cor(use = 'pairwise.complete.obs')
  12. # For basic column-wise imputation (replacing NA with the mean value):
  13. d_imputed <- d |>
  14. apply(2, \(var) replace(var, is.na(var), mean(var, na.rm = TRUE)))
  15. # To obtain the regression coefficients of the predictors (columns) for each column:
  16. coefficients <- d_imputed |>
  17. apply(2, FUN = \(var) coef(lm(var ~ ., as.data.frame(d_imputed))))

Note: The code has been translated, and only the code portions have been provided without additional content.

英文:

d being your example data:

  1. d &lt;-
  2. paste0(&#39;Var_&#39;, 1:14) |&gt;
  3. Map(f = \(.) sample(c(-20:14, NA),
  4. size = 64,
  5. prob = c(rep(.49/35, 35), .51),
  6. replace = TRUE
  7. )
  8. ) |&gt;
  9. as.data.frame()

... you get the pairwise associations in terms of the correlation matrix like so:

  1. d |&gt; cor(use = &#39;pairwise.complete.obs&#39;)

... and a basic column-wise imputation (replacing NA with the mean value) this way:

  1. d_imputed &lt;- d |&gt;
  2. apply(2, \(var) replace(var, is.na(var), mean(var, na.rm = TRUE)))

Finally you can obtain the regression coefficients of the predictors (columns) for each column like so:

  1. d_imputed |&gt;
  2. apply(2, FUN = \(var) coef(lm(var ~ ., as.data.frame(d_imputed))))

A word of caution: above is just a technical answer to your literal question. For a statistically sound solution, I'd recommend researching over at Cross Validated about imputation, dimensionality reduction, predictor selection and such (see Ben Bolker's comment).

huangapple
  • 本文由 发表于 2023年7月5日 00:10:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76614365.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定