2023年7月5日 00:10:37go评论107阅读模式

英文:

linear regression using dataset with missing values

问题

我有14个变量（var1-var14）的效应大小数据。每个值代表某种处理对特定变量的效应大小。缺失值是因为某些文章没有考虑特定变量。正值表示该处理对变量的促进作用，负值表示其对变量的抑制作用。我想要：（1）进行一对一的线性回归，涵盖每个变量，比较变量之间是否存在关联，（2）将var1视为因变量，var2-var14都视为自变量，找到最佳拟合模型（可能使用glmulti包？）并展示哪些变量对var1的变化最重要。

以下是示例数据：

set.seed(123)
# 创建带有效应大小和缺失值的数据集
mydata <- data.frame(
  Var1 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var2 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var3 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var4 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var5 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var6 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var7 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var8 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var9 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var10 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var11 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var12 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var13 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var14 = sample(c(-20:14, NA), 64, replace = TRUE)
)
# 在每列中设置超过50%的缺失值
for (col in 1:14) {
  missing_indices <- sample(1:64, size = 32)
  mydata[missing_indices, col] <- NA
}

使用这种数据集（即带有缺失值）是否可能执行所有这些操作？谢谢！

英文:

I have data on the effect sizes for 14 variables (var1-var14). Each value is the effect size of a specific treatment on a certain variable. Missing values are due to that some articles did not consider certain variables. A positive value show promoting while a negative value shows the inhibiting effect of that treatment on the variable. I want (1) to do a pairwise linear regression that runs through each and every variable and compare if there is an association between variables, (2) consider var1 as the dependent variable and var2-var14 all as independent variables to find the best-fit model (maybe using glmulti package?) and show changes in which variables are most important for change in var1.

Here is a sample data:

set.seed(123)
**# Create the dataset with effect sizes and missing values**
mydata &lt;- data.frame(
  Var1 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var2 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var3 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var4 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var5 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var6 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var7 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var8 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var9 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var10 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var11 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var12 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var13 = sample(c(-20:14, NA), 64, replace = TRUE),
  Var14 = sample(c(-20:14, NA), 64, replace = TRUE)
)
**# Set more than 50% missing values in each column**
for (col in 1:14) {
  missing_indices &lt;- sample(1:64, size = 32)
  mydata[missing_indices, col] &lt;- NA
}

Is it possible to do all this with such dataset (i.e., missing values)? Thanks!

答案1

得分: 1

Here is the translated code:

d <- 
  paste0('Var_', 1:14) |>
  Map(f = \(.) sample(c(-20:14, NA),
                      size = 64,
                      prob = c(rep(.49/35, 35), .51),
                      replace = TRUE
                      )
      ) |>
  as.data.frame()
# To get the pairwise associations in terms of the correlation matrix:
correlation_matrix <- d |> cor(use = 'pairwise.complete.obs')
# For basic column-wise imputation (replacing NA with the mean value):
d_imputed <- d |>
  apply(2, \(var) replace(var, is.na(var), mean(var, na.rm = TRUE)))
# To obtain the regression coefficients of the predictors (columns) for each column:
coefficients <- d_imputed |>
  apply(2, FUN = \(var) coef(lm(var ~ ., as.data.frame(d_imputed))))

Note: The code has been translated, and only the code portions have been provided without additional content.

英文:

d being your example data:

d &lt;- 
  paste0(&#39;Var_&#39;, 1:14) |&gt;
  Map(f = \(.) sample(c(-20:14, NA),
                      size = 64,
                      prob = c(rep(.49/35, 35), .51),
                      replace = TRUE
                      )
      ) |&gt;
  as.data.frame()

... you get the pairwise associations in terms of the correlation matrix like so:

d |&gt; cor(use = &#39;pairwise.complete.obs&#39;)

... and a basic column-wise imputation (replacing NA with the mean value) this way:

d_imputed &lt;- d |&gt;
  apply(2, \(var) replace(var, is.na(var), mean(var, na.rm = TRUE)))

Finally you can obtain the regression coefficients of the predictors (columns) for each column like so:

d_imputed |&gt; 
  apply(2, FUN = \(var) coef(lm(var ~ ., as.data.frame(d_imputed))))

A word of caution: above is just a technical answer to your literal question. For a statistically sound solution, I'd recommend researching over at Cross Validated about imputation, dimensionality reduction, predictor selection and such (see Ben Bolker's comment).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

线性回归使用带有缺失值的数据集

问题

答案1

在 {factoextra} PCA biplot 中，仅保留各个群组的平均点。

如何将自定义图像插入Shiny绘图标题？

使用R中的子集来过滤字符串。

将日期添加到来自不同列的时间，如果时间跨越午夜，则添加额外的一天。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。