英文:
Automatically exclude single-level factor variables from regression
问题
以下是代码的翻译部分:
我有一个自定义函数,其中包括根据指定的字符串输入创建回归公式并运行回归(`brm`,但对于基本的`lm`应该类似)的功能:
model_predict < - function(.data,dep_var,model ...) {
form < - as.formula(str_glue("{dep_var} ~ {model}"))
form_vars < - all.vars(form)
... # some other stuff
fit < - brm(form,.data,...)
... # some other stuff
}
我使用这个函数来拟合大量预先指定的模型,作为更大工作流的一部分。
有时,`model`中的某些变量是因子变量,有时这些因子在数据中只有一个水平。这会导致在尝试拟合模型时出现“只能对具有2个或更多水平的因子应用对比”错误。
由于较大的工作流和因为对于任何给定的迭代,数据和模型是否会遇到这个问题并不总是清楚*先验*,我宁愿不手动从模型中删除因子变量,当这些变量在相关数据子集中只有一个水平时。
**是否有`lm`或`brm`上的某些设置可以用来告诉模型拟合过程本身忽略单水平因子变量?**这将是最简单的解决方案,但我不确定是否存在。
或者,我希望有一种自动化解决方案,可以识别单因子水平情况何时出现,并在出现时从公式中删除相关变量(也许还会发出警告消息),例如:
# 主公式
form
> outcome ~ predictor_1 + predictor_2 * interactor
# 期望的输出...
# 如果predictor_1在数据中只有一个水平
> outcome ~ predictor_2 * interactor
# 如果predictor_2在数据中只有一个水平
> outcome ~ predictor_1
# 如果interactor在数据中只有一个水平
> outcome ~ predictor_1 + predictor_2
我尝试过tryCatch
,如此处建议,但尽管它抑制了对比...
错误,但返回的是NULL
,而不是忽略了有问题的变量的拟合模型,这是我需要的。
此外,有时这些变量在公式中以+
和以*
作为交互效应,这使得动态构建公式变得困难。
<details>
<summary>英文:</summary>
I have a custom function which, among other things, creates a regression formula based on specified string inputs and runs a regression (`brm`, but should work similarly for basic `lm`):
model_predict <- function(.data, dep_var, model ...) {
form <- as.formula(str_glue("{dep_var} ~ {model}"))
form_vars <- all.vars(form)
... # some other stuff
fit <- brm(form, .data, ...)
... # some other stuff
}
I use this to fit a large number of pre-specified models as part of a larger workflow.
Sometimes, some of the variables in `model` are factor variables, and sometimes those factors have only one level in the data. This results in the `contrasts can be applied only to factors with 2 or more levels` error when trying to fit the model.
Because of the larger workflow and because it's not always clear *a priori* if the data and model for any given iteration is going to encounter this problem, I'd rather not manually remove the factor vars from the model specified when those vars only have one level in the relevant data subset.
**Is there is some setting on `lm` or `brm` that can be used to tell the model fitting process itself to ignore single-level factor vars?** that would be the easiest solution, but I'm not sure it exists.
Alternatively, I'd like an automated solution that identifies when the single factor level situation arises, and drops the relevant vars from the formula when it does (maybe giving a warning message too), for instance:
main formula
form
> outcome ~ predictor_1 + predictor_2 * interactor
Desired outputs...
if predictor_1 has only one level in data
> outcome ~ predictor_2 * interactor
if predictor_2 has only one level in data
> outcome ~ predictor_1
if interactor has only one level in data
> outcome ~ predictor_1 + predictor_2
I've tried `tryCatch` as suggested [here][1] but while that suppresses the `contrasts...` error, it returns `NULL` instead of a fitted model ignoring the offending vars, which is what I need.
Additionally, sometimes those variables are in the formula with `+` and sometimes with `*` as interaction effects, which makes [dynamically building the formula][2] difficult.
[1]: https://stackoverflow.com/questions/49865845/how-to-get-regression-output-in-r-ignoring-one-factor-level-in-data
[2]: https://stackoverflow.com/questions/39870739/convincing-r-to-exclude-single-level-factors-when-using-lm-in-a-for-loop-in-so
</details>
# 答案1
**得分**: 1
以下是翻译好的代码部分:
``` r
# 为了管道运算符
library(magrittr)
# 定义用于更新模型的函数
update_formula <- function(data, dep_var, model) {
# 提取模型变量
model_vars <- stringr::str_split_1(model, '\\+|\\*|:') %>%
stringr::str_trim()
model_terms <- stringr::str_split_1(model, '\\+') %>%
stringr::str_trim()
# 获取所有因子变量的水平数
lev_leng <- .data %>%
dplyr::select(where(is.factor) & any_of(model_vars)) %>%
purrr::map_int(
~ length(levels(droplevels(.x)))
)
# 检查长度是否为一
invalid <- names(which(lev_leng == 1))
# 如果有任何无效变量,它将具有长度 > 0
if (length(invalid) != 0) {
for (i in invalid) {
if (any(stringr::str_detect(model_terms, paste0('*[:space:]?', i)))) {
# 从公式中移除交互项
model_terms <- stringr::str_remove(
model_terms,
paste0('[*,:]?[:space:]?',i,'[:space:]?[*,:]?$')
)
} else{
# 从公式中移除整个项
model_terms <- model_terms[stringr::str_detect(model_terms, i, T)]
}
}
model <- stringr::str_flatten(model_terms, '+')
}
# 定义新的公式
as.formula(glue::glue('{dep_var} ~ {model}'))
}
# 因变量定义
dep_var <- 'y'
# 模型示例
model <- "predictor_1 + predictor_2 * interactor"
# 示例 predictor_1
.data <- tibble::tibble(
y = runif(10),
predictor_1 = factor(1),
predictor_2 = factor(letters[1:10]),
interactor = factor(letters[1:10])
)
# 更新模型
form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_2 * interactor
#> <environment: 0x0000015b95006a58>
# 拟合模型
fit <- lm(form, .data)
# 示例 predictor_2
.data <- tibble::tibble(
y = runif(10),
predictor_1 = runif(10),
predictor_2 = factor(letters[rep(1, 10)]),
interactor = factor(letters[1:10])
)
form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_1
#> <environment: 0x0000015b95cdd218>
# 示例 interactor
.data <- tibble::tibble(
y = runif(10),
predictor_1 = runif(10),
predictor_2 = factor(letters[1:10]),
interactor = factor(letters[rep(1, 10)])
)
form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_1 + predictor_2
#> <environment: 0x0000015b96072878>
# 示例 predictor_1 & interactor
.data <- tibble::tibble(
y = runif(10),
predictor_1 = factor(10),
predictor_2 = factor(letters[1:10]),
interactor = factor(letters[rep(1, 10)])
)
form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_2
#> <environment: 0x0000015b9669d448>
创建于2023年06月13日,使用 reprex v2.0.2
英文:
Something like this should work:
# For pipe operator
library(magrittr)
# Define function for updating the model
update_formula <- function(data, dep_var, model) {
# Extract model variables
model_vars <- stringr::str_split_1(model, '\\+|\\*|:') %>%
stringr::str_trim()
model_terms <- stringr::str_split_1(model, '\\+') %>%
stringr::str_trim()
# Get the number of levels for all factor variables
lev_leng <- .data %>%
dplyr::select(where(is.factor) & any_of(model_vars)) %>%
purrr::map_int(
~ length(levels(droplevels(.x)))
)
# Check if length is one
invalid <- names(which(lev_leng == 1))
# If any is invalid, it will have length > 0
if (length(invalid) != 0) {
for (i in invalid) {
if (any(stringr::str_detect(model_terms, paste0('\\*[:space:]?', i)))) {
# Remove interactor from formula
model_terms <- stringr::str_remove(
model_terms,
paste0('[\\*,:]?[:space:]?',i,'[:space:]?[\\*,:]?')
)
} else{
# Remove entire term from formula
model_terms <- model_terms[stringr::str_detect(model_terms, i, T)]
}
}
model <- stringr::str_flatten(model_terms, '+')
}
# Define the new formula
as.formula(glue::glue('{dep_var} ~ {model}'))
}
# Dep var defition
dep_var <- 'y'
# Model example
model <- "predictor_1 + predictor_2 * interactor"
# Example predictor_1
.data <- tibble::tibble(
y = runif(10),
predictor_1 = factor(1),
predictor_2 = factor(letters[1:10]),
interactor = factor(letters[1:10])
)
# Update model
form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_2 * interactor
#> <environment: 0x0000015b95006a58>
# Fit the model
fit <- lm(form, .data)
# Example predictor_2
.data <- tibble::tibble(
y = runif(10),
predictor_1 = runif(10),
predictor_2 = factor(letters[rep(1, 10)]),
interactor = factor(letters[1:10])
)
form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_1
#> <environment: 0x0000015b95cdd218>
# Example interactor
.data <- tibble::tibble(
y = runif(10),
predictor_1 = runif(10),
predictor_2 = factor(letters[1:10]),
interactor = factor(letters[rep(1, 10)])
)
form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_1 + predictor_2
#> <environment: 0x0000015b96072878>
# Example predictor_1 & interactor
.data <- tibble::tibble(
y = runif(10),
predictor_1 = factor(10),
predictor_2 = factor(letters[1:10]),
interactor = factor(letters[rep(1, 10)])
)
form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_2
#> <environment: 0x0000015b9669d448>
<sup>Created on 2023-06-13 with reprex v2.0.2</sup>
答案2
得分: 0
编辑
根据有效列重新制定公式的方法(假设响应变量保持不变):
- 示例数据:
d <- data.frame(Y = rnorm(2),
x1 = gl(1, 1),
x2 = gl(1, 2),
x3 = LETTERS[1:2],
x4 = rnorm(2)
)
- 用于检测数据框
d
中不需要的变量的辅助函数:
get_vars_to_omit <- function(d){
names(d) |
sapply(FUN = function(name){xs <- d[[name]]
is.numeric(xs) | (length(levels(as.factor(xs))) > 1)
}) |
Filter(f = function(xs) !xs) |> names()
}
> get_vars_to_omit(d)
[1] "x1" "x2"
- 用于通过移除包含不需要的变量的项来更新现有模型公式的辅助函数:
update_formula <- function(old_formula, vars_to_omit){
var_names <- all.vars(old_formula)
old_terms <- terms(old_formula)
terms_to_drop <- attr(old_terms, 'term.labels') |>
grep(pattern = paste(vars_to_omit, collapse = '|'))
new_terms <- drop.terms(old_terms, terms_to_drop)
attr(new_terms, 'term.labels') |>
reformulate(response = var_names[1])
}
- 包含不需要变量的示例公式:
old_formula <- formula(Y ~ x1 + x2 + x3 + x4)
- 更新公式:
new_formula <- update_formula(old_formula, get_vars_to_omit(d))
## 不需要的变量已被移除:
## > new_formula
Y ~ x3 + x4
- 结果:
## > lm(new_formula, data = d)
Call:
lm(formula = new_formula, data = d)
Coefficients:
(Intercept) x3B x4
0.901 -1.222 NA
(编辑到此结束)
您可以在开头添加一些预处理步骤,例如:
d |> select_if(~ is.numeric(.x) | (length(levels(as.factor(.x))) > 1))
以上是一种“整洁”的变体,也可以用基本的R中的 lapply
或 Map
表示。对于更自动化的特征选择,可能值得尝试 {caret} 或 {tidymodels}。
英文:
Edit
An approach to reformulate the formula based on valid columns (assuming the response variable stays the same):
- example data:
d <- data.frame(Y = rnorm(2),
x1 = gl(1, 1),
x2 = gl(1, 2),
x3 = LETTERS[1:2],
x4 = rnorm(2)
)
- helper function to detect unwanted variables in dataframe
d
get_vars_to_omit <- \(d){
names(d) |>
sapply(FUN = \(name){xs <- d[[name]]
is.numeric(xs) | (length(levels(as.factor(xs))) > 1)
}) |>
Filter(f = \(xs) !xs) |> names()
}
> get_vars_to_omit(d)
[1] "x1" "x2"
- helper function to update an existing model formula by removing terms containing unwanted variables:
update_formula <- function(old_formula, vars_to_omit){
var_names <- all.vars(old_formula)
old_terms <- terms(old_formula)
terms_to_drop <- attr(old_terms, 'term.labels') |>
grep(pattern = paste(vars_to_omit, collapse = '|'))
new_terms <- drop.terms(old_terms, terms_to_drop)
attr(new_terms, 'term.labels') |>
reformulate(response = var_names[1])
}
- example formula containing unwanted variables:
old_formula <- formula(Y ~ x1 + x2 + x3 + x4)
- update formula:
new_formula <- update_formula(old_formula, get_vars_to_omit(d))
## terms with unwanted variables have been removed:
## > new_formula
Y ~ x3 + x4
- result:
## > lm(new_formula, data = d)
Call:
lm(formula = new_formula, data = d)
Coefficients:
(Intercept) x3B x4
0.901 -1.222 NA
(edit ends here)
Could you add some a knock-out step at the start? Like:
d |> select_if(~ is.numeric(.x) | (length(levels(as.factor(.x))) > 1))
Above is a "tidy" variant which can also expressed with lapply
or Map
of base R. For a more automated feature selection, {caret} or {tidymodels} might be worth a try.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论