英文:
How to dynamically create case_when() expressions based on variable names?
问题
以下是您要翻译的内容:
"I have a dataframe with a set of revenue variables and a set of sales variables. I want to create a new set of variables that takes the value of the revenue variable and checks which sales variable it matches, and if it matches, it returns the corresponding year.
Here's a simplified example of what I have and what I'm trying to achieve:
library(tidyverse)
library(rlang)
# Creating an example dataframe
df <- tibble(
revenues_1 = c(1, 3, 4),
revenues_2 = c(2, 4, 5),
sales_2005 = c(1, 2, 3),
sales_2006 = c(2, 3, 4),
sales_2007 = c(3, 4, 5)
)
# What I want to do:
# df <- df |>
# rowwise() |>
# mutate(
# year_1 = case_when(
# revenues_1 == sales_2005 ~ 2005,
# revenues_1 == sales_2006 ~ 2006,
# revenues_1 == sales_2007 ~ 2007,
# .default = NA
# ),
# year_2 = case_when(
# revenues_2 == sales_2005 ~ 2005,
# revenues_2 == sales_2006 ~ 2006,
# revenues_2 == sales_2007 ~ 2007,
# .default = NA
# )
# )
However, this is not scalable if I have many years and many revenue variables.
How can I make this more efficient?
One idea is (not working):
# Create a list of expressions for each year
years_exprs <- map(2005:2007, ~parse_expr(paste0("var == 'sales_", .x, "' ~ ", .x)))
# Use this list in a case_when statement
df <- df |>
rowwise() |>
mutate(across(starts_with("revenues_"), function(var) {
case_when(
!!!years_exprs,
.default = NA
)
}, .names = "year_{.col}")) |>
ungroup()
```"
<details>
<summary>英文:</summary>
I have a dataframe with a set of revenue variables and a set of sales variables. I want to create a new set of variables that takes the value of the revenue variable and checks which sales variable it matches, and if it matches, it returns the corresponding year.
Here's a simplified example of what I have and what I'm trying to achieve:
``` r
library(tidyverse)
library(rlang)
# Creating an example dataframe
df <- tibble(
revenues_1 = c(1, 3, 4),
revenues_2 = c(2, 4, 5),
sales_2005 = c(1, 2, 3),
sales_2006 = c(2, 3, 4),
sales_2007 = c(3, 4, 5)
)
# What I want to do:
# df <- df |>
# rowwise() |>
# mutate(
# year_1 = case_when(
# revenues_1 == sales_2005 ~ 2005,
# revenues_1 == sales_2006 ~ 2006,
# revenues_1 == sales_2007 ~ 2007,
# .default = NA
# ),
# year_2 = case_when(
# revenues_2 == sales_2005 ~ 2005,
# revenues_2 == sales_2006 ~ 2006,
# revenues_2 == sales_2007 ~ 2007,
# .default = NA
# )
# )
However, this is not scalable if I have many years and many revenue variables.
How can I make this more efficient?
One idea is (not working):
# Create a list of expressions for each year
years_exprs <- map(2005:2007, ~parse_expr(paste0("var == 'sales_", .x, "' ~ ", .x)))
# Use this list in a case_when statement
df <- df |>
rowwise() |>
mutate(across(starts_with("revenues_"), function(var) {
case_when(
!!!years_exprs,
.default = NA
)
}, .names = "year_{.col}")) |>
ungroup()
答案1
得分: 2
如果我理解你的意思正确,你可以使用现有的代码,使用dplyr
的mutate
、across
和starts_with
,然后将函数放在一个命名列表中来创建新的变量。
为了减少为多个年份创建单独的case_when
逻辑的工作量,你可以首先使用rlang::parse_exprs()
和paste0
:
yrs <- 2005:2007
years_exprs <- rlang::parse_exprs(paste0(".x == sales_", yrs, " ~ ", yrs))
然后使用!!!
来评估它:
library(dplyr)
df %>% mutate(across(starts_with("revenues_"),
list(year = ~case_when(!!!years_exprs))))
输出:
revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 revenues_1_year revenues_2_year
1 1 2 1 2 3 2005 2006
2 3 4 2 3 4 2006 2007
3 4 5 3 4 5 2006 2007
注意:如果你的变量不一定都以"revenues"开头,你可以创建一个包含所需列的向量,无论它们是否共享相同的模式(这里是mut_vars
),然后使用all_of()
:
mut_vars <- c("revenues_1", "revenues_2")
df %>% mutate(across(all_of(mut_vars),
list(year = ~case_when(!!!years_exprs))))
英文:
If I understand you correctly, using your existing code you could use dplyr
's mutate
, across
and starts_with
, then place the function in a named list to create the new variables.
In order to reduce the effort to create individual case_when
logics for multiple years, you could first use rlang::parse_exprs()
with paste0
:
yrs <- 2005:2007
years_exprs <- rlang::parse_exprs(paste0(".x == sales_", yrs, " ~ ", yrs))
Then evaluate it using !!!
:
library(dplyr)
df %>% mutate(across(starts_with("revenues_"),
list(year = ~case_when(!!!years_exprs))))
output:
revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 revenues_1_year revenues_2_year
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 2 3 2005 2006
2 3 4 2 3 4 2006 2007
3 4 5 3 4 5 2006 2007
Note: if your variables don't necessarily all start with "revenues", you could create a vector of desired columns regardless if they share the same pattern (here, mut_vars
), then use all_of()
:
mut_vars <- c("revenues_1", "revenues_2")
df %>% mutate(across(all_of(mut_vars),
list(year = ~case_when(!!!years_exprs))))
答案2
得分: 0
以下是翻译好的部分:
如果您绝对需要生成表达式,可以像这样操作:
library(glue)
library(tidyverse)
years <- 2005:2007
idx <- 1:2
years_exprs <- map_chr(idx, ~ {
str_flatten_comma(c(glue("revenues_{.x} == sales_{years} ~ {years}"),
".default = NA")) %>%
str_c("case_when(", ., ")")
}) |>
set_names(paste0("year_", idx)) |>
parse_exprs()
df |>
rowwise() |>
mutate(!!! years_exprs) |>
ungroup()
如果您要使用拆分切片操作符!!!
,则需要为mutate
提供命名表达式,然后这些名称将成为变量名称。
您可以看到在使用rlang::qq_show
注入后,此表达式将如何解析:
qq_show(
df |>
rowwise() |>
mutate(!!! years_exprs) |>
ungroup()
)
ungroup(mutate(rowwise(df), year_1 = case_when(revenues_1 == sales_2005 ~ 2005,
revenues_1 == sales_2006 ~ 2006, revenues_1 == sales_2007 ~ 2007, .default = NA),
year_2 = case_when(revenues_2 == sales_2005 ~ 2005, revenues_2 == sales_2006 ~
2006, revenues_2 == sales_2007 ~ 2007, .default = NA)))
话虽如此,如果您能找到实施@jpsmith提供的解决方案的方法,那将是更好的做法。根据我的经验,人们在数据以宽格式存在时才会以这种方式使用表达式。通常更容易将其转换为长格式或使用可用的工具,如across
。
输出
revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 year_1 year_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 2 3 2005 2006
2 3 4 2 3 4 2006 2007
3 4 5 3 4 5 2006 2007
英文:
If you absolutely needed to generate expressions you could do so like:
library(glue)
library(tidyverse)
years <- 2005:2007
idx <- 1:2
years_exprs <- map_chr(idx, ~ {
str_flatten_comma(c(glue("revenues_{.x} == sales_{years} ~ {years}"),
".default = NA")) %>%
str_c("case_when(", ., ")")
}) |>
set_names(paste0("year_", idx)) |>
parse_exprs()
df |>
rowwise() |>
mutate(!!! years_exprs) |>
ungroup()
You need to provide named expressions to mutate
if you are going to use the split-slice operator !!!
. The names will then become the variable names.
You can see how this expression will resolve after injection using rlang::qq_show
:
qq_show(
df |>
rowwise() |>
mutate(!!! years_exprs) |>
ungroup()
)
ungroup(mutate(rowwise(df), year_1 = case_when(revenues_1 == sales_2005 ~ 2005,
revenues_1 == sales_2006 ~ 2006, revenues_1 == sales_2007 ~ 2007, .default = NA),
year_2 = case_when(revenues_2 == sales_2005 ~ 2005, revenues_2 == sales_2006 ~
2006, revenues_2 == sales_2007 ~ 2007, .default = NA)))
All that being said, if you could find a way to implement the solution provided by @jpsmith that would be a better practice. In my experience, people resort to using expressions in this way when they have data in a wide format. Often it is easier to pivot to a longer format or use available tools like across
.
Output
revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 year_1 year_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 2 3 2005 2006
2 3 4 2 3 4 2006 2007
3 4 5 3 4 5 2006 2007
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论