如何根据变量名动态创建case_when()表达式?

huangapple go评论82阅读模式
英文:

How to dynamically create case_when() expressions based on variable names?

问题

以下是您要翻译的内容:

"I have a dataframe with a set of revenue variables and a set of sales variables. I want to create a new set of variables that takes the value of the revenue variable and checks which sales variable it matches, and if it matches, it returns the corresponding year.

Here's a simplified example of what I have and what I'm trying to achieve:

library(tidyverse)
library(rlang)

# Creating an example dataframe
df <- tibble(
  revenues_1 = c(1, 3, 4),
  revenues_2 = c(2, 4, 5),
  sales_2005 = c(1, 2, 3),
  sales_2006 = c(2, 3, 4),
  sales_2007 = c(3, 4, 5)
)

# What I want to do:
# df <- df |>
#  rowwise() |>
#  mutate(
#    year_1 = case_when(
#      revenues_1 == sales_2005 ~ 2005,
#      revenues_1 == sales_2006 ~ 2006,
#      revenues_1 == sales_2007 ~ 2007,
#      .default = NA
#    ),
#    year_2 = case_when(
#      revenues_2 == sales_2005 ~ 2005,
#      revenues_2 == sales_2006 ~ 2006,
#      revenues_2 == sales_2007 ~ 2007,
#      .default = NA
#    )
#  )

However, this is not scalable if I have many years and many revenue variables.
How can I make this more efficient?

One idea is (not working):

# Create a list of expressions for each year
years_exprs <- map(2005:2007, ~parse_expr(paste0("var == 'sales_", .x, "' ~ ", .x)))

# Use this list in a case_when statement
df <- df |>
  rowwise() |>
  mutate(across(starts_with("revenues_"), function(var) {
    case_when(
      !!!years_exprs,
      .default = NA
    )
  }, .names = "year_{.col}")) |>
  ungroup()
```"

<details>
<summary>英文:</summary>

I have a dataframe with a set of revenue variables and a set of sales variables. I want to create a new set of variables that takes the value of the revenue variable and checks which sales variable it matches, and if it matches, it returns the corresponding year. 

Here&#39;s a simplified example of what I have and what I&#39;m trying to achieve:

``` r
library(tidyverse)
library(rlang)

# Creating an example dataframe
df &lt;- tibble(
  revenues_1 = c(1, 3, 4),
  revenues_2 = c(2, 4, 5),
  sales_2005 = c(1, 2, 3),
  sales_2006 = c(2, 3, 4),
  sales_2007 = c(3, 4, 5)
)

# What I want to do:
# df &lt;- df |&gt; 
#  rowwise() |&gt;
#  mutate(
#    year_1 = case_when(
#      revenues_1 == sales_2005 ~ 2005,
#      revenues_1 == sales_2006 ~ 2006,
#      revenues_1 == sales_2007 ~ 2007,
#      .default = NA
#    ),
#    year_2 = case_when(
#      revenues_2 == sales_2005 ~ 2005,
#      revenues_2 == sales_2006 ~ 2006,
#      revenues_2 == sales_2007 ~ 2007,
#      .default = NA
#    )
#  )

However, this is not scalable if I have many years and many revenue variables.
How can I make this more efficient?

One idea is (not working):

# Create a list of expressions for each year
years_exprs &lt;- map(2005:2007, ~parse_expr(paste0(&quot;var == &#39;sales_&quot;, .x, &quot;&#39; ~ &quot;, .x)))

# Use this list in a case_when statement
df &lt;- df |&gt; 
  rowwise() |&gt;
  mutate(across(starts_with(&quot;revenues_&quot;), function(var) {
    case_when(
      !!!years_exprs,
      .default = NA
    )
  }, .names = &quot;year_{.col}&quot;)) |&gt; 
  ungroup()

答案1

得分: 2

如果我理解你的意思正确,你可以使用现有的代码,使用dplyrmutateacrossstarts_with,然后将函数放在一个命名列表中来创建新的变量。

为了减少为多个年份创建单独的case_when逻辑的工作量,你可以首先使用rlang::parse_exprs()paste0

yrs <- 2005:2007
years_exprs <- rlang::parse_exprs(paste0(".x == sales_", yrs, " ~ ", yrs))

然后使用!!!来评估它:

library(dplyr)

df %>% mutate(across(starts_with("revenues_"), 
                     list(year = ~case_when(!!!years_exprs))))

输出:

  revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 revenues_1_year revenues_2_year
1          1          2          1          2          3            2005            2006
2          3          4          2          3          4            2006            2007
3          4          5          3          4          5            2006            2007

注意:如果你的变量不一定都以"revenues"开头,你可以创建一个包含所需列的向量,无论它们是否共享相同的模式(这里是mut_vars),然后使用all_of()

mut_vars <- c("revenues_1", "revenues_2")
df %>% mutate(across(all_of(mut_vars), 
                     list(year = ~case_when(!!!years_exprs))))
英文:

If I understand you correctly, using your existing code you could use dplyr's mutate, across and starts_with, then place the function in a named list to create the new variables.

In order to reduce the effort to create individual case_when logics for multiple years, you could first use rlang::parse_exprs() with paste0:

yrs &lt;- 2005:2007
years_exprs &lt;- rlang::parse_exprs(paste0(&quot;.x == sales_&quot;, yrs, &quot; ~ &quot;, yrs))

Then evaluate it using !!!:

library(dplyr)

df %&gt;% mutate(across(starts_with(&quot;revenues_&quot;), 
                     list(year = ~case_when(!!!years_exprs))))

output:

  revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 revenues_1_year revenues_2_year
       &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;           &lt;dbl&gt;           &lt;dbl&gt;
1          1          2          1          2          3            2005            2006
2          3          4          2          3          4            2006            2007
3          4          5          3          4          5            2006            2007

Note: if your variables don't necessarily all start with "revenues", you could create a vector of desired columns regardless if they share the same pattern (here, mut_vars), then use all_of():

mut_vars &lt;- c(&quot;revenues_1&quot;, &quot;revenues_2&quot;)
df %&gt;% mutate(across(all_of(mut_vars), 
                     list(year = ~case_when(!!!years_exprs))))


答案2

得分: 0

以下是翻译好的部分:

如果您绝对需要生成表达式,可以像这样操作:

library(glue)
library(tidyverse)

years <- 2005:2007
idx <- 1:2
years_exprs <- map_chr(idx, ~ {
  str_flatten_comma(c(glue("revenues_{.x} == sales_{years} ~ {years}"),
                      ".default = NA")) %>%
    str_c("case_when(", ., ")")
                    }) |>
  set_names(paste0("year_", idx)) |>
  parse_exprs()

df |>
  rowwise() |>
  mutate(!!! years_exprs) |>
  ungroup()

如果您要使用拆分切片操作符!!!,则需要为mutate提供命名表达式,然后这些名称将成为变量名称。

您可以看到在使用rlang::qq_show注入后,此表达式将如何解析:

qq_show(
  df |>
    rowwise() |>
    mutate(!!! years_exprs) |>
    ungroup()
)

ungroup(mutate(rowwise(df), year_1 = case_when(revenues_1 == sales_2005 ~ 2005,
revenues_1 == sales_2006 ~ 2006, revenues_1 == sales_2007 ~ 2007, .default = NA),
year_2 = case_when(revenues_2 == sales_2005 ~ 2005, revenues_2 == sales_2006 ~
  2006, revenues_2 == sales_2007 ~ 2007, .default = NA)))

话虽如此,如果您能找到实施@jpsmith提供的解决方案的方法,那将是更好的做法。根据我的经验,人们在数据以宽格式存在时才会以这种方式使用表达式。通常更容易将其转换为长格式或使用可用的工具,如across

输出

  revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 year_1 year_2
       <dbl>      <dbl>      <dbl>      <dbl>      <dbl>  <dbl>  <dbl>
1          1          2          1          2          3   2005   2006
2          3          4          2          3          4   2006   2007
3          4          5          3          4          5   2006   2007
英文:

If you absolutely needed to generate expressions you could do so like:

library(glue)
library(tidyverse)

years &lt;- 2005:2007
idx &lt;- 1:2
years_exprs &lt;- map_chr(idx, ~ {
  str_flatten_comma(c(glue(&quot;revenues_{.x} == sales_{years} ~ {years}&quot;),
                      &quot;.default = NA&quot;)) %&gt;% 
    str_c(&quot;case_when(&quot;, ., &quot;)&quot;)
                    }) |&gt;
  set_names(paste0(&quot;year_&quot;, idx)) |&gt;
  parse_exprs()


df |&gt;
  rowwise() |&gt;
  mutate(!!! years_exprs) |&gt;
  ungroup()

You need to provide named expressions to mutate if you are going to use the split-slice operator !!!. The names will then become the variable names.

You can see how this expression will resolve after injection using rlang::qq_show:

qq_show(
  df |&gt;
    rowwise() |&gt;
    mutate(!!! years_exprs) |&gt;
    ungroup()
)

ungroup(mutate(rowwise(df), year_1 = case_when(revenues_1 == sales_2005 ~ 2005,
revenues_1 == sales_2006 ~ 2006, revenues_1 == sales_2007 ~ 2007, .default = NA),
year_2 = case_when(revenues_2 == sales_2005 ~ 2005, revenues_2 == sales_2006 ~
  2006, revenues_2 == sales_2007 ~ 2007, .default = NA)))

All that being said, if you could find a way to implement the solution provided by @jpsmith that would be a better practice. In my experience, people resort to using expressions in this way when they have data in a wide format. Often it is easier to pivot to a longer format or use available tools like across.

Output

  revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 year_1 year_2
       &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;
1          1          2          1          2          3   2005   2006
2          3          4          2          3          4   2006   2007
3          4          5          3          4          5   2006   2007

huangapple
  • 本文由 发表于 2023年7月28日 00:59:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76781965.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定