如何根据变量名动态创建case_when()表达式?

huangapple go评论109阅读模式
英文:

How to dynamically create case_when() expressions based on variable names?

问题

以下是您要翻译的内容:

"I have a dataframe with a set of revenue variables and a set of sales variables. I want to create a new set of variables that takes the value of the revenue variable and checks which sales variable it matches, and if it matches, it returns the corresponding year.

Here's a simplified example of what I have and what I'm trying to achieve:

  1. library(tidyverse)
  2. library(rlang)
  3. # Creating an example dataframe
  4. df <- tibble(
  5. revenues_1 = c(1, 3, 4),
  6. revenues_2 = c(2, 4, 5),
  7. sales_2005 = c(1, 2, 3),
  8. sales_2006 = c(2, 3, 4),
  9. sales_2007 = c(3, 4, 5)
  10. )
  11. # What I want to do:
  12. # df <- df |>
  13. # rowwise() |>
  14. # mutate(
  15. # year_1 = case_when(
  16. # revenues_1 == sales_2005 ~ 2005,
  17. # revenues_1 == sales_2006 ~ 2006,
  18. # revenues_1 == sales_2007 ~ 2007,
  19. # .default = NA
  20. # ),
  21. # year_2 = case_when(
  22. # revenues_2 == sales_2005 ~ 2005,
  23. # revenues_2 == sales_2006 ~ 2006,
  24. # revenues_2 == sales_2007 ~ 2007,
  25. # .default = NA
  26. # )
  27. # )

However, this is not scalable if I have many years and many revenue variables.
How can I make this more efficient?

One idea is (not working):

  1. # Create a list of expressions for each year
  2. years_exprs <- map(2005:2007, ~parse_expr(paste0("var == 'sales_", .x, "' ~ ", .x)))
  3. # Use this list in a case_when statement
  4. df <- df |>
  5. rowwise() |>
  6. mutate(across(starts_with("revenues_"), function(var) {
  7. case_when(
  8. !!!years_exprs,
  9. .default = NA
  10. )
  11. }, .names = "year_{.col}")) |>
  12. ungroup()
  13. ```"
  14. <details>
  15. <summary>英文:</summary>
  16. I have a dataframe with a set of revenue variables and a set of sales variables. I want to create a new set of variables that takes the value of the revenue variable and checks which sales variable it matches, and if it matches, it returns the corresponding year.
  17. Here&#39;s a simplified example of what I have and what I&#39;m trying to achieve:
  18. ``` r
  19. library(tidyverse)
  20. library(rlang)
  21. # Creating an example dataframe
  22. df &lt;- tibble(
  23. revenues_1 = c(1, 3, 4),
  24. revenues_2 = c(2, 4, 5),
  25. sales_2005 = c(1, 2, 3),
  26. sales_2006 = c(2, 3, 4),
  27. sales_2007 = c(3, 4, 5)
  28. )
  29. # What I want to do:
  30. # df &lt;- df |&gt;
  31. # rowwise() |&gt;
  32. # mutate(
  33. # year_1 = case_when(
  34. # revenues_1 == sales_2005 ~ 2005,
  35. # revenues_1 == sales_2006 ~ 2006,
  36. # revenues_1 == sales_2007 ~ 2007,
  37. # .default = NA
  38. # ),
  39. # year_2 = case_when(
  40. # revenues_2 == sales_2005 ~ 2005,
  41. # revenues_2 == sales_2006 ~ 2006,
  42. # revenues_2 == sales_2007 ~ 2007,
  43. # .default = NA
  44. # )
  45. # )

However, this is not scalable if I have many years and many revenue variables.
How can I make this more efficient?

One idea is (not working):

  1. # Create a list of expressions for each year
  2. years_exprs &lt;- map(2005:2007, ~parse_expr(paste0(&quot;var == &#39;sales_&quot;, .x, &quot;&#39; ~ &quot;, .x)))
  3. # Use this list in a case_when statement
  4. df &lt;- df |&gt;
  5. rowwise() |&gt;
  6. mutate(across(starts_with(&quot;revenues_&quot;), function(var) {
  7. case_when(
  8. !!!years_exprs,
  9. .default = NA
  10. )
  11. }, .names = &quot;year_{.col}&quot;)) |&gt;
  12. ungroup()

答案1

得分: 2

如果我理解你的意思正确,你可以使用现有的代码,使用dplyrmutateacrossstarts_with,然后将函数放在一个命名列表中来创建新的变量。

为了减少为多个年份创建单独的case_when逻辑的工作量,你可以首先使用rlang::parse_exprs()paste0

  1. yrs <- 2005:2007
  2. years_exprs <- rlang::parse_exprs(paste0(".x == sales_", yrs, " ~ ", yrs))

然后使用!!!来评估它:

  1. library(dplyr)
  2. df %>% mutate(across(starts_with("revenues_"),
  3. list(year = ~case_when(!!!years_exprs))))

输出:

  1. revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 revenues_1_year revenues_2_year
  2. 1 1 2 1 2 3 2005 2006
  3. 2 3 4 2 3 4 2006 2007
  4. 3 4 5 3 4 5 2006 2007

注意:如果你的变量不一定都以"revenues"开头,你可以创建一个包含所需列的向量,无论它们是否共享相同的模式(这里是mut_vars),然后使用all_of()

  1. mut_vars <- c("revenues_1", "revenues_2")
  2. df %>% mutate(across(all_of(mut_vars),
  3. list(year = ~case_when(!!!years_exprs))))
英文:

If I understand you correctly, using your existing code you could use dplyr's mutate, across and starts_with, then place the function in a named list to create the new variables.

In order to reduce the effort to create individual case_when logics for multiple years, you could first use rlang::parse_exprs() with paste0:

  1. yrs &lt;- 2005:2007
  2. years_exprs &lt;- rlang::parse_exprs(paste0(&quot;.x == sales_&quot;, yrs, &quot; ~ &quot;, yrs))

Then evaluate it using !!!:

  1. library(dplyr)
  2. df %&gt;% mutate(across(starts_with(&quot;revenues_&quot;),
  3. list(year = ~case_when(!!!years_exprs))))

output:

  1. revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 revenues_1_year revenues_2_year
  2. &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  3. 1 1 2 1 2 3 2005 2006
  4. 2 3 4 2 3 4 2006 2007
  5. 3 4 5 3 4 5 2006 2007

Note: if your variables don't necessarily all start with "revenues", you could create a vector of desired columns regardless if they share the same pattern (here, mut_vars), then use all_of():

  1. mut_vars &lt;- c(&quot;revenues_1&quot;, &quot;revenues_2&quot;)
  2. df %&gt;% mutate(across(all_of(mut_vars),
  3. list(year = ~case_when(!!!years_exprs))))

答案2

得分: 0

以下是翻译好的部分:

如果您绝对需要生成表达式,可以像这样操作:

  1. library(glue)
  2. library(tidyverse)
  3. years <- 2005:2007
  4. idx <- 1:2
  5. years_exprs <- map_chr(idx, ~ {
  6. str_flatten_comma(c(glue("revenues_{.x} == sales_{years} ~ {years}"),
  7. ".default = NA")) %>%
  8. str_c("case_when(", ., ")")
  9. }) |>
  10. set_names(paste0("year_", idx)) |>
  11. parse_exprs()
  12. df |>
  13. rowwise() |>
  14. mutate(!!! years_exprs) |>
  15. ungroup()

如果您要使用拆分切片操作符!!!,则需要为mutate提供命名表达式,然后这些名称将成为变量名称。

您可以看到在使用rlang::qq_show注入后,此表达式将如何解析:

  1. qq_show(
  2. df |>
  3. rowwise() |>
  4. mutate(!!! years_exprs) |>
  5. ungroup()
  6. )
  7. ungroup(mutate(rowwise(df), year_1 = case_when(revenues_1 == sales_2005 ~ 2005,
  8. revenues_1 == sales_2006 ~ 2006, revenues_1 == sales_2007 ~ 2007, .default = NA),
  9. year_2 = case_when(revenues_2 == sales_2005 ~ 2005, revenues_2 == sales_2006 ~
  10. 2006, revenues_2 == sales_2007 ~ 2007, .default = NA)))

话虽如此,如果您能找到实施@jpsmith提供的解决方案的方法,那将是更好的做法。根据我的经验,人们在数据以宽格式存在时才会以这种方式使用表达式。通常更容易将其转换为长格式或使用可用的工具,如across

输出

  1. revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 year_1 year_2
  2. <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
  3. 1 1 2 1 2 3 2005 2006
  4. 2 3 4 2 3 4 2006 2007
  5. 3 4 5 3 4 5 2006 2007
英文:

If you absolutely needed to generate expressions you could do so like:

  1. library(glue)
  2. library(tidyverse)
  3. years &lt;- 2005:2007
  4. idx &lt;- 1:2
  5. years_exprs &lt;- map_chr(idx, ~ {
  6. str_flatten_comma(c(glue(&quot;revenues_{.x} == sales_{years} ~ {years}&quot;),
  7. &quot;.default = NA&quot;)) %&gt;%
  8. str_c(&quot;case_when(&quot;, ., &quot;)&quot;)
  9. }) |&gt;
  10. set_names(paste0(&quot;year_&quot;, idx)) |&gt;
  11. parse_exprs()
  12. df |&gt;
  13. rowwise() |&gt;
  14. mutate(!!! years_exprs) |&gt;
  15. ungroup()

You need to provide named expressions to mutate if you are going to use the split-slice operator !!!. The names will then become the variable names.

You can see how this expression will resolve after injection using rlang::qq_show:

  1. qq_show(
  2. df |&gt;
  3. rowwise() |&gt;
  4. mutate(!!! years_exprs) |&gt;
  5. ungroup()
  6. )
  7. ungroup(mutate(rowwise(df), year_1 = case_when(revenues_1 == sales_2005 ~ 2005,
  8. revenues_1 == sales_2006 ~ 2006, revenues_1 == sales_2007 ~ 2007, .default = NA),
  9. year_2 = case_when(revenues_2 == sales_2005 ~ 2005, revenues_2 == sales_2006 ~
  10. 2006, revenues_2 == sales_2007 ~ 2007, .default = NA)))

All that being said, if you could find a way to implement the solution provided by @jpsmith that would be a better practice. In my experience, people resort to using expressions in this way when they have data in a wide format. Often it is easier to pivot to a longer format or use available tools like across.

Output

  1. revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 year_1 year_2
  2. &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
  3. 1 1 2 1 2 3 2005 2006
  4. 2 3 4 2 3 4 2006 2007
  5. 3 4 5 3 4 5 2006 2007

huangapple
  • 本文由 发表于 2023年7月28日 00:59:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76781965.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定