2023年7月28日 00:59:06go评论110阅读模式

英文:

How to dynamically create case_when() expressions based on variable names?

问题

以下是您要翻译的内容：

"I have a dataframe with a set of revenue variables and a set of sales variables. I want to create a new set of variables that takes the value of the revenue variable and checks which sales variable it matches, and if it matches, it returns the corresponding year.

Here's a simplified example of what I have and what I'm trying to achieve:

library(tidyverse)
library(rlang)
# Creating an example dataframe
df <- tibble(
  revenues_1 = c(1, 3, 4),
  revenues_2 = c(2, 4, 5),
  sales_2005 = c(1, 2, 3),
  sales_2006 = c(2, 3, 4),
  sales_2007 = c(3, 4, 5)
)
# What I want to do:
# df <- df |>
#  rowwise() |>
#  mutate(
#    year_1 = case_when(
#      revenues_1 == sales_2005 ~ 2005,
#      revenues_1 == sales_2006 ~ 2006,
#      revenues_1 == sales_2007 ~ 2007,
#      .default = NA
#    ),
#    year_2 = case_when(
#      revenues_2 == sales_2005 ~ 2005,
#      revenues_2 == sales_2006 ~ 2006,
#      revenues_2 == sales_2007 ~ 2007,
#      .default = NA
#    )
#  )

However, this is not scalable if I have many years and many revenue variables.
How can I make this more efficient?

One idea is (not working):

# Create a list of expressions for each year
years_exprs <- map(2005:2007, ~parse_expr(paste0("var == 'sales_", .x, "' ~ ", .x)))
# Use this list in a case_when statement
df <- df |>
  rowwise() |>
  mutate(across(starts_with("revenues_"), function(var) {
    case_when(
      !!!years_exprs,
      .default = NA
    )
  }, .names = "year_{.col}")) |>
  ungroup()
```"
<details>
<summary>英文:</summary>
I have a dataframe with a set of revenue variables and a set of sales variables. I want to create a new set of variables that takes the value of the revenue variable and checks which sales variable it matches, and if it matches, it returns the corresponding year. 
Here&#39;s a simplified example of what I have and what I&#39;m trying to achieve:
``` r
library(tidyverse)
library(rlang)
# Creating an example dataframe
df &lt;- tibble(
  revenues_1 = c(1, 3, 4),
  revenues_2 = c(2, 4, 5),
  sales_2005 = c(1, 2, 3),
  sales_2006 = c(2, 3, 4),
  sales_2007 = c(3, 4, 5)
)
# What I want to do:
# df &lt;- df |&gt; 
#  rowwise() |&gt;
#  mutate(
#    year_1 = case_when(
#      revenues_1 == sales_2005 ~ 2005,
#      revenues_1 == sales_2006 ~ 2006,
#      revenues_1 == sales_2007 ~ 2007,
#      .default = NA
#    ),
#    year_2 = case_when(
#      revenues_2 == sales_2005 ~ 2005,
#      revenues_2 == sales_2006 ~ 2006,
#      revenues_2 == sales_2007 ~ 2007,
#      .default = NA
#    )
#  )

However, this is not scalable if I have many years and many revenue variables.
How can I make this more efficient?

One idea is (not working):

# Create a list of expressions for each year
years_exprs &lt;- map(2005:2007, ~parse_expr(paste0(&quot;var == &#39;sales_&quot;, .x, &quot;&#39; ~ &quot;, .x)))
# Use this list in a case_when statement
df &lt;- df |&gt; 
  rowwise() |&gt;
  mutate(across(starts_with(&quot;revenues_&quot;), function(var) {
    case_when(
      !!!years_exprs,
      .default = NA
    )
  }, .names = &quot;year_{.col}&quot;)) |&gt; 
  ungroup()

答案1

得分: 2

如果我理解你的意思正确，你可以使用现有的代码，使用dplyr的mutate、across和starts_with，然后将函数放在一个命名列表中来创建新的变量。

为了减少为多个年份创建单独的case_when逻辑的工作量，你可以首先使用rlang::parse_exprs()和paste0：

yrs <- 2005:2007
years_exprs <- rlang::parse_exprs(paste0(".x == sales_", yrs, " ~ ", yrs))

然后使用!!!来评估它：

library(dplyr)
df %>% mutate(across(starts_with("revenues_"), 
                     list(year = ~case_when(!!!years_exprs))))

输出：

  revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 revenues_1_year revenues_2_year
1          1          2          1          2          3            2005            2006
2          3          4          2          3          4            2006            2007
3          4          5          3          4          5            2006            2007

注意：如果你的变量不一定都以"revenues"开头，你可以创建一个包含所需列的向量，无论它们是否共享相同的模式（这里是mut_vars），然后使用all_of()：

mut_vars <- c("revenues_1", "revenues_2")
df %>% mutate(across(all_of(mut_vars), 
                     list(year = ~case_when(!!!years_exprs))))

英文:

If I understand you correctly, using your existing code you could use dplyr's mutate, across and starts_with, then place the function in a named list to create the new variables.

In order to reduce the effort to create individual case_when logics for multiple years, you could first use rlang::parse_exprs() with paste0:

yrs &lt;- 2005:2007
years_exprs &lt;- rlang::parse_exprs(paste0(&quot;.x == sales_&quot;, yrs, &quot; ~ &quot;, yrs))

Then evaluate it using !!!:

library(dplyr)
df %&gt;% mutate(across(starts_with(&quot;revenues_&quot;), 
                     list(year = ~case_when(!!!years_exprs))))

output:

  revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 revenues_1_year revenues_2_year
       &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;           &lt;dbl&gt;           &lt;dbl&gt;
1          1          2          1          2          3            2005            2006
2          3          4          2          3          4            2006            2007
3          4          5          3          4          5            2006            2007

Note: if your variables don't necessarily all start with "revenues", you could create a vector of desired columns regardless if they share the same pattern (here, mut_vars), then use all_of():

mut_vars &lt;- c(&quot;revenues_1&quot;, &quot;revenues_2&quot;)
df %&gt;% mutate(across(all_of(mut_vars), 
                     list(year = ~case_when(!!!years_exprs))))

答案2

得分: 0

以下是翻译好的部分：

如果您绝对需要生成表达式，可以像这样操作：

library(glue)
library(tidyverse)
years <- 2005:2007
idx <- 1:2
years_exprs <- map_chr(idx, ~ {
  str_flatten_comma(c(glue("revenues_{.x} == sales_{years} ~ {years}"),
                      ".default = NA")) %>%
    str_c("case_when(", ., ")")
                    }) |>
  set_names(paste0("year_", idx)) |>
  parse_exprs()
df |>
  rowwise() |>
  mutate(!!! years_exprs) |>
  ungroup()

如果您要使用拆分切片操作符!!!，则需要为mutate提供命名表达式，然后这些名称将成为变量名称。

您可以看到在使用rlang::qq_show注入后，此表达式将如何解析：

qq_show(
  df |>
    rowwise() |>
    mutate(!!! years_exprs) |>
    ungroup()
)
ungroup(mutate(rowwise(df), year_1 = case_when(revenues_1 == sales_2005 ~ 2005,
revenues_1 == sales_2006 ~ 2006, revenues_1 == sales_2007 ~ 2007, .default = NA),
year_2 = case_when(revenues_2 == sales_2005 ~ 2005, revenues_2 == sales_2006 ~
  2006, revenues_2 == sales_2007 ~ 2007, .default = NA)))

话虽如此，如果您能找到实施@jpsmith提供的解决方案的方法，那将是更好的做法。根据我的经验，人们在数据以宽格式存在时才会以这种方式使用表达式。通常更容易将其转换为长格式或使用可用的工具，如across。

输出

  revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 year_1 year_2
       <dbl>      <dbl>      <dbl>      <dbl>      <dbl>  <dbl>  <dbl>
1          1          2          1          2          3   2005   2006
2          3          4          2          3          4   2006   2007
3          4          5          3          4          5   2006   2007

英文:

If you absolutely needed to generate expressions you could do so like:

library(glue)
library(tidyverse)
years &lt;- 2005:2007
idx &lt;- 1:2
years_exprs &lt;- map_chr(idx, ~ {
  str_flatten_comma(c(glue(&quot;revenues_{.x} == sales_{years} ~ {years}&quot;),
                      &quot;.default = NA&quot;)) %&gt;% 
    str_c(&quot;case_when(&quot;, ., &quot;)&quot;)
                    }) |&gt;
  set_names(paste0(&quot;year_&quot;, idx)) |&gt;
  parse_exprs()
df |&gt;
  rowwise() |&gt;
  mutate(!!! years_exprs) |&gt;
  ungroup()

You need to provide named expressions to mutate if you are going to use the split-slice operator !!!. The names will then become the variable names.

You can see how this expression will resolve after injection using rlang::qq_show:

qq_show(
  df |&gt;
    rowwise() |&gt;
    mutate(!!! years_exprs) |&gt;
    ungroup()
)
ungroup(mutate(rowwise(df), year_1 = case_when(revenues_1 == sales_2005 ~ 2005,
revenues_1 == sales_2006 ~ 2006, revenues_1 == sales_2007 ~ 2007, .default = NA),
year_2 = case_when(revenues_2 == sales_2005 ~ 2005, revenues_2 == sales_2006 ~
  2006, revenues_2 == sales_2007 ~ 2007, .default = NA)))

All that being said, if you could find a way to implement the solution provided by @jpsmith that would be a better practice. In my experience, people resort to using expressions in this way when they have data in a wide format. Often it is easier to pivot to a longer format or use available tools like across.

Output

  revenues_1 revenues_2 sales_2005 sales_2006 sales_2007 year_1 year_2
       &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;
1          1          2          1          2          3   2005   2006
2          3          4          2          3          4   2006   2007
3          4          5          3          4          5   2006   2007

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何根据变量名动态创建case_when()表达式？

问题

答案1

答案2

如何在ggplot中制作基本的R双条形图

NAs produced using indexing to calculate RMSE.

运行`lapply`函数，对列表的列表使用`lmer`函数。

Logistic Regression中只有一个列的固定系数

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。