2023年4月10日 18:22:24go评论85阅读模式

英文:

How to write anonymous functions in R arrow across

问题

我已经通过arrow包的open_dataset函数打开了一个.parquet数据集。我想要使用across来同时清理多个数值列。然而，当我运行这段代码时：

start_numeric_cols = "sum"
sales <- sales %>% mutate(
  across(starts_with(start_numeric_cols) & (!where(is.numeric)), 
         \(col) {replace(col, col == "NULL", 0) %>% as.numeric()}),
  across(starts_with(start_numeric_cols) & (where(is.numeric)),
         \(col) {replace(col, is.na(col), 0)})
)
#> Error in `across_setup()`:
#> ! Anonymous functions are not yet supported in Arrow

错误信息非常有用，但我想知道是否有一种方法只使用across内的dplyr动词来完成相同的操作（或者其他方法，而不必输入每个列名）。

英文:

I have opened a .parquet dataset through the open_dataset function of the arrow package. I want to use across to clean several numeric columns at a time. However, when I run this code:

start_numeric_cols = &quot;sum&quot;
sales &lt;- sales %&gt;% mutate(
  across(starts_with(start_numeric_cols) &amp; (!where(is.numeric)), 
         \(col) {replace(col, col == &quot;NULL&quot;, 0) %&gt;% as.numeric()}),
  across(starts_with(start_numeric_cols) &amp; (where(is.numeric)),
         \(col) {replace(col, is.na(col), 0)})
)
#&gt; Error in `across_setup()`:
#&gt; ! Anonymous functions are not yet supported in Arrow

The error message is pretty informative, but I am wondering whether there is any way to do the same only with dplyr verbs within across (or another workaround without having to type each column name).

答案1

得分: 3

arrow具有越来越多的可在R中使用的功能，而不需要将数据导入R（在此处可用），但目前不支持replace()。但是，您可以使用ifelse()/if_else()/case_when()。还请注意，支持purrr风格的lambda函数，而不支持常规匿名函数。

我没有您的数据，所以将使用iris数据集作为示例来演示查询成功构建，即使在这个数据的上下文中并没有完全意义。

library(arrow)
library(dplyr)
start_numeric_cols <- "P"
iris %>%
  as_arrow_table() %>%
  mutate(
    across(
      starts_with(start_numeric_cols) & (!where(is.numeric)),
      ~ as.numeric(if_else(.x == "NULL", 0, .x))
    ),
    across(
      starts_with(start_numeric_cols) & (where(is.numeric)),
      ~ if_else(is.na(.x), 0, .x)
    )
  )

查询结果如下：

Table (query)
Sepal.Length: double
Sepal.Width: double
Petal.Length: double (if_else(is_null(Petal.Length, {nan_is_null=true}), 0, Petal.Length))
Petal.Width: double (if_else(is_null(Petal.Width, {nan_is_null=true}), 0, Petal.Width))
Species: dictionary<values=string, indices=int8>
请查看$.data以获取源Arrow对象
[1]: https://arrow.apache.org/docs/r/reference/acero.html

英文:

arrow has a growing set of functions that can be used without pulling the data into R (available here) but replace() is not yet supported. However, you can use ifelse()/if_else()/case_when(). Note also that purrr-style lambda functions are supported where regular anonymous functions are not.

I don't have your data so will use the iris dataset as an example to demonstrate that the query builds successfully, even if it doesn't make complete sense in the context of this data.

library(arrow)
library(dplyr)
start_numeric_cols &lt;- &quot;P&quot;
iris %&gt;%
  as_arrow_table() %&gt;%
  mutate(
    across(
    starts_with(start_numeric_cols) &amp; (!where(is.numeric)),
    ~ as.numeric(if_else(.x == &quot;NULL&quot;, 0, .x))
  ),
  across(
    starts_with(start_numeric_cols) &amp; (where(is.numeric)),
    ~ if_else(is.na(.x), 0, .x)
  )
)
Table (query)
Sepal.Length: double
Sepal.Width: double
Petal.Length: double (if_else(is_null(Petal.Length, {nan_is_null=true}), 0, Petal.Length))
Petal.Width: double (if_else(is_null(Petal.Width, {nan_is_null=true}), 0, Petal.Width))
Species: dictionary&lt;values=string, indices=int8&gt;
See $.data for the source Arrow object

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在R中编写匿名函数箭头形式。

问题

答案1

在R中使用group_by计算每日平均值时出现日期/时间问题。

从子列表中删除名称中的字符串

对齐面板内的标签以进行并排绘图。

Matrix multiplication algorithm

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。