英文:
How to write anonymous functions in R arrow across
问题
我已经通过arrow包的open_dataset函数打开了一个.parquet数据集。我想要使用across来同时清理多个数值列。然而,当我运行这段代码时:
start_numeric_cols = "sum"
sales <- sales %>% mutate(
across(starts_with(start_numeric_cols) & (!where(is.numeric)),
\(col) {replace(col, col == "NULL", 0) %>% as.numeric()}),
across(starts_with(start_numeric_cols) & (where(is.numeric)),
\(col) {replace(col, is.na(col), 0)})
)
#> Error in `across_setup()`:
#> ! Anonymous functions are not yet supported in Arrow
错误信息非常有用,但我想知道是否有一种方法只使用across内的dplyr动词来完成相同的操作(或者其他方法,而不必输入每个列名)。
英文:
I have opened a .parquet dataset through the open_dataset function of the arrow package. I want to use across to clean several numeric columns at a time. However, when I run this code:
start_numeric_cols = "sum"
sales <- sales %>% mutate(
across(starts_with(start_numeric_cols) & (!where(is.numeric)),
\(col) {replace(col, col == "NULL", 0) %>% as.numeric()}),
across(starts_with(start_numeric_cols) & (where(is.numeric)),
\(col) {replace(col, is.na(col), 0)})
)
#> Error in `across_setup()`:
#> ! Anonymous functions are not yet supported in Arrow
The error message is pretty informative, but I am wondering whether there is any way to do the same only with dplyr verbs within across (or another workaround without having to type each column name).
答案1
得分: 3
arrow具有越来越多的可在R中使用的功能,而不需要将数据导入R(在此处可用),但目前不支持replace()。但是,您可以使用ifelse()/if_else()/case_when()。还请注意,支持purrr风格的lambda函数,而不支持常规匿名函数。
我没有您的数据,所以将使用iris数据集作为示例来演示查询成功构建,即使在这个数据的上下文中并没有完全意义。
library(arrow)
library(dplyr)
start_numeric_cols <- "P"
iris %>%
as_arrow_table() %>%
mutate(
across(
starts_with(start_numeric_cols) & (!where(is.numeric)),
~ as.numeric(if_else(.x == "NULL", 0, .x))
),
across(
starts_with(start_numeric_cols) & (where(is.numeric)),
~ if_else(is.na(.x), 0, .x)
)
)
查询结果如下:
Table (query)
Sepal.Length: double
Sepal.Width: double
Petal.Length: double (if_else(is_null(Petal.Length, {nan_is_null=true}), 0, Petal.Length))
Petal.Width: double (if_else(is_null(Petal.Width, {nan_is_null=true}), 0, Petal.Width))
Species: dictionary<values=string, indices=int8>
请查看$.data以获取源Arrow对象
[1]: https://arrow.apache.org/docs/r/reference/acero.html
英文:
arrow has a growing set of functions that can be used without pulling the data into R (available here) but replace() is not yet supported. However, you can use ifelse()/if_else()/case_when(). Note also that purrr-style lambda functions are supported where regular anonymous functions are not.
I don't have your data so will use the iris dataset as an example to demonstrate that the query builds successfully, even if it doesn't make complete sense in the context of this data.
library(arrow)
library(dplyr)
start_numeric_cols <- "P"
iris %>%
as_arrow_table() %>%
mutate(
across(
starts_with(start_numeric_cols) & (!where(is.numeric)),
~ as.numeric(if_else(.x == "NULL", 0, .x))
),
across(
starts_with(start_numeric_cols) & (where(is.numeric)),
~ if_else(is.na(.x), 0, .x)
)
)
Table (query)
Sepal.Length: double
Sepal.Width: double
Petal.Length: double (if_else(is_null(Petal.Length, {nan_is_null=true}), 0, Petal.Length))
Petal.Width: double (if_else(is_null(Petal.Width, {nan_is_null=true}), 0, Petal.Width))
Species: dictionary<values=string, indices=int8>
See $.data for the source Arrow object
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论