英文:
How do I delete all columns that contain NA only with dbplyr?
问题
我想删除一个数据框中所有只包含NA值的列,在`dbplyr`语法中。
使用普通的R语法没有问题,参见这里:https://stackoverflow.com/questions/2643939/remove-columns-from-dataframe-where-all-values-are-na
这是在R中的一个可能性:
library(tidyverse)
library(DBI)
library(dbplyr)
df <- data.frame(a = NA,
b = seq(1:5),
c = c(rep(1, 4), NA))
df %>% purrr::discard(~all(is.na(.)))
# 期望输出:
# A tibble: 5 × 2
b c
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 NA
但现在我想在`dbplyr`语法中做到这一点。以下是一个示例:
# 创建测试数据库
con <- dbConnect(RSQLite::SQLite(), "")
# 插入测试表
dbWriteTable(con, "df", df)
con %>%
tbl("df") %>%
purrr::discard(~all(is.na(.))) %>%
collect()
# 错误信息:Error in UseMethod("collect") :
# no applicable method for 'collect' applied to an object of class "list"
英文:
I want to delete all columns of a dataframe which contain NA values only, within the dbplyr
syntax.
With the normal R syntax this is no problem, see here: https://stackoverflow.com/questions/2643939/remove-columns-from-dataframe-where-all-values-are-na
Here is a possibility in R:
library(tidyverse)
library(DBI)
library(dbplyr)
df <- data.frame(a = NA,
b = seq(1:5),
c = c(rep(1, 4), NA))
df %>% purrr::discard(~all(is.na(.)))
# desired output:
# A tibble: 5 × 2
b c
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 NA
But now I want to do this in the dbplyr
syntax. For this the following example:
# Creating test database
con <- dbConnect(RSQLite::SQLite(), "")
# Inserting test table
dbWriteTable(con, "df", df)
con %>%
tbl("df") %>%
purrr::discard(~all(is.na(.))) %>%
collect()
#Error in UseMethod("collect") :
# no applicable method for 'collect' applied to an object of class "list"
Can anyone help me find a solution within dbplyr
?
答案1
得分: 3
修订后的答案 - 删除所有值都为NA的列
这里的想法是检查每一列的NA值的数量并与行数进行比较:
df = con %>% tbl("df")
num_rows = df %>%
ungroup() %>%
summarise(num = n()) %>%
collect() %>%
pull(num)
cols_to_remove = c()
for(col in colnames(df)){
num_na = df %>%
ungroup() %>%
mutate(is_na = ifelse(is.na(!!sym(col)), 1, 0)) %>%
summarise(num_na = sum(is_na)) %>%
collect() %>%
pull(num_na)
if(num_na == num_rows){
cols_to_remove = c(cols_to_remove, col)
}
}
local_df = df %>%
select(-all_of(cols_to_remove)) %>%
collect()
原始答案 - 删除所有值都为NA的行
这是一种方法:
df = con %>% tbl("df")
df = df %>% mutate(all_na = TRUE)
for(col in colnames(df)){
df = df %>%
mutate(all_na = ifelse(!is.na(!!sym(col)), FALSE, all_na))
}
df %>%
filter(all_na != TRUE) %>%
select(-all_na)
这个想法是创建一个新的列来存储行中是否存在NA值。通过迭代所有行并更新这列,我们创建了一个指示器列,表示是否存在所有NA值。
语法解释
!!sym(.)
是 rlang 的特性。它将包含字符串 "my_column"
的变量 col
转换为列名 my_column
。
在 select
中使用 !!sym(.)
的地方,我们使用了 all_of
来实现类似的功能。
英文:
Revised answer - remove columns where every value is NA
The idea here is check each column for the number of NAs and compare against the number of rows:
df = con %>% tbl("df")
num_rows = df %>%
ungroup() %>%
summarise(num = n()) %>%
collect() %>%
pull(num)
cols_to_remove = c()
for(col in colnames(df)){
num_na = df %>%
ungroup() %>%
mutate(is_na = ifelse(is.na(!!sym(col)), 1, 0)) %>%
summarise(num_na = sum(is_na)) %>%
collect() %>%
pull(num_na)
if(num_na == num_rows){
cols_to_remove = c(cols_to_remove, col)
}
}
local_df = df %>%
select(-all_of(cols_to_remove)) %>%
collect()
Original answer - removes rows where every value is NA
Here is one approach:
df = con %>% tbl("df")
df = df %>% mutate(all_na = TRUE)
for(col in colnames(df)){
df = df %>%
mutate(all_na = ifelse(!is.na(!!sym(col)), FALSE, all_na))
}
df %>%
filter(all_na != TRUE) %>%
select(-all_na)
The idea is to create a new column to store whether NAs are found in the row. By iterating through all the rows and updating this column, we create an indicator column for the presence of all NA values.
Syntax explanation
!!sym(.)
is an rlang feature. It converts from the variable col
containing the string "my_column"
to the column name my_column
.
Instead of using !!sym(.)
inside select
we use all_of
for a similar purpose.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论