使用dbplyr如何删除所有只包含NA的列?

huangapple go评论64阅读模式
英文:

How do I delete all columns that contain NA only with dbplyr?

问题

我想删除一个数据框中所有只包含NA值的列,在`dbplyr`语法中。
使用普通的R语法没有问题,参见这里:https://stackoverflow.com/questions/2643939/remove-columns-from-dataframe-where-all-values-are-na
这是在R中的一个可能性:

library(tidyverse)
library(DBI)
library(dbplyr)

df <- data.frame(a = NA,
                 b = seq(1:5), 
                 c = c(rep(1, 4), NA))

df %>% purrr::discard(~all(is.na(.)))
# 期望输出:
# A tibble: 5 × 2
      b     c
  <int> <dbl>
1     1     1
2     2     1
3     3     1
4     4     1
5     5    NA

但现在我想在`dbplyr`语法中做到这一点。以下是一个示例:

# 创建测试数据库
con <- dbConnect(RSQLite::SQLite(), "")
# 插入测试表
dbWriteTable(con, "df", df)

con %>% 
  tbl("df") %>% 
  purrr::discard(~all(is.na(.))) %>% 
  collect()

# 错误信息:Error in UseMethod("collect") : 
# no applicable method for 'collect' applied to an object of class "list"
英文:

I want to delete all columns of a dataframe which contain NA values only, within the dbplyr syntax.
With the normal R syntax this is no problem, see here: https://stackoverflow.com/questions/2643939/remove-columns-from-dataframe-where-all-values-are-na
Here is a possibility in R:

library(tidyverse)
library(DBI)
library(dbplyr)

df &lt;- data.frame(a = NA,
                 b = seq(1:5), 
                 c = c(rep(1, 4), NA))

df %&gt;% purrr::discard(~all(is.na(.)))
# desired output:
# A tibble: 5 &#215; 2
      b     c
  &lt;int&gt; &lt;dbl&gt;
1     1     1
2     2     1
3     3     1
4     4     1
5     5    NA

But now I want to do this in the dbplyr syntax. For this the following example:

# Creating test database
con &lt;- dbConnect(RSQLite::SQLite(), &quot;&quot;)
# Inserting test table
dbWriteTable(con, &quot;df&quot;, df)

con %&gt;% 
  tbl(&quot;df&quot;) %&gt;% 
  purrr::discard(~all(is.na(.))) %&gt;% 
  collect()

#Error in UseMethod(&quot;collect&quot;) : 
# no applicable method for &#39;collect&#39; applied to an object of class &quot;list&quot;

Can anyone help me find a solution within dbplyr?

答案1

得分: 3

修订后的答案 - 删除所有值都为NA的列

这里的想法是检查每一列的NA值的数量并与行数进行比较:

df = con %>% tbl("df")

num_rows = df %>%
  ungroup() %>%
  summarise(num = n()) %>%
  collect() %>%
  pull(num)

cols_to_remove = c()

for(col in colnames(df)){
  num_na = df %>%
    ungroup() %>%
    mutate(is_na = ifelse(is.na(!!sym(col)), 1, 0)) %>%
    summarise(num_na = sum(is_na)) %>%
    collect() %>%
    pull(num_na)

  if(num_na == num_rows){
    cols_to_remove = c(cols_to_remove, col)
  }
}

local_df = df %>%
  select(-all_of(cols_to_remove)) %>%
  collect()

原始答案 - 删除所有值都为NA的行

这是一种方法:

df = con %>% tbl("df")

df = df %>% mutate(all_na = TRUE)

for(col in colnames(df)){
  df = df %>%
    mutate(all_na = ifelse(!is.na(!!sym(col)), FALSE, all_na))
}

df %>%
  filter(all_na != TRUE) %>%
  select(-all_na)

这个想法是创建一个新的列来存储行中是否存在NA值。通过迭代所有行并更新这列,我们创建了一个指示器列,表示是否存在所有NA值。


语法解释

!!sym(.) 是 rlang 的特性。它将包含字符串 &quot;my_column&quot; 的变量 col 转换为列名 my_column

select 中使用 !!sym(.) 的地方,我们使用了 all_of 来实现类似的功能。

英文:

Revised answer - remove columns where every value is NA

The idea here is check each column for the number of NAs and compare against the number of rows:

df = con %&gt;% tbl(&quot;df&quot;)

num_rows = df %&gt;%
  ungroup() %&gt;%
  summarise(num = n()) %&gt;%
  collect() %&gt;%
  pull(num)

cols_to_remove = c()

for(col in colnames(df)){
  num_na = df %&gt;%
    ungroup() %&gt;%
    mutate(is_na = ifelse(is.na(!!sym(col)), 1, 0)) %&gt;%
    summarise(num_na = sum(is_na)) %&gt;%
    collect() %&gt;%
    pull(num_na)

  if(num_na == num_rows){
    cols_to_remove = c(cols_to_remove, col)
  }
}

local_df = df %&gt;%
  select(-all_of(cols_to_remove)) %&gt;%
  collect()

Original answer - removes rows where every value is NA

Here is one approach:

df = con %&gt;% tbl(&quot;df&quot;)

df = df %&gt;% mutate(all_na = TRUE)

for(col in colnames(df)){
  df = df %&gt;%
    mutate(all_na = ifelse(!is.na(!!sym(col)), FALSE, all_na))
}

df %&gt;%
  filter(all_na != TRUE) %&gt;%
  select(-all_na)

The idea is to create a new column to store whether NAs are found in the row. By iterating through all the rows and updating this column, we create an indicator column for the presence of all NA values.


Syntax explanation

!!sym(.) is an rlang feature. It converts from the variable col containing the string &quot;my_column&quot; to the column name my_column.

Instead of using !!sym(.) inside select we use all_of for a similar purpose.

huangapple
  • 本文由 发表于 2023年7月4日 23:08:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76613918.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定