Is there a way to summarise by percentage in R while including the data in a new data frame?

huangapple go评论77阅读模式
英文:

Is there a way to summarise by percentage in R while including the data in a new data frame?

问题

我在我的工作中经常使用Excel和R,我一直在尝试自动化一种我的老板要求我检查数据质量的表格。我最近才开始使用R,所以我的代码还不是很好。

我的想法是创建一个数据框,每一列都汇总了这些向量的信息。数据框中所有NA的总数,数据框中NA的百分比,然后根据某些列对NA在某个水平上的数量进行过滤。

我尝试过的代码如下:

rowsna <- c("总NA数", "% NA占比", "变量1的NA数量,在水平1上", ...)
na_count <- df %>% summarise_all(~sum(is.na(.)))
na_count[2, ] <- df %>% summarise_all(~mean(is.na(.)))
na_count[3, ] <- df %>% filter(变量 == 值) %>% summarise_all(~sum(is.na(.)))
...
row.names(na_count) <- rowsna
na_count <- as.data.frame(t(na_count))
na_count$variable

问题是,我不知道如何计算na_count[2, ]部分的缺失百分比。如果可能的话,我想要一些帮助。

英文:

I'm working a lot using Excel and R in my job and I've been trying to automatize a type of form my Boss asks me about the data quality. I've just recently started working with R so my code isn't the best.

The idea is to do a data.frame that summarizes in each column these vectors. Sum of all na's in the data.frame, percentage of NA in the data.frame and then filtering by some columns is the n of NAs in a level.

The code I've tried is the following one:

rowsna &lt;- c(&quot;Total NA&quot;, &quot;% NA&quot;, &quot;n NA Variable 1, level 1&quot;,...)
na_count &lt;- df %&gt;% summarise_all(~sum(is.na(.)))
na_count[2, ] &lt;- df %&gt;% summarise_all(~mean(is.na(.)))
na_count[3, ] &lt;- df %&gt;% filter(variable == value) %&gt;% summarise_all(~sum(is.na(.)))
...
row.names(na_count) &lt;- rowsna
na_count &lt;- as.data.frame(t(na_count))
na_count$variable

The thing is, I've got no idea how to calc the percentage of missing in the na_count[2 , ] part. I would like some help if possible.

答案1

得分: 1

这似乎是你想要的:

library(tidyverse)

# 虚构数据集
df <- tibble(
  id = 1:10,
  x = c(1:9, NA),
  y = c(1:5, rep(NA, 5)),
  z = rep(NA, 10)
)

NA_df <- df %>%
  # 计算每列中NA的数量
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  
  # 然后将其长格式化
  pivot_longer(cols = everything()) %>%
  
  # 接着计算每列中NA的百分比
  mutate(mean = 100*value/nrow(df))

# 假设我们只想获取NA少于5个的列
threshold <- 5

good_columns <- NA_df %>%
  filter(value < threshold) %>%
  pull(name)

# 现在我们可以使用good_columns向量来子集化原始数据框
df %>%
  select(all_of(good_columns))

# 一个tibble: 10 × 2
      id     x
   <int> <int>
 1     1     1
 2     2     2
 3     3     3
 4     4     4
 5     5     5
 6     6     6
 7     7     7
 8     8     8
 9     9     9
10    10    NA
英文:

It sounds like this is what you want:

library(tidyverse)

# toy dataset
df &lt;- tibble(
  id = 1:10,
  x = c(1:9, NA),
  y = c(1:5, rep(NA, 5)),
  z = rep(NA, 10)
)

NA_df &lt;- df %&gt;%
  # we find the number of NAs in each column
  summarise(across(everything(), ~ sum(is.na(.x)))) %&gt;%

  # then we pivot it longer
  pivot_longer(cols = everything()) %&gt;%

  # then find the percentage of NAs in each column
  mutate(mean = 100*value/nrow(df))

# let&#39;s say for the sake of argument that we only want to get columns with less than 5 NAs
threshold &lt;- 5

good_columns &lt;- NA_df %&gt;%
  filter(value &lt; threshold) %&gt;%
  pull(name)

# now we can use the good_columns vector to subset the original dataframe
df %&gt;%
  select(all_of(good_columns))

# A tibble: 10 &#215; 2
      id     x
   &lt;int&gt; &lt;int&gt;
 1     1     1
 2     2     2
 3     3     3
 4     4     4
 5     5     5
 6     6     6
 7     7     7
 8     8     8
 9     9     9
10    10    NA

huangapple
  • 本文由 发表于 2023年6月1日 15:51:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76379753.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定