英文:
Is there a way to summarise by percentage in R while including the data in a new data frame?
问题
我在我的工作中经常使用Excel和R,我一直在尝试自动化一种我的老板要求我检查数据质量的表格。我最近才开始使用R,所以我的代码还不是很好。
我的想法是创建一个数据框,每一列都汇总了这些向量的信息。数据框中所有NA的总数,数据框中NA的百分比,然后根据某些列对NA在某个水平上的数量进行过滤。
我尝试过的代码如下:
rowsna <- c("总NA数", "% NA占比", "变量1的NA数量,在水平1上", ...)
na_count <- df %>% summarise_all(~sum(is.na(.)))
na_count[2, ] <- df %>% summarise_all(~mean(is.na(.)))
na_count[3, ] <- df %>% filter(变量 == 值) %>% summarise_all(~sum(is.na(.)))
...
row.names(na_count) <- rowsna
na_count <- as.data.frame(t(na_count))
na_count$variable
问题是,我不知道如何计算na_count[2, ]部分的缺失百分比。如果可能的话,我想要一些帮助。
英文:
I'm working a lot using Excel and R in my job and I've been trying to automatize a type of form my Boss asks me about the data quality. I've just recently started working with R so my code isn't the best.
The idea is to do a data.frame that summarizes in each column these vectors. Sum of all na's in the data.frame, percentage of NA in the data.frame and then filtering by some columns is the n of NAs in a level.
The code I've tried is the following one:
rowsna <- c("Total NA", "% NA", "n NA Variable 1, level 1",...)
na_count <- df %>% summarise_all(~sum(is.na(.)))
na_count[2, ] <- df %>% summarise_all(~mean(is.na(.)))
na_count[3, ] <- df %>% filter(variable == value) %>% summarise_all(~sum(is.na(.)))
...
row.names(na_count) <- rowsna
na_count <- as.data.frame(t(na_count))
na_count$variable
The thing is, I've got no idea how to calc the percentage of missing in the na_count[2 , ] part. I would like some help if possible.
答案1
得分: 1
这似乎是你想要的:
library(tidyverse)
# 虚构数据集
df <- tibble(
id = 1:10,
x = c(1:9, NA),
y = c(1:5, rep(NA, 5)),
z = rep(NA, 10)
)
NA_df <- df %>%
# 计算每列中NA的数量
summarise(across(everything(), ~ sum(is.na(.x)))) %>%
# 然后将其长格式化
pivot_longer(cols = everything()) %>%
# 接着计算每列中NA的百分比
mutate(mean = 100*value/nrow(df))
# 假设我们只想获取NA少于5个的列
threshold <- 5
good_columns <- NA_df %>%
filter(value < threshold) %>%
pull(name)
# 现在我们可以使用good_columns向量来子集化原始数据框
df %>%
select(all_of(good_columns))
# 一个tibble: 10 × 2
id x
<int> <int>
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 NA
英文:
It sounds like this is what you want:
library(tidyverse)
# toy dataset
df <- tibble(
id = 1:10,
x = c(1:9, NA),
y = c(1:5, rep(NA, 5)),
z = rep(NA, 10)
)
NA_df <- df %>%
# we find the number of NAs in each column
summarise(across(everything(), ~ sum(is.na(.x)))) %>%
# then we pivot it longer
pivot_longer(cols = everything()) %>%
# then find the percentage of NAs in each column
mutate(mean = 100*value/nrow(df))
# let's say for the sake of argument that we only want to get columns with less than 5 NAs
threshold <- 5
good_columns <- NA_df %>%
filter(value < threshold) %>%
pull(name)
# now we can use the good_columns vector to subset the original dataframe
df %>%
select(all_of(good_columns))
# A tibble: 10 × 2
id x
<int> <int>
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 NA
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论