Is there a way to summarise by percentage in R while including the data in a new data frame?

huangapple go评论103阅读模式
英文:

Is there a way to summarise by percentage in R while including the data in a new data frame?

问题

我在我的工作中经常使用Excel和R,我一直在尝试自动化一种我的老板要求我检查数据质量的表格。我最近才开始使用R,所以我的代码还不是很好。

我的想法是创建一个数据框,每一列都汇总了这些向量的信息。数据框中所有NA的总数,数据框中NA的百分比,然后根据某些列对NA在某个水平上的数量进行过滤。

我尝试过的代码如下:

  1. rowsna <- c("总NA数", "% NA占比", "变量1的NA数量,在水平1上", ...)
  2. na_count <- df %>% summarise_all(~sum(is.na(.)))
  3. na_count[2, ] <- df %>% summarise_all(~mean(is.na(.)))
  4. na_count[3, ] <- df %>% filter(变量 == 值) %>% summarise_all(~sum(is.na(.)))
  5. ...
  6. row.names(na_count) <- rowsna
  7. na_count <- as.data.frame(t(na_count))
  8. na_count$variable

问题是,我不知道如何计算na_count[2, ]部分的缺失百分比。如果可能的话,我想要一些帮助。

英文:

I'm working a lot using Excel and R in my job and I've been trying to automatize a type of form my Boss asks me about the data quality. I've just recently started working with R so my code isn't the best.

The idea is to do a data.frame that summarizes in each column these vectors. Sum of all na's in the data.frame, percentage of NA in the data.frame and then filtering by some columns is the n of NAs in a level.

The code I've tried is the following one:

  1. rowsna &lt;- c(&quot;Total NA&quot;, &quot;% NA&quot;, &quot;n NA Variable 1, level 1&quot;,...)
  2. na_count &lt;- df %&gt;% summarise_all(~sum(is.na(.)))
  3. na_count[2, ] &lt;- df %&gt;% summarise_all(~mean(is.na(.)))
  4. na_count[3, ] &lt;- df %&gt;% filter(variable == value) %&gt;% summarise_all(~sum(is.na(.)))
  5. ...
  6. row.names(na_count) &lt;- rowsna
  7. na_count &lt;- as.data.frame(t(na_count))
  8. na_count$variable

The thing is, I've got no idea how to calc the percentage of missing in the na_count[2 , ] part. I would like some help if possible.

答案1

得分: 1

这似乎是你想要的:

  1. library(tidyverse)
  2. # 虚构数据集
  3. df <- tibble(
  4. id = 1:10,
  5. x = c(1:9, NA),
  6. y = c(1:5, rep(NA, 5)),
  7. z = rep(NA, 10)
  8. )
  9. NA_df <- df %>%
  10. # 计算每列中NA的数量
  11. summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  12. # 然后将其长格式化
  13. pivot_longer(cols = everything()) %>%
  14. # 接着计算每列中NA的百分比
  15. mutate(mean = 100*value/nrow(df))
  16. # 假设我们只想获取NA少于5个的列
  17. threshold <- 5
  18. good_columns <- NA_df %>%
  19. filter(value < threshold) %>%
  20. pull(name)
  21. # 现在我们可以使用good_columns向量来子集化原始数据框
  22. df %>%
  23. select(all_of(good_columns))
  24. # 一个tibble: 10 × 2
  25. id x
  26. <int> <int>
  27. 1 1 1
  28. 2 2 2
  29. 3 3 3
  30. 4 4 4
  31. 5 5 5
  32. 6 6 6
  33. 7 7 7
  34. 8 8 8
  35. 9 9 9
  36. 10 10 NA
英文:

It sounds like this is what you want:

  1. library(tidyverse)
  2. # toy dataset
  3. df &lt;- tibble(
  4. id = 1:10,
  5. x = c(1:9, NA),
  6. y = c(1:5, rep(NA, 5)),
  7. z = rep(NA, 10)
  8. )
  9. NA_df &lt;- df %&gt;%
  10. # we find the number of NAs in each column
  11. summarise(across(everything(), ~ sum(is.na(.x)))) %&gt;%
  12. # then we pivot it longer
  13. pivot_longer(cols = everything()) %&gt;%
  14. # then find the percentage of NAs in each column
  15. mutate(mean = 100*value/nrow(df))
  16. # let&#39;s say for the sake of argument that we only want to get columns with less than 5 NAs
  17. threshold &lt;- 5
  18. good_columns &lt;- NA_df %&gt;%
  19. filter(value &lt; threshold) %&gt;%
  20. pull(name)
  21. # now we can use the good_columns vector to subset the original dataframe
  22. df %&gt;%
  23. select(all_of(good_columns))
  24. # A tibble: 10 &#215; 2
  25. id x
  26. &lt;int&gt; &lt;int&gt;
  27. 1 1 1
  28. 2 2 2
  29. 3 3 3
  30. 4 4 4
  31. 5 5 5
  32. 6 6 6
  33. 7 7 7
  34. 8 8 8
  35. 9 9 9
  36. 10 10 NA

huangapple
  • 本文由 发表于 2023年6月1日 15:51:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76379753.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定