2023年6月1日 15:51:18go评论103阅读模式

英文:

Is there a way to summarise by percentage in R while including the data in a new data frame?

问题

我在我的工作中经常使用Excel和R，我一直在尝试自动化一种我的老板要求我检查数据质量的表格。我最近才开始使用R，所以我的代码还不是很好。

我的想法是创建一个数据框，每一列都汇总了这些向量的信息。数据框中所有NA的总数，数据框中NA的百分比，然后根据某些列对NA在某个水平上的数量进行过滤。

我尝试过的代码如下：

rowsna <- c("总NA数", "% NA占比", "变量1的NA数量，在水平1上", ...)
na_count <- df %>% summarise_all(~sum(is.na(.)))
na_count[2, ] <- df %>% summarise_all(~mean(is.na(.)))
na_count[3, ] <- df %>% filter(变量 == 值) %>% summarise_all(~sum(is.na(.)))
...
row.names(na_count) <- rowsna
na_count <- as.data.frame(t(na_count))
na_count$variable

问题是，我不知道如何计算na_count[2, ]部分的缺失百分比。如果可能的话，我想要一些帮助。

英文:

I'm working a lot using Excel and R in my job and I've been trying to automatize a type of form my Boss asks me about the data quality. I've just recently started working with R so my code isn't the best.

The idea is to do a data.frame that summarizes in each column these vectors. Sum of all na's in the data.frame, percentage of NA in the data.frame and then filtering by some columns is the n of NAs in a level.

The code I've tried is the following one:

rowsna &lt;- c(&quot;Total NA&quot;, &quot;% NA&quot;, &quot;n NA Variable 1, level 1&quot;,...)
na_count &lt;- df %&gt;% summarise_all(~sum(is.na(.)))
na_count[2, ] &lt;- df %&gt;% summarise_all(~mean(is.na(.)))
na_count[3, ] &lt;- df %&gt;% filter(variable == value) %&gt;% summarise_all(~sum(is.na(.)))
...
row.names(na_count) &lt;- rowsna
na_count &lt;- as.data.frame(t(na_count))
na_count$variable

The thing is, I've got no idea how to calc the percentage of missing in the na_count[2 , ] part. I would like some help if possible.

答案1

得分: 1

这似乎是你想要的：

library(tidyverse)
# 虚构数据集
df <- tibble(
  id = 1:10,
  x = c(1:9, NA),
  y = c(1:5, rep(NA, 5)),
  z = rep(NA, 10)
)
NA_df <- df %>%
  # 计算每列中NA的数量
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  
  # 然后将其长格式化
  pivot_longer(cols = everything()) %>%
  
  # 接着计算每列中NA的百分比
  mutate(mean = 100*value/nrow(df))
# 假设我们只想获取NA少于5个的列
threshold <- 5
good_columns <- NA_df %>%
  filter(value < threshold) %>%
  pull(name)
# 现在我们可以使用good_columns向量来子集化原始数据框
df %>%
  select(all_of(good_columns))
# 一个tibble: 10 × 2
      id     x
   <int> <int>
 1     1     1
 2     2     2
 3     3     3
 4     4     4
 5     5     5
 6     6     6
 7     7     7
 8     8     8
 9     9     9
10    10    NA

英文:

It sounds like this is what you want:

library(tidyverse)
# toy dataset
df &lt;- tibble(
  id = 1:10,
  x = c(1:9, NA),
  y = c(1:5, rep(NA, 5)),
  z = rep(NA, 10)
)
NA_df &lt;- df %&gt;%
  # we find the number of NAs in each column
  summarise(across(everything(), ~ sum(is.na(.x)))) %&gt;%
  # then we pivot it longer
  pivot_longer(cols = everything()) %&gt;%
  # then find the percentage of NAs in each column
  mutate(mean = 100*value/nrow(df))
# let&#39;s say for the sake of argument that we only want to get columns with less than 5 NAs
threshold &lt;- 5
good_columns &lt;- NA_df %&gt;%
  filter(value &lt; threshold) %&gt;%
  pull(name)
# now we can use the good_columns vector to subset the original dataframe
df %&gt;%
  select(all_of(good_columns))
# A tibble: 10 &#215; 2
      id     x
   &lt;int&gt; &lt;int&gt;
 1     1     1
 2     2     2
 3     3     3
 4     4     4
 5     5     5
 6     6     6
 7     7     7
 8     8     8
 9     9     9
10    10    NA

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Is there a way to summarise by percentage in R while including the data in a new data frame?

问题

答案1

是否可以根据彼此之间的距离重新排列GPS点

Levelplot 用于分类数据？

在ggplot中指导图例标题的位置以及颜色和填充的映射。

Barplot with double x-axis labels

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。