在R中如何处理多个csv文件以识别空值?

huangapple go评论97阅读模式
英文:

How to process multiple csv files for identifying null values in R?

问题

我有各种 .csv 文件。每个文件都有多列。我正在使用提供的 R 代码进行质量检查,针对特定列,查看有多少行具有有效值,有多少行为空值。该代码在单个 csv 文件上运行良好。但我想对所有 csv 文件运行此代码,并需要每个 csv 文件的输出。此外,我想要一个日志文件。请问有谁能帮我修改代码,使其能够处理各种 csv 文件。

  1. install.packages("readr")
  2. library(readr)
  3. check_column <- function(df, column) {
  4. valid_values <- !is.na(df[[column]])
  5. num_valid <- sum(valid_values)
  6. num_null <- nrow(df) - num_valid
  7. return(c(num_valid, num_null))
  8. }
  9. # 读取 CSV 文件
  10. df <- read_csv("data.csv")
  11. for (column in names(df)) {
  12. results <- check_column(df, column)
  13. print(paste(column, ": ", results[1], " 有效, ", results[2], " 空值"))
  14. }

示例数据:(并非所有文件具有相同数量的列)

Csv1.csv

  1. D_T Temp C) Press (Pa) ...
  2. 2021-03-01 00:00:00+00 28 1018 ...
  3. 2021-03-02 00:00:00+00 27 1017 ...
  4. 2021-03-03 00:00:00+00 28 1019 ...
  5. ..
  6. ..
  7. Csv2.csv

D_T Temp (°C) Vel (m/s) Press (Pa)...
2022-03-01 00:00:00+00 28 118 1018 ...
2022-03-02 00:00:00+00 27 117 1019 ...
2022-03-03 00:00:00+00 28 119 1018 ...
..
..

  1. <details>
  2. <summary>英文:</summary>
  3. I have various .csv files. Each file has multiple columns. I am using the given code in R to pursue a quality check that for a particular column, how many rows have valid values and how many are null. The code works well for a single csv file. But I want to run the code for all the csv files and need output for each csv file. Additionally, I want a log file. Could anyone please help me by modifying the code how it can be used to process various csv files.
  4. install.packages(&quot;readr&quot;)
  5. library(readr)
  6. check_column &lt;- function(df, column) {
  7. valid_values &lt;- !is.na(df[[column]])
  8. num_valid &lt;- sum(valid_values)
  9. num_null &lt;- nrow(df) - num_valid
  10. return(c(num_valid, num_null))
  11. }
  12. #Read the CSV file
  13. df &lt;- read_csv(&quot;data.csv&quot;)
  14. for (column in names(df)) {
  15. results &lt;- check_column(df, column)
  16. print(paste(column, &quot;: &quot;, results[1], &quot; valid, &quot;, results[2], &quot; null&quot;))
  17. }
  18. Sample data: (Not all files have same number of columns)
  19. Csv1.csv
  20. D_T Temp (&#176;C) Press (Pa) ...
  21. 2021-03-01 00:00:00+00 28 1018 ...
  22. 2021-03-02 00:00:00+00 27 1017 ...
  23. 2021-03-03 00:00:00+00 28 1019 ...
  24. ..
  25. ..
  26. Csv2.csv
  27. D_T Temp (&#176;C) Vel (m/s) Press (Pa_...
  28. 2022-03-01 00:00:00+00 28 118 1018 ...
  29. 2022-03-02 00:00:00+00 27 117 1019 ...
  30. 2022-03-03 00:00:00+00 28 119 1018 ...
  31. ..
  32. ..
  33. </details>
  34. # 答案1
  35. **得分**: 1
  36. 以下是翻译好的代码部分:
  37. ```R
  38. 如何像这样做呢?这将不会在一个变量中存储任何内容。如果您需要帮助,请告诉我。
  39. library(readr)
  40. for(files in list.files(pattern=".*csv$")) {
  41. file <- read_csv(files)
  42. out <- file(paste0(files, ".log"), open = "w")
  43. sapply(colnames(file), function(x) {
  44. cat(
  45. paste0(x, ":"),
  46. sum(!is.na(file[, x])),
  47. "valid,",
  48. sum(is.na(file[, x])),
  49. "null\n",
  50. file = out
  51. )
  52. })
  53. close(out)
  54. }
  55. 要写入一个文件中:
  56. library(readr)
  57. out <- file("output.log", open = "w")
  58. for(files in list.files(pattern=".*csv$")) {
  59. file <- read_csv(files)
  60. cat(files, "\n", file = out)
  61. sapply(colnames(file), function(x) {
  62. cat(
  63. paste0(x, ":"),
  64. sum(!is.na(file[, x])),
  65. "valid,",
  66. sum(is.na(file[, x])),
  67. "null\n",
  68. file = out
  69. )
  70. })
  71. }
  72. close(out)

希望这对您有帮助。

英文:

How about something like this? This will not store anything in a variable. Let me know if you need help with it.

  1. library(readr)
  2. for(files in list.files(pattern=&quot;.*csv$&quot;)) {
  3. file &lt;- read_csv(files)
  4. out &lt;- file(paste0(files, &quot;.log&quot;), open = &quot;w&quot;)
  5. sapply(colnames(file), function(x) {
  6. cat(
  7. paste0(x, &quot;:&quot;),
  8. sum(!is.na(file[, x])),
  9. &quot;valid,&quot;,
  10. sum(is.na(file[, x])),
  11. &quot;null\n&quot;,
  12. file = out
  13. )
  14. })
  15. close(out)
  16. }

To write into one file only:

  1. library(readr)
  2. out &lt;- file(&quot;output.log&quot;, open = &quot;w&quot;)
  3. for(files in list.files(pattern=&quot;.*csv$&quot;)) {
  4. file &lt;- read_csv(files)
  5. cat(files, &quot;\n&quot;, file = out)
  6. sapply(colnames(file), function(x) {
  7. cat(
  8. paste0(x, &quot;:&quot;),
  9. sum(!is.na(file[, x])),
  10. &quot;valid,&quot;,
  11. sum(is.na(file[, x])),
  12. &quot;null\n&quot;,
  13. file = out
  14. )
  15. })
  16. }
  17. close(out)

huangapple
  • 本文由 发表于 2023年7月7日 06:21:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76632850.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定