英文:
How to process multiple csv files for identifying null values in R?
问题
我有各种 .csv 文件。每个文件都有多列。我正在使用提供的 R 代码进行质量检查,针对特定列,查看有多少行具有有效值,有多少行为空值。该代码在单个 csv 文件上运行良好。但我想对所有 csv 文件运行此代码,并需要每个 csv 文件的输出。此外,我想要一个日志文件。请问有谁能帮我修改代码,使其能够处理各种 csv 文件。
install.packages("readr")
library(readr)
check_column <- function(df, column) {
valid_values <- !is.na(df[[column]])
num_valid <- sum(valid_values)
num_null <- nrow(df) - num_valid
return(c(num_valid, num_null))
}
# 读取 CSV 文件
df <- read_csv("data.csv")
for (column in names(df)) {
results <- check_column(df, column)
print(paste(column, ": ", results[1], " 有效, ", results[2], " 空值"))
}
示例数据:(并非所有文件具有相同数量的列)
Csv1.csv
D_T Temp (°C) Press (Pa) ...
2021-03-01 00:00:00+00 28 1018 ...
2021-03-02 00:00:00+00 27 1017 ...
2021-03-03 00:00:00+00 28 1019 ...
..
..
Csv2.csv
D_T Temp (°C) Vel (m/s) Press (Pa)...
2022-03-01 00:00:00+00 28 118 1018 ...
2022-03-02 00:00:00+00 27 117 1019 ...
2022-03-03 00:00:00+00 28 119 1018 ...
..
..
<details>
<summary>英文:</summary>
I have various .csv files. Each file has multiple columns. I am using the given code in R to pursue a quality check that for a particular column, how many rows have valid values and how many are null. The code works well for a single csv file. But I want to run the code for all the csv files and need output for each csv file. Additionally, I want a log file. Could anyone please help me by modifying the code how it can be used to process various csv files.
install.packages("readr")
library(readr)
check_column <- function(df, column) {
valid_values <- !is.na(df[[column]])
num_valid <- sum(valid_values)
num_null <- nrow(df) - num_valid
return(c(num_valid, num_null))
}
#Read the CSV file
df <- read_csv("data.csv")
for (column in names(df)) {
results <- check_column(df, column)
print(paste(column, ": ", results[1], " valid, ", results[2], " null"))
}
Sample data: (Not all files have same number of columns)
Csv1.csv
D_T Temp (°C) Press (Pa) ...
2021-03-01 00:00:00+00 28 1018 ...
2021-03-02 00:00:00+00 27 1017 ...
2021-03-03 00:00:00+00 28 1019 ...
..
..
Csv2.csv
D_T Temp (°C) Vel (m/s) Press (Pa_...
2022-03-01 00:00:00+00 28 118 1018 ...
2022-03-02 00:00:00+00 27 117 1019 ...
2022-03-03 00:00:00+00 28 119 1018 ...
..
..
</details>
# 答案1
**得分**: 1
以下是翻译好的代码部分:
```R
如何像这样做呢?这将不会在一个变量中存储任何内容。如果您需要帮助,请告诉我。
library(readr)
for(files in list.files(pattern=".*csv$")) {
file <- read_csv(files)
out <- file(paste0(files, ".log"), open = "w")
sapply(colnames(file), function(x) {
cat(
paste0(x, ":"),
sum(!is.na(file[, x])),
"valid,",
sum(is.na(file[, x])),
"null\n",
file = out
)
})
close(out)
}
要写入一个文件中:
library(readr)
out <- file("output.log", open = "w")
for(files in list.files(pattern=".*csv$")) {
file <- read_csv(files)
cat(files, "\n", file = out)
sapply(colnames(file), function(x) {
cat(
paste0(x, ":"),
sum(!is.na(file[, x])),
"valid,",
sum(is.na(file[, x])),
"null\n",
file = out
)
})
}
close(out)
希望这对您有帮助。
英文:
How about something like this? This will not store anything in a variable. Let me know if you need help with it.
library(readr)
for(files in list.files(pattern=".*csv$")) {
file <- read_csv(files)
out <- file(paste0(files, ".log"), open = "w")
sapply(colnames(file), function(x) {
cat(
paste0(x, ":"),
sum(!is.na(file[, x])),
"valid,",
sum(is.na(file[, x])),
"null\n",
file = out
)
})
close(out)
}
To write into one file only:
library(readr)
out <- file("output.log", open = "w")
for(files in list.files(pattern=".*csv$")) {
file <- read_csv(files)
cat(files, "\n", file = out)
sapply(colnames(file), function(x) {
cat(
paste0(x, ":"),
sum(!is.na(file[, x])),
"valid,",
sum(is.na(file[, x])),
"null\n",
file = out
)
})
}
close(out)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论