英文:
Is it possible to delete the first few row of xlsx files (over 100 files) with multiple sheets in r?
问题
只有文件的第一个表格包含引言,其中包括"Reference Key"变量。是否有可能避免读取整个数据集并删除引言,然后将来自同一文件的表格合并到一个xlsx文件中?
英文:
I have a series of xlsx files (> 200mb each) with multiple sheets. Only the first sheet of the files contain an introduction, something likes:
This table is designed for balabala etc... | balabala |
---|---|
Reference Key | date |
1 | 01/01/1999 |
The number of lines of introductions from each files are not the same, but all the datasets start with Reference Key
variable.
Is it possible to avoid reading the whole datasets and deleting the introductions, then merging the sheets from same file into one xlsx file?
答案1
得分: 2
以下是代码的翻译部分:
# 对我上面的评论进行进一步扩展。 未经测试的代码,因为您没有给我们一个可重现的示例。
library(readxl)
library(tidyverse)
# 在这里进行明显的编辑
myFiles <- list.files(path="<your path>", pattern="xlsx")
# 读取一个文件
readFile <- function(f) {
sheets <- excel_sheets(f)
lapply(
seq_along(sheets),
function(x) read_excel(f, sheet=x, skip=ifelse(x == 1, 1, 0))
) %>%
# 将文件中的所有工作表合并成一个单一的数据框
bind_rows()
}
# 处理您的文件
excelFiles <- lapply(myFiles, readFile)
请注意,这是您提供的代码的翻译版本。如果您需要进一步的帮助或有任何问题,请随时提出。
英文:
To expand on my comment above. Untested code, since you haven't given us a reproducible example.
library(readxl)
library(tidyverse)
# Make the obvious edit here
myFiles <- list.files(path="<your path>", pattern="xlsx")
# Read one file
readFile <- function(f) {
sheets <- excel_sheets(f)
lapply(
seq_along(sheets),
function(x) read_excel(f, sheet=x, skip=ifelse(x == 1, 1, 0))
) %>%
# Combine all sheets in the file into a single data frame
bind_rows()
}
# Process your files
excelFiles <- lapply(myFiles, readFile)
答案2
得分: 1
这是将一组 xlsx
文件(每个文件可以有一个或多个工作表)转换为CSV文件目录树的快速过程。循环查找以 "Reference Key"
开头的行,如果找到,就跳到该行;如果找不到,就不跳过,假设 readxl::read_excel
会做适当的猜测。
files <- list.files(pattern = "xlsx$", full.names = TRUE)
for (fn in files) {
dirnm <- tools::file_path_sans_ext(fn)
dir.create(dirnm, showWarnings = FALSE)
for (sht in readxl::excel_sheets(fn)) {
dat <- readxl::read_excel(fn, sht, n_max = 4, col_types = "text")
skip <- grep("Reference Key", dat[[1]])[1]
if (is.na(skip)) skip <- 0L
newname <- file.path(dirnm, paste0(sht, ".csv"))
readxl::read_excel(fn, sht, skip = skip) %>|
write.csv(newname, row.names = FALSE)
}
}
对于我来说,这对两个文件有效:
Book1.xlsx
包含工作表Sheet1
和Sheet2
;Book2.xlsx
包含工作表Sheet1
和Sheet2
。
执行完毕后,我们现在有包含CSV文件的子目录:
files
# [1] "./Book1.xlsx" "./Book2.xlsx"
list.files(pattern = "csv$", recursive = TRUE, full.names = TRUE)
# [1] "./Book1/Sheet1.csv" "./Book1/Sheet2.csv" "./Book2/Sheet1.csv" "./Book2/Sheet2.csv"
这对于您的任何用途都应该很好用。如果您处理大量数据,有很多原因可以让您更喜欢直接写入parquet格式而不是CSV:
- 延迟读取:在dplyr管道中,您可以使用一组较小的
mutate
、filter
、select
等操作,直到最终使用%>% collect()
读取数据之前,不会将数据读入内存,类似于dbplyr
和dtplyr
; - 合并文件:如果模式(列/类型)都相同,那么您可以一次使用
arrow::open_dataset
打开 所有(或一部分)文件,并且它将虚拟组合它们,可选择使用Hive分区(这里没有必要使用,但如果适用,可以添加); - 本机数据类型:
read_excel
生成的Rdata.frame
的类(和属性)在parquet文件中得到保留,因此如果您的Excel数据包括日期、时间戳等,您可以在保存为parquet文件之前设置它们,然后在读取parquet文件时它们将恢复为正确的类别。
为此,我认为我会修改循环中最内层的部分,类似于以下内容:
newname <- file.path(dirnm, paste0(sht, ".pq"))
readxl::read_excel(fn, sht, skip = skip) %>%
mutate(
thedata = as.Date(somedate, format = "....."),
thetime = as.POSIXct(somestamp, format = ".....")
) %>%
arrow::write_parquet(newname)
然后对每个文件使用 arrow::open_dataset
(如果需要的话),或者如果模式都相同,可以使用类似以下内容:
ds <- list.files(basename(files), pattern = "pq$", recursive = TRUE) %>|
arrow::open_dataset()
这样您可以在一个对象中以_延迟_方式访问所有数据。
英文:
Here's a quick process for converting a set of xlsx
files with one or more sheets each into a directory tree of CSV files. The loop finds a line that starts with "Reference Key"
and, if found, skips to that row; if not found, it skips nothing, assuming that readxl::read_excel
will guess appropriately.
files <- list.files(pattern = "xlsx$", full.names = TRUE)
for (fn in files) {
dirnm <- tools::file_path_sans_ext(fn)
dir.create(dirnm, showWarnings = FALSE)
for (sht in readxl::excel_sheets(fn)) {
dat <- readxl::read_excel(fn, sht, n_max = 4, col_types = "text")
skip <- grep("Reference Key", dat[[1]])[1]
if (is.na(skip)) skip <- 0L
newname <- file.path(dirnm, paste0(sht, ".csv"))
readxl::read_excel(fn, sht, skip = skip) |>
write.csv(newname, row.names = FALSE)
}
}
This works for me given two files:
Book1.xlsx
with sheetsSheet1
andSheet2
;Book2.xlsx
with sheetsSheet1
andSheet2
.
After this, we now have subdirs with CSV files:
files
# [1] "./Book1.xlsx" "./Book2.xlsx"
list.files(pattern = "csv$", recursive = TRUE, full.names = TRUE)
# [1] "./Book1/Sheet1.csv" "./Book1/Sheet2.csv" "./Book2/Sheet1.csv" "./Book2/Sheet2.csv"
This should work well for whatever your purpose. If you're dealing with large amounts of data, there are many reasons why you many prefer to write directly to parquet format instead of CSV:
- lazy reading: in a dplyr pipe, you can use a somewhat-reduced set of
mutate
,filter
,select
, and such, and none of the data is read into memory until you finally%>% collect()
the data, similar todbplyr
anddtplyr
; - combine files: if the schema (columns/types) are all the same, then you can use
arrow::open_dataset
once with all (or a subset of) files, and it will virtually combine them, optionally using hive-partitioning (not used here necessarily, but can be added if applicable); - native types: the classes (and attributes) of the R
data.frame
produced byread_excel
are preserved in the parquet file, so if your excel data includes dates, timestamps, etc, you can set this before saving to parquets, and when you read the parquets they will be the correct classes again.
For that, I think I would modify the inner-most portion of the loop to be something like:
newname <- file.path(dirnm, paste0(sht, ".pq"))
readxl::read_excel(fn, sht, skip = skip) %>%
mutate(
thedata = as.Date(somedate, format = "....."),
thetime = as.POSIXct(somestamp, format = ".....")
) %>%
arrow::write_parquet(newname)
and then use arrow::open_dataset
on each file (if desired) or something like this if the schema are all the same:
ds <- list.files(basename(files), pattern = "pq$", recursive = TRUE) |>
arrow::open_dataset()
and have lazy access to all of the data in one object.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论