Is it possible to delete the first few row of xlsx files (over 100 files) with multiple sheets in r?

huangapple go评论119阅读模式
英文:

Is it possible to delete the first few row of xlsx files (over 100 files) with multiple sheets in r?

问题

只有文件的第一个表格包含引言,其中包括"Reference Key"变量。是否有可能避免读取整个数据集并删除引言,然后将来自同一文件的表格合并到一个xlsx文件中?

英文:

I have a series of xlsx files (> 200mb each) with multiple sheets. Only the first sheet of the files contain an introduction, something likes:

This table is designed for balabala etc... balabala
Reference Key date
1 01/01/1999

The number of lines of introductions from each files are not the same, but all the datasets start with Reference Key variable.

Is it possible to avoid reading the whole datasets and deleting the introductions, then merging the sheets from same file into one xlsx file?

答案1

得分: 2

以下是代码的翻译部分:

# 对我上面的评论进行进一步扩展。 未经测试的代码,因为您没有给我们一个可重现的示例。

library(readxl)
library(tidyverse)

# 在这里进行明显的编辑
myFiles <- list.files(path="<your path>", pattern="xlsx")

# 读取一个文件
readFile <- function(f) {
  sheets <- excel_sheets(f)
  lapply(
    seq_along(sheets),
    function(x) read_excel(f, sheet=x, skip=ifelse(x == 1, 1, 0))
  ) %>%
  # 将文件中的所有工作表合并成一个单一的数据框
  bind_rows()
}

# 处理您的文件
excelFiles <- lapply(myFiles, readFile)

请注意,这是您提供的代码的翻译版本。如果您需要进一步的帮助或有任何问题,请随时提出。

英文:

To expand on my comment above. Untested code, since you haven't given us a reproducible example.

library(readxl)
library(tidyverse)

# Make the obvious edit here
myFiles &lt;- list.files(path=&quot;&lt;your path&gt;&quot;, pattern=&quot;xlsx&quot;)

# Read one file
readFile &lt;- function(f) {
  sheets &lt;- excel_sheets(f)
  lapply(
    seq_along(sheets),
    function(x) read_excel(f, sheet=x, skip=ifelse(x == 1, 1, 0))
  ) %&gt;% 
  # Combine all sheets in the file into a single data frame
  bind_rows()
}

# Process your files
excelFiles &lt;- lapply(myFiles, readFile)

答案2

得分: 1

这是将一组 xlsx 文件(每个文件可以有一个或多个工作表)转换为CSV文件目录树的快速过程。循环查找以 &quot;Reference Key&quot; 开头的行,如果找到,就跳到该行;如果找不到,就不跳过,假设 readxl::read_excel 会做适当的猜测。

files <- list.files(pattern = "xlsx$", full.names = TRUE)
for (fn in files) {
  dirnm <- tools::file_path_sans_ext(fn)
  dir.create(dirnm, showWarnings = FALSE)
  for (sht in readxl::excel_sheets(fn)) {
    dat <- readxl::read_excel(fn, sht, n_max = 4, col_types = "text")
    skip <- grep("Reference Key", dat[[1]])[1]
    if (is.na(skip)) skip <- 0L
    newname <- file.path(dirnm, paste0(sht, ".csv"))
    readxl::read_excel(fn, sht, skip = skip) %>|
      write.csv(newname, row.names = FALSE)
  }
}

对于我来说,这对两个文件有效:

  • Book1.xlsx 包含工作表 Sheet1Sheet2
  • Book2.xlsx 包含工作表 Sheet1Sheet2

执行完毕后,我们现在有包含CSV文件的子目录:

files
# [1] "./Book1.xlsx" "./Book2.xlsx"
list.files(pattern = "csv$", recursive = TRUE, full.names = TRUE)
# [1] "./Book1/Sheet1.csv" "./Book1/Sheet2.csv" "./Book2/Sheet1.csv" "./Book2/Sheet2.csv"

这对于您的任何用途都应该很好用。如果您处理大量数据,有很多原因可以让您更喜欢直接写入parquet格式而不是CSV:

  • 延迟读取:在dplyr管道中,您可以使用一组较小的 mutatefilterselect 等操作,直到最终使用 %>% collect() 读取数据之前,不会将数据读入内存,类似于 dbplyrdtplyr
  • 合并文件:如果模式(列/类型)都相同,那么您可以一次使用 arrow::open_dataset 打开 所有(或一部分)文件,并且它将虚拟组合它们,可选择使用Hive分区(这里没有必要使用,但如果适用,可以添加);
  • 本机数据类型read_excel 生成的R data.frame 的类(和属性)在parquet文件中得到保留,因此如果您的Excel数据包括日期、时间戳等,您可以在保存为parquet文件之前设置它们,然后在读取parquet文件时它们将恢复为正确的类别。

为此,我认为我会修改循环中最内层的部分,类似于以下内容:

    newname <- file.path(dirnm, paste0(sht, ".pq"))
    readxl::read_excel(fn, sht, skip = skip) %>%
      mutate(
        thedata = as.Date(somedate, format = "....."),
        thetime = as.POSIXct(somestamp, format = ".....")
      ) %>%
      arrow::write_parquet(newname)

然后对每个文件使用 arrow::open_dataset(如果需要的话),或者如果模式都相同,可以使用类似以下内容:

ds <- list.files(basename(files), pattern = "pq$", recursive = TRUE) %>|
  arrow::open_dataset()

这样您可以在一个对象中以_延迟_方式访问所有数据。

英文:

Here's a quick process for converting a set of xlsx files with one or more sheets each into a directory tree of CSV files. The loop finds a line that starts with &quot;Reference Key&quot; and, if found, skips to that row; if not found, it skips nothing, assuming that readxl::read_excel will guess appropriately.

files &lt;- list.files(pattern = &quot;xlsx$&quot;, full.names = TRUE)
for (fn in files) {
  dirnm &lt;- tools::file_path_sans_ext(fn)
  dir.create(dirnm, showWarnings = FALSE)
  for (sht in readxl::excel_sheets(fn)) {
    dat &lt;- readxl::read_excel(fn, sht, n_max = 4, col_types = &quot;text&quot;)
    skip &lt;- grep(&quot;Reference Key&quot;, dat[[1]])[1]
    if (is.na(skip)) skip &lt;- 0L
    newname &lt;- file.path(dirnm, paste0(sht, &quot;.csv&quot;))
    readxl::read_excel(fn, sht, skip = skip) |&gt;
      write.csv(newname, row.names = FALSE)
  }
}

This works for me given two files:

  • Book1.xlsx with sheets Sheet1 and Sheet2;
  • Book2.xlsx with sheets Sheet1 and Sheet2.

After this, we now have subdirs with CSV files:

files
# [1] &quot;./Book1.xlsx&quot; &quot;./Book2.xlsx&quot;
list.files(pattern = &quot;csv$&quot;, recursive = TRUE, full.names = TRUE)
# [1] &quot;./Book1/Sheet1.csv&quot; &quot;./Book1/Sheet2.csv&quot; &quot;./Book2/Sheet1.csv&quot; &quot;./Book2/Sheet2.csv&quot;

This should work well for whatever your purpose. If you're dealing with large amounts of data, there are many reasons why you many prefer to write directly to parquet format instead of CSV:

  • lazy reading: in a dplyr pipe, you can use a somewhat-reduced set of mutate, filter, select, and such, and none of the data is read into memory until you finally %&gt;% collect() the data, similar to dbplyr and dtplyr;
  • combine files: if the schema (columns/types) are all the same, then you can use arrow::open_dataset once with all (or a subset of) files, and it will virtually combine them, optionally using hive-partitioning (not used here necessarily, but can be added if applicable);
  • native types: the classes (and attributes) of the R data.frame produced by read_excel are preserved in the parquet file, so if your excel data includes dates, timestamps, etc, you can set this before saving to parquets, and when you read the parquets they will be the correct classes again.

For that, I think I would modify the inner-most portion of the loop to be something like:

    newname &lt;- file.path(dirnm, paste0(sht, &quot;.pq&quot;))
    readxl::read_excel(fn, sht, skip = skip) %&gt;%
      mutate(
        thedata = as.Date(somedate, format = &quot;.....&quot;),
        thetime = as.POSIXct(somestamp, format = &quot;.....&quot;)
      ) %&gt;%
      arrow::write_parquet(newname)

and then use arrow::open_dataset on each file (if desired) or something like this if the schema are all the same:

ds &lt;- list.files(basename(files), pattern = &quot;pq$&quot;, recursive = TRUE) |&gt;
  arrow::open_dataset()

and have lazy access to all of the data in one object.

huangapple
  • 本文由 发表于 2023年2月24日 16:06:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75553994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定