2023年2月24日 16:06:14go评论147阅读模式

英文:

Is it possible to delete the first few row of xlsx files (over 100 files) with multiple sheets in r?

问题

只有文件的第一个表格包含引言，其中包括"Reference Key"变量。是否有可能避免读取整个数据集并删除引言，然后将来自同一文件的表格合并到一个xlsx文件中？

英文:

I have a series of xlsx files (> 200mb each) with multiple sheets. Only the first sheet of the files contain an introduction, something likes:

This table is designed for balabala etc...	balabala
Reference Key	date
1	01/01/1999

The number of lines of introductions from each files are not the same, but all the datasets start with Reference Key variable.

Is it possible to avoid reading the whole datasets and deleting the introductions, then merging the sheets from same file into one xlsx file?

答案1

得分: 2

以下是代码的翻译部分：

# 对我上面的评论进行进一步扩展。 未经测试的代码，因为您没有给我们一个可重现的示例。
library(readxl)
library(tidyverse)
# 在这里进行明显的编辑
myFiles <- list.files(path="<your path>", pattern="xlsx")
# 读取一个文件
readFile <- function(f) {
  sheets <- excel_sheets(f)
  lapply(
    seq_along(sheets),
    function(x) read_excel(f, sheet=x, skip=ifelse(x == 1, 1, 0))
  ) %>%
  # 将文件中的所有工作表合并成一个单一的数据框
  bind_rows()
}
# 处理您的文件
excelFiles <- lapply(myFiles, readFile)

请注意，这是您提供的代码的翻译版本。如果您需要进一步的帮助或有任何问题，请随时提出。

英文:

To expand on my comment above. Untested code, since you haven't given us a reproducible example.

library(readxl)
library(tidyverse)
# Make the obvious edit here
myFiles &lt;- list.files(path=&quot;&lt;your path&gt;&quot;, pattern=&quot;xlsx&quot;)
# Read one file
readFile &lt;- function(f) {
  sheets &lt;- excel_sheets(f)
  lapply(
    seq_along(sheets),
    function(x) read_excel(f, sheet=x, skip=ifelse(x == 1, 1, 0))
  ) %&gt;% 
  # Combine all sheets in the file into a single data frame
  bind_rows()
}
# Process your files
excelFiles &lt;- lapply(myFiles, readFile)

答案2

得分: 1

这是将一组 xlsx 文件（每个文件可以有一个或多个工作表）转换为CSV文件目录树的快速过程。循环查找以 "Reference Key" 开头的行，如果找到，就跳到该行；如果找不到，就不跳过，假设 readxl::read_excel 会做适当的猜测。

files <- list.files(pattern = "xlsx$", full.names = TRUE)
for (fn in files) {
  dirnm <- tools::file_path_sans_ext(fn)
  dir.create(dirnm, showWarnings = FALSE)
  for (sht in readxl::excel_sheets(fn)) {
    dat <- readxl::read_excel(fn, sht, n_max = 4, col_types = "text")
    skip <- grep("Reference Key", dat[[1]])[1]
    if (is.na(skip)) skip <- 0L
    newname <- file.path(dirnm, paste0(sht, ".csv"))
    readxl::read_excel(fn, sht, skip = skip) %>|
      write.csv(newname, row.names = FALSE)
  }
}

对于我来说，这对两个文件有效：

Book1.xlsx 包含工作表 Sheet1 和 Sheet2；
Book2.xlsx 包含工作表 Sheet1 和 Sheet2。

执行完毕后，我们现在有包含CSV文件的子目录：

files
# [1] "./Book1.xlsx" "./Book2.xlsx"
list.files(pattern = "csv$", recursive = TRUE, full.names = TRUE)
# [1] "./Book1/Sheet1.csv" "./Book1/Sheet2.csv" "./Book2/Sheet1.csv" "./Book2/Sheet2.csv"

这对于您的任何用途都应该很好用。如果您处理大量数据，有很多原因可以让您更喜欢直接写入parquet格式而不是CSV：

延迟读取：在dplyr管道中，您可以使用一组较小的 mutate、filter、select 等操作，直到最终使用 %>% collect() 读取数据之前，不会将数据读入内存，类似于 dbplyr 和 dtplyr；
合并文件：如果模式（列/类型）都相同，那么您可以一次使用 arrow::open_dataset 打开所有（或一部分）文件，并且它将虚拟组合它们，可选择使用Hive分区（这里没有必要使用，但如果适用，可以添加）；
本机数据类型：read_excel 生成的R data.frame 的类（和属性）在parquet文件中得到保留，因此如果您的Excel数据包括日期、时间戳等，您可以在保存为parquet文件之前设置它们，然后在读取parquet文件时它们将恢复为正确的类别。

为此，我认为我会修改循环中最内层的部分，类似于以下内容：

    newname <- file.path(dirnm, paste0(sht, ".pq"))
    readxl::read_excel(fn, sht, skip = skip) %>%
      mutate(
        thedata = as.Date(somedate, format = "....."),
        thetime = as.POSIXct(somestamp, format = ".....")
      ) %>%
      arrow::write_parquet(newname)

然后对每个文件使用 arrow::open_dataset（如果需要的话），或者如果模式都相同，可以使用类似以下内容：

ds <- list.files(basename(files), pattern = "pq$", recursive = TRUE) %>|
  arrow::open_dataset()

这样您可以在一个对象中以_延迟_方式访问所有数据。

英文:

Here's a quick process for converting a set of xlsx files with one or more sheets each into a directory tree of CSV files. The loop finds a line that starts with "Reference Key" and, if found, skips to that row; if not found, it skips nothing, assuming that readxl::read_excel will guess appropriately.

files &lt;- list.files(pattern = &quot;xlsx$&quot;, full.names = TRUE)
for (fn in files) {
  dirnm &lt;- tools::file_path_sans_ext(fn)
  dir.create(dirnm, showWarnings = FALSE)
  for (sht in readxl::excel_sheets(fn)) {
    dat &lt;- readxl::read_excel(fn, sht, n_max = 4, col_types = &quot;text&quot;)
    skip &lt;- grep(&quot;Reference Key&quot;, dat[[1]])[1]
    if (is.na(skip)) skip &lt;- 0L
    newname &lt;- file.path(dirnm, paste0(sht, &quot;.csv&quot;))
    readxl::read_excel(fn, sht, skip = skip) |&gt;
      write.csv(newname, row.names = FALSE)
  }
}

This works for me given two files:

Book1.xlsx with sheets Sheet1 and Sheet2;
Book2.xlsx with sheets Sheet1 and Sheet2.

After this, we now have subdirs with CSV files:

files
# [1] &quot;./Book1.xlsx&quot; &quot;./Book2.xlsx&quot;
list.files(pattern = &quot;csv$&quot;, recursive = TRUE, full.names = TRUE)
# [1] &quot;./Book1/Sheet1.csv&quot; &quot;./Book1/Sheet2.csv&quot; &quot;./Book2/Sheet1.csv&quot; &quot;./Book2/Sheet2.csv&quot;

This should work well for whatever your purpose. If you're dealing with large amounts of data, there are many reasons why you many prefer to write directly to parquet format instead of CSV:

lazy reading: in a dplyr pipe, you can use a somewhat-reduced set of mutate, filter, select, and such, and none of the data is read into memory until you finally %>% collect() the data, similar to dbplyr and dtplyr;
combine files: if the schema (columns/types) are all the same, then you can use arrow::open_dataset once with all (or a subset of) files, and it will virtually combine them, optionally using hive-partitioning (not used here necessarily, but can be added if applicable);
native types: the classes (and attributes) of the R data.frame produced by read_excel are preserved in the parquet file, so if your excel data includes dates, timestamps, etc, you can set this before saving to parquets, and when you read the parquets they will be the correct classes again.

For that, I think I would modify the inner-most portion of the loop to be something like:

    newname &lt;- file.path(dirnm, paste0(sht, &quot;.pq&quot;))
    readxl::read_excel(fn, sht, skip = skip) %&gt;%
      mutate(
        thedata = as.Date(somedate, format = &quot;.....&quot;),
        thetime = as.POSIXct(somestamp, format = &quot;.....&quot;)
      ) %&gt;%
      arrow::write_parquet(newname)

and then use arrow::open_dataset on each file (if desired) or something like this if the schema are all the same:

ds &lt;- list.files(basename(files), pattern = &quot;pq$&quot;, recursive = TRUE) |&gt;
  arrow::open_dataset()

and have lazy access to all of the data in one object.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Is it possible to delete the first few row of xlsx files (over 100 files) with multiple sheets in r?

问题

答案1

答案2

提取时间范围内的Excel项目到连续时间。

有没有办法将异常值提取到一个单独的数据框中？

根据单元格值的变化打印整行表格，然后在Excel中清除它？

for循环使用case_when出现错误：“传递了4个参数给’for’，但需要3个”。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。