读取具有不同扩展名的本地文件 – 高效的方法?

huangapple go评论95阅读模式
英文:

reading in local files with different extensions - efficient way?

问题

我有一个包含不同文件扩展名(.csv、.xls、.xlsx)的20多个数据文件的列表。我想将每个数据文件存储为一个数据框(data frame)并放入一个列表中。我编写了一个for循环来执行这个任务。

我的“data”文件夹包含不同数据文件的子文件夹。然后在循环内,根据文件扩展名的不同调用不同的函数。我想知道是否有一种方法可以读取文件并将它们存储在一个单一的列表中,而不使用for循环?

以下是我的代码:

  1. library(dplyr)
  2. library(readxl)
  3. library(XML)
  4. # 获取所有文件名
  5. files <- list.files(path="data", recursive=T, pattern="*.(csv|xls|xlsx)$", full.names=T)
  6. ### 以上导入的示例文件名
  7. files <- c("data/Bills/BillHistory_Account1.csv", "data/Bills/BillHistory_2.xls", "data/Usages/UsageHistory_3.xls", "data/Usages/UsageHistory_4.xls.xlsx")
  8. # 将每个数据框存储在一个列表中
  9. df_list <- list()
  10. for(i in seq_along(files)){
  11. if(grepl("*.csv$", files[[i]])){
  12. df_list[[i]] <- read.csv(files[[i]], sep="\t")
  13. } else if(grepl("*.xls$", files[[i]])){
  14. ### 一些.xls文件实际上包含HTML代码
  15. ### 需要使用XML::readHTMLTable来读取
  16. df_list[[i]] <- readHTMLTable(files[[i]])$tblMain
  17. } else if(grepl("*.xlsx$", files[[i]])){
  18. df_list[[i]] <- read_excel(files[[i]])
  19. }
  20. }

提前感谢您的帮助!

P.S. 我还想知道是否有一种方法可以创建一个可复制的示例来读取本地数据文件?我想不出其他方法,除非读者创建任意的本地CSV或Excel文件。

英文:

I have a list of 20+ data files with different file extensions (.csv, .xls, .xlsx). I want to store each data file as a data frame in a list. I wrote a for loop to do the task.

My "data" folder contains subfolders with different data files. Then within the loop, I called different functions accordingly depending on the file extension types. I was wondering if there is a way to read in files and store them in a single list without using a for loop?

Below is my code:

  1. library(dplyr)
  2. library(readxl)
  3. library(XML)
  4. # grab all file names
  5. files &lt;- list.files(path=&quot;data&quot;, recursive=T, pattern=&quot;*.(csv|xls|xlsx)$&quot;, full.names=T)
  6. ### example imported file names from above
  7. files &lt;- c(&quot;data/Bills/BillHistory_Account1.csv&quot;, &quot;data/Bills/BillHistory_2.xls&quot;, &quot;data/Usages/UsageHistory_3.xls&quot;, &quot;data/Usages/UsageHistory_4.xls.xlsx&quot;)
  8. # store each data frame in a list
  9. df_list &lt;- list()
  10. for(i in seq_along(files)){
  11. if(grepl(&quot;*.csv$&quot;, files[[i]])){
  12. df_list[[i]] &lt;- read.csv(files[[i]], sep=&quot;\t&quot;)
  13. } else if(grepl(&quot;*.xls$&quot;, files[[i]])){
  14. ### some .xls files actually contain html codes
  15. ### need to read using XML::readHTMLTable
  16. df_list[[i]] &lt;- readHTMLTable(files[[i]])$tblMain
  17. } else if(grepl(&quot;*.xlsx$&quot;, files[[i]])){
  18. df_list[[i]] &lt;- read_excel(files[[i]])
  19. }
  20. }

Thank you in advance for helping!

P.S. I was also wondering if there is a way to create a reproducible example for reading in local data files? I can't think of any means other than having the reader create arbitrary local csv, or excel files.

答案1

得分: 2

包{rio}是各种导入包的通用包装器。一个简单的import(filename.ext)将猜测文件格式,因此整个任务可能会归结为:

  1. # 获取所有文件名
  2. files <- list.files(path="data", recursive=T, pattern="*.(csv|xls|xlsx)$", full.names=T)
  3. dataframe_list <- Map(files, f = \(filename) rio::import(filename))

如果需要,仍然可以向import添加特定于读取器的命名参数(sepheaderskip ...)。

英文:

Package {rio} is a universal wrapper around various import packages. A simple import(filename.ext) will guess the file format, so the whole task might boil down to:

  1. # grab all file names
  2. files &lt;- list.files(path=&quot;data&quot;, recursive=T, pattern=&quot;*.(csv|xls|xlsx)$&quot;, full.names=T)
  3. dataframe_list &lt;- Map(files, f = \(filename) rio::import(filename))

If needed, you can still add reader-specific named arguments (sep, header, skip ...) to import.

huangapple
  • 本文由 发表于 2023年6月6日 05:19:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76410075.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定