读取目录中的所有CSV文件,并将每个文件的名称添加到一个新列中。

huangapple go评论74阅读模式
英文:

Read all csv files in a directory and add the name of each file in a new column

问题

我有这段代码,它读取目录中的所有CSV文件。

nm <- list.files()

df <- do.call(rbind, lapply(nm, function(x) read_delim(x, ';', col_names = TRUE)))

我想以一种方式修改它,将文件名附加到数据中。结果将是一个包含所有CSV文件的单个数据框,并且在数据框内,有一列指定数据来自哪个文件。如何操作?

英文:

I have this code that reads all CSV files in a directory.

nm &lt;- list.files()

df &lt;- do.call(rbind, lapply(nm, function(x) read_delim(x,&#39;;&#39;,col_names = T)))

I want to modify it in a way that appends the filename to the data. The result would be a single data frame that has all the CSV files, and inside the data frame, there is a column that specifies from which file the data came. How to do it?

答案1

得分: 5

代替do.call(rbind, lapply(...)),你可以使用purrr::map_dfr()并使用.id参数:

library(readr)
library(purrr)

df <- list.files() %>%
  set_names() %>%
  map_dfr(read_delim, .id = "file")

df
# A tibble: 9 × 3
  file    col1  col2
  <chr>  <dbl> <dbl>
1 f1.csv     1     4
2 f1.csv     2     5
3 f1.csv     3     6
4 f2.csv     1     4
5 f2.csv     2     5
6 f2.csv     3     6
7 f3.csv     1     4
8 f3.csv     2     5
9 f3.csv     3     6

示例数据:

for (f in c("f1.csv", "f2.csv", "f3.csv")) {
  readr::write_delim(data.frame(col1 = 1:3, col2 = 4:6), f, ";")
}
英文:

Instead of do.call(rbind, lapply(...)), you can use purrr::map_dfr() with the .id argument:

library(readr)
library(purrr)

df &lt;- list.files() |&gt;
  set_names() |&gt;
  map_dfr(read_delim, .id = &quot;file&quot;)

df
# A tibble: 9 &#215; 3
  file    col1  col2
  &lt;chr&gt;  &lt;dbl&gt; &lt;dbl&gt;
1 f1.csv     1     4
2 f1.csv     2     5
3 f1.csv     3     6
4 f2.csv     1     4
5 f2.csv     2     5
6 f2.csv     3     6
7 f3.csv     1     4
8 f3.csv     2     5
9 f3.csv     3     6

Example data:

for (f in c(&quot;f1.csv&quot;, &quot;f2.csv&quot;, &quot;f3.csv&quot;)) {
  readr::write_delim(data.frame(col1 = 1:3, col2 = 4:6), f, &quot;;&quot;)
}

答案2

得分: 4

readr::read_csv()可以接受一个文件名的向量id参数是“要存储文件路径的列的名称。当读取多个输入文件并且文件路径中包含数据时,这很有用,例如数据收集日期。”

nm |&gt;
  readr::read_csv(
    id = &quot;file_path&quot;
  )

我看到其他答案使用不带目录的文件名。如果需要这样做,考虑使用专门用于文件操作的函数,而不是正则表达式,除非您确信文件名和路径始终是良好的。

nm |&gt;
  readr::read_csv(
    id = &quot;file_path&quot;
  ) |&gt;
  dplyr::mutate(
    file_name_1 = basename(file_path),                     # 如果需要扩展名
    file_name_2 = tools::file_path_sans_ext(file_name_1),  # 如果不需要扩展名
  ) 
英文:

readr::read_csv() can accept a vector of file names. The id parameter is "the name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date."

nm |&gt; 
  readr::read_csv(
    id = &quot;file_path&quot;
  )

I see other answers use file name without the directory. If that's desired, consider using functions built for file manipulation, instead of regexes, unless you're sure the file names & paths are always well-behaved.

nm |&gt; 
  readr::read_csv(
    id = &quot;file_path&quot;
  ) |&gt; 
  dplyr::mutate(
    file_name_1 = basename(file_path),                     # If you want the extension
    file_name_2 = tools::file_path_sans_ext(file_name_1),  # If you don&#39;t
  ) 

答案3

得分: 3

对于常规情况,我更喜欢让 readr 自己循环遍历 csv 文件。但在某些情况下,在将它们堆叠在一起之前,逐个处理文件会有所帮助。

几周前,purrr 1.0map_dfr() 函数被“建议使用适当的 map 函数以及 list_rbind()”来取代。

@zephryl 的代码片段稍作修改如下:

list.files() |&gt;
  rlang::set_names() |&gt;
  purrr::map(readr::read_delim) |&gt; 
  # { 在此之前可能要处理文件,然后再堆叠/绑定 } |&gt;
  purrr::list_rbind(names_to = &quot;file&quot;)

这些函数在 purrr 1.0.0 中被取代,因为它们的名称暗示它们像 _lgl()_int() 等一样工作,需要长度为 1 的输出,但实际上它们返回任何大小的结果,因为结果在没有大小检查的情况下组合在一起。此外,它们使用 dplyr::bind_rows()dplyr::bind_cols(),这些函数需要安装 dplyr,并且在边缘情况下具有令人困惑的语义。被取代的函数不会消失,但只会接收关键错误修复。

相反,我们建议使用 map()map2() 等与 list_rbind()list_cbind() 结合使用。这些函数在底层使用 vctrs::vec_rbind()vctrs::vec_cbind(),其名称更清晰地反映了它们的语义。

来源: <https://purrr.tidyverse.org/reference/map_dfr.html>

英文:

For conventional scenarios, I prefer for readr to loop through the csvs by itself. But there some scenarios where it helps to process files individually before stacking them together.

A few weeks ago, purrr 1.0's map_dfr() function was "superseded in favour of using the
appropriate map function along with list_rbind()".

@zephryl's snippet is slightly modified to become

list.files() |&gt;
  rlang::set_names() |&gt;
  purrr::map(readr::read_delim) |&gt; 
  # { possibly process files here before stacking/binding } |&gt;
  purrr::list_rbind(names_to = &quot;file&quot;)

>The functions were superseded in purrr 1.0.0 because their names suggest they work like _lgl(), _int(), etc which require length 1 outputs, but actually they return results of any size because the results are combined without any size checks. Additionally, they use dplyr::bind_rows() and dplyr::bind_cols() which require dplyr to be installed and have confusing semantics with edge cases. Superseded functions will not go away, but will only receive critical bug fixes.
>
> Instead, we recommend using map(), map2(), etc with list_rbind() and list_cbind(). These use vctrs::vec_rbind() and vctrs::vec_cbind() under the hood, and have names that more clearly reflect their semantics.

Source: <https://purrr.tidyverse.org/reference/map_dfr.html>

答案4

得分: 2

这是另一种使用purrr的解决方案,它从列filename中的值中删除了文件扩展名。

library(tidyverse)

nm <- list.files(pattern = "\\.csv$")

df <- map_dfr(
  .x = nm,
  ~ read.csv(.x) %>%
    mutate(
      filename = stringr::str_replace(
        .x,
        "\\.csv$",
        ""
      )
    )
)

View(df)

编辑

实际上,您仍然可以通过在应用@zephryl的方法时添加一个mutate()过程来从文件名的列中删除文件扩展名:

df <- nm %>%
  set_names() %>%
  map_dfr(read_delim, .id = "file") %>%
  mutate(
    file = stringr::str_replace(
      file,
      "\\.csv$",
      ""
    )
  )
英文:

Here is another solution using purrr, which removes the file extention from the value in the column filename.

library(tidyverse)

nm &lt;- list.files(pattern = &quot;\\.csv$&quot;)

df &lt;- map_dfr(
  .x = nm,
  ~ read.csv(.x) %&gt;%
    mutate(
      filename = stringr::str_replace(
        .x,
        &quot;\\.csv$&quot;,
        &quot;&quot;
      )
    )
)

View(df)

EDIT

Actually you can still removes the file extention from the column for the file names when you apply @zephryl's approach by adding a mutate() process as follows:

df &lt;- nm %&gt;%
  set_names() %&gt;%
  map_dfr(read_delim, .id = &quot;file&quot;) %&gt;%
  mutate(
    file = stringr::str_replace(
      file,
      &quot;\\.csv$&quot;,
      &quot;&quot;
    )
  )

答案5

得分: 2

你可以使用dplyr中的bind_rows()函数,并提供参数.id,该参数会创建一个新的标识列,将每一行与其原始数据框链接起来。

df <- dplyr::bind_rows(
  lapply(setNames(nm, basename(nm)), read_csv2),
  .id = 'src'
)

使用basename()函数会删除文件名前面添加的目录路径。

英文:

You can use bind_rows() from dplyr and supply the argument .id that creates a new column of identifiers to link each row to its original data frame.

df &lt;- dplyr::bind_rows(
  lapply(setNames(nm, basename(nm)), read_csv2),
  .id = &#39;src&#39;
)

The use of basename() removes the directory paths prepended to the file names.

huangapple
  • 本文由 发表于 2023年1月9日 12:07:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75053103.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定