将大文件在R中使用循环分割成小文件

huangapple go评论71阅读模式
英文:

Split large file in R into smaller files with a loop

问题

我有一个包含12,626,756行的CSV文件,我需要将其拆分成较小的文件,以便同事可以在Excel中打开。我想创建一个循环,将文件拆分为适合Excel行限制的文件,并将它们导出为CSV文件,直到达到文件末尾(应生成13个文件)。

#步骤1:加载数据
data <- read.csv(".../Desktop/Data/file.csv", header = TRUE)

#步骤2:计算行数
totalrows <- nrow(data)

#步骤3:确定需要多少个拆分文件
excelrowlimit <- 1048576 - 5
filesrequired <- ceiling(totalrows / excelrowlimit)

例如:

csvfile 1应包含行1:1048571
csvfile 2应包含行1048572:2097143
csvfile 3应包含行2097144:3145715
csvfile 4应包含行3145716:4194287
...等等

如何编写一个循环语句,以(1)按所需的文件数进行拆分,(2)为每个CSV导出提供不同的文件名?

英文:

I have a csv file with 12,626,756 rows that I need to split into smaller files so a colleague can open them in Excel. I want to create a loop that splits the file into files that fit within Excel's row limit and exports them as CSV files until it reaches the end (it should produce 13 files)

#STEP 1: load data
data &lt;- read.csv(&quot;.../Desktop/Data/file.csv&quot;, header = TRUE)

#STEP 2: count rows
totalrows &lt;- nrow(data)

#STEP 3: determine how many splits you need 
excelrowlimit &lt;- 1048576 - 5
filesrequired &lt;- ceiling(totalrows/ excelrowlimit)

for example:

csvfile 1 should contain rows 1:1048571
csvfile 2 should contain rows 1048572:2097143
csvfile 3 should contain rows 2097144:3145715
csvfile 4 should contain rows 3145716:4194287
... and so on

how can I write a loop statement that (1) splits by number of files needed and (2) gives a different file name to each csv export?

答案1

得分: 1

这是扩展我上面评论的解决方案。与其他任何解决方案相比,这应该具有更小的内存需求,因为它不需要复制原始数据帧的全部或部分。

library(tidyverse)

rowCount <- 1048571
data %>%
  mutate(Group = ceiling((row_number()) / rowCount)) %>%
  group_by(Group) %>%
  group_walk(
    function(.x, .y) {
      write.csv(.x, file = paste0("file", .y$Group, ".csv"))
    }
  )
英文:

Here's a solution expanding my comment above. This should have a smaller memory requirement than any other solution as it does not require copying all or part of the original data frame.

library(tidyverse)

rowCount &lt;- 1048571
data %&gt;% 
  mutate(Group = ceiling((row_number()) / rowCount)) %&gt;% 
  group_by(Group) %&gt;% 
  group_walk(
    function(.x, .y) {
      write.csv(.x, file = paste0(&quot;file&quot;, .y$Group, &quot;.csv&quot;))
    }
  )

答案2

得分: 0

这里是一个示例,演示如何使用 split_at 来设置所需的文件大小。

在最后部分,你当然可以根据需要更改 write_csv 的参数,例如设置路径、分隔符等。

library(tidyverse)

split_at <- 5

data.frame(x = 1:19) %>%
  mutate(group = (row_number() - 1) %/% !! split_at) %>%
  group_split(group) %>%
  map(.f = ~write_csv(.x, file = paste0('file ', unique(.x$group), '.csv')))
英文:

Here‘s an example of how to achieve this where you can set the desired file size with split_at.

In the last part, you can of course change the write_csv arguments as you want, e.g. to set a path, a delimiter etc.

library(tidyverse)

split_at &lt;- 5

data.frame(x = 1:19) %&gt;%
  mutate(group = (row_number() - 1) %/% !! split_at) %&gt;%
  group_split(group) %&gt;%
  map(.f = ~write_csv(.x, file = paste0(&#39;file &#39;, unique(.x$group), &#39;.csv&#39;)))

答案3

得分: 0

#STEP 1: 加载数据
data <- read.csv(".../Desktop/Data/file.csv", header = TRUE)

对数据进行分组,每500行一个分组

data <- data %>% mutate(Group = ceiling(1:nrow(.)/500))

按照分组写出CSV文件

for(i in unique(data$Group)){
data %>% filter(Group == i) %>% select(-Group) %>%
write.csv(paste0("/your/path/",i,".csv"))
}

英文:

I assume that split data by every 500 rows.You can mutate a column to lable group.Then put in for loop to write out csv according to this column.

#STEP 1: load data
data &lt;- read.csv(&quot;.../Desktop/Data/file.csv&quot;, header = TRUE)

# mutate a column to lable the group
data &lt;- data %&gt;% mutate(Group = ceiling(1:nrow(.)/500))

# write out csv by group
for(i in unique(data$Group)){
  data %&gt;% filter(Group == i) %&gt;% select(-Group) %&gt;%
    write.csv(paste0(&quot;/your/path/&quot;,i,&quot;.csv&quot;))
}

huangapple
  • 本文由 发表于 2023年7月11日 12:53:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76658802.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定