2023年7月11日 12:53:18go评论104阅读模式

英文:

Split large file in R into smaller files with a loop

问题

我有一个包含12,626,756行的CSV文件，我需要将其拆分成较小的文件，以便同事可以在Excel中打开。我想创建一个循环，将文件拆分为适合Excel行限制的文件，并将它们导出为CSV文件，直到达到文件末尾（应生成13个文件）。

＃步骤1：加载数据
data <- read.csv(".../Desktop/Data/file.csv", header = TRUE)
＃步骤2：计算行数
totalrows <- nrow(data)
＃步骤3：确定需要多少个拆分文件
excelrowlimit <- 1048576 - 5
filesrequired <- ceiling(totalrows / excelrowlimit)

例如：

csvfile 1应包含行1:1048571
csvfile 2应包含行1048572:2097143
csvfile 3应包含行2097144:3145715
csvfile 4应包含行3145716:4194287
...等等

如何编写一个循环语句，以（1）按所需的文件数进行拆分，（2）为每个CSV导出提供不同的文件名？

英文:

I have a csv file with 12,626,756 rows that I need to split into smaller files so a colleague can open them in Excel. I want to create a loop that splits the file into files that fit within Excel's row limit and exports them as CSV files until it reaches the end (it should produce 13 files)

#STEP 1: load data
data &lt;- read.csv(&quot;.../Desktop/Data/file.csv&quot;, header = TRUE)
#STEP 2: count rows
totalrows &lt;- nrow(data)
#STEP 3: determine how many splits you need 
excelrowlimit &lt;- 1048576 - 5
filesrequired &lt;- ceiling(totalrows/ excelrowlimit)

for example:

csvfile 1 should contain rows 1:1048571
csvfile 2 should contain rows 1048572:2097143
csvfile 3 should contain rows 2097144:3145715
csvfile 4 should contain rows 3145716:4194287
... and so on

how can I write a loop statement that (1) splits by number of files needed and (2) gives a different file name to each csv export?

答案1

得分: 1

这是扩展我上面评论的解决方案。与其他任何解决方案相比，这应该具有更小的内存需求，因为它不需要复制原始数据帧的全部或部分。

library(tidyverse)
rowCount <- 1048571
data %>%
  mutate(Group = ceiling((row_number()) / rowCount)) %>%
  group_by(Group) %>%
  group_walk(
    function(.x, .y) {
      write.csv(.x, file = paste0("file", .y$Group, ".csv"))
    }
  )

英文:

Here's a solution expanding my comment above. This should have a smaller memory requirement than any other solution as it does not require copying all or part of the original data frame.

library(tidyverse)
rowCount &lt;- 1048571
data %&gt;% 
  mutate(Group = ceiling((row_number()) / rowCount)) %&gt;% 
  group_by(Group) %&gt;% 
  group_walk(
    function(.x, .y) {
      write.csv(.x, file = paste0(&quot;file&quot;, .y$Group, &quot;.csv&quot;))
    }
  )

答案2

得分: 0

这里是一个示例，演示如何使用 split_at 来设置所需的文件大小。

在最后部分，你当然可以根据需要更改 write_csv 的参数，例如设置路径、分隔符等。

library(tidyverse)
split_at <- 5
data.frame(x = 1:19) %>%
  mutate(group = (row_number() - 1) %/% !! split_at) %>%
  group_split(group) %>%
  map(.f = ~write_csv(.x, file = paste0('file ', unique(.x$group), '.csv')))

英文:

Here‘s an example of how to achieve this where you can set the desired file size with split_at.

In the last part, you can of course change the write_csv arguments as you want, e.g. to set a path, a delimiter etc.

library(tidyverse)
split_at &lt;- 5
data.frame(x = 1:19) %&gt;%
  mutate(group = (row_number() - 1) %/% !! split_at) %&gt;%
  group_split(group) %&gt;%
  map(.f = ~write_csv(.x, file = paste0(&#39;file &#39;, unique(.x$group), &#39;.csv&#39;)))

答案3

得分: 0

#STEP 1: 加载数据
data <- read.csv(".../Desktop/Data/file.csv", header = TRUE)

对数据进行分组，每500行一个分组

data <- data %>% mutate(Group = ceiling(1:nrow(.)/500))

按照分组写出CSV文件

for(i in unique(data$Group)){
data %>% filter(Group == i) %>% select(-Group) %>%
write.csv(paste0("/your/path/",i,".csv"))
}

英文:

I assume that split data by every 500 rows.You can mutate a column to lable group.Then put in for loop to write out csv according to this column.

#STEP 1: load data
data &lt;- read.csv(&quot;.../Desktop/Data/file.csv&quot;, header = TRUE)
# mutate a column to lable the group
data &lt;- data %&gt;% mutate(Group = ceiling(1:nrow(.)/500))
# write out csv by group
for(i in unique(data$Group)){
  data %&gt;% filter(Group == i) %&gt;% select(-Group) %&gt;%
    write.csv(paste0(&quot;/your/path/&quot;,i,&quot;.csv&quot;))
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将大文件在R中使用循环分割成小文件

问题

答案1

答案2

答案3

对数据进行分组，每500行一个分组

按照分组写出CSV文件

如何根据列中特定的字符串重新塑造或转置数据框在R中？

将多列格式化为百分比。

在R中合并不同数据集中具有相同列名的列元素。

无效的用户输入，使用循环。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。