2023年5月28日 07:38:52go评论98阅读模式

英文:

How to split a massive text file with header while deleting original file?

问题

我有一个巨大的以管道分隔的.txt文件（300 GB），我想将其拆分为1 GB的文件，以便在Python中进行进一步分析。然而，我的计算机没有足够的空间来存储另外的300 GB，因此我想在拆分文件时删除原始文件的部分内容。这个文件还有一个标题，我希望在所有拆分的文件中都保留。

我尝试过在Bash中进行拆分，但无法找到一种在拆分时删除原始文件的方法。这个文件太大，无法完全加载到Python中。

编辑：我想要做类似于这样的操作，但要保留标题：
https://unix.stackexchange.com/questions/628747/split-large-file-into-chunks-and-delete-original

英文:

I have a massive, pipe-delimited .txt file (300 GB) that I'm trying to split into 1 GB files for further analysis in Python. My PC does not have enough space for another 300 GB, though, so I would like to delete chunks of the original file as I split it. The file also has a header that I would like to keep in all the split files.

I have tried splitting it in Bash, but cannot figure out a way to this while deleting the original file. The file is too big to load into Python in full.

Edit: I want to do something like this, but with a header:

https://unix.stackexchange.com/questions/628747/split-large-file-into-chunks-and-delete-original

答案1

得分: 2

假设：

数据字段不包含嵌入的换行符，否则head和/或tail命令可能会（错误地）拆分数据行

对提供的 unix.stackexchange.com 链接中的答案进行扩展：

numfiles=100                    # OP事先确定要创建多少个文件
numlines=100000                 # OP事先确定要将多少行移动到每个新文件
head -1 bigfile > header         # 复制标题行
for ((i=numfiles; i>1; i--))
do
    newf=newfile.$i
    cp header "${newf}"
    tail -${numlines} bigfile >> "${newf}"
    truncate -s -$(wc -c < "${newf}") bigfile
done
mv bigfile newfile.1             # 重命名剩余的原始文件
**注意：** 需要使用`truncate`（属于GNU coreutils的一部分，例如`sudo apt-get install coreutils`）
性能：
- `bigfile`：1000万行，810兆字节
- 10秒：在Win10虚拟机中运行`cygwin`（Ubuntu主机，NVME Gen4 PCIe驱动）
- 2秒：直接在相同的Ubuntu主机上运行

英文:

Assumptions:

data fields do not include embedded linefeeds otherwise the head and/or tail commands could (erroneously) split data lines

Expanding on this answer to the unix.stackexchange.com link provided by OP:

numfiles=100                                       # OP determines beforehand how many files to create
numlines=100000                                    # OP determines beforehand how many lines to move to each new file
head -1 bigfile &gt; header                           # make a copy of the header line
for ((i=numfiles; i&gt;1; i--))
do
    newf=newfile.$i
    cp header &quot;${newf}&quot;
    tail -${numlines} bigfile &gt;&gt; &quot;${newf}&quot;
    truncate -s -$(wc -c &lt; &quot;${newf}&quot;) bigfile
done
mv bigfile newfile.1                               # rename what&#39;s left of the original file

NOTE: requires truncate (part of the GNU coreutils, eg, sudo apt-get install coreutils)

Performance:

bigfile : 10 million lines, 810 MBytes
10 seconds: cygwin running in Win10 virtual machine (Ubuntu host, NVME Gen4 PCIe drive)
2 seconds: running directly on the same Ubuntu host

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在删除原始文件的同时拆分带有标题的大型文本文件？

问题

答案1

Trap 和 SIGINT 在 bash 中。

如何将可执行二进制文件的输出写入内存而不是磁盘？

在Linux脚本中创建选择菜单。

将Python变量设置为Linux环境变量。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。