英文:
How to split a massive text file with header while deleting original file?
问题
我有一个巨大的以管道分隔的.txt文件(300 GB),我想将其拆分为1 GB的文件,以便在Python中进行进一步分析。然而,我的计算机没有足够的空间来存储另外的300 GB,因此我想在拆分文件时删除原始文件的部分内容。这个文件还有一个标题,我希望在所有拆分的文件中都保留。
我尝试过在Bash中进行拆分,但无法找到一种在拆分时删除原始文件的方法。这个文件太大,无法完全加载到Python中。
编辑:我想要做类似于这样的操作,但要保留标题:
https://unix.stackexchange.com/questions/628747/split-large-file-into-chunks-and-delete-original
英文:
I have a massive, pipe-delimited .txt file (300 GB) that I'm trying to split into 1 GB files for further analysis in Python. My PC does not have enough space for another 300 GB, though, so I would like to delete chunks of the original file as I split it. The file also has a header that I would like to keep in all the split files.
I have tried splitting it in Bash, but cannot figure out a way to this while deleting the original file. The file is too big to load into Python in full.
Edit: I want to do something like this, but with a header:
https://unix.stackexchange.com/questions/628747/split-large-file-into-chunks-and-delete-original
答案1
得分: 2
假设:
- 数据字段不包含嵌入的换行符,否则
head
和/或tail
命令可能会(错误地)拆分数据行
对提供的 unix.stackexchange.com
链接中的答案进行扩展:
numfiles=100 # OP事先确定要创建多少个文件
numlines=100000 # OP事先确定要将多少行移动到每个新文件
head -1 bigfile > header # 复制标题行
for ((i=numfiles; i>1; i--))
do
newf=newfile.$i
cp header "${newf}"
tail -${numlines} bigfile >> "${newf}"
truncate -s -$(wc -c < "${newf}") bigfile
done
mv bigfile newfile.1 # 重命名剩余的原始文件
**注意:** 需要使用`truncate`(属于GNU coreutils的一部分,例如`sudo apt-get install coreutils`)
性能:
- `bigfile`:1000万行,810兆字节
- 10秒:在Win10虚拟机中运行`cygwin`(Ubuntu主机,NVME Gen4 PCIe驱动)
- 2秒:直接在相同的Ubuntu主机上运行
英文:
Assumptions:
- data fields do not include embedded linefeeds otherwise the
head
and/ortail
commands could (erroneously) split data lines
Expanding on this answer to the unix.stackexchange.com
link provided by OP:
numfiles=100 # OP determines beforehand how many files to create
numlines=100000 # OP determines beforehand how many lines to move to each new file
head -1 bigfile > header # make a copy of the header line
for ((i=numfiles; i>1; i--))
do
newf=newfile.$i
cp header "${newf}"
tail -${numlines} bigfile >> "${newf}"
truncate -s -$(wc -c < "${newf}") bigfile
done
mv bigfile newfile.1 # rename what's left of the original file
NOTE: requires truncate
(part of the GNU coreutils, eg, sudo apt-get install coreutils
)
Performance:
bigfile
: 10 million lines, 810 MBytes- 10 seconds:
cygwin
running in Win10 virtual machine (Ubuntu host, NVME Gen4 PCIe drive) - 2 seconds: running directly on the same Ubuntu host
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论