如何在删除原始文件的同时拆分带有标题的大型文本文件?

huangapple go评论65阅读模式
英文:

How to split a massive text file with header while deleting original file?

问题

我有一个巨大的以管道分隔的.txt文件(300 GB),我想将其拆分为1 GB的文件,以便在Python中进行进一步分析。然而,我的计算机没有足够的空间来存储另外的300 GB,因此我想在拆分文件时删除原始文件的部分内容。这个文件还有一个标题,我希望在所有拆分的文件中都保留。

我尝试过在Bash中进行拆分,但无法找到一种在拆分时删除原始文件的方法。这个文件太大,无法完全加载到Python中。

编辑:我想要做类似于这样的操作,但要保留标题:
https://unix.stackexchange.com/questions/628747/split-large-file-into-chunks-and-delete-original

英文:

I have a massive, pipe-delimited .txt file (300 GB) that I'm trying to split into 1 GB files for further analysis in Python. My PC does not have enough space for another 300 GB, though, so I would like to delete chunks of the original file as I split it. The file also has a header that I would like to keep in all the split files.

I have tried splitting it in Bash, but cannot figure out a way to this while deleting the original file. The file is too big to load into Python in full.

Edit: I want to do something like this, but with a header:

https://unix.stackexchange.com/questions/628747/split-large-file-into-chunks-and-delete-original

答案1

得分: 2

假设:

  • 数据字段不包含嵌入的换行符,否则head和/或tail命令可能会(错误地)拆分数据行

对提供的 unix.stackexchange.com 链接中的答案进行扩展:

numfiles=100                    # OP事先确定要创建多少个文件
numlines=100000                 # OP事先确定要将多少行移动到每个新文件

head -1 bigfile > header         # 复制标题行

for ((i=numfiles; i>1; i--))
do
    newf=newfile.$i
    cp header "${newf}"
    tail -${numlines} bigfile >> "${newf}"
    truncate -s -$(wc -c < "${newf}") bigfile
done

mv bigfile newfile.1             # 重命名剩余的原始文件

**注意:** 需要使用`truncate`(属于GNU coreutils的一部分,例如`sudo apt-get install coreutils`
性能:
- `bigfile`:1000万行,810兆字节
- 10秒:在Win10虚拟机中运行`cygwin`(Ubuntu主机,NVME Gen4 PCIe驱动)
- 2秒:直接在相同的Ubuntu主机上运行
英文:

Assumptions:

  • data fields do not include embedded linefeeds otherwise the head and/or tail commands could (erroneously) split data lines

Expanding on this answer to the unix.stackexchange.com link provided by OP:

numfiles=100                                       # OP determines beforehand how many files to create
numlines=100000                                    # OP determines beforehand how many lines to move to each new file

head -1 bigfile &gt; header                           # make a copy of the header line

for ((i=numfiles; i&gt;1; i--))
do
    newf=newfile.$i
    cp header &quot;${newf}&quot;
    tail -${numlines} bigfile &gt;&gt; &quot;${newf}&quot;
    truncate -s -$(wc -c &lt; &quot;${newf}&quot;) bigfile
done

mv bigfile newfile.1                               # rename what&#39;s left of the original file

NOTE: requires truncate (part of the GNU coreutils, eg, sudo apt-get install coreutils)

Performance:

  • bigfile : 10 million lines, 810 MBytes
  • 10 seconds: cygwin running in Win10 virtual machine (Ubuntu host, NVME Gen4 PCIe drive)
  • 2 seconds: running directly on the same Ubuntu host

huangapple
  • 本文由 发表于 2023年5月28日 07:38:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76349434.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定