英文:
How to get list of files in an archive file without download
问题
我尝试获取位于远程服务器上的大型归档文件(zip、7z、tar、rar等)中的文件名列表。由于网络成本的原因,我避免下载文件。
另一种方法是使用HTTP范围请求(1,2,3);然而,每种归档文件类型都有一个独特的分配方式用于整个中央目录。Apache commons-compress库支持其中大部分,我倾向于使用它来解决这个问题。如何在不下载的情况下使用它来处理远程归档文件?
英文:
I try to get list of file names in large archive files (zip, 7z, tar, rar etc.) located in remote server. I avoid to download files due to network cost.
An alternative is to use an HTTP range request (1, 2, 3); however, each archive file type has a unique allocation for the entire central directory. Apache commons-compress library supports most of them, I tend to use it to overcome this issue. How can I use it for remote archive files without download?
As with the python libraries (1, 2), do you have any advice for Java?
答案1
得分: 1
如果你无法在服务器端运行某个程序,那么你可以对压缩文件的末尾进行范围请求,然后在本地重新构建一个没有内容的压缩文件,你可以使用unzip
命令列出其中的内容。你需要将内容部分写入零值。
我刚刚尝试将一个大型压缩文件的中央目录之前的所有内容清零,然后使用unzip命令成功列出了其中的内容。
要实现这个目的,你可以选择以下两种方法:a) 搜索中央目录的末尾,可能还有zip64的末尾记录定位器和zip64的末尾记录,以确定中央目录的偏移量,并从那里开始读取;或者b) 从压缩文件的末尾开始,每次读取的部分逐渐增大,比如每次翻倍,直到unzip -l
命令可以正常工作。如果你没有完整地获取到中央目录,那么unzip -l
命令会报告"start of central directory not found"。
要使用范围请求,你需要知道压缩文件的大小。对于b)方法,你可以读取最后的1K,再往前1K,再往前2K,再往前4K,以此类推,直到unzip -l
命令可以正常工作。每次你需要将文件更新为从压缩文件末尾到目前为止累积的部分的零值,然后再写入已经累积的部分。为了高效地进行操作,你可以先创建一个长度与服务器上的压缩文件相同的全零文件。然后,当你从末尾累积更多数据时,覆盖该文件的末尾部分,并每次执行unzip -l
命令。
如果你想尝试a)方法,那么你需要阅读并理解zip文件格式的应用说明。
英文:
If you can't run something on the server side, then you can do a range request on just the end of the zip file, and reconstruct a zip file locally with no contents on which you can use unzip
to list the contents. You would write zeros for the content.
I just tried zeroing out everything before the central directory on a large zip file, and unzip listed the contents just fine.
To do this you could either a) search for the end of central directory, and then possibly the zip64 end record locator and zip64 end record, in order determine the offset of the central directory, reading from there, or b) read larger and larger portions of the end of the zip file, say doubling each time, until unzip -l
works. If you have not captured the entire central directory, then unzip -l
will report "start of central directory not found".
To use range requests, you will need to know the size of the zip file. Then for b), you can read, say, the last 1K, the 1K before that, the 2K before that, the 4K before that, and so on, until unzip -l
works. Each time you would update a file with zeros up to what you have accumulated from the end of the zip file so far, followed by what you have accumulated. To do this efficiently, you would start with file of all zeros with the length of the zip file on the server. Then as you accumulate more data from the end, write over the end of that file, repeating unzip -l
each time.
If you want to try a), then you'll need to read and understand the zip file format appnote.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论