英文:
How to extract many files from multiple 7z using Python?
问题
我需要提取分散在50个7z文件中的700,000个jpg文件。我有一个txt文件,每个文件需要提取的内容都有一行。该行包含目标7z文件以及位置和文件名。
这是txt文件的内容:
A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg
我目前能够使用Python提取文件,但一次只能从一个7z文件中提取。我使用以下命令来执行提取操作:
7zz e A0000to22000.7z @f1.txt
然而,这花费的时间太长了。是否有办法修改命令或使用另一种方法,以便我可以一次从多个不同的7z文件中提取许多不同的文件?
英文:
I need to extract 700k jpg files that are dispersed among 50 7z files. I have a txt file that has one row for each file I need. The row contains the target 7z file and location and file name.
This is what the txt file looks like:
A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg
I currently am able to extract files with Python but only from one 7z at a time. I use this command to do that:
7zz e A0000to22000.7z @f1.txt
This is taking way too long though. Is there anyway to edit the command or use another approach so I can extract many different files from many different 7z files at once?
答案1
得分: 2
更新的回答
根据新的信息,需要从每个7z存档中检索大量文件,需要进行修改的方法。
首先,我们必须生成从每个7z存档中所需的文件列表,然后并行处理该列表。因此,这段代码应该可以做到:
awk -F, '{sub("7z","txt",$1); print $2 > $1}' joblist.txt
这将创建一个名为 A20000to22000.txt 的文件,其中包含从存档 A20000to22000.7z 中提取的所有文件,类似地,对于 B20000to22000.7z,它应该生成 B20000to22000.txt。
在文件以 .txt 结尾的部分看起来正确之前,请不要继续进行。
现在,我们需要使用 GNU Parallel 并行处理这些 .txt 文件。应该类似于以下内容:
parallel --dry-run 7zz e {.}.7z @{} ::: *to*.txt
我使用了 *to*.txt 以避免处理原始的 joblist.txt。
如果该命令看起来正确,请删除 --dry-run 并实际运行。
原始的回答
假设 joblist.txt 如下所示:
A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg
并且这对应于需要运行如下命令:
7zz e A20000to22000.7z A20000to22000/rows/A21673.Lo1sign.jpg
您可以使用 GNU Parallel 并行执行此操作,如下所示:
parallel --dry-run --colsep , 7zz e {1} {2} :::: joblist.txt
如果看起来正确,请删除 --dry-run 并实际运行。
请注意,这是在终端/Shell中完成的,而不涉及Python,因此属于您提到的“另一种方法”。
英文:
Updated Answer
With the new information that there are lots of files to retrieve from each archive, a modified approach is needed.
First we must generate a list of the files needed from each 7z archive, then process that list in parallel. So this code should do that:
awk -F, '{sub("7z","txt",$1); print $2 > $1}' joblist.txt
That should make a file called A20000to22000.txt that contains all the files to be extracted from the archive A20000to22000.7z and similarly for B20000to22000.7z it should produce B20000to22000.txt.
Don't proceed past here till the files ending in .txt look correct.
Now we need to process the .txt files in parallel with GNU Parallel. That should look something like this:
parallel --dry-run 7zz e {.}.7z @{} ::: *to*.txt
I used *to*.txt in order to avoid processing the original joblist.txt.
If that command looks correct, remove --dry-run and run for real.
Original Answer
Assuming joblist.txt looks like this:
A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg
and that corresponds to needing to run a command like:
7zz e A20000to22000.7z A20000to22000/rows/A21673.Lo1sign.jpg
you can do that in parallel with GNU Parallel like this:
parallel --dry-run --colsep , 7zz e {1} {2} :::: joblist.txt
If it looks right, remove --dry-run and run for real.
Note that this is done in the terminal/shell and without Python, so it falls under the "another approach" you mentioned.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论