英文:
How to extract many files from multiple 7z using Python?
问题
我需要提取分散在50个7z文件中的700,000个jpg文件。我有一个txt文件,每个文件需要提取的内容都有一行。该行包含目标7z文件以及位置和文件名。
这是txt文件的内容:
A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg
我目前能够使用Python提取文件,但一次只能从一个7z文件中提取。我使用以下命令来执行提取操作:
7zz e A0000to22000.7z @f1.txt
然而,这花费的时间太长了。是否有办法修改命令或使用另一种方法,以便我可以一次从多个不同的7z文件中提取许多不同的文件?
英文:
I need to extract 700k jpg files that are dispersed among 50 7z files. I have a txt file that has one row for each file I need. The row contains the target 7z file and location and file name.
This is what the txt file looks like:
A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg
I currently am able to extract files with Python but only from one 7z at a time. I use this command to do that:
7zz e A0000to22000.7z @f1.txt
This is taking way too long though. Is there anyway to edit the command or use another approach so I can extract many different files from many different 7z files at once?
答案1
得分: 2
更新的回答
根据新的信息,需要从每个7z存档中检索大量文件,需要进行修改的方法。
首先,我们必须生成从每个7z存档中所需的文件列表,然后并行处理该列表。因此,这段代码应该可以做到:
awk -F, '{sub("7z","txt",$1); print $2 > $1}' joblist.txt
这将创建一个名为 A20000to22000.txt
的文件,其中包含从存档 A20000to22000.7z
中提取的所有文件,类似地,对于 B20000to22000.7z
,它应该生成 B20000to22000.txt
。
在文件以 .txt
结尾的部分看起来正确之前,请不要继续进行。
现在,我们需要使用 GNU Parallel 并行处理这些 .txt
文件。应该类似于以下内容:
parallel --dry-run 7zz e {.}.7z @{} ::: *to*.txt
我使用了 *to*.txt
以避免处理原始的 joblist.txt
。
如果该命令看起来正确,请删除 --dry-run
并实际运行。
原始的回答
假设 joblist.txt
如下所示:
A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg
并且这对应于需要运行如下命令:
7zz e A20000to22000.7z A20000to22000/rows/A21673.Lo1sign.jpg
您可以使用 GNU Parallel 并行执行此操作,如下所示:
parallel --dry-run --colsep , 7zz e {1} {2} :::: joblist.txt
如果看起来正确,请删除 --dry-run
并实际运行。
请注意,这是在终端/Shell中完成的,而不涉及Python,因此属于您提到的“另一种方法”。
英文:
Updated Answer
With the new information that there are lots of files to retrieve from each archive, a modified approach is needed.
First we must generate a list of the files needed from each 7z archive, then process that list in parallel. So this code should do that:
awk -F, '{sub("7z","txt",$1); print $2 > $1}' joblist.txt
That should make a file called A20000to22000.txt
that contains all the files to be extracted from the archive A20000to22000.7z
and similarly for B20000to22000.7z
it should produce B20000to22000.txt
.
Don't proceed past here till the files ending in .txt
look correct.
Now we need to process the .txt
files in parallel with GNU Parallel. That should look something like this:
parallel --dry-run 7zz e {.}.7z @{} ::: *to*.txt
I used *to*.txt
in order to avoid processing the original joblist.txt
.
If that command looks correct, remove --dry-run
and run for real.
Original Answer
Assuming joblist.txt
looks like this:
A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg
and that corresponds to needing to run a command like:
7zz e A20000to22000.7z A20000to22000/rows/A21673.Lo1sign.jpg
you can do that in parallel with GNU Parallel like this:
parallel --dry-run --colsep , 7zz e {1} {2} :::: joblist.txt
If it looks right, remove --dry-run
and run for real.
Note that this is done in the terminal/shell and without Python, so it falls under the "another approach" you mentioned.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论