如何使用Python从多个7z文件中提取多个文件?

huangapple go评论66阅读模式
英文:

How to extract many files from multiple 7z using Python?

问题

我需要提取分散在50个7z文件中的700,000个jpg文件。我有一个txt文件,每个文件需要提取的内容都有一行。该行包含目标7z文件以及位置和文件名。

这是txt文件的内容:

A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg

我目前能够使用Python提取文件,但一次只能从一个7z文件中提取。我使用以下命令来执行提取操作:

7zz e A0000to22000.7z @f1.txt

然而,这花费的时间太长了。是否有办法修改命令或使用另一种方法,以便我可以一次从多个不同的7z文件中提取许多不同的文件?

英文:

I need to extract 700k jpg files that are dispersed among 50 7z files. I have a txt file that has one row for each file I need. The row contains the target 7z file and location and file name.

This is what the txt file looks like:

A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg

I currently am able to extract files with Python but only from one 7z at a time. I use this command to do that:

7zz e A0000to22000.7z @f1.txt

This is taking way too long though. Is there anyway to edit the command or use another approach so I can extract many different files from many different 7z files at once?

答案1

得分: 2

更新的回答

根据新的信息,需要从每个7z存档中检索大量文件,需要进行修改的方法。

首先,我们必须生成从每个7z存档中所需的文件列表,然后并行处理该列表。因此,这段代码应该可以做到:

awk -F, '{sub("7z","txt",$1); print $2 > $1}' joblist.txt

这将创建一个名为 A20000to22000.txt 的文件,其中包含从存档 A20000to22000.7z 中提取的所有文件,类似地,对于 B20000to22000.7z,它应该生成 B20000to22000.txt

在文件以 .txt 结尾的部分看起来正确之前,请不要继续进行。

现在,我们需要使用 GNU Parallel 并行处理这些 .txt 文件。应该类似于以下内容:

parallel --dry-run 7zz e {.}.7z @{} ::: *to*.txt 

我使用了 *to*.txt 以避免处理原始的 joblist.txt

如果该命令看起来正确,请删除 --dry-run 并实际运行。

原始的回答

假设 joblist.txt 如下所示:

A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg

并且这对应于需要运行如下命令:

7zz e A20000to22000.7z A20000to22000/rows/A21673.Lo1sign.jpg

您可以使用 GNU Parallel 并行执行此操作,如下所示:

parallel --dry-run --colsep , 7zz e {1} {2} :::: joblist.txt

如果看起来正确,请删除 --dry-run 并实际运行。

请注意,这是在终端/Shell中完成的,而不涉及Python,因此属于您提到的“另一种方法”。

英文:

Updated Answer

With the new information that there are lots of files to retrieve from each archive, a modified approach is needed.

First we must generate a list of the files needed from each 7z archive, then process that list in parallel. So this code should do that:

awk -F, '{sub("7z","txt",$1); print $2 > $1}' joblist.txt

That should make a file called A20000to22000.txt that contains all the files to be extracted from the archive A20000to22000.7z and similarly for B20000to22000.7z it should produce B20000to22000.txt.

Don't proceed past here till the files ending in .txt look correct.

Now we need to process the .txt files in parallel with GNU Parallel. That should look something like this:

parallel --dry-run 7zz e {.}.7z @{} ::: *to*.txt 

I used *to*.txt in order to avoid processing the original joblist.txt.

If that command looks correct, remove --dry-run and run for real.

Original Answer

Assuming joblist.txt looks like this:

A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg

and that corresponds to needing to run a command like:

7zz e A20000to22000.7z A20000to22000/rows/A21673.Lo1sign.jpg

you can do that in parallel with GNU Parallel like this:

parallel --dry-run --colsep , 7zz e {1} {2} :::: joblist.txt

If it looks right, remove --dry-run and run for real.


Note that this is done in the terminal/shell and without Python, so it falls under the "another approach" you mentioned.

huangapple
  • 本文由 发表于 2023年1月9日 08:23:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75052238.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定