如何在.Net Core中合并两个大型PDF文件而不会占用大量内存。

huangapple go评论65阅读模式
英文:

How to merge 2 large pdf files in .Net Core without large memory utilization

问题

当通过代码合并PDF文件时,我见过的每种实现都要求在将它们合并成一个之前打开所有文件。这对于小文件来说效果不错,但对于足够大的文件,我们可能会超出Azure函数可用的资源,导致函数重新启动。

我想知道是否有一种方法可以在外部简单地将一个文件附加到另一个文件,而无需打开它们。类比一下,考虑可以在命令行上执行的这个命令:"copy *.txt newfile.txt"。它将获取所有以 .txt 结尾的文件并创建一个新文件,将它们合并在一起。如果我理解正确,我不认为它需要将任何单独文件的整个内容加载到内存中,因为它只是进行文件系统复制,将文件连接在一起,一个文件结束,另一个文件开始。是否有类似的方法可以用于PDF文件?

我们尝试了各种合并实现,其中您将每个文件的页面加载到一个循环中,然后逐个将每个页面添加到呈现器中。显然,这种方法的内存使用与文件大小成正比。

英文:

When combining PDF files via code, every implementation I've seen, requires all files to be opened before being combined into one. This works just fine for small files, but with large enough files, we can exceed the resources available to an azure function, causing the function to restart.

I am wondering if there is a way to simply append one file to another from the outside, without having to open them. As an analogy, consider the this command that can be executed on a command line: "copy *.txt newfile.txt". It will take all files ending in .txt and create a new file that combines them all. If I understand correctly, I don't believe it requires loading the entire contents of any individual file into memory, as it is merely doing a file system copy, but joining the files where one file ends and the other begins. Is there a similar approach that can be taken with pdfs?

We've tried various merge implementations where you load the pages of each file into a for loop and add each page to a renderer, one by one. Obviously the memory usage is directly proportional to the size of the files in this approach.

答案1

得分: 4

以下是已翻译的内容:

实际上,可以使用增量更新功能来实现这一点,您可以修改 PDF 文档而不影响其原始内容,只需将更改附加到文件末尾即可。

这是增量更新规范。

此外,这里有一个使用 GemBox.Pdf 合并 PDF 文件的示例,采用增量更新的方式:合并大量PDF文件

主要的点是使用没有参数的PdfDocument.Save()重载方法,然后使用PdfDocument.Unload()方法。这样,任何给定时间只会加载一个文件。

using (var document = new PdfDocument())
{
    document.Save("Merged Files.pdf");

    foreach (var file in files)
    {
        using (var source = PdfDocument.Load(file))
            document.Pages.Kids.AddClone(source.Pages);

        // 保存新页面。
        document.Save();

        // 清除先前解析的页面,从而释放内存。
        document.Unload();
    }
}

我们甚至可以进一步卸载在X页之后的任何解析对象。

int counter = 0;
using (var destination = new PdfDocument())
{
    destination.Save("Merged Files.pdf");

    foreach (var file in files)
        using (var source = PdfDocument.Load(file))
            for (int index = 0, count = source.Pages.Count; index < count; counter++, index++)
            {
                destination.Pages.AddClone(source.Pages[index]);
                // 在100页后卸载。
                if (counter % 100 == 0)
                {
                    destination.Save();
                    destination.Unload();
                    source.Unload();
                }
            }

    destination.Save();
}
英文:

Actually, this is possible using the incremental update feature, you can modify a PDF document without affecting its original content by appending just the changes at the end of the file.

Here is the Incremental Updates specification.

Also, here is an example of merging PDF files with incremental updates using GemBox.Pdf:
Merge a large number of PDF files

The main point is to use PdfDocument.Save() overload method that has no parameters and then PdfDocument.Unload() method. That way only one file will be loaded at any given time.

using (var document = new PdfDocument())
{
    document.Save(&quot;Merged Files.pdf&quot;);

    foreach (var file in files)
    {
        using (var source = PdfDocument.Load(file))
            document.Pages.Kids.AddClone(source.Pages);

        // Save the new pages.
        document.Save();

        // Clear previously parsed pages and thus free the memory.
        document.Unload();
    }
}

We can go even further with this and unload any parsed objects after X pages.

int counter = 0;
using (var destination = new PdfDocument())
{
    destination.Save(&quot;Merged Files.pdf&quot;);

    foreach (var file in files)
        using (var source = PdfDocument.Load(file))
            for (int index = 0, count = source.Pages.Count; index &lt; count; counter++, index++)
            {
                destination.Pages.AddClone(source.Pages[index]);
                // Unload after 100 pages.
                if (counter % 100 == 0)
                {
                    destination.Save();
                    destination.Unload();
                    source.Unload();
                }
            }

    destination.Save();
}

答案2

得分: -1

TL;DR: 不幸的是,不行。

Long version: PDF是一种结构如下的文件格式:

[文件头]
PDF对象
[XRef表]
[Trailer对象]
[startxref偏移]

文件头是一些简单的字节,指示这是一个PDF文件,然后是一些大于128个字节的二进制文件指示。

所有的PDF对象都是这个样子的:

1 0 obj
[对象]
endobj

其中1是对象编号。所以对象2会以这种方式开始:2 0 obj等等。

XRef表是对象及其在文件中的字节偏移的列表。

对象可以引用其他对象。例如,Trailer对象是一个包含对根元素的引用的字典。根元素包含对页面对象的引用。页面对象包含一个或多个(可能是数组)对页面对象的引用。

因此,在读取PDF文件时,您从结尾开始,读取startxref值。这指向XRef表的开头。该表告诉您对象从1 0 obj开始的字节偏移。

因此,如果您要连接2个文件,当读取文件时,您将读取startxref值。但是偏移量是不正确的,因为偏移量比第一个文件的长度大。即使纠正了这一点,整个XRef表仍然受到第一个文件长度的影响。即使纠正了这一点,您仍然只能读取最后一个PDF。要实际连接页面,您需要编辑所有对象编号(以及在文件中使用该对象的引用),使其成为[文件1的最后对象编号+n]。然后,您需要覆盖以前的页面对象,以包括第二个文件的所有页面。

这不是最困难的部分,因为您只需创建一个新的PageTree对象,它将引用两个PageTree对象编号。您需要更新/覆盖Catalog对象,以引用新创建的PageTree作为树的新根。

然后,您需要创建一个新的XRef表,其中包括来自两个文件的所有对象,按正确的顺序列出它们以及它们的字节偏移。

英文:

TL;DR: Unfortunately, no.


Long version: Pdf is a fileformat that is structured as follows:

[File header]
PDF objects
[XRef Table]
[Trailer object]
[startxref offset]

The file header is a few simple bytes indicating that it is a PDF file, then some >128 bytes to indicate a binary file.
All PDF objects look like this:

1 0 obj
[object]
endobj

Where 1 is the object number. So object 2 would start as: 2 0 obj etc.

The XRef table is a list of objects and their byte offset in the file.

Objects can reference other objects. For example, the trailer object is a dictionary that contains a reference to the root element. The root element contains a reference to the pages object. The pages object contains one or more (might be an array) references to page object(s).

So when reading the PDF file, you start at the end, read the startxref value. That points to the start of the xref table. That table tells you at what byte offset the objects are, starting at object 1 0 obj.

So if you would concat 2 files together, when reading the file, you would read the startxref value. but the offset would be incorrect, because the offset is off by the length of the first file. Even when correcting for that. The entire xref table is off by the first file length. And even when correcting that, you can still only read the last PDF. To actually concat the pages, you need to edit all the object numbers (and the references in the file that use that object) to be [last object number from file 1 + n]. Then you need to overwrite the previous pages object to also include all the pages of the second file.

That is not the hardest, because you can just create a new PageTree object that will reference both pagetree object numbers. You do need to update/overwrite the Catalog object to reference the newly created pagetree as the new root of the tree.

Then you need to create a new xref table that uses all the objects from both files, list them in correct order and their byte offsets.

huangapple
  • 本文由 发表于 2023年3月20日 23:06:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75791984.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定