如何在.Net Core中合并两个大型PDF文件而不会占用大量内存。

huangapple go评论112阅读模式

How to merge 2 large pdf files in .Net Core without large memory utilization



我想知道是否有一种方法可以在外部简单地将一个文件附加到另一个文件,而无需打开它们。类比一下,考虑可以在命令行上执行的这个命令:"copy *.txt newfile.txt"。它将获取所有以 .txt 结尾的文件并创建一个新文件,将它们合并在一起。如果我理解正确,我不认为它需要将任何单独文件的整个内容加载到内存中,因为它只是进行文件系统复制,将文件连接在一起,一个文件结束,另一个文件开始。是否有类似的方法可以用于PDF文件?



When combining PDF files via code, every implementation I've seen, requires all files to be opened before being combined into one. This works just fine for small files, but with large enough files, we can exceed the resources available to an azure function, causing the function to restart.

I am wondering if there is a way to simply append one file to another from the outside, without having to open them. As an analogy, consider the this command that can be executed on a command line: "copy *.txt newfile.txt". It will take all files ending in .txt and create a new file that combines them all. If I understand correctly, I don't believe it requires loading the entire contents of any individual file into memory, as it is merely doing a file system copy, but joining the files where one file ends and the other begins. Is there a similar approach that can be taken with pdfs?

We've tried various merge implementations where you load the pages of each file into a for loop and add each page to a renderer, one by one. Obviously the memory usage is directly proportional to the size of the files in this approach.


得分: 4


实际上,可以使用增量更新功能来实现这一点,您可以修改 PDF 文档而不影响其原始内容,只需将更改附加到文件末尾即可。


此外,这里有一个使用 GemBox.Pdf 合并 PDF 文件的示例,采用增量更新的方式:合并大量PDF文件


  1. using (var document = new PdfDocument())
  2. {
  3. document.Save("Merged Files.pdf");
  4. foreach (var file in files)
  5. {
  6. using (var source = PdfDocument.Load(file))
  7. document.Pages.Kids.AddClone(source.Pages);
  8. // 保存新页面。
  9. document.Save();
  10. // 清除先前解析的页面,从而释放内存。
  11. document.Unload();
  12. }
  13. }


  1. int counter = 0;
  2. using (var destination = new PdfDocument())
  3. {
  4. destination.Save("Merged Files.pdf");
  5. foreach (var file in files)
  6. using (var source = PdfDocument.Load(file))
  7. for (int index = 0, count = source.Pages.Count; index < count; counter++, index++)
  8. {
  9. destination.Pages.AddClone(source.Pages[index]);
  10. // 在100页后卸载。
  11. if (counter % 100 == 0)
  12. {
  13. destination.Save();
  14. destination.Unload();
  15. source.Unload();
  16. }
  17. }
  18. destination.Save();
  19. }

Actually, this is possible using the incremental update feature, you can modify a PDF document without affecting its original content by appending just the changes at the end of the file.

Here is the Incremental Updates specification.

Also, here is an example of merging PDF files with incremental updates using GemBox.Pdf:
Merge a large number of PDF files

The main point is to use PdfDocument.Save() overload method that has no parameters and then PdfDocument.Unload() method. That way only one file will be loaded at any given time.

  1. using (var document = new PdfDocument())
  2. {
  3. document.Save(&quot;Merged Files.pdf&quot;);
  4. foreach (var file in files)
  5. {
  6. using (var source = PdfDocument.Load(file))
  7. document.Pages.Kids.AddClone(source.Pages);
  8. // Save the new pages.
  9. document.Save();
  10. // Clear previously parsed pages and thus free the memory.
  11. document.Unload();
  12. }
  13. }

We can go even further with this and unload any parsed objects after X pages.

  1. int counter = 0;
  2. using (var destination = new PdfDocument())
  3. {
  4. destination.Save(&quot;Merged Files.pdf&quot;);
  5. foreach (var file in files)
  6. using (var source = PdfDocument.Load(file))
  7. for (int index = 0, count = source.Pages.Count; index &lt; count; counter++, index++)
  8. {
  9. destination.Pages.AddClone(source.Pages[index]);
  10. // Unload after 100 pages.
  11. if (counter % 100 == 0)
  12. {
  13. destination.Save();
  14. destination.Unload();
  15. source.Unload();
  16. }
  17. }
  18. destination.Save();
  19. }


得分: -1

TL;DR: 不幸的是,不行。

Long version: PDF是一种结构如下的文件格式:




1 0 obj

其中1是对象编号。所以对象2会以这种方式开始:2 0 obj等等。



因此,在读取PDF文件时,您从结尾开始,读取startxref值。这指向XRef表的开头。该表告诉您对象从1 0 obj开始的字节偏移。





TL;DR: Unfortunately, no.

Long version: Pdf is a fileformat that is structured as follows:

  1. [File header]
  2. PDF objects
  3. [XRef Table]
  4. [Trailer object]
  5. [startxref offset]

The file header is a few simple bytes indicating that it is a PDF file, then some >128 bytes to indicate a binary file.
All PDF objects look like this:

  1. 1 0 obj
  2. [object]
  3. endobj

Where 1 is the object number. So object 2 would start as: 2 0 obj etc.

The XRef table is a list of objects and their byte offset in the file.

Objects can reference other objects. For example, the trailer object is a dictionary that contains a reference to the root element. The root element contains a reference to the pages object. The pages object contains one or more (might be an array) references to page object(s).

So when reading the PDF file, you start at the end, read the startxref value. That points to the start of the xref table. That table tells you at what byte offset the objects are, starting at object 1 0 obj.

So if you would concat 2 files together, when reading the file, you would read the startxref value. but the offset would be incorrect, because the offset is off by the length of the first file. Even when correcting for that. The entire xref table is off by the first file length. And even when correcting that, you can still only read the last PDF. To actually concat the pages, you need to edit all the object numbers (and the references in the file that use that object) to be [last object number from file 1 + n]. Then you need to overwrite the previous pages object to also include all the pages of the second file.

That is not the hardest, because you can just create a new PageTree object that will reference both pagetree object numbers. You do need to update/overwrite the Catalog object to reference the newly created pagetree as the new root of the tree.

Then you need to create a new xref table that uses all the objects from both files, list them in correct order and their byte offsets.

  • 本文由 发表于 2023年3月20日 23:06:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/75791984.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
