2020年8月10日 14:36:38go评论82阅读模式

英文:

Identifying page break from byte[]

问题

是否可以从字节流中识别页面分隔符？
如果可以，正确的方法是什么？

编辑
该文件是使用Apache PDFBox创建和存储的。

英文:

I have a use case where I am downloading a large file by writing the bytes to ServletOutputStream, and I would like to return some specified pages without completely loading file in-memory and using a library.

Is it possible to identify the page break from the byte stream?
If yes, what should be the correct approach?

Edit
The file was created and stored using Apache PDFBox.

答案1

得分: 5

无。简单来说，字节流中没有页面分隔符。PDF文件包含多个对象（字体、颜色空间、位图等），这些对象可以在多个页面上使用。在某些PDF文件中，甚至所有页面共享所有资源。因此，在PDF字节数组中没有专门用于单个页面的部分。

此外，这些对象是通过文件中的偏移量通过交叉引用流或表引用的。因此，仅为某些给定页面提供字节流的部分是行不通的，因为偏移量会错误。

从理论上讲，可以确定PDF字节流中未被这些给定页面使用的区域，并传输0。如果使用了某种传输压缩，这些区域将被压缩得很好。但要确定这些区域，您需要使用PDF库，而您不希望这样做。

或者，有一种特殊的方式可以保存针对部分文件访问进行了优化的PDF文件（保存成这种格式的文件称为“线性化”），但这对您也没有帮助，因为PDFBox不提供保存这种类型的PDF文件，而且要利用这种优化需要支持HTTP范围，这在Servlet容器或Servlet本身中很少支持。

总之，我认为您最好的选择是修改生成大文件的过程，以便生成您想要的较小文件，而不是生成大文件或者与大文件一起生成较小文件。

英文:

> Is it possible to identify the page break from the byte stream?

No. For the simple reason that there is no page break in the byte stream.

PDF files contain numerous objects (fonts, colorspaces, bitmaps, ...) which can be used on multiple pages. In some PDFs all pages even share all resources. Thus, you don't have a section in the PDF byte array used for a page and only that page.

Furthermore, those objects are referenced via cross reference streams or tables by their offset in the file. So only serving the regions of the byte stream that are needed for some given pages cannot work to start with as the offsets would be wrong then.

Theoretically one could determine the regions in a PDF byte stream which are not used by those given pages and transfer 0s instead. If you employ some transport compression, these regions would compress quite well. But to determine those regions, you'd need a PDF library which you don't want to do.

Alternatively, there is a special way to save PDF files optimized for partial file access (files so saved are called "linearized"), but that doesn't help you either as PDFBox does not offer saving PDFs like that and because making use of that optimization requires support of HTTP ranges which are seldom supported in servlet containers or servlets themselves.

IMO your best option is to change the production of the large file to produce the smaller files you want instead of (or in addition to) the large file.

答案2

得分: 3

你所询问的内容

拥有PDF文档后，您可以编写代码来创建一个仅包含单个页面的小型PDF文档。一个包含10页的PDF会生成10个单独的PDF文件，总字节数远多于原始PDF。

这令人失望，我不知道有什么简单的分页系统。

关于PDF流式传输

可以生成用于Web流式传输的PDF：

顺序、有序呈现元素
图像数据在使用之前放在前面
最好使用标准字体，它们已经在PDF查看器中存在。
嵌入字体仅传输使用的字符次之，但不适用于PDF表单。
我不知道PDFBox及其线性化PDF的功能，但它可能足以按顺序创建PDF。

当然，页面徽标等只需定义一次。

图像必须有适当的打印解决方案。

矢量图形可以是理想的选择（eps、svg）。

英文:

What you asked

Having the PDF document, you can write code that creates a small PDF document with just one single page. A 10 page PDF would give 10 single PDFs, together much more bytes than the original PFD.

This is disappointing, there is no easy paging system I am aware of.

Around PDF streaming

One can generate a PDF optimized for the web streaming:

sequential, in-order presentation of elements
image data in front before it is used
best use the standard fonts, they are already present with the PDF viewer.
embedded fonts only transmitting the used characters comes second best, but
is not suited for PFD forms.
PDFBox and its capability for linearized PDFs I unaware of, but it might be sufficient to create the PDF in-order.

An of course page logos and such need only be defined once.

Images must have an adequate solution for printing.

Vector graphics can be ideal (eps, svg).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

识别来自字节数组的分页符

问题

答案1

答案2

Java中同时执行if和else语句。

Camel-case fields are not recognized by API.

无法匹配AES-GCM规范向量与Java加密库。

怎样在调用静态方法时指定泛型类型？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论