识别来自字节数组的分页符

huangapple go评论82阅读模式
英文:

Identifying page break from byte[]

问题

  1. 是否可以从字节流中识别页面分隔符?
  2. 如果可以,正确的方法是什么?

编辑
该文件是使用Apache PDFBox创建和存储的。

英文:

I have a use case where I am downloading a large file by writing the bytes to ServletOutputStream, and I would like to return some specified pages without completely loading file in-memory and using a library.

  1. Is it possible to identify the page break from the byte stream?
  2. If yes, what should be the correct approach?

Edit
The file was created and stored using Apache PDFBox.

答案1

得分: 5

无。简单来说,字节流中没有页面分隔符。PDF文件包含多个对象(字体、颜色空间、位图等),这些对象可以在多个页面上使用。在某些PDF文件中,甚至所有页面共享所有资源。因此,在PDF字节数组中没有专门用于单个页面的部分。

此外,这些对象是通过文件中的偏移量通过交叉引用流或表引用的。因此,仅为某些给定页面提供字节流的部分是行不通的,因为偏移量会错误。

从理论上讲,可以确定PDF字节流中被这些给定页面使用的区域,并传输0。如果使用了某种传输压缩,这些区域将被压缩得很好。但要确定这些区域,您需要使用PDF库,而您不希望这样做。

或者,有一种特殊的方式可以保存针对部分文件访问进行了优化的PDF文件(保存成这种格式的文件称为“线性化”),但这对您也没有帮助,因为PDFBox不提供保存这种类型的PDF文件,而且要利用这种优化需要支持HTTP范围,这在Servlet容器或Servlet本身中很少支持。

总之,我认为您最好的选择是修改生成大文件的过程,以便生成您想要的较小文件,而不是生成大文件或者与大文件一起生成较小文件。

英文:

> Is it possible to identify the page break from the byte stream?

No. For the simple reason that there is no page break in the byte stream.

PDF files contain numerous objects (fonts, colorspaces, bitmaps, ...) which can be used on multiple pages. In some PDFs all pages even share all resources. Thus, you don't have a section in the PDF byte array used for a page and only that page.

Furthermore, those objects are referenced via cross reference streams or tables by their offset in the file. So only serving the regions of the byte stream that are needed for some given pages cannot work to start with as the offsets would be wrong then.

Theoretically one could determine the regions in a PDF byte stream which are not used by those given pages and transfer 0s instead. If you employ some transport compression, these regions would compress quite well. But to determine those regions, you'd need a PDF library which you don't want to do.

Alternatively, there is a special way to save PDF files optimized for partial file access (files so saved are called "linearized"), but that doesn't help you either as PDFBox does not offer saving PDFs like that and because making use of that optimization requires support of HTTP ranges which are seldom supported in servlet containers or servlets themselves.


IMO your best option is to change the production of the large file to produce the smaller files you want instead of (or in addition to) the large file.

答案2

得分: 3

你所询问的内容

拥有PDF文档后,您可以编写代码来创建一个仅包含单个页面的小型PDF文档。一个包含10页的PDF会生成10个单独的PDF文件,总字节数远多于原始PDF。

这令人失望,我不知道有什么简单的分页系统。

关于PDF流式传输

可以生成用于Web流式传输的PDF:

  • 顺序、有序呈现元素
  • 图像数据在使用之前放在前面
  • 最好使用标准字体,它们已经在PDF查看器中存在。
    嵌入字体仅传输使用的字符次之,但不适用于PDF表单。
  • 我不知道PDFBox及其线性化PDF的功能,但它可能足以按顺序创建PDF。

当然,页面徽标等只需定义一次。

图像必须有适当的打印解决方案。

矢量图形可以是理想的选择(eps、svg)。

英文:

What you asked

Having the PDF document, you can write code that creates a small PDF document with just one single page. A 10 page PDF would give 10 single PDFs, together much more bytes than the original PFD.

This is disappointing, there is no easy paging system I am aware of.

Around PDF streaming

One can generate a PDF optimized for the web streaming:

  • sequential, in-order presentation of elements
  • image data in front before it is used
  • best use the standard fonts, they are already present with the PDF viewer.
    embedded fonts only transmitting the used characters comes second best, but
    is not suited for PFD forms.
  • PDFBox and its capability for linearized PDFs I unaware of, but it might be sufficient to create the PDF in-order.

An of course page logos and such need only be defined once.

Images must have an adequate solution for printing.

Vector graphics can be ideal (eps, svg).

huangapple
  • 本文由 发表于 2020年8月10日 14:36:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/63335244.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定