英文:
Missing descendant font dictionary
问题
抱歉如果我在这里打破了一些流程。
我知道有一个与此完全相同问题的问题
https://stackoverflow.com/questions/60317866/pdfbox-returns-missing-descendant-font-dictionary 但该帖子突然结束,因为作者未能提供详细信息。而且由于声望较低,无法继续该线程。
而且它非常明确地指出了缺少复合字体的问题。我想知道是否有一些方法可以修复它,因为PDF在我们的浏览器中打开得很好,但我们无法以编程方式处理它。
已经在一些版本上尝试过,包括最新的2.0.21版本。
我会分享PDF文件。
期待您的回复
@mkl,@Tilman Hausherr
如果您需要更多细节,请告诉我。
我的代码试图将PDF转换为图像
PDDocument document = PDDocument.load(new File(pdfPath + "//" + fileName));
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
}
英文:
Starting with an apology if I am breaking some process here.
I am aware that there is a question with exactly the same problem
https://stackoverflow.com/questions/60317866/pdfbox-returns-missing-descendant-font-dictionary but the thread ends abruptly because the author wasn't able to give the details, unfortunately. Also due to low reputation wasn't able to continue that thread.
And it very well states the problem of missing composite font. I wanted to know if there is some way to fix it since the PDF opens fine in our browser but we are not able to deal with it programmatically.
Tried it on some variety of versions including the latest 2.0.21
I will share the PDF
Looking forward to you
@mkl, @Tilman Hausherr
Please let me know if you need more details.
My code trying to convert the PDF to images
PDDocument document = PDDocument.load(new File(pdfPath+"//"+fileName));
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
}
</details>
# 答案1
**得分**: 2
在链接可用时下载了文件,我进行了分析。
Adobe Acrobat Reader在打开文档时显示错误消息。iText RUPS报告了交叉引用问题。因此,第一印象是:该PDF文件损坏了。
尽管如此,我仔细查看了,但结果并不好...
根据交叉引用和尾部信息,该PDF应该包含58个间接对象,其ID从1到58。然而事实证明,对象32到49确实缺失了,尽管它们中的大多数被引用,有些是作为子字体出现。这解释了为什么PDFBox报告缺失子字体。
此外,对象50到57和1到10也没有按照交叉引用表所指示的位置出现。第二个交叉引用表的位置也不正确,并且文件长度根据线性化字典是错误的。
这种损坏的方式给人的印象是,该文件是同一文件的两个略微不同版本的混合体;就像尝试下载文件,但在某个点被中断,然后从文件的新版本继续;或者好像某个PDF处理器在某种方式下更改了文件,并试图将更改后的副本保存到同一个文件中,但是被中断了。
总结:这个PDF文件彻底损坏了。
如果一个PDF处理器尝试修复它,你不能确定从文件的哪个版本中获取信息,不同的PDF处理器(如果它们可以在某种程度上理解它)可能会以不同的方式解释该文件。
如果可能的话,你应该拒绝该文件,并要求一个未损坏的版本。
如果不可能的话,从似乎最好修复它的查看器中复制数据,然后手动检查复制的准确性,然后根据您对相关账户信息的其他了解检查整个提取的数据是否合理。一点祈祷也无妨。
<details>
<summary>英文:</summary>
Having downloaded the file when the link was available, I analyzed it.
Adobe Acrobat Reader shows error messages when opening the document. iText RUPS reports cross reference issues. First impression, therefore: That PDF is broken.
Nonetheless I looked closer but the result of that closer look was not better...
According to the cross references and trailers the PDF should contain 58 indirect objects with IDs 1 through 58. It turned out, though, that objects 32 through 49 are missing albeit most of them are referenced, some as descendant fonts. This explains why PDFBox reports missing descendant fonts.
Furthermore, objects 50 through 57 and 1 through 10 are not at the locations they should be according to the cross reference tables. Also the second cross reference table is at a wrong location and the file length is incorrect according to the linearization dictionary.
The way this is broken leaves the impression that the file is a mix of two slightly different versions of the same file; as if a download of the file was attempted but interrupted at some point and continued from a new version of the file; or as if some PDF processor somehow changed the file and tried to save the changed copy into the same file but was interrupted.
Summarized: The PDF is utterly broken.
If a PDF processor tries to repair it, you cannot be sure information from which version of the file you'll get, different PDF processors (if they can somehow make sense of it) are likely to interpret the file differently.
If possible, you should reject the file and request a non-broken version of it.
If not possible, copy the data from a viewer that appears to best repair it, manually check the copy for accuracy, and then check the whole extracted data for plausibility in regard to other information you have on the accounts in question. A little prayer won't hurt either.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论