PDFBox更新版本以混乱的顺序提取数据。

huangapple go评论82阅读模式
英文:

PDFBox Newer version extracts data in jumbled order

问题

我正在尝试使用PDFTextStripperByArea从特定的PDF区域提取数据,我只对提取的数据感兴趣,因为数据是以混乱的顺序出现的,而页面的其余数据都能正确提取。这是在PDFBox版本2.0.7上进行的。

当我尝试使用旧版本1.8.x进行相同操作时,它可以正确提取数据。

我提取的区域似乎使用了不同的字体,与PDF中的其他数据不同。
我有点困惑出现了什么问题,是否有任何方法可以在使用较新版本时正确抓取数据,因为由于其他依赖关系,我不能回退到旧版本。

我尝试过:

  1. 在最新的PDFBox版本2.0.20上运行PDF,但仍然没有成功。
  2. 尝试进行调试,结果发现setSortByPosition在处理页面的初始步骤中进行了交换,然而,我不能将其设置为false,否则我会丢失换行符[另外旧版本在将setSortByPosition设置为true时运行良好]。

代码片段如下 -

Rectangle region = new Rectangle();
region.setRect(55, 75.80, 160, 100);
PDDocument pdfDoc = PDDocument.load(new File(pdfFilePath));
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea ();
stripperByArea.setSortByPosition(true);
stripperByArea.addRegion("CVAM", region);
stripperByArea.extractRegions(pdfDoc.getPages().get(0));
return stripperByArea.getTextForRegion("CVAM");

我在评论中分享了PDF文件链接
提前感谢!!!!

英文:

I am trying to extract data from a particular PDF region using PDFTextStripperByArea and the only data that I am interested to extract is coming in jumbled order, rest all the page data comes properly. This is on PDFBox versions 2.0.7.

When I try the same using legacy version 1.8.x, it extracts the data properly.

The area that I am extracting appears to be different font as compared to the other data in PDF.
I am a little confused on what wrong is happening, is there any way to scrape the data correctly using the newer versions since I cannot fall back on older version due to other dependencies.

What I have tried: -

  1. Running the PDF on the latest PDFBox version 2.0.20, still no luck
  2. Try debugging out and turns out that setSortByPosition is doing the swapping in the initial step of processing the page, however, I cannot set it false else I lose the new-line characters [ plus the older version works fine when setSortByPosition is set to true]

The code snippet -

Rectangle region = new Rectangle();
region.setRect(55, 75.80, 160, 100);
PDDocument pdfDoc = PDDocument.load(new File(pdfFilePath));
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea ();
stripperByArea.setSortByPosition(true);
stripperByArea.addRegion("CVAM", region);
stripperByArea.extractRegions(pdfDoc.getPages().get(0));
return stripperByArea.getTextForRegion("CVAM");

I am sharing the PDF file link in the comment
Thanks in advance!!!!!

答案1

得分: 2

PDF中的字体具有非常不现实的元数据。特别是它们的上升高度(Ascent)下降高度(Descent),**大写字母高度(CapHeight)字体边界框(FontBBox)**条目包含的值声称字形的高度实际上是它们实际高度的两倍左右。由于PDF中的视觉文本行设置得非常紧密,这意味着遵循这些元数据的PDF工具必须假定实际上并不是三行文本,而可能是一行或两行文本,其中一些字母升高了一点,一些字母降低了一点。因此,排序结果杂乱无章。

您可以验证不仅PDFBox在处理这些字体时存在问题。例如,在Adobe Reader中打开PDF并点击文本,会出现一个巨大的光标条:

PDFBox更新版本以混乱的顺序提取数据。

复制粘贴该地址会产生以下结果:

>1D4A0N0I EHL IDD DEPNO WELALKLES DR MT PLEASANT SC 29464-9473

尽管如此,根据@Tilman的提醒,2.0.21将具有设置自己的高度计算的可能性,我在当前的PDFBox开发版本中使用了该功能,提供了一个恒定的低字体高度:

PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea() {
    @Override
    protected float computeFontHeight(PDFont font) throws IOException {
        return .5f;
    }
};
stripperByArea.setSortByPosition(false);
stripperByArea.addRegion("CVAM", region);
stripperByArea.extractRegions(pdfDoc.getPages().get(0));
String text = stripperByArea.getTextForRegion("CVAM");

(来自ExtractText 测试 testCustomFontHeightYOYO)

无论是将SortByPosition设置为true还是false,现在的结果都是:

>DANIEL D POWELL<br>
>1400 HIDDEN LAKES DR<br>
>MT PLEASANT SC 29464-9473

英文:

The fonts in your PDF have very unrealistic metadata. In particular their Ascent, Descent, CapHeight, and FontBBox entries contain values that claim that the glyphs are about twice as high as they actually are. As the visual text lines in your PDF are set quite tightly, this means that a PDF tool following those metadata must assume that there actually are not three but one or probably two text lines with some letters raised a bit and some lowered a bit. Sorting, therefore, results in a hodgepodge.

You can check that not only PDFBox has issues with these fonts. E.g. opening the PDF in Adobe Reader and clicking into the text you get a giant cursor bar:

PDFBox更新版本以混乱的顺序提取数据。

and copying&pasting the address results in

>1D4A0N0I EHL IDD DEPNO WELALKLES DR MT PLEASANT SC 29464-9473

Nonetheless, following @Tilman's remark that 2.0.21 will have the possibility to set own height calculations, I made use of that feature in the current PDFBox development head to supply a constant, low font height:

PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea() {
    @Override
    protected float computeFontHeight(PDFont font) throws IOException {
        return .5f;
    }
};
stripperByArea.setSortByPosition(false);
stripperByArea.addRegion(&quot;CVAM&quot;, region);
stripperByArea.extractRegions(pdfDoc.getPages().get(0));
String text = stripperByArea.getTextForRegion(&quot;CVAM&quot;);

(from ExtractText test testCustomFontHeightYOYO)

Both with SortByPosition set to true and false the result now is:

>DANIEL D POWELL<br>
>1400 HIDDEN LAKES DR<br>
>MT PLEASANT SC 29464-9473

huangapple
  • 本文由 发表于 2020年7月23日 13:32:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/63047485.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定