PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet

huangapple go评论110阅读模式
英文:

PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet

问题

我正在尝试使用PdResources的getXObjectNames()从我的PDF中获取并读取条形码。

  • 我的代码与此链接非常相似:https://issues.apache.org/jira/browse/PDFBOX-2124

如果您查看上面的JIRA条目,您将会看到一个附加的PDF文件。
当我在该PDF文件上运行代码时,我会得到所需的输出(即打印出了条形码类型)。

然而,当我在我的PDF上运行它时,它无法识别其中的条形码(我已经检查过,条形码实际上是图像,而不是文本)。

另外可能听起来很奇怪,但它在我的PDF上曾经成功运行过,而且自那以后我没有进行任何更改,但现在肯定不起作用(由于某种原因我不能分享这个PDF)。

是否有人遇到过类似的问题?

同时,这是我在Stack Overflow上的第一个问题。如果我在任何地方做错了,请告诉我。

这是那个PDF的链接:
https://drive.google.com/file/d/1PzVApIePg4U9XL399BpAd2oeY6Q2tLEB/view?usp=drivesdk

英文:

I am trying to fetch and read bar codes from my PDF using getXObjectNames() of PdResources.

If you see the above JIRA item, you will see a PDF file attached to it.
When I run the code on that PDF file I get the desired output (i.e. the bar code type is printed.)

However when I run it on my PDF, it does not recognize the bar code in it. (I have checked that the bar code is in fact an image and not text.)

Also it may sound weird, but it did work on my PDF once and I haven't made any changes since then, but it definitely does not work now. (I cannot share the PDF for some reason.)

Has anyone faced a similar issue?

Also this is my first question on Stack Overflow. Please tell me if I am wrong anywhere.

Here is a link to that pdf:
https://drive.google.com/file/d/1PzVApIePg4U9XL399BpAd2oeY6Q2tLEB/view?usp=drivesdk

答案1

得分: 2

In General

由于你只展示了与 PDFBOX-2124 中的代码 "非常相似",并且由于你表示 "因某些原因无法分享PDF",我只能分析那段代码。因此,我无法确定问题的实质是什么,但我可以列举一些可能的问题。

首先,该代码仅检查给定页面的直接资源以获取位图图像:

PDResources pdResources = pdPage.getResources();
Map<String, PDXObject> xobjects = (Map<String, PDXObject>) pdResources.getXObjects();
if (xobjects != null)
{
    for (String key : xobjects.keySet())
    {
        PDXObject xobject = xobjects.get(key);
        if (xobject instanceof PDImageXObject)
        {
            PDImageXObject imageObject = (PDImageXObject) xobject;
            String suffix = imageObject.getSuffix();
            if (suffix != null)
            {
                BufferedImage image = imageObject.getImage();
                extractBarcodeArrayByAreas(image, this.maximumBlankPixelDelimiterCount);
            }
        }
    }
}

(PDFBOX-2124 PdPageBarcodeScanner 方法 scsan)

位图图像也可以存储在其他位置,例如:

  • 在页面上使用的表单xobjects、图案或Type 3字体的单独资源中;要找到它们,必须检查其他页面资源,甚至是递归地,因为图像可能是在表单xobject上使用的图案的资源;
  • 在页面的注释的单独资源中;因此,还必须递归进入注释资源;
  • 嵌入在某些内容流中;因此,还必须搜索页面本身、页面资源(递归)、页面注释及其资源(递归)的内容流。

此外,位图可能以PDFBox不知道如何导出为 BufferedImage 的某种格式(特别是某些颜色空间)给出。

此外,条形码可能是使用应用于纯黑色位图的某种掩码构建的,在这种情况下,你的代码可能只会尝试扫描该纯黑色图像。

此外,你说:

> 我已经检查过条形码实际上是图像,而不是文本。

如果你只检查了条形码是否不是文本,它可能不仅是位图图像,还可能是通过矢量图形指令绘制的。因此,你还必须检查所有内容流,以查找绘制条形码的矢量图形指令。

此外,可能会有组合,例如,在绘制嵌入式位图图像时可能会激活矢量图形的软蒙版等。

我确信我在这里漏掉了一些选项。


作为下一步,你可能希望分析无法分享的PDF,以弄清楚条形码的绘制方式。

或者,你可以将页面呈现为位图图像,并使用zxing在该大图像中搜索条形码。

Sample PDF.pdf

你提供了一个样本PDF的链接。因此,我尝试使用与PDFBOX-2124中非常相似的代码来提取条形码。显然,那里的代码是为某个PDFBox 2.0.0-SNAPSHOT编写的,因此需要稍作更正。特别是你在问题标题中提到的 getXObjectNames() 方法最终被使用:

PDResources pdResources = pdPage.getResources();
int index = 0;

for (COSName name : pdResources.getXObjectNames()) {
    PDXObject xobject = pdResources.getXObject(name);
    if (xobject instanceof PDImageXObject)
    {
        PDImageXObject imageObject = (PDImageXObject) xobject;
        String suffix = imageObject.getSuffix();
        if (suffix != null)
        {
            BufferedImage image = imageObject.getImage();

            File file = new File(RESULT_FOLDER, String.format("Sample PDF-1-%s.%s", index, imageObject.getSuffix()));
            ImageIO.write(image, imageObject.getSuffix(), file);
            index++;
            System.out.println(file);
        }
    }
}

(ExtractImages 测试 testExtractSamplePDFJayshreeAtak)

输出:一个位图图像被导出为 "Sample PDF-1-0.tiff",如下所示:

PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet

因此,我无法复现你的问题。

> PDF Box 的 getXObjectNames() 在我的PDF上无法识别条形码,但是在我从互联网上获取的PDF文件上可以识别出来。

显然,getXObjectNames() 返回了位图图像xobject资源的名称,并且PDFBox可以正确导出它。

请检查你的代码,确定像你所说的那样图像没有被提取出来,还是某个后续步骤无法处理它。

如果在你的情况下图像确实没有被提取出来,

  • 更新你的PDFBox版本(我使用了当前的开发版本,但是最新发布的版本应该返回相同的结果),
  • 更新你的Java版本,
  • 检查是否有可能引发问题的额外JAI库。

如果在你的情况下图像被提取出来,但后续代码无法如预期那样进行分析,

  • 更彻底地进行调试,找出分析失败的位置,
  • 在这里提出一个新问题,重点是二维码图像分析,
  • 并提供足够的代码和tiff文件,以使其他人能够真正复现这个问题。
英文:

In General

As you don't show your code but only describe it as very similar to that in PDFBOX-2124, and as you say you cannot share the PDF for some reason, I only have that code to analyze. Thus, I cannot tell what really is the issue but merely enumerate some possible problems

First of all, that code only inspects the immediate resources of the given page for bitmap images:

PDResources pdResources = pdPage.getResources();

Map&lt;String, PDXObject&gt; xobjects = (Map&lt;String, PDXObject&gt;) pdResources.getXObjects();
if (xobjects != null)
{
    for (String key : xobjects.keySet())
    {
        PDXObject xobject = xobjects.get(key);
        if (xobject instanceof PDImageXObject)
        {
            PDImageXObject imageObject = (PDImageXObject) xobject;
            String suffix = imageObject.getSuffix();
            if (suffix != null)
            {
                BufferedImage image = imageObject.getImage();
    			extractBarcodeArrayByAreas(image, this.maximumBlankPixelDelimiterCount);
            }
        }
    }
}	

(PDFBOX-2124 PdPageBarcodeScanner method scsan)

Bitmap images can also be stored elsewhere, e.g.

  • in the separate resources of form xobjects, patterns, or Type 3 fonts used on the page; to find them one has to inspect other page resources, too, even recursively as the image might be a resource of a pattern used in a form xobject used on the page;
  • in the separate resources of annotations of the page; thus, one has to recurse into annotation resources, too;
  • inlined in some content stream; thus, one also has to search the content streams of the page itself, of page resources (recursively), and page annotations and their resources (recursively).

Furthermore, the bitmap might be given in some format (in particular with some colorspace) which PDFBox does not know how to export as BufferedImage.

Also the bar code may be constructed using some mask applied to a purely black bitmap in which case your code probably only tries to scan that purely black image.

Furthermore, you say

> I have checked that the bar code is in fact an image and not text.

If you only checked that the bar code is not text, it may not only be a bitmap image but it can also be drawn by vector graphics instructions. Thus, you also have to check all content streams for vector graphics instructions drawing a bar code.

Also there may be combinations, e.g. a soft mask of vector graphics may be active when drawing a purely black inlined bitmap image etc.

And I'm sure I've missed a number of options here.


As next step you may want to analyze the PDF you cannot share to find out how exactly that barcode is drawn.

Alternatively, you render the page as bitmap image and search that large bitmap for bar codes using zxing.


Sample PDF.pdf

You provided a link to a sample PDF. So I tried to extract the bar code using code very similar to that from PDFBOX-2124. Apparently the code there was for some PDFBox 2.0.0-SNAPSHOT, so it had to be corrected a bit. In particular the method getXObjectNames() you mention in the question title finally is used:

PDResources pdResources = pdPage.getResources();
int index = 0;

for (COSName name : pdResources.getXObjectNames()) {
    PDXObject xobject = pdResources.getXObject(name);
    if (xobject instanceof PDImageXObject)
    {
        PDImageXObject imageObject = (PDImageXObject) xobject;
        String suffix = imageObject.getSuffix();
        if (suffix != null)
        {
            BufferedImage image = imageObject.getImage();

            File file = new File(RESULT_FOLDER, String.format(&quot;Sample PDF-1-%s.%s&quot;, index, imageObject.getSuffix()));
            ImageIO.write(image, imageObject.getSuffix(), file);
            index++;
            System.out.println(file);
        }
    }
}

(ExtractImages test testExtractSamplePDFJayshreeAtak)

The output: One bitmap image is exported as "Sample PDF-1-0.tiff" which looks like this:

PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet

Thus, I cannot reproduce your issue

> PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet

Obviously getXObjectNames() does return the name of the bitmap image xobject resource and PDFBox exports it just fine.

Please check with your code whether as claimed the image is not extracted or whether some later step simply cannot deal with it.

If in your case indeed the image is not extracted,

  • update your PDFBox version (I used the current development head but the newest released version should return the same),
  • update your Java,
  • check whether you have extra JAI jars that might cause trouble.

If in your case the image is extracted but not analyzed as expected by later code,

  • debug more thoroughly to find out where the analysis fails,

  • create a new question here focusing on the QR code image analysis,

  • and provide enough code and the tiff file to allow people to actually reproduce the issue.

huangapple
  • 本文由 发表于 2020年8月25日 21:29:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/63579973.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定