从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。

huangapple go评论60阅读模式
英文:

images inverted and split when extracting images from pdf document by using PDFBox or Poppler

问题

I want to extract whole images per page in a pdf document by using PDFBox in JAVA. but all extracted images were inverted and split. It should be noted that it's not a bug in PDFBox or poppler but some format reasons of the pdf document itself.
so how can i piece together the whole image and get the right direction of every image? could anybody give me some advices? a snippet of JAVA code is preferred.
my pdf link: download

英文:

want to extract whole images per page in a pdf document by using PDFBox in JAVA. but all extracted images were inverted and split. It should be noted that it's not a bug in PDFBox or poppler but some format reasons of the pdf document itself.
so how can i piece together the whole image and get the right direction of every image? could anybody give me some advices? a snippet of JAVA code is preferred.
my pdf link: download

答案1

得分: 0

这里是前6张图片,我们可以看到它们只是右侧的文本,而艺术作品被指定为单一矢量线路径(如左侧所示)。

提取这样的成千上万张图片比它们的价值更高。

第一页仅有115张,密度异常高,为1200 ptpi。

现在我们可以测量它,然后应用在6和7之间的Y距离提升。

英文:

Here are the first 6 Images and we can see they are simply the text on the write whereas the art work is specified as single vector line paths (as shown on the left)

从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。

Extracting such thousands or hundreds of images is more work than its worth
Page 1 alone has 115 at unusually high density of 1200 ptpi

C:\Apps\PDF\poppler\poppler-23.05.0\Library\bin>pdfimages -list -f 1 -l 1 my.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 stencil   144   468  -       1   1  ccitt  no       348  0  1200  1200  197B 2.3%
   1     1 stencil    64   456  -       1   1  image  no       349  0  1200  1200  165B 4.5%
   1     2 stencil    64   456  -       1   1  image  no       349  0  1200  1200  165B 4.5%
   1     3 stencil    72   468  -       1   1  ccitt  no       350  0  1200  1200  154B 3.7%
   1     4 stencil   192   468  -       1   1  ccitt  no       351  0  1200  1200  264B 2.4%
   1     5 stencil    96   456  -       1   1  ccitt  no       352  0  1200  1200  142B 2.6%
   1     6 stencil   136   570  -       1   1  ccitt  no       353  0  1200  1200  192B 2.0%
   1     7 stencil   224   582  -       1   1  ccitt  no       419  0  1200  1200  329B 2.0%
   1     8 stencil   104   582  -       1   1  ccitt  no       420  0  1200  1200  194B 2.6%
   1     9 stencil   192   582  -       1   1  ccitt  no       345  0  1200  1200  306B 2.2%

So export each marquee area as an image.
从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。

It is possible to define the area as program vectors but as fast as you see them (about 4 xy rect values) you could click to clipboard and automate save as image6.png 7.png 8.png etc.

从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。

There are those that attempt to specify how a white space may be defined as a capturable area but it depends if you have the time to write a custom detector, based on search for 6. blah or 7. blah (not 1. - 5.) then vector full width for a height under that. here using Poppler.

pdftoppm -f 1 -l 1 -r 300 -x 360 -W 1750 -y 375 -H 360 -png my.pdf out6

从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。

and now we have the measure of it we can apply the Y distance uplift between 6. and 7.
从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。

答案2

得分: 0

乍一看,似乎每个问题中的图形都是在由但不包含文本对象的内容流指令单独包裹的分离块中绘制的。因此,隔离它们的一种方法是将所有这些指令块导出到单独的新页面。然后,可以对这些新页面进行后处理,例如将其渲染为位图图像使用 PdfRenderer

我基于此编写了执行此操作的代码,该代码基于最初来自此答案PdfContentStreamEditor

PDDocument document = PDDocument.load(...);

for (PDPage page : document.getDocumentCatalog().getPages()) {
    // ... 省略其他部分 ...
}

document.save(new File(RESULT_FOLDER, "my-isolatedFigures.pdf"));

IsolateFigures 测试 testIsolateInMy

第一幅图形被很好地提取出来:

S30 a S30 b S31 a S31 b
从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。

然而,某些图形实际上包含文本对象,因此被分隔成部分图像并丢失其文本内容:

S32 b 1 S32 b 2 S32 b 3 S32 b 4
从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。
英文:

At first glance it looked like each of the figures in question was drawn in a separate block of content stream instructions enveloped by but not containing text objects. Thus, one approach to isolate them is to export all such blocks of instructions to a separate new page. You then can post-process these new pages, e.g. by rendering them as bitmap images using a PdfRenderer.

I based code doing this on the PdfContentStreamEditor originally from this answer like this:

PDDocument document = PDDocument.load(...);

for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
        ByteArrayOutputStream commonRaw = null;
        ContentStreamWriter commonWriter = null;
        int depth = 0;

        @Override
        public void processPage(PDPage page) throws IOException {
            commonRaw = new ByteArrayOutputStream();
            try {
                commonWriter = new ContentStreamWriter(commonRaw);
                startFigurePage(page);
                super.processPage(page);
            } finally {
                endFigurePage();
                commonRaw.close();
            }
        }

        @Override
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator,
                List<COSBase> operands) throws IOException {
            String operatorString = operator.getName();
            if (operatorString.equals("BT")) {
                endFigurePage();
            }
            if (operatorString.equals("q")) {
                depth++;
            }
            writeFigure(operator, operands);
            if (operatorString.equals("Q")) {
                depth--;
            }
            if (operatorString.equals("ET")) {
                startFigurePage(getCurrentPage());
            }

            super.write(contentStreamWriter, operator, operands);
        }

        OutputStream figureRaw = null;
        ContentStreamWriter figureWriter = null;
        PDPage figurePage = null;
        int xobjectsDrawn = 0;
        int pathsPainted = 0;

        void startFigurePage(PDPage currentPage) throws IOException {
            figurePage = new PDPage(currentPage.getMediaBox());
            figurePage.setResources(currentPage.getResources());
            PDStream stream = new PDStream(document);
            figurePage.setContents(stream);
            figureWriter = new ContentStreamWriter(figureRaw = stream.createOutputStream(COSName.FLATE_DECODE));
            figureRaw.write(commonRaw.toByteArray());
            xobjectsDrawn = 0;
            pathsPainted = 0;
        }

        void endFigurePage() throws IOException {
            if (figureWriter != null) {
                figureWriter = null;
                figureRaw.close();
                figureRaw = null;
                if (xobjectsDrawn > 0 || pathsPainted > 3)
                    document.addPage(figurePage);
                figurePage = null;
            }
        }

        final List<String> PATH_PAINTING_OPERATORS = Arrays.asList("S", "s", "F", "f", "f*",
                "B", "B*", "b", "b*");

        void writeFigure(Operator operator, List<COSBase> operands) throws IOException {
            if (figureWriter != null) {
                String operatorString = operator.getName();
                boolean isXObjectDo = operatorString.equals("Do");
                boolean isPathPainting = PATH_PAINTING_OPERATORS.contains(operatorString);
                if (isXObjectDo)
                    xobjectsDrawn++;
                if (isPathPainting)
                    pathsPainted++;
                figureWriter.writeTokens(operands);
                figureWriter.writeToken(operator);
                if (depth == 0) {
                    if (!isXObjectDo) {
                        if (isPathPainting)
                            operator = Operator.getOperator("n");
                        commonWriter.writeTokens(operands);
                        commonWriter.writeToken(operator);
                    }
                }
            }
        }
    };
    editor.processPage(page);
}

document.save(new File(RESULT_FOLDER, "my-isolatedFigures.pdf"));

(IsolateFigures test testIsolateInMy)

The first figures are extracted quite fine:

S30 a S30 b S31 a S31 b
从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。

Certain figures, though, turn out to contain text objects and, therefore, are separated in partial images and lose their text content:

S32 b 1 S32 b 2 S32 b 3 S32 b 4
从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。 从PDF文档中使用PDFBox或Poppler提取图像时,图像会被反转并分割。

huangapple
  • 本文由 发表于 2023年5月25日 22:47:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76333586.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定