从PDF文件中删除一张图片,使用PDFBox。

huangapple go评论76阅读模式
英文:

delete am image from a PDF file using PDFbox

问题

以下是你要的翻译内容:

public class DeleteImage {
    public static void removeImages(String pdfFile) throws Exception {
        PDDocument document = PDDocument.load(new File(pdfFile));

        for (PDPage page : document.getPages()) {
            PDResources pdResources = page.getResources();
            pdResources.getXObjectNames().forEach(propertyName -> {
                if (!pdResources.isImageXObject(propertyName)) {
                    return;
                }
                PDXObject o;
                try {
                    o = pdResources.getXObject(propertyName);
                    if (o instanceof PDImageXObject) {
                        System.out.println("propertyName" + propertyName);
                        page.getCOSObject().removeItem(propertyName);
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            });

            for (COSName name : page.getResources().getPatternNames()) {
                PDAbstractPattern pattern = page.getResources().getPattern(name);
                System.out.println("have pattern");
            }

            PDFStreamParser parser = new PDFStreamParser(page);
            parser.parse();
            List<Object> tokens = parser.getTokens();
            System.out.println("original tokens size" + tokens.size());
            List<Object> newTokens = new ArrayList<Object>();

            for (int j = 0; j < tokens.size(); j++) {
                Object token = tokens.get(j);
                if (token instanceof Operator) {
                    Operator op = (Operator) token;

                    System.out.println("operation" + op.getName());
                    // find image - remove it
                    if (op.getName().equals("Do")) {
                        System.out.println("op equals Do");
                        newTokens.remove(newTokens.size() - 1);
                        continue;
                    } else if ("BI".equals(op.getName())) {
                        System.out.println("inline -- op equals BI");
                    } else {
                        System.out.println("op not equals Do");
                    }
                }
                newTokens.add(token);
            }

            PDDocument newDoc = new PDDocument();
            PDPage newPage = newDoc.importPage(page);
            newPage.setResources(page.getResources());

            System.out.println("tokens size" + newTokens.size());
            PDStream newContents = new PDStream(newDoc);
            OutputStream out = newContents.createOutputStream();
            ContentStreamWriter writer = new ContentStreamWriter(out);
            writer.writeTokens(newTokens);
            out.close();
            newPage.setContents(newContents);
        }

        document.save("RemoveImage.pdf");
        document.close();
    }

    public static void remove(String pdfFile) throws Exception {
        PDDocument document = PDDocument.load(new File(pdfFile));
        PDResources resources = null;

        for (PDPage page : document.getPages()) {
            resources = page.getResources();

            for (COSName name : resources.getXObjectNames()) {
                PDXObject xobject = resources.getXObject(name);

                if (xobject instanceof PDImageXObject) {
                    System.out.println("have image");
                    removeImages(pdfFile);
                }
            }
        }
        document.save("RemoveImage.pdf");
        document.close();
    }
}

注意:这段代码的翻译中保留了变量名、操作符、方法名等的原文,以保持代码的一致性。

英文:

I am attempting to delete images from a PDF using java and PDFbox. The images are not inline, and the PDF does not have patterns or forms. The pdf file contains 2 images. The PDFdebugger tool shows Resources >> XObject >> IM3 and IM5. The problem is: I display the output pdf file and the images are not deleted.

public class DeleteImage {
public static void removeImages(String pdfFile) throws Exception {
PDDocument document = PDDocument.load(new File(pdfFile));
for (PDPage page : document.getPages()) {
PDResources pdResources = page.getResources();
pdResources.getXObjectNames().forEach(propertyName -&gt; {
if(!pdResources.isImageXObject(propertyName)) {
return;
}
PDXObject o;
try {
o = pdResources.getXObject(propertyName);
if (o instanceof PDImageXObject) {
System.out.println(&quot;propertyName&quot; + propertyName);
page.getCOSObject().removeItem(propertyName);
}
} catch (IOException e) {
e.printStackTrace();
}
});
for (COSName name :  page.getResources().getPatternNames()) {
PDAbstractPattern pattern = page.getResources().getPattern(name);
System.out.println(&quot;have pattern&quot;);
}
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List&lt;Object&gt; tokens = parser.getTokens();
System.out.println(&quot;original tokens size&quot; + tokens.size());
List&lt;Object&gt; newTokens = new ArrayList&lt;Object&gt;();
for(int j=0; j&lt;tokens.size(); j++) {
Object token = tokens.get( j );
if( token instanceof Operator ) {
Operator op = (Operator)token;
System.out.println(&quot;operation&quot; + op.getName());
//find image - remove it
if( op.getName().equals(&quot;Do&quot;) ) {
System.out.println(&quot;op equals Do&quot;);
newTokens.remove(newTokens.size()-1);
continue;
} else if (&quot;BI&quot;.equals(op.getName())) {
System.out.println(&quot;inline -- op equals BI&quot;);
} else {
System.out.println(&quot;op not quals Do&quot;);
}
}
newTokens.add(token);
}
PDDocument newDoc = new PDDocument();
PDPage newPage = newDoc.importPage(page);
newPage.setResources(page.getResources());
System.out.println(&quot;tokens size&quot; + newTokens.size());
PDStream newContents = new PDStream(newDoc);
OutputStream out = newContents.createOutputStream();
ContentStreamWriter writer = new ContentStreamWriter( out );
writer.writeTokens( newTokens);
out.close();
newPage.setContents( newContents );
}
document.save(&quot;RemoveImage.pdf&quot;);
document.close();
}
public static void remove(String pdfFile) throws Exception {
PDDocument document = PDDocument.load(new File(pdfFile));
PDResources resources = null;
for (PDPage page : document.getPages()) {
resources = page.getResources();
for (COSName name : resources.getXObjectNames()) {
PDXObject xobject = resources.getXObject(name);
if (xobject instanceof PDImageXObject) {
System.out.println(&quot;have image&quot;);
removeImages(pdfFile);
}
}
}
document.save(&quot;RemoveImage.pdf&quot;);
document.close();
}
}

答案1

得分: 1

如果调用remove函数...

remove函数中,你会:

  • 将PDF文件加载到document中,
  • 遍历document的各个页面,对于每个页面
    • 遍历XObject资源,在每个XObject资源上
      • 检查是否为图像XObject,如果是的话
        • 调用removeImages函数,该函数会加载相同的原始文件,进行处理,并将结果保存为“RemoveImage.pdf”。
  • 在所有这些处理完成后,你将未更改的document保存为“RemoveImage.pdf”。

因此,在最后一步中,你会覆盖掉在removeImages函数中可能做的任何更改,最终得到的是名为“RemoveImage.pdf”的原始文件!

如果直接调用removeImages函数...

removeImages函数中,你进行了一些更改,但存在一些问题:

  • 每当你找到一个图像XObject资源时,你尝试直接从页面中删除它

    page.getCOSObject().removeItem(propertyName);
    

    但图像XObject资源不是page的直接子项,而是由pdResources管理的,因此你应该从那里将其删除。

  • 你从页面内容中删除了所有的Do指令,不仅仅是图像XObject的指令,所以你可能删除了更多不想删除的内容。

英文:

If You Call remove...

In remove you

  • load the PDF into document,
  • iterate over the pages of document, and for each page
    • iterate over the XObject resources, and for each Xobject
      • check whether it is an image Xobject, and if it is
        • call removeImages which loads the same original file, processes it, and saves the result as "RemoveImage.pdf".
  • After all that processing you save the unchanged document to "RemoveImage.pdf".

So in that last step you overwrite any changes you may have done in removeImages and end up with your original file in "RemoveImage.pdf"!

If You Call removeImages Directly...

In removeImages you do some changes but there are certain issues:

  • Whenever you find an image Xobject resource, you attempt to remove it from the page directly

    page.getCOSObject().removeItem(propertyName);
    

    but the image Xobject resource is not a direct child of the page, it is managed by pdResources, so you should remove it from there.

  • You remove all Do instructions from the page content, not only those for image Xobjects, so you probably remove more than you wanted.

huangapple
  • 本文由 发表于 2020年8月26日 21:36:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/63598888.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定