使用Java中的PDFbox替换或删除PDF中的文本。

huangapple go评论85阅读模式
英文:

Replace or remove text from PDF with PDFbox in Java

问题

我正在尝试使用PDFBOX 2.0来替换或删除文本模式中的空白内容(在我的情况下,我想从所有PDF中删除所有的"[QR]"单词),但是我找不到任何对我有用的方法。

我尝试过itext,但是同样的情况,没有任何方法奏效。

我的PDF中的"[QR]"字符串是在创建PDF之后编辑的,也许这就是它们不显示为"tj"操作符的原因?

我的主要代码:

replaceText(documentoPDF, "[QR]", "");

我的方法(我打印了Tj值,但我的模式在那里不出现):

public void replaceText(PDDocument documentoPDF, String searchString, String replacement) throws IOException{
    for (PDPage page : documentoPDF.getPages()){
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List<?> tokens = parser.getTokens();
        
        for (int j = 0; j < tokens.size(); j++){
            Object next = tokens.get(j);
            if (next instanceof Operator){
                Operator op = (Operator) next;
                String pstring = "";
                int prej = 0;
                
                // Tj和TJ是在PDF中显示字符串的两个操作符
                if (op.getName().equals("Tj")) {
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else if (op.getName().equals("TJ")) {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();
                            if (j == prej) {
                                pstring += string;
                            } else {
                                prej = j;
                                pstring = string;
                            }
                        }                       
                    }                    
                    System.out.println(pstring.trim());
                    
                    if (searchString.equals(pstring.trim())) {
                        COSString cosString2 = (COSString) previous.getObject(0);
                        cosString2.setValue(replacement.getBytes());
                        int total = previous.size() - 1;    
                        for (int k = total; k > 0; k--) {
                            previous.remove(k);
                        }                        
                    }
                }
            }
        }
        PDStream updatedStream = new PDStream(documentoPDF);
        OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);            
        out.close();
        page.setContents(updatedStream);
    }
    documentoPDF.save("resources\\resultado\\nuevo.pdf");
}

这是一个带有一些"[QR]"模式的PDF示例:链接

如果有人能够提供帮助,我将不胜感激。如果需要,我可以上传整个项目。

提前谢谢。

英文:

I'm trying to use PDFBOX 2.0 to replace empty or delete a text pattern, (in my case i want to remove all "[QR]" words from all PDF), but I can't find anything that works for me.

I tried itext, but the same, nothing works.

The "[QR]" string from my pdf were edited after the PDF was created, maybe that's why they don't appear as tj operators?

My main:

replaceText(documentoPDF, &quot;[QR]&quot;, &quot;&quot;);

My method (i printed Tj values and my pattern dont appear there):

public void replaceText(PDDocument documentoPDF, String searchString, String replacement) throws IOException{
for ( PDPage page : documentoPDF.getPages()){
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List&lt;?&gt; tokens = parser.getTokens();
for (int j = 0; j &lt; tokens.size(); j++){
Object next = tokens.get(j);
if (next instanceof Operator){
Operator op = (Operator) next;
String pstring = &quot;&quot;;
int prej = 0;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals(&quot;Tj&quot;)) 
{
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else 
if (op.getName().equals(&quot;TJ&quot;)) 
{
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k &lt; previous.size(); k++) 
{
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) 
{
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}                       
}                        
System.out.println(pstring.trim());
if (searchString.equals(pstring.trim())) 
{                            
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());                           
int total = previous.size()-1;    
for (int k = total; k &gt; 0; k--) {
previous.remove(k);
}                            
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(documentoPDF);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);            
out.close();
page.setContents(updatedStream);
}
documentoPDF.save(&quot;resources\\resultado\\nuevo.pdf&quot;);
}

This is an example of pdf with some [QR] patterns: http://www.mediafire.com/file/9w3kkc4yozwsfms/file

If someone can help, i will appreciate it.

I can upload my entire project if you need

Thanks in advance.

答案1

得分: 5

如评论中已经提到的,你的代码不起作用的原因很简单 - 你完全忽略了该文本字体的编码。在内容流中,实际上有[( &gt;) ( 4) ( 5) ( @) ] TJ指令(实际上在'>','4','5'和'@'之前的“空格”实际上是零字节,0x00)。因此,显然编码是一种16位编码,而且还没有嵌入ASCII。

为了正确考虑字体,必须跟踪当前的字体。这意味着解析整个内容流并分析文本字体设置调用、保存图形状态调用和恢复图形状态调用。然后,你必须从正确的资源中检索正确的字体对象。

所有这些实际上已经由PDFBox内容解析框架完成,用于文本提取等。因此,我们可以围绕这个框架创建一个内容流编辑器。

实际上,这已经完成了,可以参考这个答案中的PdfContentStreamEditor

对于你的文档,要删除的文本片段是由单个文本绘制指令绘制的,每个指令都只绘制一个要删除的文本片段,因此我们可以简单地查看当前指令绘制的文本,然后决定是否保留该指令:

PDDocument document = ...;
for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
        final StringBuilder recentChars = new StringBuilder();

        @Override
        protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector displacement)
                throws IOException {
            String string = font.toUnicode(code);
            if (string != null)
                recentChars.append(string);

            super.showGlyph(textRenderingMatrix, font, code, displacement);
        }

        @Override
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
            String recentText = recentChars.toString();
            recentChars.setLength(0);
            String operatorString = operator.getName();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString) && "[QR]".equals(recentText))
            {
                return;
            }

            super.write(contentStreamWriter, operator, operands);
        }

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    };
    editor.processPage(page);
}
document.save("nuevo-noQrText.pdf");

(根据你的PDFBox版本,要覆盖的showGlyph方法可能有一个第五个参数;因此,请检查你的PDFBox副本的showGlyph签名,并在此代码不起作用时进行适应。感谢@DanielNorberg的提示!)

结果是QR码下面的“[QR]”文本已经消失,例如:

使用Java中的PDFbox替换或删除PDF中的文本。

变成了

使用Java中的PDFbox替换或删除PDF中的文本。

英文:

As already mentioned in comments, the reason why your code doesn't work is simple - you completely ignore the encoding of the font of that text. In the content stream there actually are [( &gt;) ( 4) ( 5) ( @) ] TJ instructions (The "spaces" before '>', '4', '5', and '@' actually are zero bytes, 0x00). Thus, apparently the encoding is some 16bit encoding which additionally does not have ASCII naturally embedded.

To properly take the font into account one has to keep track of the current font. This means parsing the whole content stream and analyzing text font setting calls, save graphics state calls, and restore graphics state calls. Then you have to retrieve the proper font object from the correct resources.

All this actually is already done by the PDFBox content parsing framework used for e.g. text extraction. Thus, we can create a content stream editor around this framework.

Actually, this also has already been done, see the PdfContentStreamEditor from this answer.

As in case of your document the text pieces to delete are drawn by a single text drawing instruction each and each of these instructions draws only a text piece to remove, we can simply look at the text the current instruction draws and then decide whether to keep the instruction or not:

PDDocument document = ...;
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
final StringBuilder recentChars = new StringBuilder();
@Override
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector displacement)
throws IOException {
String string = font.toUnicode(code);
if (string != null)
recentChars.append(string);
super.showGlyph(textRenderingMatrix, font, code, displacement);
}
@Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List&lt;COSBase&gt; operands) throws IOException {
String recentText = recentChars.toString();
recentChars.setLength(0);
String operatorString = operator.getName();
if (TEXT_SHOWING_OPERATORS.contains(operatorString) &amp;&amp; &quot;[QR]&quot;.equals(recentText))
{
return;
}
super.write(contentStreamWriter, operator, operands);
}
final List&lt;String&gt; TEXT_SHOWING_OPERATORS = Arrays.asList(&quot;Tj&quot;, &quot;&#39;&quot;, &quot;\&quot;&quot;, &quot;TJ&quot;);
};
editor.processPage(page);
}
document.save(&quot;nuevo-noQrText.pdf&quot;);

(EditPageContent test testRemoveQrTextNuevo)

Depending on your PDFBox version the showGlyph method to override may have a fifth parameter; thus, please check the showGlyph signature of your PDFBox copy and adapt if this code does not work. Thanks to @DanielNorberg for the hint!

In the result the "[QR]" texts underneath the QR codes have vanished, e.g.

使用Java中的PDFbox替换或删除PDF中的文本。

became

使用Java中的PDFbox替换或删除PDF中的文本。

huangapple
  • 本文由 发表于 2020年8月26日 14:54:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/63592078.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定