2020年10月15日 07:26:01go评论111阅读模式

英文:

PDFBox getting wrong TextPositions in a specific pdf document

问题

上下文

我一直在处理一个程序，该程序获取一个PDF文件，通过pdfbox的标记注释功能对某些单词进行高亮显示，并保存新的PDF文件。

为此，我扩展了PDFTextStripper类，以便覆盖writeString()方法，并获取每个单词（框）的TextPositions，以便我准确地知道文本在PDF文档中的坐标位置（TextPosition对象提供了每个单词框的坐标）。然后，基于这些信息，我绘制一个PDRectangle，用于高亮显示所需的单词。

问题

对于我到目前为止尝试过的所有文档，它都运行得很完美，但对于一个文档来说，我从TextPositions得到的位置似乎是错误的，导致高亮显示错误。

这是原始文档：
https://pdfhost.io/v/b1Mcpoy~s_Thomson.pdf

这是在第一个单词框的writeString()中为我提供的高亮文档，使用setSortByPosition(false)，即MicroRNA：
https://pdfhost.io/v/V6INb4Xet_Thomson.pdf
它应该高亮显示MicroRNA，但却高亮显示了它上方的空白区域（粉色的HL矩形）。

这是在第一个单词框的writeString()中为我提供的高亮文档，使用setSortByPosition(true)，即Original：
https://pdfhost.io/v/Lndh.j6ji_Thomson.pdf
它应该高亮显示Original，但却高亮显示了PDF文档开头的空白区域（粉色的HL矩形）。

这个PDF可能包含PDFBox难以获取正确位置的内容，我想，或者这可能是PDFBox中的某种错误。

技术规范：

PDFBox 2.0.17
Java 11.0.6+10，AdoptOpenJDK
MacOS Catalina 10.15.4，16GB，x86_64

坐标值

例如，对于M letter的起始和结束位置，我从writeString()得到的TextPosition坐标是：

M letter

endX = 59.533783
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 35.886597
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
unicode = M
direction = -1.0

A Letter

endX = 146.34933
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 129.18181
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
fontSizePt = 23
unicode = A
direction = -1.0

这导致了我上面分享的错误高亮标注，而对于其他所有PDF文档，这都非常精确，我测试过许多不同的文档。我在这里一筹莫展，我对PDF定位不是专家。我尝试使用PDFbox调试工具，但我无法正确阅读它。在这里的任何帮助都将非常感激。如果需要更多证据，请告诉我。谢谢。

编辑

请注意，文本提取正常工作。

我的代码

首先，我使用来自我想要高亮显示的第一个和最后一个字符的TextPosition对象的一些值创建了一个坐标数组：

private void extractHLCoordinates(TextPosition firstPosition, TextPosition lastPosition, int pageNumber) {
    double firstPositionX = firstPosition.getX();
    double firstPositionY = firstPosition.getY();
    double lastPositionEndX = lastPosition.getEndX();
    double lastPositionY = lastPosition.getY();

    double height = firstPosition.getHeight();
    double width = firstPosition.getWidth();
    int rotation = firstPosition.getRotation();

    double[] wordCoordinates = {firstPositionX, firstPositionY, lastPositionEndX, lastPositionY, pageNumber, 
    height, width, rotation};

    ...
}

现在是根据提取的坐标绘制的时间：

for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {

    DPage page = pdDocument.getPage(pageIndex);
    List<PDAnnotation> annotations = page.getAnnotations();

    int rotation;
    double pageHeight = page.getMediaBox().getHeight();
    double pageWidth  = page.getMediaBox().getWidth();
    
    // 每个CoordinatePoint对象都保存着我想要高亮显示的每个单词的坐标数组 - 请参见前面的方法
    for (CoordinatePoint coordinate : coordinates) {
        double[] wordCoordinates = coordinate.getCoordinates();
        
        int pageNumber = (int) wordCoordinates[4];

        // 如果当前坐标与当前页无关，则忽略它们
        if ((int) pageNumber == (pageIndex + 1))

<details>
<summary>英文:</summary>

**The Context**

I&#39;ve been working on a program that gets a pdf, highlights some words (via pdfbox Mark Annotation) and saves the new pdf.

For this I extend the [PDFTextStripper](https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/text/PDFTextStripper.html) class, in order to override the [writeString()](https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#writeString(java.lang.String)) method and get the [TextPositions](https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/text/TextPosition.html) of each word (box), so that I know exactly where the text is in the PDF doc in terms of coordinates (TextPosition object provides me the coordinates of each word box). Then, based on that, I draw a [PDRectangle](https://pdfbox.apache.org/docs/2.0.3/javadocs/org/apache/pdfbox/pdmodel/common/PDRectangle.html) highlighting the word I want to.

**The Problem**

It works perfectly for all the documents I&#39;ve tried so far, except for one that the positions I&#39;m getting from TextPostions seem to be wrong, leading to wrong highlights.

This is the original document:  
https://pdfhost.io/v/b1Mcpoy~s_Thomson.pdf

This is the document with a highlighting in the very first word box writeString() provides me, with [setSortByPosition(false)](https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setSortByPosition(boolean)), which is *MicroRNA*:  
https://pdfhost.io/v/V6INb4Xet_Thomson.pdf  
It should highlight *MicroRNA*, but it is highlighting a blank space above it (pink HL rectangle).

This is the document with a highlighting in the very first word box writeString() provides me, with [setSortByPosition(true)](https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setSortByPosition(boolean)), which is *Original*:  
https://pdfhost.io/v/Lndh.j6ji_Thomson.pdf  
It should highlight *Original*, but it is highlighting a blank space at the very beginning of the PDF document (pink HL rectangle).

This PDF might contain something that PDFBox struggles to get the right positions, I suppose, or this may be a sort of a bug in PDFBox.

**Technical Specification:**

PDFBox 2.0.17  
Java 11.0.6+10, AdoptOpenJDK  
MacOS Catalina 10.15.4, 16gb, x86_64  

**Coordinates Values**

So for instance for the start and end of the MicroRNA word box, the TextPosition coordinates writeString() gives me are:  

*M letter*

endX = 59.533783
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 35.886597
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
unicode = M
direction = -1.0


*A Letter*

endX = 146.34933
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 129.18181
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
fontSizePt = 23
unicode = A
direction = -1.0


And it results in the wrong HL annotation I shared above, while for all other PDF docs this is just very precise, and I&#39;ve tested many different ones. I&#39;m clueless here and I&#39;m not an expert on PDF positionings. I&#39;ve tried to use the PDFbox debugger tool, but I can&#39;t read it properly. Any help here would be very appreciated. Let me know if I can provide more evidence. Thanks.
**EDIT**
Note that text extraction is working just fine. 
**My Code**
First I create an array of coordinates with a few values from *TextPosition* object of the first and last character I want to HL:

private void extractHLCoordinates(TextPosition firstPosition, TextPosition lastPosition, int pageNumber) {
double firstPositionX = firstPosition.getX();
double firstPositionY = firstPosition.getY();
double lastPositionEndX = lastPosition.getEndX();
double lastPositionY = lastPosition.getY();

double height = firstPosition.getHeight();
double width = firstPosition.getWidth();
int rotation = firstPosition.getRotation();
double[] wordCoordinates = {firstPositionX, firstPositionY, lastPositionEndX, lastPositionY, pageNumber, 
height, width, rotation};
...

}


Now it&#39;s drawing time based on the extracted coordinates:

for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {

DPage page = pdDocument.getPage(pageIndex);
List&lt;PDAnnotation&gt; annotations = page.getAnnotations();
int rotation;
double pageHeight = page.getMediaBox().getHeight();
double pageWidth  = page.getMediaBox().getWidth();
// each CoordinatePoint obj holds the double array with the 
// coordinates of each word I want to HL - see the previous method
for (CoordinatePoint coordinate : coordinates) {
double[] wordCoordinates = coordinate.getCoordinates();
int pageNumber = (int) wordCoordinates[4];
// if the current coordinates are not related to the current page, 
//ignore them
if ((int) pageNumber == (pageIndex + 1)) {
// getting rotation of the page: portrait, landscape...
rotation = (int) wordCoordinates[7];
firstPositionX = wordCoordinates[0];
firstPositionY = wordCoordinates[1];
lastPositionEndX = wordCoordinates[2];
lastPositionY = wordCoordinates[3];
height = wordCoordinates[5];
double height;
double minX;
double maxX;
double minY;
double maxY;
if (rotation == 90) {
double width = wordCoordinates[6];
width = (pageHeight * width) / pageWidth;
//defining coordinates of a rectangle
maxX = firstPositionY;
minX = firstPositionY - height;
minY = firstPositionX;
maxY = firstPositionX + width;
} else {
minX = firstPositionX;
maxX = lastPositionEndX;
minY = pageHeight - firstPositionY;
maxY = pageHeight - lastPositionY + height;
}
// Finally I draw the Rectangle
PDAnnotationTextMarkup txtMark = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);
PDRectangle pdRectangle = new PDRectangle();
pdRectangle.setLowerLeftX((float) minX);
pdRectangle.setLowerLeftY((float) minY);
pdRectangle.setUpperRightX((float) maxX);
pdRectangle.setUpperRightY((float) ((float) maxY + height));
txtMark.setRectangle(pdRectangle);
// And the QuadPoints
float[] quads = new float[8];
quads[0] = pdRectangle.getLowerLeftX();  // x1
quads[1] = pdRectangle.getUpperRightY() - 2; // y1
quads[2] = pdRectangle.getUpperRightX(); // x2
quads[3] = quads[1]; // y2
quads[4] = quads[0];  // x3
quads[5] = pdRectangle.getLowerLeftY() - 2; // y3
quads[6] = quads[2]; // x4
quads[7] = quads[5]; // y5
txtMark.setQuadPoints(quads);
...
}
}


</details>
# 答案1
**得分**: 2
你的四角点坐标是相对于CropBox计算的，但它们需要相对于MediaBox计算。对于此文档，CropBox小于MediaBox，因此高亮显示不在正确位置上。通过用CropBox.LLX - MediaBox.LLY调整x，用MediaBox.URY - CropBox.URY调整y，高亮显示将位于正确位置。  
上述调整适用于旋转角度为0的页面。如果旋转角度不等于0，则根据PDFBox返回的坐标如何进行进一步的调整（我对PDFBox API不太熟悉）。
**OP编辑**
我在这里发布了我对代码所做的更改，以便帮助他人。
请注意，我还没有尝试过旋转角度为90的情况。我会在获得这部分内容后在这里更新。
*更改前*
```...
if (rotation == 90) {
double width = wordCoordinates[6];
width = (pageHeight * width) / pageWidth;
//定义矩形的坐标
maxX = firstPositionY;
minX = firstPositionY - height;
minY = firstPositionX;
maxY = firstPositionX + width;
} else {
minX = firstPositionX;
maxX = lastPositionEndX;
minY = pageHeight - firstPositionY;
maxY = pageHeight - lastPositionY + height;
}
...

更改后

...
PDRectangle mediaBox = page.getMediaBox();
PDRectangle cropBox = page.getCropBox();
if (rotation == 90) {
double width = wordCoordinates[6];
width = (pageHeight * width) / pageWidth;
//定义矩形的坐标
maxX = firstPositionY;
minX = firstPositionY - height;
minY = firstPositionX;
maxY = firstPositionX + width;
} else {
minX = firstPositionX + cropBox.getLowerLeftX() - mediaBox.getLowerLeftY();
maxX = lastPositionEndX + cropBox.getLowerLeftX() - mediaBox.getLowerLeftY();
minY = pageHeight - firstPositionY - (mediaBox.getUpperRightY() - cropBox.getUpperRightY());
maxY = pageHeight - lastPositionY + height - (mediaBox.getUpperRightY() - cropBox.getUpperRightY());
}
...

英文:

Your Quadpoints coordinates are computed relative to CropBox but they need to be relative to MediaBox. For this document the CropBox is smaller than the MediaBox so the highlight is not in the correct position. Adjust the x with CropBox.LLX - MediaBox.LLY and y with MediaBox.URY - CropBox.URY and the highlight will be in the right position.<br/>
The adjustment above works for pages with Rotate = 0. If Rotate != 0 then further adjustments might be needed depending on how the coordinates are returned by PDFBox (I'm not very familiar with PDFBox API).

OP EDIT

Posting here the changes I've done to my code so it may help others.
Note that I haven't tried anything for rotate == 90 yet. I'll update here once I have this piece.

Before

...
if (rotation == 90) {
double width = wordCoordinates[6];
width = (pageHeight * width) / pageWidth;
//defining coordinates of a rectangle
maxX = firstPositionY;
minX = firstPositionY - height;
minY = firstPositionX;
maxY = firstPositionX + width;
} else {
minX = firstPositionX;
maxX = lastPositionEndX;
minY = pageHeight - firstPositionY;
maxY = pageHeight - lastPositionY + height;
}
...

After

...
PDRectangle mediaBox = page.getMediaBox();
PDRectangle cropBox = page.getCropBox();
if (rotation == 90) {
double width = wordCoordinates[6];
width = (pageHeight * width) / pageWidth;
//defining coordinates of a rectangle
maxX = firstPositionY;
minX = firstPositionY - height;
minY = firstPositionX;
maxY = firstPositionX + width;
} else {
minX = firstPositionX + cropBox.getLowerLeftX() - mediaBox.getLowerLeftY();
maxX = lastPositionEndX + cropBox.getLowerLeftX() - mediaBox.getLowerLeftY();
minY = pageHeight - firstPositionY - (mediaBox.getUpperRightY() - cropBox.getUpperRightY());
maxY = pageHeight - lastPositionY + height - (mediaBox.getUpperRightY() - cropBox.getUpperRightY());
}
...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PDFBox在特定PDF文档中获取的TextPositions不正确。

问题

从枚举获取项作为字符串

After deployed the App in Elastic Beanstalk end point of app throws 404 not found in SpringBoot+MySql+Angular app with maven

如何在不填充内部的情况下打印图案？

如何显示 void 方法的内容 java

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论