英文:
Apache PdfBox: Confusion about coordinates
问题
我尝试从PDF中提取一些文本。为此,我需要定义一个包含文本的矩形。
我注意到,当我比较从文本提取中获取的坐标与绘图坐标时,坐标可能具有不同的含义。
package MyTest.MyTest;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.PDPageContentStream.*;
import org.apache.pdfbox.text.*;
import java.awt.*;
import java.io.*;
public class MyTest
{
public static void main (String [] args) throws Exception
{
PDDocument pd = PDDocument.load (new File ("my.pdf"));
PDFTextStripperByArea st = new PDFTextStripperByArea ();
PDPage pg = pd.getPage (0);
float h = pg.getMediaBox ().getHeight ();
float w = pg.getMediaBox ().getWidth ();
System.out.println (h + " x " + w + " in internal units");
h = h / 72 * 2.54f * 10;
w = w / 72 * 2.54f * 10;
System.out.println (h + " x " + w + " in mm");
int X = 85;
int Y = 175;
int dX = 250;
int dY = 15;
// extract some text
st.addRegion ("a", new Rectangle (X, Y, dX, dY));
st.extractRegions (pg);
String text = st.getTextForRegion ("a");
System.out.println("text=" + text);
// fill a rectangle
PDPageContentStream contents = new PDPageContentStream (pd, pg, AppendMode.APPEND, false);
contents.setNonStrokingColor (Color.RED);
contents.addRect (X, Y, dX, dY);
contents.fill ();
contents.close ();
pd.save ("x.pdf");
}
}
提取的文本(在控制台中的输出 text=)与我用红色矩形覆盖的文本(生成的 x.pdf)不同。
为什么会这样?
为了测试,请尝试使用一些现有的PDF文件。为了避免在瞄准带有文本矩形的过程中进行大量的尝试/错误,可以使用一个包含大量文本的文件。
英文:
I try to extract some text out of a PDF. For that I need to define a rectangle that contains the text.
I recognized that the coordinates may have a different meaning when I compare the coordinates from extraction of text to coordinates of drawing.
package MyTest.MyTest;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.PDPageContentStream.*;
import org.apache.pdfbox.text.*;
import java.awt.*;
import java.io.*;
public class MyTest
{
public static void main (String [] args) throws Exception
{
PDDocument pd = PDDocument.load (new File ("my.pdf"));
PDFTextStripperByArea st = new PDFTextStripperByArea ();
PDPage pg = pd.getPage (0);
float h = pg.getMediaBox ().getHeight ();
float w = pg.getMediaBox ().getWidth ();
System.out.println (h + " x " + w + " in internal units");
h = h / 72 * 2.54f * 10;
w = w / 72 * 2.54f * 10;
System.out.println (h + " x " + w + " in mm");
int X = 85;
int Y = 175;
int dX = 250;
int dY = 15;
// extract some text
st.addRegion ("a", new Rectangle (X, Y, dX, dY));
st.extractRegions (pg);
String text = st.getTextForRegion ("a");
System.out.println("text="+text);
// fill a rectangle
PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);
contents.setNonStrokingColor (Color.RED);
contents.addRect (X, Y, dX, dY);
contents.fill ();
contents.close ();
pd.save ("x.pdf");
}
}
The text I extract (output of text= in the console) is not the text I overdraw with my red rectangle (generated x.pdf).
Why??
For testing try some PDF you already have. To avoid a lot of try/error in aiming for a rectangle with text in it use a file with a lot of text.
答案1
得分: 7
在你的方法中,至少存在两个问题:
不同的坐标系统
你使用了 st.addRegion
。它的 JavaDoc 评论告诉我们:
/**
* 添加一个新的区域以进行文本分组。
*
* @param regionName 区域的名称。
* @param rect 从中检索文本的矩形区域。y 坐标是 Java 坐标(y == 0 为顶部),而不是 PDF 坐标(y == 0 为底部)。
*/
public void addRegion(String regionName, Rectangle2D rect)
(实际上,PDFBox 的整个文本提取机制使用了自己的坐标系统,由于这引起了许多问题,因此在 Stack Overflow 上已经有了很多相关的问题。)
另一方面,contents.addRect
不使用那些 "java 坐标"。因此,你必须从最大裁剪框 y 坐标中减去你在文本提取中使用的 y 坐标,以获得 addRect
的坐标。
此外,区域矩形的锚点位于左上角,而常规的 PDF 矩形(例如你使用 contents.addRect
定义的矩形)的锚点位于左下角。因此,你还需要从 y 坐标中加上或减去矩形的高度。
实际上,你可能还需要更改 x 坐标。虽然它不是镜像的,但可能存在一个偏移量,PDFBox 的文本提取坐标系统在左页面边框处使用 x=0,但在 PDF 用户空间中不一定是这样。因此,你可能需要将裁剪框的左边框 x 坐标添加到你的文本提取 x 坐标中。
可能已更改的坐标系统
在页面内容流中,可能通过对当前变换矩阵应用变换来更改坐标系统。结果是,你附加到其中的指令中的坐标可能比上面甚至提到的含义不同。
为了排除这种影响,你应该使用具有附加的 boolean resetContext
参数的不同 PDPageContentStream
构造函数:
/**
* 创建一个新的 PDPage 内容流。
*
* @param document 页面所属的文档。
* @param sourcePage 要将内容写入的页面。
* @param appendContent 指示内容是要覆盖、附加还是前置的。
* @param compress 告诉内容流是否应压缩页面内容。
* @param resetContext 告诉是否应重置图形上下文。这只在 appendContent 参数设置为 {@link AppendMode#APPEND} 时才相关。当附加到现有流时,应使用此选项,因为现有流可能已更改图形属性(例如缩放、旋转)。
* @throws IOException 如果写入页面内容时出现错误。
*/
public PDPageContentStream(PDDocument document, PDPage sourcePage, AppendMode appendContent,
boolean compress, boolean resetContext) throws IOException
即用以下代码替换:
PDPageContentStream contents = new PDPageContentStream(pd, pg, AppendMode.APPEND, false);
变成:
PDPageContentStream contents = new PDPageContentStream(pd, pg, AppendMode.APPEND, false, false);
英文:
There are (at least) two issues in your approach:
Different coordinate systems
You use st.addRegion
. Its JavaDoc comment tells us:
/**
* Add a new region to group text by.
*
* @param regionName The name of the region.
* @param rect The rectangle area to retrieve the text from. The y-coordinates are java
* coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
*/
public void addRegion( String regionName, Rectangle2D rect )
(Actually the whole text extraction apparatus of PDFBox uses its own coordinate system, and there already have been many questions on stack overflow because of irritations this caused.)
On the other hand contents.addRect
does not use those "java coordinates". Thus, you have to subtract the y coordinate you use in text extraction from the maximum crop box y coordinate to get a coordinate for addRect
.
Furthermore, the region rectangles have their anchor point at the top left while the regular PDF rectangles (like the one you define with contents.addRect
) have it at the bottom left. Thus, you additionally have to add or subtract the rectangle height from the y coordinate.
Actually you may have to change the x coordinate, too. It is not mirrored but there may be a shift, the PDFBox text extraction coordinate system uses x=0 for the left page border but that is not necessarily the case in PDF user space. Thus, you may have to add the left border x coordinate of the crop box to your text extraction x coordinate.
Possibly changed coordinate system
In the page content stream the coordinate system may have been changed by applying a transformation to the current transformation matrix. As a result the coordinates in the instructions you append to it may have a different meaning than even outlined above.
To rule out such an effect, you should use a different PDPageContentStream
constructor with an additional boolean resetContext
parameter:
/**
* Create a new PDPage content stream.
*
* @param document The document the page is part of.
* @param sourcePage The page to write the contents to.
* @param appendContent Indicates whether content will be overwritten, appended or prepended.
* @param compress Tell if the content stream should compress the page contents.
* @param resetContext Tell if the graphic context should be reset. This is only relevant when
* the appendContent parameter is set to {@link AppendMode#APPEND}. You should use this when
* appending to an existing stream, because the existing stream may have changed graphic
* properties (e.g. scaling, rotation).
* @throws IOException If there is an error writing to the page contents.
*/
public PDPageContentStream(PDDocument document, PDPage sourcePage, AppendMode appendContent,
boolean compress, boolean resetContext) throws IOException
I.e. replace
PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);
by
PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false, false);
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论