英文:
iText difference between word generated PDF and CorelDraw generated PDF
问题
我正在尝试获取来自不同来源(Word、Excel、CorelDraw)生成的PDF中特定字符串的位置。我已提取了我认为与此问题相关的部分代码:
private static iText.Kernel.Geom.Rectangle? textRect;
private static readonly string specificText = "Test";
private void BtnNewPDF_Click(object sender, RoutedEventArgs e)
{
// 初始化PDF文档
PdfDocument pdfDoc = new(new PdfReader(@"D:\Word-test.pdf"));
using Document document = new(pdfDoc);
// 从PDF文档中获取指定页面
PdfPage pdfPage = pdfDoc.GetPage(1);
// 创建PdfCanvasProcessor对象
PdfCanvasProcessor canvasProcessor = new(new MyEventListener());
// 处理页面
canvasProcessor.ProcessPageContent(pdfPage);
if (textRect is not null)
{
MessageBox.Show(textRect.GetX().ToString());
MessageBox.Show(textRect.GetY().ToString());
}
document.Close();
}
class MyEventListener : IEventListener
{
public void EventOccurred(IEventData data, EventType type)
{
if (type == EventType.RENDER_TEXT)
{
// 将IEventData转换为TextRenderInfo
TextRenderInfo renderInfo = (TextRenderInfo)data;
// 查看当前块是否包含文本
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), specificText);
// 如果未找到,请退出
if (startPosition < 0) { return; }
// 获取文本的边界框
textRect = renderInfo.GetDescentLine().GetBoundingRectangle();
}
}
public ICollection<EventType> GetSupportedEvents()
{
return new List<EventType> { EventType.RENDER_TEXT };
}
}
当我处理CorelDraw生成的PDF时,我能够在EventData中获取整个字符串("Test")。但是,当我使用从Microsoft Word生成的PDF时,我得到的是"T"和"est"这样的块。
我是否使用了错误的方法来获取字符串的位置,还是在生成PDF时可以进行一些更改?如果没有办法解决这个问题,我需要编写一些代码来连接这些字母以获取要搜索的字符串并获取位置。我不知道如何解决这个问题。有人能告诉我一般来说如何解决这个问题吗?
翻译完成,如果有任何其他疑问,请随时提出。
英文:
I'm trying to get the location of a specific string in PDF generated from different sources (Word, Excel, CorelDraw..).
I have extracted part of the code that in my opinion is relevant to this question:
private static iText.Kernel.Geom.Rectangle? textRect;
private static readonly string specificText = "Test";
private void BtnNewPDF_Click(object sender, RoutedEventArgs e)
{
//Initialize PDF document
PdfDocument pdfDoc = new(new PdfReader(@"D:\Word-test.pdf"));
using Document document = new(pdfDoc);
// Get the specified page from the PDF document
PdfPage pdfPage = pdfDoc.GetPage(1);
// Create a PdfCanvasProcessor object
PdfCanvasProcessor canvasProcessor = new(new MyEventListener());
// Process the page
canvasProcessor.ProcessPageContent(pdfPage);
if (textRect is not null)
{
MessageBox.Show(textRect.GetX().ToString());
MessageBox.Show(textRect.GetY().ToString());
}
document.Close();
}
class MyEventListener : IEventListener
{
public void EventOccurred(IEventData data, EventType type)
{
if (type == EventType.RENDER_TEXT)
{
// Cast the IEventData to TextRenderInfo
TextRenderInfo renderInfo = (TextRenderInfo)data;
//See if the current chunk contains the text
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), specificText);
//If not found bail
if (startPosition < 0) { return; }
// Get the bounding box for the text
textRect = renderInfo.GetDescentLine().GetBoundingRectangle();
}
}
public ICollection<EventType> GetSupportedEvents()
{
return new List<EventType> { EventType.RENDER_TEXT };
}
}
When I'm Working with pdf generated from CorelDraw I'm able to get the whole string ("Test") in EventData. but when I'm using PDF generated from Microsoft Word I'm getting chunks like "T" and "est".
Am I using the wrong procedure to get the location of the string or can I make some changes in Microsoft Word when generating PDF? If nothing of this can't be done, I'll need to make some code to concatenate the letters to get searched string and get location. I don't know how to attack this problem. Can somebody tell me, in general, how can this be solved?
答案1
得分: 2
PDF内容是根据其中几种内容流中的绘图指令绘制的。这些指令可能以不同的方式绘制文本,它们可以在一条指令中绘制整个文本行,它们可以使用单独的指令绘制每个单词,... 甚至可以使用单独的指令绘制每个字母。
iText将文本绘制指令的每个字符串参数作为单独的事件传递给您。因此,您的观察基本上意味着CorelDraw一次绘制比Word更多的行片段。很可能Word会使用单独的指令绘制“T”和“est”以应用字符间距。
因此,您确实需要编写一些代码来连接字母以获取要搜索的字符串和位置。
在这种情况下,您可能会想知道如何解决这个问题。有人可以告诉我,通常情况下,这个问题可以如何解决吗?
iText包括用于文本提取的示例事件监听器,例如SimpleTextExtractionStrategy
和LocationTextExtractionStrategy
。由于您试图不仅提取文本还要提取位置,较新的RegexBasedLocationExtractionStrategy
对您可能特别有兴趣。
如果无法直接使用RegexBasedLocationExtractionStrategy
,至少可以查看其代码以获取灵感。毕竟,iText是开源的...
英文:
PDF content is drawn according to the drawing instructions in several kinds of content streams therein. These instructions may draw text in different manners, they may draw a whole text line in one instruction, they may draw it using separate instructions for each word, ... They may even draw it using separate instructions for each letter.
iText forwards you each string argument of a text drawing instruction in a separate event. Thus, your observation essentially means that CorelDraw draws larger line pieces at once than Word does. Most likely Word draws "T" and "est" in separate instructions to apply kerning in-between.
So essentially you indeed
> need to make some code to concatenate the letters to get searched string and get location.
In that context you wonder
> how to attack this problem. Can somebody tell me, in general, how can this be solved?
iText includes example event listeners for text extraction, e.g. the SimpleTextExtractionStrategy
and the LocationTextExtractionStrategy
. As you are trying to not only extract the text but also locations, the newer RegexBasedLocationExtractionStrategy
might be of special interest to you.
If you cannot use the RegexBasedLocationExtractionStrategy
as is, you can at least look at its code for inspiration. iText is open source after all...
答案2
得分: 1
根据 @mkl 的回答,我编译了以下代码,使用 RegexBasedLocationExtractionStrategy
从 PDF 的第一页提取所需文本的坐标。在 Word、Excel 或 CorelDraw 生成的 PDF 上同样有效。
public static float[] ArrTextCoordinates(iText.Kernel.Pdf.PdfDocument pdfDoc, string Text)
{
const int PageNo = 1;
float[] TextCoordinates = { 0, 0, 0, 0 };
RegexBasedLocationExtractionStrategy extractionStrategy = new(Text);
PdfCanvasProcessor parser = new(extractionStrategy);
parser.ProcessPageContent(pdfDoc.GetPage(PageNo));
IPdfTextLocation? location = extractionStrategy.GetResultantLocations().FirstOrDefault();
if (location != null)
{
TextCoordinates = new float[4]
{
location.GetRectangle().GetX(),
location.GetRectangle().GetY(),
location.GetRectangle().GetWidth(),
location.GetRectangle().GetHeight()
};
}
return TextCoordinates;
}
希望这能帮助你提取所需文本的坐标。
英文:
Based on @mkl answer I have compiled this code that extracts the coordinates of the required text on the first page of the PDF using RegexBasedLocationExtractionStrategy
. Works same on word, excel or CorelDraw generated PDF-s.
public static float[] ArrTextCoordinates(iText.Kernel.Pdf.PdfDocument pdfDoc, string Text)
{
const int PageNo = 1;
float[] TextCoordinates = { 0, 0, 0, 0 };
RegexBasedLocationExtractionStrategy extractionStrategy = new(Text);
PdfCanvasProcessor parser = new(extractionStrategy);
parser.ProcessPageContent(pdfDoc.GetPage(PageNo));
IPdfTextLocation? location = extractionStrategy.GetResultantLocations().FirstOrDefault();
if (location != null)
{
TextCoordinates = new float[4]
{ location.GetRectangle().GetX(),
location.GetRectangle().GetY(),
location.GetRectangle().GetWidth(),
location.GetRectangle().GetHeight()
};
}
return TextCoordinates;
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论