2023年3月3日 20:22:27go评论98阅读模式

英文:

iText difference between word generated PDF and CorelDraw generated PDF

问题

我正在尝试获取来自不同来源（Word、Excel、CorelDraw）生成的PDF中特定字符串的位置。我已提取了我认为与此问题相关的部分代码：

private static iText.Kernel.Geom.Rectangle? textRect;
private static readonly string specificText = "Test";
private void BtnNewPDF_Click(object sender, RoutedEventArgs e)
{
    // 初始化PDF文档
    PdfDocument pdfDoc = new(new PdfReader(@"D:\Word-test.pdf"));
    using Document document = new(pdfDoc);
    // 从PDF文档中获取指定页面
    PdfPage pdfPage = pdfDoc.GetPage(1);
    // 创建PdfCanvasProcessor对象
    PdfCanvasProcessor canvasProcessor = new(new MyEventListener());
    // 处理页面
    canvasProcessor.ProcessPageContent(pdfPage);
    if (textRect is not null)
    {
        MessageBox.Show(textRect.GetX().ToString());
        MessageBox.Show(textRect.GetY().ToString());
    }
    document.Close();
}
class MyEventListener : IEventListener
{
    public void EventOccurred(IEventData data, EventType type)
    {
        if (type == EventType.RENDER_TEXT)
        {
            // 将IEventData转换为TextRenderInfo
            TextRenderInfo renderInfo = (TextRenderInfo)data;
            // 查看当前块是否包含文本
            var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), specificText);
            // 如果未找到，请退出
            if (startPosition < 0) { return; }
            // 获取文本的边界框
            textRect = renderInfo.GetDescentLine().GetBoundingRectangle();
        }
    }
    public ICollection<EventType> GetSupportedEvents()
    {
        return new List<EventType> { EventType.RENDER_TEXT };
    }
}

当我处理CorelDraw生成的PDF时，我能够在EventData中获取整个字符串（"Test"）。但是，当我使用从Microsoft Word生成的PDF时，我得到的是"T"和"est"这样的块。

我是否使用了错误的方法来获取字符串的位置，还是在生成PDF时可以进行一些更改？如果没有办法解决这个问题，我需要编写一些代码来连接这些字母以获取要搜索的字符串并获取位置。我不知道如何解决这个问题。有人能告诉我一般来说如何解决这个问题吗？

翻译完成，如果有任何其他疑问，请随时提出。

英文:

I'm trying to get the location of a specific string in PDF generated from different sources (Word, Excel, CorelDraw..).
I have extracted part of the code that in my opinion is relevant to this question:

    private static iText.Kernel.Geom.Rectangle? textRect;
    private static readonly string specificText = &quot;Test&quot;;
    private void BtnNewPDF_Click(object sender, RoutedEventArgs e)
    {
        //Initialize PDF document
        PdfDocument pdfDoc = new(new PdfReader(@&quot;D:\Word-test.pdf&quot;));
        using Document document = new(pdfDoc);
        // Get the specified page from the PDF document
        PdfPage pdfPage = pdfDoc.GetPage(1);
        // Create a PdfCanvasProcessor object
        PdfCanvasProcessor canvasProcessor = new(new MyEventListener());
        // Process the page
        canvasProcessor.ProcessPageContent(pdfPage);
        if (textRect is not null)
        {
            MessageBox.Show(textRect.GetX().ToString());
            MessageBox.Show(textRect.GetY().ToString());
        }
        document.Close();
    }
    class MyEventListener : IEventListener
    {
        public void EventOccurred(IEventData data, EventType type)
        {
            if (type == EventType.RENDER_TEXT)
            {
                // Cast the IEventData to TextRenderInfo
                TextRenderInfo renderInfo = (TextRenderInfo)data;
                //See if the current chunk contains the text
                var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), specificText);
                //If not found bail
                if (startPosition &lt; 0) { return; }
                // Get the bounding box for the text
                textRect = renderInfo.GetDescentLine().GetBoundingRectangle();
            }
        }
        public ICollection&lt;EventType&gt; GetSupportedEvents()
        {
            return new List&lt;EventType&gt; { EventType.RENDER_TEXT };
        }
    }

When I'm Working with pdf generated from CorelDraw I'm able to get the whole string ("Test") in EventData. but when I'm using PDF generated from Microsoft Word I'm getting chunks like "T" and "est".

Am I using the wrong procedure to get the location of the string or can I make some changes in Microsoft Word when generating PDF? If nothing of this can't be done, I'll need to make some code to concatenate the letters to get searched string and get location. I don't know how to attack this problem. Can somebody tell me, in general, how can this be solved?

答案1

得分: 2

PDF内容是根据其中几种内容流中的绘图指令绘制的。这些指令可能以不同的方式绘制文本，它们可以在一条指令中绘制整个文本行，它们可以使用单独的指令绘制每个单词，... 甚至可以使用单独的指令绘制每个字母。

iText将文本绘制指令的每个字符串参数作为单独的事件传递给您。因此，您的观察基本上意味着CorelDraw一次绘制比Word更多的行片段。很可能Word会使用单独的指令绘制“T”和“est”以应用字符间距。

因此，您确实需要编写一些代码来连接字母以获取要搜索的字符串和位置。

在这种情况下，您可能会想知道如何解决这个问题。有人可以告诉我，通常情况下，这个问题可以如何解决吗？

iText包括用于文本提取的示例事件监听器，例如SimpleTextExtractionStrategy和LocationTextExtractionStrategy。由于您试图不仅提取文本还要提取位置，较新的RegexBasedLocationExtractionStrategy对您可能特别有兴趣。

如果无法直接使用RegexBasedLocationExtractionStrategy，至少可以查看其代码以获取灵感。毕竟，iText是开源的...

英文:

PDF content is drawn according to the drawing instructions in several kinds of content streams therein. These instructions may draw text in different manners, they may draw a whole text line in one instruction, they may draw it using separate instructions for each word, ... They may even draw it using separate instructions for each letter.

iText forwards you each string argument of a text drawing instruction in a separate event. Thus, your observation essentially means that CorelDraw draws larger line pieces at once than Word does. Most likely Word draws "T" and "est" in separate instructions to apply kerning in-between.

So essentially you indeed

> need to make some code to concatenate the letters to get searched string and get location.

In that context you wonder

> how to attack this problem. Can somebody tell me, in general, how can this be solved?

iText includes example event listeners for text extraction, e.g. the SimpleTextExtractionStrategy and the LocationTextExtractionStrategy. As you are trying to not only extract the text but also locations, the newer RegexBasedLocationExtractionStrategy might be of special interest to you.

If you cannot use the RegexBasedLocationExtractionStrategy as is, you can at least look at its code for inspiration. iText is open source after all...

答案2

得分: 1

根据 @mkl 的回答，我编译了以下代码，使用 RegexBasedLocationExtractionStrategy 从 PDF 的第一页提取所需文本的坐标。在 Word、Excel 或 CorelDraw 生成的 PDF 上同样有效。

public static float[] ArrTextCoordinates(iText.Kernel.Pdf.PdfDocument pdfDoc, string Text)
{
    const int PageNo = 1;
    float[] TextCoordinates = { 0, 0, 0, 0 };
    RegexBasedLocationExtractionStrategy extractionStrategy = new(Text);
    PdfCanvasProcessor parser = new(extractionStrategy);
    parser.ProcessPageContent(pdfDoc.GetPage(PageNo));
    IPdfTextLocation? location = extractionStrategy.GetResultantLocations().FirstOrDefault();
    if (location != null) 
    { 
        TextCoordinates = new float[4] 
        { 
            location.GetRectangle().GetX(),
            location.GetRectangle().GetY(),
            location.GetRectangle().GetWidth(), 
            location.GetRectangle().GetHeight()
        }; 
    }
    return TextCoordinates;
}

希望这能帮助你提取所需文本的坐标。

英文:

Based on @mkl answer I have compiled this code that extracts the coordinates of the required text on the first page of the PDF using RegexBasedLocationExtractionStrategy. Works same on word, excel or CorelDraw generated PDF-s.

    public static float[] ArrTextCoordinates(iText.Kernel.Pdf.PdfDocument pdfDoc, string Text)
    {
        const int PageNo = 1;
        float[] TextCoordinates = { 0, 0, 0, 0 };
        RegexBasedLocationExtractionStrategy extractionStrategy = new(Text);
        PdfCanvasProcessor parser = new(extractionStrategy);
        parser.ProcessPageContent(pdfDoc.GetPage(PageNo));
        IPdfTextLocation? location = extractionStrategy.GetResultantLocations().FirstOrDefault();
        if (location != null) 
        { 
            TextCoordinates = new float[4] 
            { location.GetRectangle().GetX(),
              location.GetRectangle().GetY(),
              location.GetRectangle().GetWidth(), 
              location.GetRectangle().GetHeight()
            }; 
        }
        return TextCoordinates;
    }

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

iText生成的PDF与CorelDraw生成的PDF之间的区别

问题

答案1

答案2

如何对需要参数的不同类别的方法进行基准测试？

WindowsApp AppWindow content

无线控制器无法使用旧输入系统接收输入。

“Authorize” 和 “AllowAnonymous” 元数据在应用于控制器时会产生不同的行为。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。