2023年7月17日 19:03:27go评论79阅读模式

英文:

How to avoid spurious spaces in words when reading PDF using iText for .NET

问题

使用iText7（v8.0.0）我尝试解析一个非PDF/A格式的PDF。代码如下：

using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;

var pdfDocument = new PdfDocument(new PdfReader("TestFile.pdf"));

var page = pdfDocument.GetPage(1);

var text = PdfTextExtractor.GetTextFromPage(page);

输出的文本基本正常，但是其中某些单词包含了多余的空格。我了解到这与文本在PDF中的呈现方式有关。GetTextFromPage方法有一个接受文本提取策略的重载版本；我尝试了默认策略实现LocationTextExtractionStrategy和SimpleTextExtractionStrategy，但都未解决这个问题。

我猜想我需要定义自己的文本提取策略，但如何去做并不是很明显。

（如果读者感兴趣的话，我也尝试了IronPDF，但效果也不理想。）

英文:

Using iText7 (v8.0.0) I am attempting to parse a (non-PDF/A) PDF. The code is as follows:

using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;

var pdfDocument = new PdfDocument(new PdfReader(&quot;TestFile.pdf&quot;));

var page = pdfDocument.GetPage(1);

var text = PdfTextExtractor.GetTextFromPage(page);

All is well in terms of the output except that certain words in the output text have spurious spaces in them. I understand that this is to do with the way text is rendered into the PDF. The GetTextFromPage method has an overload that takes a text extraction strategy; I tried both the default strategy implementations LocationTextExtractionStrategy and SimpleTextExtractionStrategy but neither dealt with the issue.

I am guessing that I need to define my own text extraction strategy but it isn't very obvious how to go about doing this.

(In case readers are interested, I tried the same with IronPDF and that was no better.)

答案1

得分: 1

尽管我后来决定不使用iText库，因为许可费用问题，但我成功解决了这个问题，所以我想分享一下我的发现。除了上面许多有帮助的评论之外，我从这个问题的答案中获得了我所需的基本信息：https://stackoverflow.com/questions/16398483/how-can-we-extract-text-from-pdf-using-itextsharp-with-spaces。

以下是解决此问题的最简单方法。添加以下类：

using iText.Kernel.Pdf.Canvas.Parser.Listener;

internal class MyStrategy : LocationTextExtractionStrategy
{
    protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
    {
        var chunkLocation = chunk.GetLocation();
        var previousChunkLocation = previousChunk.GetLocation();
        var chunkCharSpaceWidth = chunkLocation.GetCharSpaceWidth();

        float dist = chunkLocation.DistanceFromEndOf(previousChunkLocation);
        if (dist &lt; -chunkCharSpaceWidth || dist &gt; chunkCharSpaceWidth / 1.5f)
            return true;
        return false;
    }
}

我发现值1.5f提供了最佳结果（默认值为2.0f）；你的情况可能有所不同。

然后，将自定义策略提供给处理：

var text = PdfTextExtractor.GetTextFromPage(page, new MyStrategy());

请注意，如果你需要对SimpleTextExtractionStrategy执行相同操作，你几乎必须复制并粘贴整个原始类，因为它使用了一些private成员，在继承时无法访问。

英文:

Although I have subsequently decided not to use the iText library due to the licensing costs, as I managed to fix the issue, I thought I'd share my findings. Aside from the many helpful comments above, I got the base information I needed from answers to this question: https://stackoverflow.com/questions/16398483/how-can-we-extract-text-from-pdf-using-itextsharp-with-spaces.

Here is the easiest way to address this issue. Add the following class:

using iText.Kernel.Pdf.Canvas.Parser.Listener;

internal class MyStrategy : LocationTextExtractionStrategy
{
    protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
    {
        var chunkLocation = chunk.GetLocation();
        var previousChunkLocation = previousChunk.GetLocation();
        var chunkCharSpaceWidth = chunkLocation.GetCharSpaceWidth();

        float dist = chunkLocation.DistanceFromEndOf(previousChunkLocation);
        if (dist &lt; -chunkCharSpaceWidth || dist &gt; chunkCharSpaceWidth / 1.5f)
            return true;
        return false;
    }
}

I found the value 1.5f gave the best results (the default is 2.0f); your mileage may vary.

It is then a simple matter to supply the custom strategy to the processing thus:

var text = PdfTextExtractor.GetTextFromPage(page, new MyStrategy());

Note that if you need to do the same thing with the SimpleTextExtractionStrategy you pretty much have to copy and paste the entire original class, as it uses a bunch of private members which you don't have access to when inheriting.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在使用 iText for .NET 阅读 PDF 时避免单词中的多余空格。

问题

答案1

只有在条件为真时，才能访问仅存在于条件内的对象吗？

如何将PDF文件汇总为纯文本，并创建并放置新文件在桌面上？

如何指定Web API的CPU/RAM资源？

保存工作簿，如何不处理弹出窗口？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论