如何在使用 iText for .NET 阅读 PDF 时避免单词中的多余空格。

huangapple go评论79阅读模式
英文:

How to avoid spurious spaces in words when reading PDF using iText for .NET

问题

使用iText7(v8.0.0)我尝试解析一个非PDF/A格式的PDF。 代码如下:

using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;

var pdfDocument = new PdfDocument(new PdfReader("TestFile.pdf"));

var page = pdfDocument.GetPage(1);

var text = PdfTextExtractor.GetTextFromPage(page);

输出的文本基本正常,但是其中某些单词包含了多余的空格。我了解到这与文本在PDF中的呈现方式有关。GetTextFromPage方法有一个接受文本提取策略的重载版本;我尝试了默认策略实现LocationTextExtractionStrategySimpleTextExtractionStrategy,但都未解决这个问题。

我猜想我需要定义自己的文本提取策略,但如何去做并不是很明显。

(如果读者感兴趣的话,我也尝试了IronPDF,但效果也不理想。)

英文:

Using iText7 (v8.0.0) I am attempting to parse a (non-PDF/A) PDF. The code is as follows:

using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;

var pdfDocument = new PdfDocument(new PdfReader("TestFile.pdf"));

var page = pdfDocument.GetPage(1);

var text = PdfTextExtractor.GetTextFromPage(page);

All is well in terms of the output except that certain words in the output text have spurious spaces in them. I understand that this is to do with the way text is rendered into the PDF. The GetTextFromPage method has an overload that takes a text extraction strategy; I tried both the default strategy implementations LocationTextExtractionStrategy and SimpleTextExtractionStrategy but neither dealt with the issue.

I am guessing that I need to define my own text extraction strategy but it isn't very obvious how to go about doing this.

(In case readers are interested, I tried the same with IronPDF and that was no better.)

答案1

得分: 1

尽管我后来决定不使用iText库,因为许可费用问题,但我成功解决了这个问题,所以我想分享一下我的发现。除了上面许多有帮助的评论之外,我从这个问题的答案中获得了我所需的基本信息:https://stackoverflow.com/questions/16398483/how-can-we-extract-text-from-pdf-using-itextsharp-with-spaces。

以下是解决此问题的最简单方法。添加以下类:

using iText.Kernel.Pdf.Canvas.Parser.Listener;

internal class MyStrategy : LocationTextExtractionStrategy
{
    protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
    {
        var chunkLocation = chunk.GetLocation();
        var previousChunkLocation = previousChunk.GetLocation();
        var chunkCharSpaceWidth = chunkLocation.GetCharSpaceWidth();

        float dist = chunkLocation.DistanceFromEndOf(previousChunkLocation);
        if (dist < -chunkCharSpaceWidth || dist > chunkCharSpaceWidth / 1.5f)
            return true;
        return false;
    }
}

我发现值1.5f提供了最佳结果(默认值为2.0f);你的情况可能有所不同。

然后,将自定义策略提供给处理:

var text = PdfTextExtractor.GetTextFromPage(page, new MyStrategy());

请注意,如果你需要对SimpleTextExtractionStrategy执行相同操作,你几乎必须复制并粘贴整个原始类,因为它使用了一些private成员,在继承时无法访问。

英文:

Although I have subsequently decided not to use the iText library due to the licensing costs, as I managed to fix the issue, I thought I'd share my findings. Aside from the many helpful comments above, I got the base information I needed from answers to this question: https://stackoverflow.com/questions/16398483/how-can-we-extract-text-from-pdf-using-itextsharp-with-spaces.

Here is the easiest way to address this issue. Add the following class:

using iText.Kernel.Pdf.Canvas.Parser.Listener;

internal class MyStrategy : LocationTextExtractionStrategy
{
    protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
    {
        var chunkLocation = chunk.GetLocation();
        var previousChunkLocation = previousChunk.GetLocation();
        var chunkCharSpaceWidth = chunkLocation.GetCharSpaceWidth();

        float dist = chunkLocation.DistanceFromEndOf(previousChunkLocation);
        if (dist < -chunkCharSpaceWidth || dist > chunkCharSpaceWidth / 1.5f)
            return true;
        return false;
    }
}

I found the value 1.5f gave the best results (the default is 2.0f); your mileage may vary.

It is then a simple matter to supply the custom strategy to the processing thus:

var text = PdfTextExtractor.GetTextFromPage(page, new MyStrategy());

Note that if you need to do the same thing with the SimpleTextExtractionStrategy you pretty much have to copy and paste the entire original class, as it uses a bunch of private members which you don't have access to when inheriting.

huangapple
  • 本文由 发表于 2023年7月17日 19:03:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76703810.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定