英文:
How to avoid spurious spaces in words when reading PDF using iText for .NET
问题
使用iText7(v8.0.0)我尝试解析一个非PDF/A格式的PDF。 代码如下:
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
var pdfDocument = new PdfDocument(new PdfReader("TestFile.pdf"));
var page = pdfDocument.GetPage(1);
var text = PdfTextExtractor.GetTextFromPage(page);
输出的文本基本正常,但是其中某些单词包含了多余的空格。我了解到这与文本在PDF中的呈现方式有关。GetTextFromPage
方法有一个接受文本提取策略的重载版本;我尝试了默认策略实现LocationTextExtractionStrategy
和SimpleTextExtractionStrategy
,但都未解决这个问题。
我猜想我需要定义自己的文本提取策略,但如何去做并不是很明显。
(如果读者感兴趣的话,我也尝试了IronPDF,但效果也不理想。)
英文:
Using iText7 (v8.0.0) I am attempting to parse a (non-PDF/A) PDF. The code is as follows:
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
var pdfDocument = new PdfDocument(new PdfReader("TestFile.pdf"));
var page = pdfDocument.GetPage(1);
var text = PdfTextExtractor.GetTextFromPage(page);
All is well in terms of the output except that certain words in the output text have spurious spaces in them. I understand that this is to do with the way text is rendered into the PDF. The GetTextFromPage
method has an overload that takes a text extraction strategy; I tried both the default strategy implementations LocationTextExtractionStrategy
and SimpleTextExtractionStrategy
but neither dealt with the issue.
I am guessing that I need to define my own text extraction strategy but it isn't very obvious how to go about doing this.
(In case readers are interested, I tried the same with IronPDF and that was no better.)
答案1
得分: 1
尽管我后来决定不使用iText库,因为许可费用问题,但我成功解决了这个问题,所以我想分享一下我的发现。除了上面许多有帮助的评论之外,我从这个问题的答案中获得了我所需的基本信息:https://stackoverflow.com/questions/16398483/how-can-we-extract-text-from-pdf-using-itextsharp-with-spaces。
以下是解决此问题的最简单方法。添加以下类:
using iText.Kernel.Pdf.Canvas.Parser.Listener;
internal class MyStrategy : LocationTextExtractionStrategy
{
protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
{
var chunkLocation = chunk.GetLocation();
var previousChunkLocation = previousChunk.GetLocation();
var chunkCharSpaceWidth = chunkLocation.GetCharSpaceWidth();
float dist = chunkLocation.DistanceFromEndOf(previousChunkLocation);
if (dist < -chunkCharSpaceWidth || dist > chunkCharSpaceWidth / 1.5f)
return true;
return false;
}
}
我发现值1.5f
提供了最佳结果(默认值为2.0f);你的情况可能有所不同。
然后,将自定义策略提供给处理:
var text = PdfTextExtractor.GetTextFromPage(page, new MyStrategy());
请注意,如果你需要对SimpleTextExtractionStrategy
执行相同操作,你几乎必须复制并粘贴整个原始类,因为它使用了一些private
成员,在继承时无法访问。
英文:
Although I have subsequently decided not to use the iText library due to the licensing costs, as I managed to fix the issue, I thought I'd share my findings. Aside from the many helpful comments above, I got the base information I needed from answers to this question: https://stackoverflow.com/questions/16398483/how-can-we-extract-text-from-pdf-using-itextsharp-with-spaces.
Here is the easiest way to address this issue. Add the following class:
using iText.Kernel.Pdf.Canvas.Parser.Listener;
internal class MyStrategy : LocationTextExtractionStrategy
{
protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
{
var chunkLocation = chunk.GetLocation();
var previousChunkLocation = previousChunk.GetLocation();
var chunkCharSpaceWidth = chunkLocation.GetCharSpaceWidth();
float dist = chunkLocation.DistanceFromEndOf(previousChunkLocation);
if (dist < -chunkCharSpaceWidth || dist > chunkCharSpaceWidth / 1.5f)
return true;
return false;
}
}
I found the value 1.5f
gave the best results (the default is 2.0f); your mileage may vary.
It is then a simple matter to supply the custom strategy to the processing thus:
var text = PdfTextExtractor.GetTextFromPage(page, new MyStrategy());
Note that if you need to do the same thing with the SimpleTextExtractionStrategy
you pretty much have to copy and paste the entire original class, as it uses a bunch of private
members which you don't have access to when inheriting.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论