PDF读取时,为什么我的页面内容会混在一起?

huangapple go评论68阅读模式
英文:

Why are the contents of my pages getting mixed up when reading PDF?

问题

我正在使用iText7从PDF文件中读取文本。这对第一页有效。但之后,页面的内容似乎混在一起了。因此,在文档的第3页上,有包含第1页和第3页内容的行。第2页的文本显示与第1页相同的行(但实际上它们完全不同)。

  • 第1页,实际:约36行,结果36行 -> 很好
  • 第2页,实际:>50行,结果36行(等于第1页)
  • 第3页,实际:约16行,结果47行(添加并混合了第1页的行)

用于阅读文档的代码如下:

using System;
using System.Collections.Generic;
using System.Linq;

namespace StockMarket
{
    class PdfReader
    {
        // ...
    }

    // ...

    class DocumentTree
    {
        // ...
    }

    class DocumentPage
    {
        // ...
    }

    class Line
    {
        // ...
    }
}

页面树是一个简单的树,包含页面,页面由行组成(按"\n"拆分的页面),行由单词组成(按" "拆分的行),但循环中的__txt__已经包含混乱的内容(所以我的树不会导致问题)。

感谢您的帮助。

英文:

I am using iText7 to read the text from a pdf file. This works fine for the first page. After that the contents of the pages are somehow getting mixed up. So at page 3 of the document I have lines that contain content of page 1 and 3. The text of page 2 shows the exact same lines as page 1 (but in "reallity" they are completely different).

  • Page 1, real: ~36 lines, result 36 lines -> GREAT
  • Page 2, real: >50 lines, result 36 lines (==Page 1)
  • Page 3, real: ~16 lines, result 47 lines (adds and mixes with lines of page 1)

https://www.dropbox.com/s/63gy5cg1othy6ci/Dividenden_Microsoft.pdf?dl=0

For reading the document I use the following code:

using System;
using System.Collections.Generic;
using System.Linq;

namespace StockMarket
{
    class PdfReader
    {
        /// <summary>
        /// Reads PDF file by a given path.
        /// </summary>
        /// <param name="path">The path to the file</param>
        /// <param name="pageCount">The number of pages to read (0=all, 1 by default) </param>
        /// <returns></returns>
        public static DocumentTree PdfToText(string path, int pageCount=1 )
        {
            var pages = new DocumentTree();
            using (iText.Kernel.Pdf.PdfReader reader = new iText.Kernel.Pdf.PdfReader(path))
            {
                using (iText.Kernel.Pdf.PdfDocument pdfDocument = new iText.Kernel.Pdf.PdfDocument(reader))
                {
                    var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();

                    // set up pages to read
                    int pagesToRead = 1;
                    if (pageCount > 0)
                    {
                        pagesToRead = pageCount;
                    }
                    if (pagesToRead > pdfDocument.GetNumberOfPages() || pageCount==0)
                    {
                        pagesToRead = pdfDocument.GetNumberOfPages();
                    }

                    // for each page to read...
                    for (int i = 1; i <= pagesToRead; ++i)
                    {
                        // get the page and save it
                        var page = pdfDocument.GetPage(i);
                        var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
                        pages.Add(txt);
                    }
                    pdfDocument.Close();
                    reader.Close();
                }
            }
            return pages;
        }

    }
    
    /// <summary>
    /// A class representing parts of a PDF document.
    /// </summary>
    class DocumentTree
    {
        /// <summary>
        /// Constructor
        /// </summary>
        public DocumentTree()
        {
            Pages = new List<DocumentPage>();
        }

        private List<DocumentPage> _pages;
        /// <summary>
        /// The pages of the document
        /// </summary>
        public List<DocumentPage> Pages
        {
            get { return _pages; }
            set { _pages = value; }
        }

        /// <summary>
        /// Adds a <see cref="DocumentPage"/> to the document.
        /// </summary>
        /// <param name="page">The text of the <see cref="DocumentPage"/>.</param>
        public void Add(string page)
        {
            Pages.Add(new DocumentPage(page));
        }
    }

    /// <summary>
    /// A class representing a single page of a document
    /// </summary>
    class DocumentPage
    {
        /// <summary>
        /// Constructor
        /// </summary>
        /// <param name="pageContent">The pages content as text</param>
        public DocumentPage(string pageContent)
        {
            // set the content to the input
            CompletePage = pageContent;

            // split the content by lines
            var splitter = new string[] { "\n" };
            foreach (var line in CompletePage.Split(splitter, StringSplitOptions.None))
            {
                // add lines to the page if the content is not empty
                if (!string.IsNullOrWhiteSpace(line))
                {                    
                    _lines.Add(new Line(line));
                }
            }

        }

        private List<Line> _lines = new List<Line>();
        /// <summary>
        /// The lines of text of the <see cref="DocumentPage"/>
        /// </summary>
        public List<Line> Lines
        {
            get
            {
                return _lines;
            }            
        }

        /// <summary>
        /// The text of the complete <see cref="DocumentPage"/>.
        /// </summary>
        private string CompletePage;
    }

    /// <summary>
    /// A class representing a single line of text
    /// </summary>
    class Line
    {
        /// <summary>
        /// Constructor
        /// </summary>
        public Line(string lineContent)
        {
            CompleteLine = lineContent;
        }

        /// <summary>
        /// The words of the <see cref="Line"/>.
        /// </summary>
        public List<string> Words
        {
            get
            {
                return CompleteLine.Split(" ".ToArray()).Where((word)=> { return !string.IsNullOrWhiteSpace(word); }).ToList();
            }
        }

        /// <summary>
        /// The complete text of the <see cref="Line"/>.
        /// </summary>
        private string CompleteLine;

        public override string ToString()
        {
            return CompleteLine;
        }
    }
}

The page tree is a simple tree with the pages, consisting of lines (read page split by "\n") and lines consisting of words (line split by " ") but the txt in the loop already contains the messed up content (so my tree is not causing the issues).

Thanks for your help.

答案1

得分: 2

以下是已翻译的内容:

一些解析事件监听器,特别是大多数文本提取策略,并不适合在多个页面上重复使用。相反,您应该为每个页面创建一个新的实例。

按照一个经验法则,每个收集信息并允许您在页面解析过程中访问数据的监听器(例如文本提取策略允许您访问收集的页面文本),如果您不希望来自所有页面的数据累积,大多数情况下必须为每个页面单独实例化。

因此,在您的代码中,将策略的实例化

var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();

移到 for 循环中:

// 对于每个要读取的页面...
for (int i = 1; i <= pagesToRead; ++i)
{
    var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();
    // 获取页面并保存它
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
    pages.Add(txt);
}

或者您可以简化循环:

// 对于每个要读取的页面...
for (int i = 1; i <= pagesToRead; ++i)
{
    // 获取页面并保存它
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page);
    pages.Add(txt);
}

这个 PdfTextExtractor.GetTextFromPage 重载在内部每次都会创建一个新的 LocationTextExtractionStrategy 实例。

英文:

Some parsing event listeners, in particular most text extraction strategies, are not meant to be reused on multiple pages. Instead you should create a new instance for each page.

As a rule of thumb each such listener that collects information while a page is parsed, and afterwards allows you to access that data (like text extraction strategies allow you to access the collected page text), most likely must be instantiated separately for each page if you don't want the data from all pages to accumulate.

Thus, in your code move the strategy instantiation

var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();

into the for loop:

// for each page to read...
for (int i = 1; i &lt;= pagesToRead; ++i)
{
    var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();
    // get the page and save it
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
    pages.Add(txt);
}

Alternatively you can shorten the loop to

// for each page to read...
for (int i = 1; i &lt;= pagesToRead; ++i)
{
    // get the page and save it
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page);
    pages.Add(txt);
}

This PdfTextExtractor.GetTextFromPage overload creates a new LocationTextExtractionStrategy instance each time internally.

huangapple
  • 本文由 发表于 2020年1月7日 00:09:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/59615341.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定