2020年1月7日 00:09:45go评论82阅读模式

英文:

Why are the contents of my pages getting mixed up when reading PDF?

问题

我正在使用iText7从PDF文件中读取文本。这对第一页有效。但之后，页面的内容似乎混在一起了。因此，在文档的第3页上，有包含第1页和第3页内容的行。第2页的文本显示与第1页相同的行（但实际上它们完全不同）。

第1页，实际：约36行，结果36行 -> 很好
第2页，实际：>50行，结果36行（等于第1页）
第3页，实际：约16行，结果47行（添加并混合了第1页的行）

用于阅读文档的代码如下：

using System;
using System.Collections.Generic;
using System.Linq;

namespace StockMarket
{
    class PdfReader
    {
        // ...
    }

    // ...

    class DocumentTree
    {
        // ...
    }

    class DocumentPage
    {
        // ...
    }

    class Line
    {
        // ...
    }
}

页面树是一个简单的树，包含页面，页面由行组成（按"\n"拆分的页面），行由单词组成（按" "拆分的行），但循环中的__txt__已经包含混乱的内容（所以我的树不会导致问题）。

感谢您的帮助。

英文:

I am using iText7 to read the text from a pdf file. This works fine for the first page. After that the contents of the pages are somehow getting mixed up. So at page 3 of the document I have lines that contain content of page 1 and 3. The text of page 2 shows the exact same lines as page 1 (but in "reallity" they are completely different).

Page 1, real: ~36 lines, result 36 lines -> GREAT
Page 2, real: >50 lines, result 36 lines (==Page 1)
Page 3, real: ~16 lines, result 47 lines (adds and mixes with lines of page 1)

https://www.dropbox.com/s/63gy5cg1othy6ci/Dividenden_Microsoft.pdf?dl=0

For reading the document I use the following code:

using System;
using System.Collections.Generic;
using System.Linq;

namespace StockMarket
{
    class PdfReader
    {
        /// &lt;summary&gt;
        /// Reads PDF file by a given path.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;path&quot;&gt;The path to the file&lt;/param&gt;
        /// &lt;param name=&quot;pageCount&quot;&gt;The number of pages to read (0=all, 1 by default) &lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public static DocumentTree PdfToText(string path, int pageCount=1 )
        {
            var pages = new DocumentTree();
            using (iText.Kernel.Pdf.PdfReader reader = new iText.Kernel.Pdf.PdfReader(path))
            {
                using (iText.Kernel.Pdf.PdfDocument pdfDocument = new iText.Kernel.Pdf.PdfDocument(reader))
                {
                    var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();

                    // set up pages to read
                    int pagesToRead = 1;
                    if (pageCount &gt; 0)
                    {
                        pagesToRead = pageCount;
                    }
                    if (pagesToRead &gt; pdfDocument.GetNumberOfPages() || pageCount==0)
                    {
                        pagesToRead = pdfDocument.GetNumberOfPages();
                    }

                    // for each page to read...
                    for (int i = 1; i &lt;= pagesToRead; ++i)
                    {
                        // get the page and save it
                        var page = pdfDocument.GetPage(i);
                        var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
                        pages.Add(txt);
                    }
                    pdfDocument.Close();
                    reader.Close();
                }
            }
            return pages;
        }

    }
    
    /// &lt;summary&gt;
    /// A class representing parts of a PDF document.
    /// &lt;/summary&gt;
    class DocumentTree
    {
        /// &lt;summary&gt;
        /// Constructor
        /// &lt;/summary&gt;
        public DocumentTree()
        {
            Pages = new List&lt;DocumentPage&gt;();
        }

        private List&lt;DocumentPage&gt; _pages;
        /// &lt;summary&gt;
        /// The pages of the document
        /// &lt;/summary&gt;
        public List&lt;DocumentPage&gt; Pages
        {
            get { return _pages; }
            set { _pages = value; }
        }

        /// &lt;summary&gt;
        /// Adds a &lt;see cref=&quot;DocumentPage&quot;/&gt; to the document.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;page&quot;&gt;The text of the &lt;see cref=&quot;DocumentPage&quot;/&gt;.&lt;/param&gt;
        public void Add(string page)
        {
            Pages.Add(new DocumentPage(page));
        }
    }

    /// &lt;summary&gt;
    /// A class representing a single page of a document
    /// &lt;/summary&gt;
    class DocumentPage
    {
        /// &lt;summary&gt;
        /// Constructor
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;pageContent&quot;&gt;The pages content as text&lt;/param&gt;
        public DocumentPage(string pageContent)
        {
            // set the content to the input
            CompletePage = pageContent;

            // split the content by lines
            var splitter = new string[] { &quot;\n&quot; };
            foreach (var line in CompletePage.Split(splitter, StringSplitOptions.None))
            {
                // add lines to the page if the content is not empty
                if (!string.IsNullOrWhiteSpace(line))
                {                    
                    _lines.Add(new Line(line));
                }
            }

        }

        private List&lt;Line&gt; _lines = new List&lt;Line&gt;();
        /// &lt;summary&gt;
        /// The lines of text of the &lt;see cref=&quot;DocumentPage&quot;/&gt;
        /// &lt;/summary&gt;
        public List&lt;Line&gt; Lines
        {
            get
            {
                return _lines;
            }            
        }

        /// &lt;summary&gt;
        /// The text of the complete &lt;see cref=&quot;DocumentPage&quot;/&gt;.
        /// &lt;/summary&gt;
        private string CompletePage;
    }

    /// &lt;summary&gt;
    /// A class representing a single line of text
    /// &lt;/summary&gt;
    class Line
    {
        /// &lt;summary&gt;
        /// Constructor
        /// &lt;/summary&gt;
        public Line(string lineContent)
        {
            CompleteLine = lineContent;
        }

        /// &lt;summary&gt;
        /// The words of the &lt;see cref=&quot;Line&quot;/&gt;.
        /// &lt;/summary&gt;
        public List&lt;string&gt; Words
        {
            get
            {
                return CompleteLine.Split(&quot; &quot;.ToArray()).Where((word)=&gt; { return !string.IsNullOrWhiteSpace(word); }).ToList();
            }
        }

        /// &lt;summary&gt;
        /// The complete text of the &lt;see cref=&quot;Line&quot;/&gt;.
        /// &lt;/summary&gt;
        private string CompleteLine;

        public override string ToString()
        {
            return CompleteLine;
        }
    }
}

The page tree is a simple tree with the pages, consisting of lines (read page split by "\n") and lines consisting of words (line split by " ") but the txt in the loop already contains the messed up content (so my tree is not causing the issues).

Thanks for your help.

答案1

得分: 2

以下是已翻译的内容：

一些解析事件监听器，特别是大多数文本提取策略，并不适合在多个页面上重复使用。相反，您应该为每个页面创建一个新的实例。

按照一个经验法则，每个收集信息并允许您在页面解析过程中访问数据的监听器（例如文本提取策略允许您访问收集的页面文本），如果您不希望来自所有页面的数据累积，大多数情况下必须为每个页面单独实例化。

因此，在您的代码中，将策略的实例化

var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();

移到 for 循环中：

// 对于每个要读取的页面...
for (int i = 1; i <= pagesToRead; ++i)
{
    var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();
    // 获取页面并保存它
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
    pages.Add(txt);
}

或者您可以简化循环：

// 对于每个要读取的页面...
for (int i = 1; i <= pagesToRead; ++i)
{
    // 获取页面并保存它
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page);
    pages.Add(txt);
}

这个 PdfTextExtractor.GetTextFromPage 重载在内部每次都会创建一个新的 LocationTextExtractionStrategy 实例。

英文:

Some parsing event listeners, in particular most text extraction strategies, are not meant to be reused on multiple pages. Instead you should create a new instance for each page.

As a rule of thumb each such listener that collects information while a page is parsed, and afterwards allows you to access that data (like text extraction strategies allow you to access the collected page text), most likely must be instantiated separately for each page if you don't want the data from all pages to accumulate.

Thus, in your code move the strategy instantiation

var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();

into the for loop:

// for each page to read...
for (int i = 1; i &lt;= pagesToRead; ++i)
{
    var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();
    // get the page and save it
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
    pages.Add(txt);
}

Alternatively you can shorten the loop to

// for each page to read...
for (int i = 1; i &lt;= pagesToRead; ++i)
{
    // get the page and save it
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page);
    pages.Add(txt);
}

This PdfTextExtractor.GetTextFromPage overload creates a new LocationTextExtractionStrategy instance each time internally.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PDF读取时，为什么我的页面内容会混在一起？

问题

答案1

如何在asp.net中管理密码

iText 7将HTML转换为PDF – 如何查看整个宽表格？

.NET 7在反序列化请求时将对象转换为正确类型

如何编写一个LINQ查询，获取用户列表以及他们的第一篇文章？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论