问题

在上述图像中，我展示了PDF文档。在PDF文件中，每一行都有行号。我想要将其保存到数据库中，格式如下：

行号       内容
1          在Box 2中显示的日期上，与Box 3中命名的一方达成一致

当我读取文档时，所有内容都以文本形式出现，但如何可以将行号和内容分开识别呢？因为在内容中也可能出现数字，所以数字逻辑不能帮助识别行号。是否有任何方式可以识别行号？任何帮助将不胜感激。谢谢。

英文:

Hi I am trying to read pdf document along with line numbers.

In the above image I have showed PDF. In pdf file each line has line number. I want to save in the database like

Line Number  Content
1            It is agreed on the date shown in Box 2 between the party named in Box 3 as

when I read the document everything comes as text but how can I identify the line numbers and contents separately? because in content also number may appear so number logic doesnt help in identifying the line number. Is there anyway identify line numbers? Any help would be appreciated. Thanks

答案1

得分: 0

我使用哪个库来读取PDF中的数据？我使用iTextSharp创建了一个示例，你可以参考它。

在逐行读取数据的过程中，使用正则表达式获取字符串的最后一行编号，并使用TrimEnd()方法将其移除：

public IActionResult Index()
{
    PdfReader reader = new PdfReader(@"C:\Users\Administrator\Desktop\Test.pdf");
    int intPageNum = reader.NumberOfPages;
    string[] words;
    string line;
    string text;
    PdfModel pdfModel = new PdfModel();

    for (int i = 1; i <= intPageNum; i++)
    {
        text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());

        words = text.Split('\n');
        for (int j = 0, len = words.Length; j < len; j++)
        {
            line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
            var x = Regex.Match(line, @"([0-9]+)[^0-9]*$");
            if (x.Success && x.Groups.Count > 0)
            {
                var foundNumber = x.Groups[1].Captures[0].Value;
                line = line.Trim().TrimEnd(foundNumber.ToCharArray());
                pdfModel.Line = int.Parse(foundNumber);
                pdfModel.Content = line;
                _context.PdfModel.Add(pdfModel);
                _context.SaveChanges();
            }
        }
    }
    return View();
}

测试结果：

我的PDF：

我的数据库：

英文:

Which library are you using to read the data from the PDF? I made an example using iTextSharp, you can refer to it.

In the process of reading data line by line, use regular expressions to get the last line number of the string, and use the TrimEnd() method to remove it:

public IActionResult Index()
{
    PdfReader reader = new PdfReader(@&quot;C:\Users\Administrator\Desktop\Test.pdf&quot;);
    int intPageNum = reader.NumberOfPages;
    string[] words;
    string line;
    string text;
    PdfModel pdfModel = new PdfModel();

    for (int i = 1; i &lt;= intPageNum; i++)
    {
        text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());

        words = text.Split(&#39;\n&#39;);
        for (int j = 0, len = words.Length; j &lt; len; j++)
        {
            line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
            var x = Regex.Match(line, @&quot;([0-9]+)[^0-9]*$&quot;);
            if (x.Success &amp;&amp; x.Groups.Count &gt; 0)
            {
                var foundNumber = x.Groups[1].Captures[0].Value;
                line = line.Trim().TrimEnd(foundNumber.ToCharArray());
                pdfModel.Line = int.Parse(foundNumber);
                pdfModel.Content = line;
                _context.PdfModel.Add(pdfModel);
                _context.SaveChanges();
            }
        }
    }
    return View();
}

Test Result:

My PDF:

My Database:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在C#中读取带有行号的PDF？

问题

答案1

如何在ASP.NET MVC项目的Razor视图中显示图像？

如何从子类更改基本参数的初始状态

是否可以要求派生类实现基类的虚方法？

Set HttpClient Timeout for Azure function app

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论