如何在C#中读取带有行号的PDF?

huangapple go评论62阅读模式
英文:

How to read PDF with line numbers in c#?

问题

在上述图像中,我展示了PDF文档。在PDF文件中,每一行都有行号。我想要将其保存到数据库中,格式如下:

行号       内容
1          在Box 2中显示的日期上,与Box 3中命名的一方达成一致

当我读取文档时,所有内容都以文本形式出现,但如何可以将行号和内容分开识别呢?因为在内容中也可能出现数字,所以数字逻辑不能帮助识别行号。是否有任何方式可以识别行号?任何帮助将不胜感激。谢谢。

英文:

Hi I am trying to read pdf document along with line numbers.

如何在C#中读取带有行号的PDF?

In the above image I have showed PDF. In pdf file each line has line number. I want to save in the database like

Line Number  Content
1            It is agreed on the date shown in Box 2 between the party named in Box 3 as

when I read the document everything comes as text but how can I identify the line numbers and contents separately? because in content also number may appear so number logic doesnt help in identifying the line number. Is there anyway identify line numbers? Any help would be appreciated. Thanks

答案1

得分: 0

我使用哪个库来读取PDF中的数据?我使用iTextSharp创建了一个示例,你可以参考它。

在逐行读取数据的过程中,使用正则表达式获取字符串的最后一行编号,并使用TrimEnd()方法将其移除:

public IActionResult Index()
{
    PdfReader reader = new PdfReader(@"C:\Users\Administrator\Desktop\Test.pdf");
    int intPageNum = reader.NumberOfPages;
    string[] words;
    string line;
    string text;
    PdfModel pdfModel = new PdfModel();

    for (int i = 1; i <= intPageNum; i++)
    {
        text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());

        words = text.Split('\n');
        for (int j = 0, len = words.Length; j < len; j++)
        {
            line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
            var x = Regex.Match(line, @"([0-9]+)[^0-9]*$");
            if (x.Success && x.Groups.Count > 0)
            {
                var foundNumber = x.Groups[1].Captures[0].Value;
                line = line.Trim().TrimEnd(foundNumber.ToCharArray());
                pdfModel.Line = int.Parse(foundNumber);
                pdfModel.Content = line;
                _context.PdfModel.Add(pdfModel);
                _context.SaveChanges();
            }
        }
    }
    return View();
}

测试结果:

我的PDF:

如何在C#中读取带有行号的PDF?

我的数据库:

如何在C#中读取带有行号的PDF?

英文:

Which library are you using to read the data from the PDF? I made an example using iTextSharp, you can refer to it.

In the process of reading data line by line, use regular expressions to get the last line number of the string, and use the TrimEnd() method to remove it:

public IActionResult Index()
{
    PdfReader reader = new PdfReader(@&quot;C:\Users\Administrator\Desktop\Test.pdf&quot;);
    int intPageNum = reader.NumberOfPages;
    string[] words;
    string line;
    string text;
    PdfModel pdfModel = new PdfModel();

    for (int i = 1; i &lt;= intPageNum; i++)
    {
        text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());

        words = text.Split(&#39;\n&#39;);
        for (int j = 0, len = words.Length; j &lt; len; j++)
        {
            line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
            var x = Regex.Match(line, @&quot;([0-9]+)[^0-9]*$&quot;);
            if (x.Success &amp;&amp; x.Groups.Count &gt; 0)
            {
                var foundNumber = x.Groups[1].Captures[0].Value;
                line = line.Trim().TrimEnd(foundNumber.ToCharArray());
                pdfModel.Line = int.Parse(foundNumber);
                pdfModel.Content = line;
                _context.PdfModel.Add(pdfModel);
                _context.SaveChanges();
            }
        }
    }
    return View();
}

Test Result:

My PDF:

如何在C#中读取带有行号的PDF?

My Database:
如何在C#中读取带有行号的PDF?

huangapple
  • 本文由 发表于 2023年6月5日 19:45:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76406085.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定