英文:
How to read PDF with line numbers in c#?
问题
在上述图像中,我展示了PDF文档。在PDF文件中,每一行都有行号。我想要将其保存到数据库中,格式如下:
行号 内容
1 在Box 2中显示的日期上,与Box 3中命名的一方达成一致
当我读取文档时,所有内容都以文本形式出现,但如何可以将行号和内容分开识别呢?因为在内容中也可能出现数字,所以数字逻辑不能帮助识别行号。是否有任何方式可以识别行号?任何帮助将不胜感激。谢谢。
英文:
Hi I am trying to read pdf document along with line numbers.
In the above image I have showed PDF. In pdf file each line has line number. I want to save in the database like
Line Number Content
1 It is agreed on the date shown in Box 2 between the party named in Box 3 as
when I read the document everything comes as text but how can I identify the line numbers and contents separately? because in content also number may appear so number logic doesnt help in identifying the line number. Is there anyway identify line numbers? Any help would be appreciated. Thanks
答案1
得分: 0
我使用哪个库来读取PDF中的数据?我使用iTextSharp创建了一个示例,你可以参考它。
在逐行读取数据的过程中,使用正则表达式获取字符串的最后一行编号,并使用TrimEnd()
方法将其移除:
public IActionResult Index()
{
PdfReader reader = new PdfReader(@"C:\Users\Administrator\Desktop\Test.pdf");
int intPageNum = reader.NumberOfPages;
string[] words;
string line;
string text;
PdfModel pdfModel = new PdfModel();
for (int i = 1; i <= intPageNum; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
words = text.Split('\n');
for (int j = 0, len = words.Length; j < len; j++)
{
line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
var x = Regex.Match(line, @"([0-9]+)[^0-9]*$");
if (x.Success && x.Groups.Count > 0)
{
var foundNumber = x.Groups[1].Captures[0].Value;
line = line.Trim().TrimEnd(foundNumber.ToCharArray());
pdfModel.Line = int.Parse(foundNumber);
pdfModel.Content = line;
_context.PdfModel.Add(pdfModel);
_context.SaveChanges();
}
}
}
return View();
}
测试结果:
我的PDF:
我的数据库:
英文:
Which library are you using to read the data from the PDF? I made an example using iTextSharp, you can refer to it.
In the process of reading data line by line, use regular expressions to get the last line number of the string, and use the TrimEnd()
method to remove it:
public IActionResult Index()
{
PdfReader reader = new PdfReader(@"C:\Users\Administrator\Desktop\Test.pdf");
int intPageNum = reader.NumberOfPages;
string[] words;
string line;
string text;
PdfModel pdfModel = new PdfModel();
for (int i = 1; i <= intPageNum; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
words = text.Split('\n');
for (int j = 0, len = words.Length; j < len; j++)
{
line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
var x = Regex.Match(line, @"([0-9]+)[^0-9]*$");
if (x.Success && x.Groups.Count > 0)
{
var foundNumber = x.Groups[1].Captures[0].Value;
line = line.Trim().TrimEnd(foundNumber.ToCharArray());
pdfModel.Line = int.Parse(foundNumber);
pdfModel.Content = line;
_context.PdfModel.Add(pdfModel);
_context.SaveChanges();
}
}
}
return View();
}
Test Result:
My PDF:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论