英文:
Is it possible to get line no while extracting text from pdf doc?
问题
有没有任何方法可以从PDF文档逐行获取文本或者使用任何库和语言获取行号。我能够使用Java中的这3个库pdfbox、itext和aspose-pdf逐页从PDF文档中获取文本。
英文:
Is there an any way to get the text line by line from pdf document or get line no using any library and language.
I'm able to get text from pdf document page by page using these 3 lib pdfbox, itext, aspose-pdf in java.
答案1
得分: 1
Using PyMuPDF, 这是最简单的方法之一:
import fitz # PyMuPDF
doc = fitz.open("input.pdf")
for page in doc:
i = 0
blocks = page.get_text("blocks", sort=True) # 文本以段落形式组织
for block in blocks:
for line in block[4].splitlines():
print(f"第{page.number}页,第{i}行:'{line}'")
i += 1
每个块都是一个由4个边界框坐标组成的元组,后跟构成段落文本的字符串。
英文:
Using PyMuPDF, this is the simplest way:
import fitz # PyMuPDF
doc = fitz.open("input.pdf")
for page in doc:
i = 0
blocks = page.get_text("blocks", sort=True) # text organized in paragraphs
for block in blocks:
for line in block[4].splitlines():
print(f"Page {page.number}, line {i}: '{line}'")
i += 1
> Every block is a tuple of 4 boundary box coordinates, followed by the string comprising the text of the paragraph.
答案2
得分: 1
PDF没有行号的概念,因为激光文字可以位于任何角度。
因此,哪一行是第1行只是人类的感知,对于大多数人来说,第1行是页面上最顶部的行。但对于一个PDF来说,它可以是它写入的第十行,也可以是最后一行,因为它使用的笛卡尔坐标系统是从页面底部到顶部的。
无论如何,为这个PDF页面提名编号的方法是:
pdftotext -layout -f 1 -l 1 -enc UTF-8 "C:\Downloads\SO 76437736 LineNumbers.pdf" - | find /v /n "never2Bfound"
要保存为文件,只需添加一个重定向器 > SO-Q76437736-Page1.txt
。
英文:
PDF has no concept of Line Numbers, since laser text could be any angle.
So which line is 1 is simply a human perception, that for the majority, 1 is the topmost line on a page. However for a PDF that can the the tenth one it writes or the last one since the cartesian system it uses is page bottom to top.
Anyway to nominate numbers for this PDF page
pdftotext -layout -f 1 -l 1 -enc UTF-8 "C:\Downloads\SO 76437736 LineNumbers.pdf" - |find /v /n "never2Bfound"
[1]Is it possible to get line no while extracting text from pdf doc?
[2]
[3]Asked today Modified today Viewed 42 times
[4]
[5] Is there an any way to get the text line by line from pdf document or get line no using any
[6] library and language. I'm able to get text from pdf document page by page using these 3 lib
[7]
[8] 2 pdfbox, itext, aspose-pdf in java.
[9]
[10] python java itext pdfbox
[11]
[12] Share Edit Follow Close Flag asked 20 hours ago
[13] Seriously
[14] 135 5
[15]
[16] the simples pdftotext output is pdftotext -layout which will usually give you lines one by one. now
[17] the problem with your question is what do you mean ? since you say your able to already get text. PDF
[18] does not use Line numbers, they are a human requirement only for input to a PDF. see
[19] stackoverflow.com/a/72778117/10802527 ÔÇô K J 6 hours ago
[20]
[21]1 Answer Sorted by:
[22] Reset to default
[23]
[24] Date created (oldest first)
[25]
[26] Using PyMuPDF, this is the simplest way:
[27]
[28]1 import fitz # PyMuPDF
[29]
[30] doc = fitz.open("input.pdf")
[31]
[32] for page in doc:
[33] i = 0
[34] blocks = page.get_text("blocks", sort=True) # text organized in paragraphs
[35] for block in blocks:
[36] for line in block[4].splitlines():
[37] print(f"Page {page.number}, line {i}: '{line}'")
[38] i += 1
[39]
[40] Every block is a tuple of 4 boundary box coordinates, followed by the string
[41] comprising the text of the paragraph.
[42]
[43] Share Edit Follow Flag edited 16 hours ago answered 19 hours ago
[44] Jorj McKie
[45] 1,897 1 13 16
[46]♀
to save as a file just add a redirector >SO-Q76437736-Page1.txt
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论