是否可以在从PDF文档中提取文本时获取行号?

huangapple go评论82阅读模式
英文:

Is it possible to get line no while extracting text from pdf doc?

问题

有没有任何方法可以从PDF文档逐行获取文本或者使用任何库和语言获取行号。我能够使用Java中的这3个库pdfbox、itext和aspose-pdf逐页从PDF文档中获取文本。

英文:

Is there an any way to get the text line by line from pdf document or get line no using any library and language.
I'm able to get text from pdf document page by page using these 3 lib pdfbox, itext, aspose-pdf in java.

答案1

得分: 1

Using PyMuPDF, 这是最简单的方法之一:

import fitz  # PyMuPDF
doc = fitz.open("input.pdf")

for page in doc:
    i = 0
    blocks = page.get_text("blocks", sort=True)  # 文本以段落形式组织
    for block in blocks:
        for line in block[4].splitlines():
            print(f"第{page.number}页,第{i}行:'{line}'")
            i += 1

每个块都是一个由4个边界框坐标组成的元组,后跟构成段落文本的字符串。

英文:

Using PyMuPDF, this is the simplest way:

import fitz  # PyMuPDF
doc = fitz.open("input.pdf")

for page in doc:
    i = 0
    blocks = page.get_text("blocks", sort=True)  # text organized in paragraphs
    for block in blocks:
        for line in block[4].splitlines():
            print(f"Page {page.number}, line {i}: '{line}'")
            i += 1

> Every block is a tuple of 4 boundary box coordinates, followed by the string comprising the text of the paragraph.

答案2

得分: 1

PDF没有行号的概念,因为激光文字可以位于任何角度。

因此,哪一行是第1行只是人类的感知,对于大多数人来说,第1行是页面上最顶部的行。但对于一个PDF来说,它可以是它写入的第十行,也可以是最后一行,因为它使用的笛卡尔坐标系统是从页面底部到顶部的。

无论如何,为这个PDF页面提名编号的方法是:

pdftotext -layout -f 1 -l 1 -enc UTF-8 "C:\Downloads\SO 76437736 LineNumbers.pdf" - | find /v /n "never2Bfound"

要保存为文件,只需添加一个重定向器 > SO-Q76437736-Page1.txt

英文:

PDF has no concept of Line Numbers, since laser text could be any angle.

So which line is 1 is simply a human perception, that for the majority, 1 is the topmost line on a page. However for a PDF that can the the tenth one it writes or the last one since the cartesian system it uses is page bottom to top.

Anyway to nominate numbers for this PDF page

pdftotext -layout -f 1 -l 1 -enc UTF-8 "C:\Downloads\SO 76437736 LineNumbers.pdf" - |find /v /n "never2Bfound"

[1]Is it possible to get line no while extracting text from pdf doc?
[2]
[3]Asked today Modified today Viewed 42 times
[4]
[5]           Is there an any way to get the text line by line from pdf document or get line no using any
[6]           library and language. I'm able to get text from pdf document page by page using these 3 lib
[7]
[8]  2 pdfbox, itext, aspose-pdf in java.
[9]
[10]                python java itext pdfbox
[11]
[12]   Share Edit Follow Close Flag                              asked 20 hours ago
[13]                                                                      Seriously
[14]                                                                       135 5
[15]
[16]          the simples pdftotext output is pdftotext -layout which will usually give you lines one by one. now
[17]          the problem with your question is what do you mean ? since you say your able to already get text. PDF
[18]          does not use Line numbers, they are a human requirement only for input to a PDF. see
[19]          stackoverflow.com/a/72778117/10802527 ÔÇô K J 6 hours ago
[20]
[21]1 Answer                                                     Sorted by:
[22]                                                             Reset to default
[23]
[24]                                                               Date created (oldest first)
[25]
[26]   Using PyMuPDF, this is the simplest way:
[27]
[28]1         import fitz # PyMuPDF
[29]
[30]          doc = fitz.open("input.pdf")
[31]
[32]          for page in doc:
[33]                i = 0
[34]                blocks = page.get_text("blocks", sort=True) # text organized in paragraphs
[35]                for block in blocks:
[36]                       for line in block[4].splitlines():
[37]                             print(f"Page {page.number}, line {i}: '{line}'")
[38]                             i += 1
[39]
[40]          Every block is a tuple of 4 boundary box coordinates, followed by the string
[41]          comprising the text of the paragraph.
[42]
[43]   Share Edit Follow Flag               edited 16 hours ago  answered 19 hours ago
[44]                                                                      Jorj McKie
[45]                                                                       1,897 1 13 16
[46]♀

to save as a file just add a redirector >SO-Q76437736-Page1.txt

huangapple
  • 本文由 发表于 2023年6月9日 14:28:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76437736.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定