问题

有没有任何方法可以从PDF文档逐行获取文本或者使用任何库和语言获取行号。我能够使用Java中的这3个库pdfbox、itext和aspose-pdf逐页从PDF文档中获取文本。

英文:

Is there an any way to get the text line by line from pdf document or get line no using any library and language.
I'm able to get text from pdf document page by page using these 3 lib pdfbox, itext, aspose-pdf in java.

答案1

得分: 1

Using PyMuPDF, 这是最简单的方法之一：

import fitz  # PyMuPDF
doc = fitz.open("input.pdf")

for page in doc:
    i = 0
    blocks = page.get_text("blocks", sort=True)  # 文本以段落形式组织
    for block in blocks:
        for line in block[4].splitlines():
            print(f"第{page.number}页，第{i}行：'{line}'")
            i += 1

每个块都是一个由4个边界框坐标组成的元组，后跟构成段落文本的字符串。

英文:

Using PyMuPDF, this is the simplest way:

import fitz  # PyMuPDF
doc = fitz.open(&quot;input.pdf&quot;)

for page in doc:
    i = 0
    blocks = page.get_text(&quot;blocks&quot;, sort=True)  # text organized in paragraphs
    for block in blocks:
        for line in block[4].splitlines():
            print(f&quot;Page {page.number}, line {i}: &#39;{line}&#39;&quot;)
            i += 1

> Every block is a tuple of 4 boundary box coordinates, followed by the string comprising the text of the paragraph.

答案2

得分: 1

PDF没有行号的概念，因为激光文字可以位于任何角度。

因此，哪一行是第1行只是人类的感知，对于大多数人来说，第1行是页面上最顶部的行。但对于一个PDF来说，它可以是它写入的第十行，也可以是最后一行，因为它使用的笛卡尔坐标系统是从页面底部到顶部的。

无论如何，为这个PDF页面提名编号的方法是：

pdftotext -layout -f 1 -l 1 -enc UTF-8 "C:\Downloads\SO 76437736 LineNumbers.pdf" - | find /v /n "never2Bfound"

要保存为文件，只需添加一个重定向器 > SO-Q76437736-Page1.txt。

英文:

PDF has no concept of Line Numbers, since laser text could be any angle.

So which line is 1 is simply a human perception, that for the majority, 1 is the topmost line on a page. However for a PDF that can the the tenth one it writes or the last one since the cartesian system it uses is page bottom to top.

Anyway to nominate numbers for this PDF page

pdftotext -layout -f 1 -l 1 -enc UTF-8 &quot;C:\Downloads\SO 76437736 LineNumbers.pdf&quot; - |find /v /n &quot;never2Bfound&quot;

[1]Is it possible to get line no while extracting text from pdf doc?
[2]
[3]Asked today Modified today Viewed 42 times
[4]
[5]           Is there an any way to get the text line by line from pdf document or get line no using any
[6]           library and language. I&#39;m able to get text from pdf document page by page using these 3 lib
[7]
[8]  2 pdfbox, itext, aspose-pdf in java.
[9]
[10]                python java itext pdfbox
[11]
[12]   Share Edit Follow Close Flag                              asked 20 hours ago
[13]                                                                      Seriously
[14]                                                                       135 5
[15]
[16]          the simples pdftotext output is pdftotext -layout which will usually give you lines one by one. now
[17]          the problem with your question is what do you mean ? since you say your able to already get text. PDF
[18]          does not use Line numbers, they are a human requirement only for input to a PDF. see
[19]          stackoverflow.com/a/72778117/10802527 &#212;&#199;&#244; K J 6 hours ago
[20]
[21]1 Answer                                                     Sorted by:
[22]                                                             Reset to default
[23]
[24]                                                               Date created (oldest first)
[25]
[26]   Using PyMuPDF, this is the simplest way:
[27]
[28]1         import fitz # PyMuPDF
[29]
[30]          doc = fitz.open(&quot;input.pdf&quot;)
[31]
[32]          for page in doc:
[33]                i = 0
[34]                blocks = page.get_text(&quot;blocks&quot;, sort=True) # text organized in paragraphs
[35]                for block in blocks:
[36]                       for line in block[4].splitlines():
[37]                             print(f&quot;Page {page.number}, line {i}: &#39;{line}&#39;&quot;)
[38]                             i += 1
[39]
[40]          Every block is a tuple of 4 boundary box coordinates, followed by the string
[41]          comprising the text of the paragraph.
[42]
[43]   Share Edit Follow Flag               edited 16 hours ago  answered 19 hours ago
[44]                                                                      Jorj McKie
[45]                                                                       1,897 1 13 16
[46]♀

to save as a file just add a redirector >SO-Q76437736-Page1.txt

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

是否可以在从PDF文档中提取文本时获取行号？

问题

答案1

答案2

时间不匹配使用 NetCDF4 时

“`python pd.DataFrame 如何计算 mean()，同时忽略某些单元格中的 ‘NA’ 字符串 “`

PNG图像使用Python Pillow的frombytes方法变成黑色。如何保持颜色？

如何用用户定义的对象填充数组？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论