2023年7月17日 20:56:15go评论93阅读模式

英文:

How to create a searchable, OCR'd PDF from PNG but use JPEG as pictures

问题

我正在数字化我的纸质文件。为此，我首先通过扫描它们转换成PNG格式。

我已经找到了如何使用tesseract来对一批图像进行OCR，并从中创建可搜索的PDF的方法（使用这个答案）。

由于我已经有了PNG文件，我想在这些文件上运行OCR。为了节省磁盘空间，我想在最终的PDF中使用JPEG格式。

有没有办法实现这一点？我正在运行Debian。

英文:

I am in the process to digitalise my paper documents. For this, I start by scanning them to PNGs.

I already figured out how to use tesseract to OCR a batch of images and create a searchable PDF from that (using this answer).

Since I have the PNGs around anyway, I think I'd like to run OCR on these. In order to save disk space I'd like to use JPEGs for the final PDF, though.

Is there a way to achieve that? I am running Debian.

答案1

得分: 1

这种情况对于 tesseract 来说是非常常见的。然而，通常的处理方式是将外观未更改的图像转换为PDF，而经过后处理以适应OCR的图像则被输入到 tesseract 中。

这个问题在这个评论中有所提及。然而，该解释相对抽象。我最终采取了以下步骤：

运行tesseract：ls *.png | tesseract -l $LANG -c textonly_pdf=1 - textonly pdf
将PNG转换为JPG：fd . -e png --exec convert -quality 95 {} {.}.jpg
创建仅包含图像的PDF：img2pdf *.jpg -o images.pdf
叠加PDF：pdftk textonly.pdf multibackground images.pdf output result.pdf（参考自这个答案)。

英文:

It turns out this scenario is one that is very common for tesseract. However, the usual framing is that visually unchanged images should go to the PDF while images post-processed to suite the OCR are feed into tesseract.

The issue is addressed in this comment. However, that explanation is rather abstract. I ended up doing the following:

Run tesseract: ls *.png | tesseract -l $LANG -c textonly_pdf=1 - textonly pdf
Convert PNGs to JPG: fd . -e png --exec convert -quality 95 {} {.}.jpg
Create an image-only PDF: img2pdf *.jpg -o images.pdf
Overlay the PDFs: pdftk textonly.pdf multibackground images.pdf output result.pdf (learned from this answer)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从PNG创建一个可搜索的OCR’d PDF，但将JPEG用作图片。

问题

答案1

“Buildroot交叉编译内核模块: ‘致命错误: asm/bitsperlong.h: 没有该文件或目录'”

尝试在AWS上扩展我的分区，但权限被拒绝。

如何将图像插入到PDF源代码中

分析 Glibc 堆内存

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。