英文:
How to create a searchable, OCR'd PDF from PNG but use JPEG as pictures
问题
我正在数字化我的纸质文件。为此,我首先通过扫描它们转换成PNG格式。
我已经找到了如何使用tesseract
来对一批图像进行OCR,并从中创建可搜索的PDF的方法(使用这个答案)。
由于我已经有了PNG文件,我想在这些文件上运行OCR。为了节省磁盘空间,我想在最终的PDF中使用JPEG格式。
有没有办法实现这一点?我正在运行Debian。
英文:
I am in the process to digitalise my paper documents. For this, I start by scanning them to PNGs.
I already figured out how to use tesseract
to OCR a batch of images and create a searchable PDF from that (using this answer).
Since I have the PNGs around anyway, I think I'd like to run OCR on these. In order to save disk space I'd like to use JPEGs for the final PDF, though.
Is there a way to achieve that? I am running Debian.
答案1
得分: 1
这种情况对于 tesseract
来说是非常常见的。然而,通常的处理方式是将外观未更改的图像转换为PDF,而经过后处理以适应OCR的图像则被输入到 tesseract
中。
这个问题在 这个评论 中有所提及。然而,该解释相对抽象。我最终采取了以下步骤:
- 运行tesseract:
ls *.png | tesseract -l $LANG -c textonly_pdf=1 - textonly pdf
- 将PNG转换为JPG:
fd . -e png --exec convert -quality 95 {} {.}.jpg
- 创建仅包含图像的PDF:
img2pdf *.jpg -o images.pdf
- 叠加PDF:
pdftk textonly.pdf multibackground images.pdf output result.pdf
(参考自这个答案)。
英文:
It turns out this scenario is one that is very common for tesseract
. However, the usual framing is that visually unchanged images should go to the PDF while images post-processed to suite the OCR are feed into tesseract
.
The issue is addressed in this comment. However, that explanation is rather abstract. I ended up doing the following:
- Run tesseract:
ls *.png | tesseract -l $LANG -c textonly_pdf=1 - textonly pdf
- Convert PNGs to JPG:
fd . -e png --exec convert -quality 95 {} {.}.jpg
- Create an image-only PDF:
img2pdf *.jpg -o images.pdf
- Overlay the PDFs:
pdftk textonly.pdf multibackground images.pdf output result.pdf
(learned from this answer)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论