如何从PNG创建一个可搜索的OCR’d PDF,但将JPEG用作图片。

huangapple go评论67阅读模式
英文:

How to create a searchable, OCR'd PDF from PNG but use JPEG as pictures

问题

我正在数字化我的纸质文件。为此,我首先通过扫描它们转换成PNG格式。

我已经找到了如何使用tesseract来对一批图像进行OCR,并从中创建可搜索的PDF的方法(使用这个答案)。

由于我已经有了PNG文件,我想在这些文件上运行OCR。为了节省磁盘空间,我想在最终的PDF中使用JPEG格式。

有没有办法实现这一点?我正在运行Debian。

英文:

I am in the process to digitalise my paper documents. For this, I start by scanning them to PNGs.

I already figured out how to use tesseract to OCR a batch of images and create a searchable PDF from that (using this answer).

Since I have the PNGs around anyway, I think I'd like to run OCR on these. In order to save disk space I'd like to use JPEGs for the final PDF, though.

Is there a way to achieve that? I am running Debian.

答案1

得分: 1

这种情况对于 tesseract 来说是非常常见的。然而,通常的处理方式是将外观未更改的图像转换为PDF,而经过后处理以适应OCR的图像则被输入到 tesseract 中。

这个问题在 这个评论 中有所提及。然而,该解释相对抽象。我最终采取了以下步骤:

  1. 运行tesseract:ls *.png | tesseract -l $LANG -c textonly_pdf=1 - textonly pdf
  2. 将PNG转换为JPG:fd . -e png --exec convert -quality 95 {} {.}.jpg
  3. 创建仅包含图像的PDF:img2pdf *.jpg -o images.pdf
  4. 叠加PDF:pdftk textonly.pdf multibackground images.pdf output result.pdf(参考自这个答案)。
英文:

It turns out this scenario is one that is very common for tesseract. However, the usual framing is that visually unchanged images should go to the PDF while images post-processed to suite the OCR are feed into tesseract.

The issue is addressed in this comment. However, that explanation is rather abstract. I ended up doing the following:

  1. Run tesseract: ls *.png | tesseract -l $LANG -c textonly_pdf=1 - textonly pdf
  2. Convert PNGs to JPG: fd . -e png --exec convert -quality 95 {} {.}.jpg
  3. Create an image-only PDF: img2pdf *.jpg -o images.pdf
  4. Overlay the PDFs: pdftk textonly.pdf multibackground images.pdf output result.pdf (learned from this answer)

huangapple
  • 本文由 发表于 2023年7月17日 20:56:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76704698.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定