Tesseract为何在这里返回错误的数字?

huangapple go评论103阅读模式
英文:

Why is Tesseract returning the wrong number here?

问题

我使用Tesseract扫描从图像中裁剪出的日期。日期的格式可能是这样的:

Tesseract为何在这里返回错误的数字?

对于几乎所有日期,这都运行得很好,没有问题。然而,有时候它会以一种奇怪的方式失败,通常涉及到3和5。在上面的例子中,它将最后的6个数字视为0435000,这无需多言是7个数字,一个5被无缘无故地添加进来了(疑似在3之后)。另一个常见的错误是将“2300”解释为3500。据我记得,它似乎总是只有日期的最后一部分(小时和分钟)会导致问题,从来不是日期的第一部分(我每天运行了好几个月)。有谁知道是什么原因导致了这个问题吗?还有可能让它更一致的方法吗?

值得一提的是,我没有使用任何特殊命令运行Tesseract,只是 tesseract date.jpg date

英文:

I'm using Tesseract to scan dates that have been cropped out of an image. A date will look e.g. like this:

Tesseract为何在这里返回错误的数字?

For almost all dates this works perfectly fine, no problem at all. However, sometimes it trips up in a strange manner, usually involving 3s and 5s. In the case above, it sees the last 6 numbers as 0435000, which needless to say is 7 numbers, a 5 has been added in out of nowhere (as suspected, right after a 3). Another common mistake is interpreting "2300" as 3500. As far as I can remember it's always been only the last portion (the hours and minutes) that trips it up, never the first part of the date (I've been running it a lot of times daily for several months). Does anyone know what is causing this? And also perhaps what I can do to make it more consistent?

For what it's worth I'm not running Tesseract with any special commands, just tesseract date.jpg date.

答案1

得分: 1

图像质量非常重要,你应该使用 RGB 和黑白:

  1. import subprocess
  2. import cv2
  3. import pytesseract
  4. # 图像处理
  5. # 命令 https://imagemagick.org/script/convert.php
  6. mag_img = r'D:\Programme\ImageMagic\magick.exe'
  7. con_bw = r"D:\Programme\ImageMagic\convert.exe"
  8. in_file = r'Qkqxn.jpg'
  9. out_file = r'Qkqxn_inv.jpg'
  10. # 调整黑白和对比度以获得更好的结果
  11. process = subprocess.run([con_bw, in_file, "-resize", "70%","-threshold","35%", out_file])
  12. # 文本处理
  13. pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
  14. img = cv2.imread(out_file)
  15. # 参数参见 tesseract 文档
  16. custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=_01234567890'
  17. tex = pytesseract.image_to_string(img, config=custom_config)
  18. print(tex)
  19. with open("numbers.txt", 'w') as f:
  20. f.writelines(tex)
  21. cv2.imshow('image',img)
  22. cv2.waitKey(0)
  23. cv2.destroyAllWindows()
英文:

The Image quality is very important and you should use rgb b&w:

  1. import subprocess
  2. import cv2
  3. import pytesseract
  4. # Image manipulation
  5. # Commands https://imagemagick.org/script/convert.php
  6. mag_img = r'D:\Programme\ImageMagic\magick.exe'
  7. con_bw = r"D:\Programme\ImageMagic\convert.exe"
  8. in_file = r'Qkqxn.jpg'
  9. out_file = r'Qkqxn_inv.jpg'
  10. # Play with black and white and contrast for better results
  11. process = subprocess.run([con_bw, in_file, "-resize", "70%","-threshold","35%", out_file])
  12. # Text ptocessing
  13. pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
  14. img = cv2.imread(out_file)
  15. # Parameters see tesseract doc
  16. custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=_01234567890'
  17. tex = pytesseract.image_to_string(img, config=custom_config)
  18. print(tex)
  19. with open("numbers.txt", 'w') as f:
  20. f.writelines(tex)
  21. cv2.imshow('image',img)
  22. cv2.waitKey(0)
  23. cv2.destroyAllWindows()

Output:
Tesseract为何在这里返回错误的数字?

huangapple
  • 本文由 发表于 2023年4月4日 14:18:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926059.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定