英文:
Why is Tesseract returning the wrong number here?
问题
我使用Tesseract扫描从图像中裁剪出的日期。日期的格式可能是这样的:
对于几乎所有日期,这都运行得很好,没有问题。然而,有时候它会以一种奇怪的方式失败,通常涉及到3和5。在上面的例子中,它将最后的6个数字视为0435000
,这无需多言是7个数字,一个5被无缘无故地添加进来了(疑似在3之后)。另一个常见的错误是将“2300”解释为3500
。据我记得,它似乎总是只有日期的最后一部分(小时和分钟)会导致问题,从来不是日期的第一部分(我每天运行了好几个月)。有谁知道是什么原因导致了这个问题吗?还有可能让它更一致的方法吗?
值得一提的是,我没有使用任何特殊命令运行Tesseract,只是 tesseract date.jpg date
。
英文:
I'm using Tesseract to scan dates that have been cropped out of an image. A date will look e.g. like this:
For almost all dates this works perfectly fine, no problem at all. However, sometimes it trips up in a strange manner, usually involving 3s and 5s. In the case above, it sees the last 6 numbers as 0435000
, which needless to say is 7 numbers, a 5 has been added in out of nowhere (as suspected, right after a 3). Another common mistake is interpreting "2300" as 3500
. As far as I can remember it's always been only the last portion (the hours and minutes) that trips it up, never the first part of the date (I've been running it a lot of times daily for several months). Does anyone know what is causing this? And also perhaps what I can do to make it more consistent?
For what it's worth I'm not running Tesseract with any special commands, just tesseract date.jpg date
.
答案1
得分: 1
图像质量非常重要,你应该使用 RGB 和黑白:
import subprocess
import cv2
import pytesseract
# 图像处理
# 命令 https://imagemagick.org/script/convert.php
mag_img = r'D:\Programme\ImageMagic\magick.exe'
con_bw = r"D:\Programme\ImageMagic\convert.exe"
in_file = r'Qkqxn.jpg'
out_file = r'Qkqxn_inv.jpg'
# 调整黑白和对比度以获得更好的结果
process = subprocess.run([con_bw, in_file, "-resize", "70%","-threshold","35%", out_file])
# 文本处理
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = cv2.imread(out_file)
# 参数参见 tesseract 文档
custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=_01234567890'
tex = pytesseract.image_to_string(img, config=custom_config)
print(tex)
with open("numbers.txt", 'w') as f:
f.writelines(tex)
cv2.imshow('image',img)
cv2.waitKey(0)
cv2.destroyAllWindows()
英文:
The Image quality is very important and you should use rgb b&w:
import subprocess
import cv2
import pytesseract
# Image manipulation
# Commands https://imagemagick.org/script/convert.php
mag_img = r'D:\Programme\ImageMagic\magick.exe'
con_bw = r"D:\Programme\ImageMagic\convert.exe"
in_file = r'Qkqxn.jpg'
out_file = r'Qkqxn_inv.jpg'
# Play with black and white and contrast for better results
process = subprocess.run([con_bw, in_file, "-resize", "70%","-threshold","35%", out_file])
# Text ptocessing
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = cv2.imread(out_file)
# Parameters see tesseract doc
custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=_01234567890'
tex = pytesseract.image_to_string(img, config=custom_config)
print(tex)
with open("numbers.txt", 'w') as f:
f.writelines(tex)
cv2.imshow('image',img)
cv2.waitKey(0)
cv2.destroyAllWindows()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论