Tesseract为何在这里返回错误的数字?

huangapple go评论72阅读模式
英文:

Why is Tesseract returning the wrong number here?

问题

我使用Tesseract扫描从图像中裁剪出的日期。日期的格式可能是这样的:

Tesseract为何在这里返回错误的数字?

对于几乎所有日期,这都运行得很好,没有问题。然而,有时候它会以一种奇怪的方式失败,通常涉及到3和5。在上面的例子中,它将最后的6个数字视为0435000,这无需多言是7个数字,一个5被无缘无故地添加进来了(疑似在3之后)。另一个常见的错误是将“2300”解释为3500。据我记得,它似乎总是只有日期的最后一部分(小时和分钟)会导致问题,从来不是日期的第一部分(我每天运行了好几个月)。有谁知道是什么原因导致了这个问题吗?还有可能让它更一致的方法吗?

值得一提的是,我没有使用任何特殊命令运行Tesseract,只是 tesseract date.jpg date

英文:

I'm using Tesseract to scan dates that have been cropped out of an image. A date will look e.g. like this:

Tesseract为何在这里返回错误的数字?

For almost all dates this works perfectly fine, no problem at all. However, sometimes it trips up in a strange manner, usually involving 3s and 5s. In the case above, it sees the last 6 numbers as 0435000, which needless to say is 7 numbers, a 5 has been added in out of nowhere (as suspected, right after a 3). Another common mistake is interpreting "2300" as 3500. As far as I can remember it's always been only the last portion (the hours and minutes) that trips it up, never the first part of the date (I've been running it a lot of times daily for several months). Does anyone know what is causing this? And also perhaps what I can do to make it more consistent?

For what it's worth I'm not running Tesseract with any special commands, just tesseract date.jpg date.

答案1

得分: 1

图像质量非常重要,你应该使用 RGB 和黑白:

import subprocess
import cv2
import pytesseract

# 图像处理
# 命令 https://imagemagick.org/script/convert.php
mag_img = r'D:\Programme\ImageMagic\magick.exe'
con_bw = r"D:\Programme\ImageMagic\convert.exe"

in_file = r'Qkqxn.jpg'
out_file = r'Qkqxn_inv.jpg'

# 调整黑白和对比度以获得更好的结果
process = subprocess.run([con_bw, in_file, "-resize", "70%","-threshold","35%", out_file])

# 文本处理
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = cv2.imread(out_file)

# 参数参见 tesseract 文档
custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=_01234567890'

tex = pytesseract.image_to_string(img, config=custom_config)
print(tex)

with open("numbers.txt", 'w') as f:
    f.writelines(tex)

cv2.imshow('image',img)
cv2.waitKey(0)
cv2.destroyAllWindows()
英文:

The Image quality is very important and you should use rgb b&w:

import subprocess
import cv2
import pytesseract

# Image manipulation
# Commands https://imagemagick.org/script/convert.php
mag_img = r'D:\Programme\ImageMagic\magick.exe'
con_bw = r"D:\Programme\ImageMagic\convert.exe" 

in_file = r'Qkqxn.jpg'
out_file = r'Qkqxn_inv.jpg'

# Play with black and white and contrast for better results
process = subprocess.run([con_bw, in_file, "-resize", "70%","-threshold","35%", out_file])

# Text ptocessing
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = cv2.imread(out_file)

# Parameters see tesseract doc 
custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=_01234567890' 

tex = pytesseract.image_to_string(img, config=custom_config)
print(tex)

with open("numbers.txt", 'w') as f:
    f.writelines(tex)

cv2.imshow('image',img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Output:
Tesseract为何在这里返回错误的数字?

huangapple
  • 本文由 发表于 2023年4月4日 14:18:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926059.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定