2023年4月4日 14:18:27go评论103阅读模式

英文:

Why is Tesseract returning the wrong number here?

问题

我使用Tesseract扫描从图像中裁剪出的日期。日期的格式可能是这样的：

对于几乎所有日期，这都运行得很好，没有问题。然而，有时候它会以一种奇怪的方式失败，通常涉及到3和5。在上面的例子中，它将最后的6个数字视为0435000，这无需多言是7个数字，一个5被无缘无故地添加进来了（疑似在3之后）。另一个常见的错误是将“2300”解释为3500。据我记得，它似乎总是只有日期的最后一部分（小时和分钟）会导致问题，从来不是日期的第一部分（我每天运行了好几个月）。有谁知道是什么原因导致了这个问题吗？还有可能让它更一致的方法吗？

值得一提的是，我没有使用任何特殊命令运行Tesseract，只是 tesseract date.jpg date。

英文:

I'm using Tesseract to scan dates that have been cropped out of an image. A date will look e.g. like this:

For almost all dates this works perfectly fine, no problem at all. However, sometimes it trips up in a strange manner, usually involving 3s and 5s. In the case above, it sees the last 6 numbers as 0435000, which needless to say is 7 numbers, a 5 has been added in out of nowhere (as suspected, right after a 3). Another common mistake is interpreting "2300" as 3500. As far as I can remember it's always been only the last portion (the hours and minutes) that trips it up, never the first part of the date (I've been running it a lot of times daily for several months). Does anyone know what is causing this? And also perhaps what I can do to make it more consistent?

For what it's worth I'm not running Tesseract with any special commands, just tesseract date.jpg date.

答案1

得分: 1

图像质量非常重要，你应该使用 RGB 和黑白：

import subprocess
import cv2
import pytesseract
# 图像处理
# 命令 https://imagemagick.org/script/convert.php
mag_img = r'D:\Programme\ImageMagic\magick.exe'
con_bw = r"D:\Programme\ImageMagic\convert.exe"
in_file = r'Qkqxn.jpg'
out_file = r'Qkqxn_inv.jpg'
# 调整黑白和对比度以获得更好的结果
process = subprocess.run([con_bw, in_file, "-resize", "70%","-threshold","35%", out_file])
# 文本处理
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = cv2.imread(out_file)
# 参数参见 tesseract 文档
custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=_01234567890'
tex = pytesseract.image_to_string(img, config=custom_config)
print(tex)
with open("numbers.txt", 'w') as f:
    f.writelines(tex)
cv2.imshow('image',img)
cv2.waitKey(0)
cv2.destroyAllWindows()

英文:

The Image quality is very important and you should use rgb b&w:

import subprocess
import cv2
import pytesseract
# Image manipulation
# Commands https://imagemagick.org/script/convert.php
mag_img = r&#39;D:\Programme\ImageMagic\magick.exe&#39;
con_bw = r&quot;D:\Programme\ImageMagic\convert.exe&quot; 
in_file = r&#39;Qkqxn.jpg&#39;
out_file = r&#39;Qkqxn_inv.jpg&#39;
# Play with black and white and contrast for better results
process = subprocess.run([con_bw, in_file, &quot;-resize&quot;, &quot;70%&quot;,&quot;-threshold&quot;,&quot;35%&quot;, out_file])
# Text ptocessing
pytesseract.pytesseract.tesseract_cmd=r&#39;C:\Program Files\Tesseract-OCR\tesseract.exe&#39;
img = cv2.imread(out_file)
# Parameters see tesseract doc 
custom_config = r&#39;--psm 7 --oem 3 -c tessedit_char_whitelist=_01234567890&#39; 
tex = pytesseract.image_to_string(img, config=custom_config)
print(tex)
with open(&quot;numbers.txt&quot;, &#39;w&#39;) as f:
    f.writelines(tex)
cv2.imshow(&#39;image&#39;,img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Output:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Tesseract为何在这里返回错误的数字？

问题

答案1

check if a dataframe is not empty in 1 line of code in python

使用Python包（spaCy）仅覆盖特定语言词汇的单词列表。

如何将多个Python模块导入一个Tkinter窗口

如何在Python中更改第三方库的日志级别或将所有日志静音？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。