问题

我有一堆预处理过的表格，看起来类似于这个样子：

OCR – tesseract – 提取表格数据中的数字

在调整参数一段时间后，我发现以下命令可以给我相当不错的结果：

tesseract my_img.png out -c tessedit_char_whitelist="0123456789.E%-" --psm 6

不幸的是，对我的需求来说还不够好。请注意输出中有些列没有分隔，而且缺少一些减号。

我该怎么做来改善结果？

0.015 1.0010.623 0.09911.850.0272 0.1% 4.0 0.03%
0.020 1.0030.304 0.3211404-0.2144 0.0% 4.0 0.02%
0.030 1.0080.370 0.26214.040.1718 0.1% 3.0 0.06%
0.040 1.0170.393 0.23814.150.1412 0.2% 0.5 0.10%
0.050 1.0300.408 0.22813.76-0.1346 0.4% 0.5 0.17%
0.060 1.9031.408 0.08518.32-0.0988 15.2% 40. 7.47%
0.080 1.7390.609 0.23516.120.2033 2.2% 35. 1.23%
0.1001.6480.242-0.00619.35 0.0590 0.4% 0.5 0.17%
0.150 1.4330.076 0.62913.32-0.3336 1.5% 2 0.75%
0.2001.4880.148 0.47913.91-0.2602 2.2% 0.5 0.96%
0.3001.664-0.303 0.31614.000.2044 2.8% 0.5 1.25%
0.400-1.883.-0.408 0.24213.70-0.1576 3.0% 0.5 1.40%
0.5002.022-0.516 0.18613.77.-0.1282 3.6% 0.5 1.60%
0.6001.9750.625 0.13413.80-0.0948 3.0% 0.5 1.38%
0.8002.0540.709 0.10113.64-0.0763 2.8% 0.5 1.34%
1.00 2.0250.790 0.07414.55-0.0629 2.6% 0.5 1.28%
1.50 1.8990.912 0.03313.360.0360 1.2% 5 0.72%
2.00 1.7950.889 0.049-13.34-0.0585 2.5% 0.5 1.35%
3.00 1.6250.866 0.06813.44-0.0887 6.3% 0.5 2.67%
4.00 1.4900.854 0.08113.71-0.1057 8.0% 0.5 3.34%
5.00 1.6160.713 0.14514.15--0.1708 7.7% 0.5 4.29%
6.00 1.4820.828 0.10014.26-0.1177 11.6% 0.5 4.23%
8.00 1.4660.820 0.11614.21-0.1362 9.0% 0.5 3.85%
10.00 1.433-0.938 0.08714.14-01117 8.2% 0.5 3.54%
15.00 14151.120 0.06013.92-0.0949 7.0 0.5 3.26%

英文:

I have a bunch of pre-processed tables that looks similar to this one:

After playing for a while with the parameters, I have found that this command gives me decent results:

tesseract my_img.png out -c tessedit_char_whitelist="0123456789.E%-" --psm 6

Unfortunately, not good enough for my needs. Note how some columns are not separated in the output, and some minus sign is missing.

What can I do to improve the results?

0.015 1.0010.623 0.09911.850.0272 0.1% 4.0 0.03%
0.020 1.0030.304 0.3211404-0.2144 0.0% 4.0 0.02%
0.030 1.0080.370 0.26214.040.1718 0.1% 3.0 0.06%
0.040 1.0170.393 0.23814.150.1412 0.2% 0.5 0.10%
0.050 1.0300.408 0.22813.76-0.1346 0.4% 0.5 0.17%
0.060 1.9031.408 0.08518.32-0.0988 15.2% 40. 7.47%
0.080 1.7390.609 0.23516.120.2033 2.2% 35. 1.23%
0.1001.6480.242-0.00619.35 0.0590 0.4% 0.5 0.17%
0.150 1.4330.076 0.62913.32-0.3336 1.5% 2 0.75%
0.2001.4880.148 0.47913.91-0.2602 2.2% 0.5 0.96%
0.3001.664-0.303 0.31614.000.2044 2.8% 0.5 1.25%
0.400-1.883.-0.408 0.24213.70-0.1576 3.0% 0.5 1.40%
0.5002.022-0.516 0.18613.77.-0.1282 3.6% 0.5 1.60%
0.6001.9750.625 0.13413.80-0.0948 3.0% 0.5 1.38%
0.8002.0540.709 0.10113.64-0.0763 2.8% 0.5 1.34%
1.00 2.0250.790 0.07414.55-0.0629 2.6% 0.5 1.28%
1.50 1.8990.912 0.03313.360.0360 1.2% 5 0.72%
2.00 1.7950.889 0.049-13.34-0.0585 2.5% 0.5 1.35%
3.00 1.6250.866 0.06813.44-0.0887 6.3% 0.5 2.67%
4.00 1.4900.854 0.08113.71-0.1057 8.0% 0.5 3.34%
5.00 1.6160.713 0.14514.15--0.1708 7.7% 0.5 4.29%
6.00 1.4820.828 0.10014.26-0.1177 11.6% 0.5 4.23%
8.00 1.4660.820 0.11614.21-0.1362 9.0% 0.5 3.85%
10.00 1.433-0.938 0.08714.14-01117 8.2% 0.5 3.54%
15.00 14151.120 0.06013.92-0.0949 7.0 0.5 3.26%

答案1

得分: 3

我使用opencv和pytesseract解决了这个问题。我的解决方案受到了这个答案的启发。

这个过程的关键是要有经过良好预处理的图像。在原问题的图像中，您可以看到在最右边的最后一列之前有一小块黑色像素。这些斑点必须清除掉！由于只有少数表格出现了这个问题，我使用了GIMP。
通过将反转的灰度图像进行膨胀处理来检测表格中的列。通过选择适当数量的迭代，列会成形，还可以发现减号。
使用cv2.boundingRect(cnt)可以裁剪出单个列并将它们保存到磁盘上。
对不同的列应用pytesseract，使用与原问题中相同的选项。
为了检测减号：我很幸运，只有第4列和第6列出现了减号，而我的表格恰好有25行（因此，每行的高度=图像的高度/25）。因此，选取这些列，将列裁剪为大约40像素宽（这是一个基于试验和错误的猜测）。现在裁剪应该有一些白色矩形，标明了减号的位置。检测这些矩形的轮廓。计算每个轮廓的质心。质心的y坐标用于找到减号所在的行号。对OCR结果进行必要的校正。
将不同的列合并成CSV文件。

编辑：使用这个方法，我获得了约98.5%的准确率。

英文:

I solved the problem by using opencv and pytesseract. My solution was inspired by this answer.

Key to this procedure is to have well pre-processed images. In the image of the original question, you can see a small spot of black pixel just before the last column on the right. Those spots must be cleared out! Since I only had a few tables with that problem, I used GIMP.
Detect the columns in the table, by applying a dilation step to the inverted gray image. By choosing an appropriate number of iterations, the columns take shape, and it is also possible to spot the minus signs.

with cv2.boundingRect(cnt) it is possible to crop out the single columns and save them to disk.
Apply pytesseract to the different columns, with the same options as in the original question.
To detect the minus signs: I was lucky enough that only the 4-th and 6-th columns presented minus signs, and my tables had exactly 25 rows (therefore, height of each row = height of the image / 25). So, take those columns, crop the the column to have say 40px width (this is a guess, based on trial and error). The crop should now have a few white rectangles where the minus signs are located. Detect the contours of these rectangles. Compute the centroid of each contour. The y-coordinate of the centroid is used to find the number of the row in which the minus sign is located. Apply corrections (where needed) to the OCR results.
Combine the different columns into a CSV file.

EDIT: with this procedure I got about 98.5% accuracy.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

OCR – tesseract – 提取表格数据中的数字

问题

答案1

在Windows上安装新字体到Tesseract中。

如何从PNG创建一个可搜索的OCR’d PDF，但将JPEG用作图片。

Google Cloud Vision 阻止索引

如何使用Tesseract.js OCR将两/三列图像转换为文本？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论