英文:
OCR - tesseract - Extract numbers in tabular data
问题
我有一堆预处理过的表格,看起来类似于这个样子:
在调整参数一段时间后,我发现以下命令可以给我相当不错的结果:
tesseract my_img.png out -c tessedit_char_whitelist="0123456789.E%-" --psm 6
不幸的是,对我的需求来说还不够好。请注意输出中有些列没有分隔,而且缺少一些减号。
我该怎么做来改善结果?
0.015 1.0010.623 0.09911.850.0272 0.1% 4.0 0.03%
0.020 1.0030.304 0.3211404-0.2144 0.0% 4.0 0.02%
0.030 1.0080.370 0.26214.040.1718 0.1% 3.0 0.06%
0.040 1.0170.393 0.23814.150.1412 0.2% 0.5 0.10%
0.050 1.0300.408 0.22813.76-0.1346 0.4% 0.5 0.17%
0.060 1.9031.408 0.08518.32-0.0988 15.2% 40. 7.47%
0.080 1.7390.609 0.23516.120.2033 2.2% 35. 1.23%
0.1001.6480.242-0.00619.35 0.0590 0.4% 0.5 0.17%
0.150 1.4330.076 0.62913.32-0.3336 1.5% 2 0.75%
0.2001.4880.148 0.47913.91-0.2602 2.2% 0.5 0.96%
0.3001.664-0.303 0.31614.000.2044 2.8% 0.5 1.25%
0.400-1.883.-0.408 0.24213.70-0.1576 3.0% 0.5 1.40%
0.5002.022-0.516 0.18613.77.-0.1282 3.6% 0.5 1.60%
0.6001.9750.625 0.13413.80-0.0948 3.0% 0.5 1.38%
0.8002.0540.709 0.10113.64-0.0763 2.8% 0.5 1.34%
1.00 2.0250.790 0.07414.55-0.0629 2.6% 0.5 1.28%
1.50 1.8990.912 0.03313.360.0360 1.2% 5 0.72%
2.00 1.7950.889 0.049-13.34-0.0585 2.5% 0.5 1.35%
3.00 1.6250.866 0.06813.44-0.0887 6.3% 0.5 2.67%
4.00 1.4900.854 0.08113.71-0.1057 8.0% 0.5 3.34%
5.00 1.6160.713 0.14514.15--0.1708 7.7% 0.5 4.29%
6.00 1.4820.828 0.10014.26-0.1177 11.6% 0.5 4.23%
8.00 1.4660.820 0.11614.21-0.1362 9.0% 0.5 3.85%
10.00 1.433-0.938 0.08714.14-01117 8.2% 0.5 3.54%
15.00 14151.120 0.06013.92-0.0949 7.0 0.5 3.26%
英文:
I have a bunch of pre-processed tables that looks similar to this one:
After playing for a while with the parameters, I have found that this command gives me decent results:
tesseract my_img.png out -c tessedit_char_whitelist="0123456789.E%-" --psm 6
Unfortunately, not good enough for my needs. Note how some columns are not separated in the output, and some minus sign is missing.
What can I do to improve the results?
0.015 1.0010.623 0.09911.850.0272 0.1% 4.0 0.03%
0.020 1.0030.304 0.3211404-0.2144 0.0% 4.0 0.02%
0.030 1.0080.370 0.26214.040.1718 0.1% 3.0 0.06%
0.040 1.0170.393 0.23814.150.1412 0.2% 0.5 0.10%
0.050 1.0300.408 0.22813.76-0.1346 0.4% 0.5 0.17%
0.060 1.9031.408 0.08518.32-0.0988 15.2% 40. 7.47%
0.080 1.7390.609 0.23516.120.2033 2.2% 35. 1.23%
0.1001.6480.242-0.00619.35 0.0590 0.4% 0.5 0.17%
0.150 1.4330.076 0.62913.32-0.3336 1.5% 2 0.75%
0.2001.4880.148 0.47913.91-0.2602 2.2% 0.5 0.96%
0.3001.664-0.303 0.31614.000.2044 2.8% 0.5 1.25%
0.400-1.883.-0.408 0.24213.70-0.1576 3.0% 0.5 1.40%
0.5002.022-0.516 0.18613.77.-0.1282 3.6% 0.5 1.60%
0.6001.9750.625 0.13413.80-0.0948 3.0% 0.5 1.38%
0.8002.0540.709 0.10113.64-0.0763 2.8% 0.5 1.34%
1.00 2.0250.790 0.07414.55-0.0629 2.6% 0.5 1.28%
1.50 1.8990.912 0.03313.360.0360 1.2% 5 0.72%
2.00 1.7950.889 0.049-13.34-0.0585 2.5% 0.5 1.35%
3.00 1.6250.866 0.06813.44-0.0887 6.3% 0.5 2.67%
4.00 1.4900.854 0.08113.71-0.1057 8.0% 0.5 3.34%
5.00 1.6160.713 0.14514.15--0.1708 7.7% 0.5 4.29%
6.00 1.4820.828 0.10014.26-0.1177 11.6% 0.5 4.23%
8.00 1.4660.820 0.11614.21-0.1362 9.0% 0.5 3.85%
10.00 1.433-0.938 0.08714.14-01117 8.2% 0.5 3.54%
15.00 14151.120 0.06013.92-0.0949 7.0 0.5 3.26%
答案1
得分: 3
我使用opencv
和pytesseract
解决了这个问题。我的解决方案受到了这个答案的启发。
-
这个过程的关键是要有经过良好预处理的图像。在原问题的图像中,您可以看到在最右边的最后一列之前有一小块黑色像素。这些斑点必须清除掉!由于只有少数表格出现了这个问题,我使用了GIMP。
-
通过将反转的灰度图像进行膨胀处理来检测表格中的列。通过选择适当数量的迭代,列会成形,还可以发现减号。
-
使用
cv2.boundingRect(cnt)
可以裁剪出单个列并将它们保存到磁盘上。 -
对不同的列应用
pytesseract
,使用与原问题中相同的选项。 -
为了检测减号:我很幸运,只有第4列和第6列出现了减号,而我的表格恰好有25行(因此,每行的高度=图像的高度/25)。因此,选取这些列,将列裁剪为大约40像素宽(这是一个基于试验和错误的猜测)。现在裁剪应该有一些白色矩形,标明了减号的位置。检测这些矩形的轮廓。计算每个轮廓的质心。质心的y坐标用于找到减号所在的行号。对OCR结果进行必要的校正。
-
将不同的列合并成CSV文件。
编辑:使用这个方法,我获得了约98.5%的准确率。
英文:
I solved the problem by using opencv
and pytesseract
. My solution was inspired by this answer.
-
Key to this procedure is to have well pre-processed images. In the image of the original question, you can see a small spot of black pixel just before the last column on the right. Those spots must be cleared out! Since I only had a few tables with that problem, I used GIMP.
-
Detect the columns in the table, by applying a dilation step to the inverted gray image. By choosing an appropriate number of iterations, the columns take shape, and it is also possible to spot the minus signs.
-
with
cv2.boundingRect(cnt)
it is possible to crop out the single columns and save them to disk. -
Apply
pytesseract
to the different columns, with the same options as in the original question. -
To detect the minus signs: I was lucky enough that only the 4-th and 6-th columns presented minus signs, and my tables had exactly 25 rows (therefore, height of each row = height of the image / 25). So, take those columns, crop the the column to have say 40px width (this is a guess, based on trial and error). The crop should now have a few white rectangles where the minus signs are located. Detect the contours of these rectangles. Compute the centroid of each contour. The y-coordinate of the centroid is used to find the number of the row in which the minus sign is located. Apply corrections (where needed) to the OCR results.
-
Combine the different columns into a CSV file.
EDIT: with this procedure I got about 98.5% accuracy.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论