OCR – tesseract – 提取表格数据中的数字

huangapple go评论50阅读模式
英文:

OCR - tesseract - Extract numbers in tabular data

问题

我有一堆预处理过的表格,看起来类似于这个样子:

OCR – tesseract – 提取表格数据中的数字

在调整参数一段时间后,我发现以下命令可以给我相当不错的结果:

tesseract my_img.png out -c tessedit_char_whitelist="0123456789.E%-" --psm 6

不幸的是,对我的需求来说还不够好。请注意输出中有些列没有分隔,而且缺少一些减号。

我该怎么做来改善结果?

0.015 1.0010.623 0.09911.850.0272 0.1% 4.0 0.03%
0.020 1.0030.304 0.3211404-0.2144 0.0% 4.0 0.02%
0.030 1.0080.370 0.26214.040.1718 0.1% 3.0 0.06%
0.040 1.0170.393 0.23814.150.1412 0.2% 0.5 0.10%
0.050 1.0300.408 0.22813.76-0.1346 0.4% 0.5 0.17%
0.060 1.9031.408 0.08518.32-0.0988 15.2% 40. 7.47%
0.080 1.7390.609 0.23516.120.2033 2.2% 35. 1.23%
0.1001.6480.242-0.00619.35 0.0590 0.4% 0.5 0.17%
0.150 1.4330.076 0.62913.32-0.3336 1.5% 2 0.75%
0.2001.4880.148 0.47913.91-0.2602 2.2% 0.5 0.96%
0.3001.664-0.303 0.31614.000.2044 2.8% 0.5 1.25%
0.400-1.883.-0.408 0.24213.70-0.1576 3.0% 0.5 1.40%
0.5002.022-0.516 0.18613.77.-0.1282 3.6% 0.5 1.60%
0.6001.9750.625 0.13413.80-0.0948 3.0% 0.5 1.38%
0.8002.0540.709 0.10113.64-0.0763 2.8% 0.5 1.34%
1.00 2.0250.790 0.07414.55-0.0629 2.6% 0.5 1.28%
1.50 1.8990.912 0.03313.360.0360 1.2% 5 0.72%
2.00 1.7950.889 0.049-13.34-0.0585 2.5% 0.5 1.35%
3.00 1.6250.866 0.06813.44-0.0887 6.3% 0.5 2.67%
4.00 1.4900.854 0.08113.71-0.1057 8.0% 0.5 3.34%
5.00 1.6160.713 0.14514.15--0.1708 7.7% 0.5 4.29%
6.00 1.4820.828 0.10014.26-0.1177 11.6% 0.5 4.23%
8.00 1.4660.820 0.11614.21-0.1362 9.0% 0.5 3.85%
10.00 1.433-0.938 0.08714.14-01117 8.2% 0.5 3.54%
15.00 14151.120 0.06013.92-0.0949 7.0 0.5 3.26%
英文:

I have a bunch of pre-processed tables that looks similar to this one:

OCR – tesseract – 提取表格数据中的数字

After playing for a while with the parameters, I have found that this command gives me decent results:

tesseract my_img.png out -c tessedit_char_whitelist="0123456789.E%-" --psm 6

Unfortunately, not good enough for my needs. Note how some columns are not separated in the output, and some minus sign is missing.

What can I do to improve the results?

0.015 1.0010.623 0.09911.850.0272 0.1% 4.0 0.03%
0.020 1.0030.304 0.3211404-0.2144 0.0% 4.0 0.02%
0.030 1.0080.370 0.26214.040.1718 0.1% 3.0 0.06%
0.040 1.0170.393 0.23814.150.1412 0.2% 0.5 0.10%
0.050 1.0300.408 0.22813.76-0.1346 0.4% 0.5 0.17%
0.060 1.9031.408 0.08518.32-0.0988 15.2% 40. 7.47%
0.080 1.7390.609 0.23516.120.2033 2.2% 35. 1.23%
0.1001.6480.242-0.00619.35 0.0590 0.4% 0.5 0.17%
0.150 1.4330.076 0.62913.32-0.3336 1.5% 2 0.75%
0.2001.4880.148 0.47913.91-0.2602 2.2% 0.5 0.96%
0.3001.664-0.303 0.31614.000.2044 2.8% 0.5 1.25%
0.400-1.883.-0.408 0.24213.70-0.1576 3.0% 0.5 1.40%
0.5002.022-0.516 0.18613.77.-0.1282 3.6% 0.5 1.60%
0.6001.9750.625 0.13413.80-0.0948 3.0% 0.5 1.38%
0.8002.0540.709 0.10113.64-0.0763 2.8% 0.5 1.34%
1.00 2.0250.790 0.07414.55-0.0629 2.6% 0.5 1.28%
1.50 1.8990.912 0.03313.360.0360 1.2% 5 0.72%
2.00 1.7950.889 0.049-13.34-0.0585 2.5% 0.5 1.35%
3.00 1.6250.866 0.06813.44-0.0887 6.3% 0.5 2.67%
4.00 1.4900.854 0.08113.71-0.1057 8.0% 0.5 3.34%
5.00 1.6160.713 0.14514.15--0.1708 7.7% 0.5 4.29%
6.00 1.4820.828 0.10014.26-0.1177 11.6% 0.5 4.23%
8.00 1.4660.820 0.11614.21-0.1362 9.0% 0.5 3.85%
10.00 1.433-0.938 0.08714.14-01117 8.2% 0.5 3.54%
15.00 14151.120 0.06013.92-0.0949 7.0 0.5 3.26%

答案1

得分: 3

我使用opencvpytesseract解决了这个问题。我的解决方案受到了这个答案的启发。

  1. 这个过程的关键是要有经过良好预处理的图像。在原问题的图像中,您可以看到在最右边的最后一列之前有一小块黑色像素。这些斑点必须清除掉!由于只有少数表格出现了这个问题,我使用了GIMP。

  2. 通过将反转的灰度图像进行膨胀处理来检测表格中的列。通过选择适当数量的迭代,列会成形,还可以发现减号。

  3. 使用cv2.boundingRect(cnt)可以裁剪出单个列并将它们保存到磁盘上。

  4. 对不同的列应用pytesseract,使用与原问题中相同的选项。

  5. 为了检测减号:我很幸运,只有第4列和第6列出现了减号,而我的表格恰好有25行(因此,每行的高度=图像的高度/25)。因此,选取这些列,将列裁剪为大约40像素宽(这是一个基于试验和错误的猜测)。现在裁剪应该有一些白色矩形,标明了减号的位置。检测这些矩形的轮廓。计算每个轮廓的质心。质心的y坐标用于找到减号所在的行号。对OCR结果进行必要的校正。

  6. 将不同的列合并成CSV文件。

编辑:使用这个方法,我获得了约98.5%的准确率。

英文:

I solved the problem by using opencv and pytesseract. My solution was inspired by this answer.

  1. Key to this procedure is to have well pre-processed images. In the image of the original question, you can see a small spot of black pixel just before the last column on the right. Those spots must be cleared out! Since I only had a few tables with that problem, I used GIMP.

  2. Detect the columns in the table, by applying a dilation step to the inverted gray image. By choosing an appropriate number of iterations, the columns take shape, and it is also possible to spot the minus signs.

OCR – tesseract – 提取表格数据中的数字

  1. with cv2.boundingRect(cnt) it is possible to crop out the single columns and save them to disk.

  2. Apply pytesseract to the different columns, with the same options as in the original question.

  3. To detect the minus signs: I was lucky enough that only the 4-th and 6-th columns presented minus signs, and my tables had exactly 25 rows (therefore, height of each row = height of the image / 25). So, take those columns, crop the the column to have say 40px width (this is a guess, based on trial and error). The crop should now have a few white rectangles where the minus signs are located. Detect the contours of these rectangles. Compute the centroid of each contour. The y-coordinate of the centroid is used to find the number of the row in which the minus sign is located. Apply corrections (where needed) to the OCR results.

  4. Combine the different columns into a CSV file.

EDIT: with this procedure I got about 98.5% accuracy.

huangapple
  • 本文由 发表于 2020年1月6日 18:48:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/59610734.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定