2023年5月25日 18:03:26go评论60阅读模式

英文:

How to eliminate certain words of particular height on image in tesseract ocr?

问题

I want to delete letters marked in red box. I am getting that letters as Junk in the output, so I want to delete that words in that images to get good output. Please help me to remove the letters/words of that height in the image using any image processing/tesseract /open cv techniques.

英文:

I am getting that letters as Junk in the output, so I want to delete that words in that images to get good output. Please help me to remove the letters/words of that height in the image using any image processing/tesseract /open cv techniques.

答案1

得分: 1

以下是翻译好的部分：

我们可以找到轮廓，找到每个轮廓的边界矩形，并用黑色填充“短”轮廓（使用OpenCV包）。

为了获得更好的结果：

在调用cv2.findContours之前应用阈值处理。
填充每个轮廓的边界矩形时留有一些小边距。

代码示例：

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # 在使用Windows时可能需要

img = cv2.imread('words.png', cv2.IMREAD_GRAYSCALE)  # 以灰度格式读取图像

# 应用阈值处理（使用`cv2.THRESH_OTSU`进行自动阈值处理）
thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)[1]  # 我们需要这个阶段，因为不是所有的像素都是0或255的值。

contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0]  # 查找轮廓
letter_h_thresh = 30  # 高度较小的字母被视为“垃圾”。
for c in contours:
    x, y, w, h = cv2.boundingRect(c)  # 计算边界矩形
    if h < letter_h_thresh:
        img[y-2:y+h+2, x-2:x+w+2] = 0  # 用零填充边界矩形（带有一些边距）

# 将预处理后的图像传递给pytesseract
text = pytesseract.image_to_string(img, config="--psm 6")
print("Text found: " + text)  # 找到的文本：SCHENGEN

cv2.imwrite('img.png', img)  # 保存图像以供测试

在分割字母如i时，我们可以使用形态操作或使用不同的方法来解决。

使用pytesseract.image_to_data将图像分割成文本框：

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # 在使用Windows时可能需要

img = cv2.imread('words.png', cv2.IMREAD_GRAYSCALE)  # 以灰度格式读取图像

thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)[1]  # 我们需要这个阶段，因为不是所有的像素都是0或255的值。

letter_h_thresh = 30  # 高度较小的字母被视为“垃圾”。
d = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT, config="--psm 6")
n_boxes = len(d['level'])
for i in range(n_boxes):
    if d['word_num'][i] > 0:
        (x0, y0, w0, h0) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        roi = thresh[y0-2:y0+h0+2, x0-2:x0+w0+2]
        img_roi = img[y0-2:y0+h0+2, x0-2:x0+w0+2]  # 在图像中进行切片
        contours = cv2.findContours(roi, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0]  # 在roi中查找轮廓
        for c in contours:
            x, y, w, h = cv2.boundingRect(c)  # 计算边界矩形
            if h < letter_h_thresh:
                img_roi[y:y+h, x:x+w] = 0  # 在img_roi切片中用零填充边界矩形

text = pytesseract.image_to_string(img, config="--psm 6")
print("Text found: " + text)  # 找到的文本：SCHENGEN

cv2.imwrite('img.png', img)  # 保存图像以供测试

英文:

We may find contours, find the bounding rectangle of each contour, and fill "short" contours with black color (using OpenCV package).

For getting better results:

Apply thresholding before calling cv2.findContours.
Fill the bounding rectangle of each contour with small margins.

Code sample:

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r&#39;C:\Program Files\Tesseract-OCR\tesseract.exe&#39;  # May be required when using Windows

img = cv2.imread(&#39;words.png&#39;, cv2.IMREAD_GRAYSCALE)  # Read image in grayscale format

# Apply thresholding (use `cv2.THRESH_OTSU` for automatic thresholding)
thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)[1]  # We need this stage, because not all pixels are 0 or 255 values.

contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0]  # Find contours
letter_h_thresh = 30  # Letter with smaller height are considered to be &quot;junk&quot;.
for c in contours:
    x, y, w, h = cv2.boundingRect(c)  # Compute the bounding rectangle
    if h &lt; letter_h_thresh:
        img[y-2:y+h+2, x-2:x+w+2] = 0  # Fill the bounding rectangle with zeros (with some margins)

# Pass preprocessed image to pytesseract
text = pytesseract.image_to_string(img, config=&quot;--psm 6&quot;)
print(&quot;Text found: &quot; + text)  # Text found: SCHENGEN

cv2.imwrite(&#39;img.png&#39;, img)  # Save img for testing

Input image words.png (removed some of your red markings):

Output image img.png (used as input to pytesseract):

In case there splitted letters like i letter, we may solve it using morphological operations, ot use different approach.

Use pytesseract.image_to_data for splitting the image into text boxes:

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r&#39;C:\Program Files\Tesseract-OCR\tesseract.exe&#39;  # May be required when using Windows

img = cv2.imread(&#39;words.png&#39;, cv2.IMREAD_GRAYSCALE)  # Read image in grayscale format

thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)[1]  # We need this stage, because not all pixels are 0 or 255 values.

letter_h_thresh = 30  # Letter with smaller height are considered to be &quot;junk&quot;.
d = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT, config=&quot;--psm 6&quot;)
n_boxes = len(d[&#39;level&#39;])
for i in range(n_boxes):
    if d[&#39;word_num&#39;][i] &gt; 0:
        (x0, y0, w0, h0) = (d[&#39;left&#39;][i], d[&#39;top&#39;][i], d[&#39;width&#39;][i], d[&#39;height&#39;][i])
        #cv2.rectangle(img, (x, y), (x + w, y + h), 128, 2)
        roi = thresh[y0-2:y0+h0+2, x0-2:x0+w0+2]
        img_roi = img[y0-2:y0+h0+2, x0-2:x0+w0+2]  # Slice in img
        contours = cv2.findContours(roi, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0]  # Find contours in roi
        for c in contours:
            x, y, w, h = cv2.boundingRect(c)  # Compute the bounding rectangle
            if h &lt; letter_h_thresh:
                img_roi[y:y+h, x:x+w] = 0  # Fill the bounding rectangle with zeros (in img_roi slice)

text = pytesseract.image_to_string(img, config=&quot;--psm 6&quot;)
print(&quot;Text found: &quot; + text)  # Text found: SCHENGEN

cv2.imwrite(&#39;img.png&#39;, img)  # Save img for testing

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Tesseract OCR中如何消除图像上特定高度的某些单词？

问题

答案1

错误在Jupyter Notebook下安装Python dlib。

在Pandas中，按另一列对数据进行分组，计算行之间的百分比变化。

如何在Python中将视频转换成图像幻灯片？

Lambda函数同时包含if和for循环

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论