英文:
How to eliminate certain words of particular height on image in tesseract ocr?
问题
I want to delete letters marked in red box. I am getting that letters as Junk in the output, so I want to delete that words in that images to get good output. Please help me to remove the letters/words of that height in the image using any image processing/tesseract /open cv techniques.
英文:
I am getting that letters as Junk in the output, so I want to delete that words in that images to get good output. Please help me to remove the letters/words of that height in the image using any image processing/tesseract /open cv techniques.
答案1
得分: 1
以下是翻译好的部分:
我们可以找到轮廓,找到每个轮廓的边界矩形,并用黑色填充“短”轮廓(使用OpenCV包)。
为了获得更好的结果:
- 在调用
cv2.findContours
之前应用阈值处理。 - 填充每个轮廓的边界矩形时留有一些小边距。
代码示例:
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # 在使用Windows时可能需要
img = cv2.imread('words.png', cv2.IMREAD_GRAYSCALE) # 以灰度格式读取图像
# 应用阈值处理(使用`cv2.THRESH_OTSU`进行自动阈值处理)
thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)[1] # 我们需要这个阶段,因为不是所有的像素都是0或255的值。
contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0] # 查找轮廓
letter_h_thresh = 30 # 高度较小的字母被视为“垃圾”。
for c in contours:
x, y, w, h = cv2.boundingRect(c) # 计算边界矩形
if h < letter_h_thresh:
img[y-2:y+h+2, x-2:x+w+2] = 0 # 用零填充边界矩形(带有一些边距)
# 将预处理后的图像传递给pytesseract
text = pytesseract.image_to_string(img, config="--psm 6")
print("Text found: " + text) # 找到的文本:SCHENGEN
cv2.imwrite('img.png', img) # 保存图像以供测试
在分割字母如i
时,我们可以使用形态操作或使用不同的方法来解决。
使用pytesseract.image_to_data
将图像分割成文本框:
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # 在使用Windows时可能需要
img = cv2.imread('words.png', cv2.IMREAD_GRAYSCALE) # 以灰度格式读取图像
thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)[1] # 我们需要这个阶段,因为不是所有的像素都是0或255的值。
letter_h_thresh = 30 # 高度较小的字母被视为“垃圾”。
d = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT, config="--psm 6")
n_boxes = len(d['level'])
for i in range(n_boxes):
if d['word_num'][i] > 0:
(x0, y0, w0, h0) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
roi = thresh[y0-2:y0+h0+2, x0-2:x0+w0+2]
img_roi = img[y0-2:y0+h0+2, x0-2:x0+w0+2] # 在图像中进行切片
contours = cv2.findContours(roi, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0] # 在roi中查找轮廓
for c in contours:
x, y, w, h = cv2.boundingRect(c) # 计算边界矩形
if h < letter_h_thresh:
img_roi[y:y+h, x:x+w] = 0 # 在img_roi切片中用零填充边界矩形
text = pytesseract.image_to_string(img, config="--psm 6")
print("Text found: " + text) # 找到的文本:SCHENGEN
cv2.imwrite('img.png', img) # 保存图像以供测试
英文:
We may find contours, find the bounding rectangle of each contour, and fill "short" contours with black color (using OpenCV package).
For getting better results:
- Apply thresholding before calling
cv2.findContours
. - Fill the bounding rectangle of each contour with small margins.
Code sample:
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # May be required when using Windows
img = cv2.imread('words.png', cv2.IMREAD_GRAYSCALE) # Read image in grayscale format
# Apply thresholding (use `cv2.THRESH_OTSU` for automatic thresholding)
thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)[1] # We need this stage, because not all pixels are 0 or 255 values.
contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0] # Find contours
letter_h_thresh = 30 # Letter with smaller height are considered to be "junk".
for c in contours:
x, y, w, h = cv2.boundingRect(c) # Compute the bounding rectangle
if h < letter_h_thresh:
img[y-2:y+h+2, x-2:x+w+2] = 0 # Fill the bounding rectangle with zeros (with some margins)
# Pass preprocessed image to pytesseract
text = pytesseract.image_to_string(img, config="--psm 6")
print("Text found: " + text) # Text found: SCHENGEN
cv2.imwrite('img.png', img) # Save img for testing
Input image words.png
(removed some of your red markings):
Output image img.png
(used as input to pytesseract):
In case there splitted letters like i
letter, we may solve it using morphological operations, ot use different approach.
Use pytesseract.image_to_data
for splitting the image into text boxes:
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # May be required when using Windows
img = cv2.imread('words.png', cv2.IMREAD_GRAYSCALE) # Read image in grayscale format
thresh = cv2.threshold(img, 0, 255, cv2.THRESH_OTSU)[1] # We need this stage, because not all pixels are 0 or 255 values.
letter_h_thresh = 30 # Letter with smaller height are considered to be "junk".
d = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT, config="--psm 6")
n_boxes = len(d['level'])
for i in range(n_boxes):
if d['word_num'][i] > 0:
(x0, y0, w0, h0) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
#cv2.rectangle(img, (x, y), (x + w, y + h), 128, 2)
roi = thresh[y0-2:y0+h0+2, x0-2:x0+w0+2]
img_roi = img[y0-2:y0+h0+2, x0-2:x0+w0+2] # Slice in img
contours = cv2.findContours(roi, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[0] # Find contours in roi
for c in contours:
x, y, w, h = cv2.boundingRect(c) # Compute the bounding rectangle
if h < letter_h_thresh:
img_roi[y:y+h, x:x+w] = 0 # Fill the bounding rectangle with zeros (in img_roi slice)
text = pytesseract.image_to_string(img, config="--psm 6")
print("Text found: " + text) # Text found: SCHENGEN
cv2.imwrite('img.png', img) # Save img for testing
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论