如何使用OpenCV删除矩形轮廓并从图像中提取文本以供pytesseract使用?

huangapple go评论93阅读模式
英文:

How can I remove rectangular contour and extract text with OpenCV for pytesseract on an image?

问题

我想从这个图像中提取文本。我尝试移除矩形轮廓,因此开始检测形成框的水平和垂直线。但我发现了一个问题,一些字符像素被错误地识别为垂直线。

为了获得一个不包含矩形框的干净图像,只包含行文本,然后我可以使用 pytesseract 进行文本提取。

你能提供一些建议来去除矩形框吗?

谢谢!

import cv2
from PIL import Image
import matplotlib.pylab as plt

image = io.imread("sample.png")
result = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# 去除水平线
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40,1))
remove_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(result, [c], -1, (255,255,255), 5)
plt.imshow(result)

去除水平线

# 去除垂直线
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,40))
remove_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(result, [c], -1, (255,255,255), 5)

plt.imshow(result)

去除水平和垂直线

英文:

I want to extract the text from this image. I tried removing the rectangle contour so I started detecting the horizontal and vertical lines that form the boxes. But I found a problem where some characters pixels were mistakenly identified as vertical lines.
to obtain a clean image without the rectangle boxes, containing only the line texts, so I can then apply pytesseract for text extraction.

Can you help with any suggestions to remove the rectangular boxes?

Thank you!

import cv2
from PIL import Image
import matplotlib.pylab as plt

image = io.imread("sample.png")
result = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

#Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40,1))
remove_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(result, [c], -1, (255,255,255), 5)
plt.imshow(result)

removing horizontal lines

# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,40))
remove_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(result, [c], -1, (255,255,255), 5)

plt.imshow(result)

removing horizontal and vertical lines

答案1

得分: 1

你可以尝试在图像中查找连接的组件,并过滤掉那些太宽或太高的组件。
例如:

import cv2
import numpy as np 

im=cv2.imread('0AASU.png', cv2.IMREAD_GRAYSCALE)
im_monochrome=cv2.threshold(im, 127,255,cv2.THRESH_BINARY_INV)[1]
_, labels,stats,_=cv2.connectedComponentsWithStats(im_monochrome)
idx=np.nonzero((stats[:,2]>150) | (stats[:,3]>150)) # 选择高度>150或宽度>150像素的组件。
result=255*np.uint8(np.isin(labels, idx)) # 移除这些组件
cv2.imwrite('result.png', result)

如何使用OpenCV删除矩形轮廓并从图像中提取文本以供pytesseract使用?

英文:

You can try to find connected components in the image and filter out those that are too wide or tall.
For example:

import cv2
import numpy as np 

im=cv2.imread('0AASU.png', cv2.IMREAD_GRAYSCALE)
im_monochrome=cv2.threshold(im, 127,255,cv2.THRESH_BINARY_INV)[1]
_, labels,stats,_=cv2.connectedComponentsWithStats(im_monochrome)
idx=np.nonzero((stats[:,2]>150) | (stats[:,3]>150)) # select CC with h>150 or w>150 px.
result=255*np.uint8(np.isin(labels, idx)) # remove this CC
cv2.imwrite( 'result.png', result)

如何使用OpenCV删除矩形轮廓并从图像中提取文本以供pytesseract使用?

huangapple
  • 本文由 发表于 2023年6月5日 10:44:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76403242.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定