从PDF文件中提取印地文文本

huangapple go评论59阅读模式
英文:

Extract hindi Text from a PDF file

问题

我正在处理从PDF文件中提取信息(使用印地语)并将其转换成数据框的任务。

我尝试了许多方法,查看了很多文章和Stack Overflow上的答案。我尝试了不同的库,如easy OCR、paddle OCR等,但未能获得正确的输出。

以下是我尝试的一些内容:

  1. https://stackoverflow.com/questions/67816185/how-to-improve-hindi-text-extraction
  2. https://stackoverflow.com/questions/28677544/how-do-i-display-the-contours-of-an-image-using-opencv-python
  3. https://amannair723.medium.com/pdf-to-excel-using-advance-python-nlp-and-computer-vision-aka-document-ai-23cc0fb56549

似乎我无法获取确切的轮廓来创建边界框。下面您可以看到我得到的输出图像。

从PDF文件中提取印地文文本

我所需要的只是将这些信息转换成一个数据框,其中的列应该是:-
नाम:
पति का नाम / पिता का नाम:
मकान संख्याः等等。

以下是我用于获取数据的代码:

import cv2
import pytesseract
import numpy as np
from pytesseract import Output

image = cv2.imread('pages_new/page3.jpg')
img = image.copy()
mask = np.zeros(image.shape, dtype=np.uint8)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

# 使用轮廓区域和宽高比进行ROI筛选
contours = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
contours = contours[0] if len(contours) == 2 else contours[1]
for c in contours:
    area = cv2.contourArea(c)
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.05 * peri, True)
    x, y, w, h = cv2.boundingRect(approx)
    aspect_ratio = w / float(h)
    if area > 10000 and aspect_ratio > 0.5:
        mask[y:y+h, x:x+w] = image[y:y+h, x:x+w]

h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img)
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)

d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        img2 = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

# 使用Pytesseract执行OCR
data = pytesseract.image_to_string(mask, lang='Devanagari', config='--psm 6')
print(data)

此外,有人可以确认页面上的信息是否会更改,我们是否需要为所有文档编写不同的代码,还是可以为所有文档获得通用脚本?

英文:

I am working on a task to extract some information (in HINDI) from a pdf file and convert it into a data frame.

I have tried many things and followed many articles, and answers on stack overflow as well. I tried different libraries like easy OCR, paddle OCR, and others but was unable to get the correct output.

Here is the link for the document. Link

Things I have tried:

  1. https://stackoverflow.com/questions/67816185/how-to-improve-hindi-text-extraction
  2. https://stackoverflow.com/questions/28677544/how-do-i-display-the-contours-of-an-image-using-opencv-python
  3. https://amannair723.medium.com/pdf-to-excel-using-advance-python-nlp-and-computer-vision-aka-document-ai-23cc0fb56549

It seems that I am unable to get the exact contours to create the bounding box. Below you can see the image of the output I am getting.

从PDF文件中提取印地文文本

All I need is to convert this information to a data frame where the columns would be:-
नाम:
पति का नाम / पिता का नाम:
मकान संख्याः an so on.

Below is the code I am using to get data:-

import cv2
import pytesseract
import numpy as np
from pytesseract import Output
image = cv2.imread('pages_new/page3.jpg')
img = image.copy()
mask = np.zeros(image.shape, dtype=np.uint8)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# Filter for ROI using contour area and aspect ratio
countour = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
countour = countour[0] if len(countour) == 2 else countour[1]
for c in countour:
    area = cv2.contourArea(c)
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.05 * peri, True)
    x,y,w,h = cv2.boundingRect(approx)
    aspect_ratio = w / float(h)
    if area > 10000 and aspect_ratio > .5:
        mask[y:y+h, x:x+w] = image[y:y+h, x:x+w]
h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img) 
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        img2 = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# cv2.imshow('img', img)
# cv2.imshow('img2', img2)
# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(mask, lang='Devanagari', config='--psm 6')
print(data)
# cv2.imshow('thresh', thresh)
# cv2.imshow('mask', mask)

Also, Could anyone please confirm if the information on the page changes, Do we have to write a different code for all the documents or we can get a generic script for all the docs?

答案1

得分: 0

以下是您要翻译的部分:

"I'm also working on the same concept, extracting voters' info from the voters list downloaded from the CEO website. I'm doing for Karnataka, in Kanada. I'm using freely available tools online to convert the PDF to image. Then using Matlab to split the image of one full page into each voter's box.

I did so because none of the available converters perform well with full page with so many bounding boxes. Also, the most important data to extract, is the voter's number in the list. This is not clear enough at least in our PDFs to be extracted as English digits.

So, I calculate the number based on the files row and column. For the voters in separate files, I manually enter the data.

So, the concept is to:

  1. Manually convert one pdf file into multiple images manually using any freely available website.

  2. Convert the images black and white so that you can split into boxes very easily without any computer vision tools. (Based on lines).

  3. Save each of the extracted box as image file for further reference or send it to OCR library.

  4. Epic number, name, house number and father, guardian name can be extracted easily. But the main number required (in my case, not sure about yours), the number in the list, needs to be calculated and manually verified as it cannot be extracted by any of the available libraries.

Here is the library, Tesseract that I'm using to convert Kanada text in the image. This is opensource library and can extract Hindi too.

So instead of re-inventing the wheel, you can do some manual processing and then use this tool which will give better text extraction which will save a lot of time.

I will soon upload my code files into a pubic github repo and share it here. Using the files, you can convert pages to images of each user. This you can use as per Ur requirements. The code is in Matlab.

https://github.com/tesseract-ocr/tesseract

[PS: This may, for Karnataka elections, I'll be publishing my android/iOS apps in the stores. The apps will have this voters list in English. Easily searchable using names, house numbers or the voters' EPIC numbers. For last Karnataka assembly elections, I had published only android app in google play store. It got removed due to privacy policy issue. Need to clearly mention that the info is available in freely public domain.]"

英文:

I'm also working on the same concept, extracting voters' info from the voters list downloaded from the CEO website. I'm doing for Karnataka, in Kanada. I'm using freely available tools online to convert the PDF to image. Then using Matlab to split the image of one full page into each voter's box.

I did so because none of the available converters perform well with full page with so many bounding boxes. Also, the most important data to extract, is the voter's number in the list. This is not clear enough at least in our PDFs to be extracted as English digits.

So, I calculate the number based on the files row and column. For the voters in separate files, I manually enter the data.

So, the concept is to:

  1. Manually convert one pdf file into multiple images manually using any freely available website.

  2. Convert the images black and white so that you can split into boxes very easily without any computer vision tools. (Based on lines).

  3. Save each of the extracted box as image file for further reference or send it to OCR library.

  4. Epic number, name, house number and father, guardian name can be extracted easily. But the main number required (in my case, not sure about yours), the number in the list, needs to be calculated and manually verified as it cannot be extracted by any of the available libraries.

Here is the library, Tesseract that I'm using to convert Kanada text in the image. This is opensource library and can extract Hindi too.

So instead of re-inventing the wheel, you can do some manual processing and then use this tool which will give better text extraction which will save a lot of time.

I will soon upload my code files into a pubic github repo and share it here. Using the files, you can convert pages to images of each user. This you can use as per Ur requirements. The code is in Matlab.

https://github.com/tesseract-ocr/tesseract

[PS: This may, for Karnataka elections, I'll be publishing my android/iOS apps in the stores. The apps will have this voters list in English. Easily searchable using names, house numbers or the voters' EPIC numbers. For last Karnataka assembly elections, I had published only android app in google play store. It got removed due to privacy policy issue. Need to clearly mention that the info is available in freely public domain.]

答案2

得分: 0

以下是代码部分的中文翻译:

# 使用以下代码获取数据:
import cv2
import numpy as np
import pdf2image
import pytesseract

# 从PDF中提取第3页并保持适当的质量
page_3 = np.array(pdf2image.convert_from_path('ROLL_Download.aspx.pdf',
                                              first_page=3, last_page=3,
                                              dpi=300, grayscale=True)[0])

# 反向二值化以进行轮廓查找
thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]

cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]
no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)
data = []
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables], key=lambda r: (r[1], r[0]))
for i_r, (x, y, w, h) in enumerate(rects, start=1):

    cnts = cv2.findContours(page_3[y+1:y+h-1, x+1:x+w-1], cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]

    inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: (r[1], r[0]))

    print('\n提取表格内文本 {}\n'.format(i_r))
    for (xx, yy, ww, hh) in inner_rects:
        # 根据完整图像设置当前坐标
        xx += x
        yy += y

        # 获取当前单元格
        cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]

        # 在数字周围进行泛洪填充
        ys, xs = np.min(np.argwhere(cell == 0), axis=0)
        print("xs 和 ys 的值:{}{}".format(xs, ys))
        temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]
        mask = cv2.morphologyEx(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(), cv2.MORPH_DILATE, np.full((ww, hh), 255))

    cnts = cv2.findContours(mask, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda b: b[0])

    # 从当前单元格的每个部分提取文本
    for x_b, y_b, w_b, h_b in boxes:
        text = pytesseract.image_to_string(
            temp[y_b:y_b+h_b, x_b:x_b+w_b],
            config='--psm 6',
            lang='Devanagari')
        print('x: {}, y: {}, 文本:\n{}'.format(xx, yy, text))
        data.append(text)

请注意,上述内容是代码的翻译,不包括代码之外的任何其他内容。

英文:

I am able to get the data using the below code:-

import cv2
import numpy as np
import pdf2image
import pytesseract
# Extract page 3 from PDF in proper quality
page_3 = np.array(pdf2image.convert_from_path('ROLL_Download.aspx.pdf',
first_page=3, last_page=3,
dpi=300, grayscale=True)[0])
# Inverse binarize for contour finding
thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]
cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]
no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)
data = []
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables], key=lambda r: (r[1], r[0]))
for i_r, (x, y, w, h) in enumerate(rects, start=1):
cnts = cv2.findContours(page_3[y+1:y+h-1, x+1:x+w-1], cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]    
inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: (r[1], r[0]))
print('\nExtract texts inside table {}\n'.format(i_r))
for (xx, yy, ww, hh) in inner_rects:
# Set current coordinates w.r.t. full image
xx += x
yy += y
# Get current cell
cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]
# Floodfill rectangles around numbers
ys, xs = np.min(np.argwhere(cell == 0), axis=0)
print("The value for xs:{} and ys:{}".format(xs, ys))
temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]
mask = cv2.morphologyEx(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(), cv2.MORPH_DILATE, np.full((ww, hh),255))
cnts = cv2.findContours(mask, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda b: b[0])
# Extract texts from each part of the current cell
for x_b, y_b, w_b, h_b in boxes:
#             print("The value for i_b is:",i_b)
text = pytesseract.image_to_string(
temp[y_b:y_b+h_b, x_b:x_b+w_b],
config='--psm 6',
lang='Devanagari')
#text = text.replace('\f', '')
print('x: {}, y: {}, text:\n{}'.format(xx, yy, text))
data.append(text)

Below is the output:-

Extract texts inside table 1
The value for xs:13 and ys:0
x: 103, y: 209, text:
1 WEZ1761006
नाम : भीमसेन
पिता का नाम : बच्चू सिंह
मकान संख्या: देव नगर Photo is
आयु : 33 लिंग : पुरुष Available
The value for xs:13 and ys:0
x: 857, y: 209, text:
2 WEZ1391713
नाम : पूजा कुमारी
पिता का नाम : विपिन सोनी
मकान संख्याः वार्ड नं1 Photo is
आयु : 23 लिग : स्त्री Available
The value for xs:13 and ys:0
x: 1610, y: 209, text:
3 WEZ1781897
नाम : सोनू
पति का नाम : राजू
मकान संख्याः वार्ड नं2 Photo is
आयु : 3 लिग : स्त्री Available
Extract texts inside table 2
The value for xs:13 and ys:41
x: 103, y: 507, text:
#174 WEZ1735174
नाम : रागिणी कुमारी कामत
पिता का नाम : संतोष कामत
मकान संख्याः 31 Photo is
आयु : 19 लिग : स्त्री Available
The value for xs:13 and ys:41
x: 857, y: 507, text:
5 WEZ1766005
नाम : पर्तीक सिंग चिडे
माता का नाम : कुलविंदर कौर
मकान संख्याः देव नगर ,वार्ड नं. 2 Photo is
आयु : 20 लिग : पुरुष Available
The value for xs:13 and ys:41
x: 1610, y: 507, text:
[|] WEZ1755230
नाम : रीता देवी
पति का नाम : प्रेम यादव
मकान संख्या: हाऊस नं. 05 Photo is
आयु : ॐ लिग : स्त्री Available
Extract texts inside table 3
The value for xs:13 and ys:0
x: 103, y: 807, text:
7 WEZ1758721
नाम : विश्व जीत वर्मा
पिता का नाम : राम चद्र
मकान संख्या: हाऊस नं. 10, वार्ड नं. 2 Photo is
आयु : 25 लिंग : पुरुष Available
The value for xs:13 and ys:0
x: 857, y: 807, text:
। | WEZ1758739
नाम : हिम्मत वर्मा
पिता का नाम : राम चद्र
मकान संख्या: हाऊस नं. 10, वार्ड नं. 2 Photo is
आयु : 23 लिंग : पुरुष Available
The value for xs:13 and ys:0
x: 1610, y: 807, text:
[१ WEZ1427087
नाम : सोनू यादव
पिता का नाम : ददन यादव
मकान संख्या: हाऊस नं. 228 Photo is
आयु : 23 लिंग : पुरुष Available

huangapple
  • 本文由 发表于 2023年3月31日 15:58:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/75896133.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定