2023年7月17日 15:26:35go评论113阅读模式

英文:

Recognize text in a PDF file rotated at any angle

问题

我有一个简单的程序（来自docTR库文档的代码），可以识别PDF文件中的文本。如果文本是完美对齐的，那么文本识别没有问题，但如果文档向右或向左旋转，那么文本识别会出现问题。

我可能会收到不仅旋转了90、180或270度的文档。倾斜扫描的文档可以以任何角度旋转（如上图所示）。

我希望在你的帮助下找到一个解决方案，可以帮助我将PDF中的表格/文本（或整个PDF）旋转成直的，以便进行轻松的文本识别，就像下面的图片一样。

也许已经有类似的解决方案了，但我还没有找到它们。如果你能指引我到现有的解决方案或帮助我编写我的解决方案的代码，我将不胜感激。

from doctr.io import DocumentFile
from doctr.models import ocr_predictor
ocr = ocr_predictor(pretrained=True)
doc = DocumentFile.from_pdf("my/path.pdf")
result = ocr(doc)
result.show(doc)

英文:

I have a simple program (code from the documentation of the docTR library) that recognizes text in a pdf file. If the text is perfectly aligned, then there are no problems with text recognition, but if the document is rotated to the right or left, then problems begin with text recognition.

I may receive documents that are not only rotated exactly 90,180 or 270 degrees. Crooked scanned documents can come rotated in any angle (as in the pictures above).

I would like with your help to find a solution that will help me rotate the table / text (or the whole pdf) in my pdf straight, for easy text recognition, as in the picture below.

Perhaps there are already similar solutions, but I have not found them yet. I would be grateful if you point me to existing solutions or help me write code with my own solution.

from doctr.io import DocumentFile
from doctr.models import ocr_predictor
ocr = ocr_predictor(pretrained=True)
doc = DocumentFile.from_pdf(&quot;my/path.pdf&quot;)
result = ocr(doc)
result.show(doc)

答案1

得分: 2

这是我对提出的问题的想法：

如果你正在从纸上扫描表格，那么文档（即使是pdf格式）包含图像。
你知道你需要旋转文档以便docTR读取它，但根据我在docTR存储库中的阅读，你也可以将pdf转换为图像，然后让docTR将其扫描为图像。
但是，为什么要将pdf转换为图像？我认为如果文件是图像，接下来的两个步骤可能会更容易：
首先，你需要知道旋转图像所需的角度（以度或弧度表示 - 对于每个文件都不同）。为此，你需要扫描图像以获取“长直线”并获取它们的角度（表格边界）。你会得到许多角度，而你只需要一个，因此你可能需要在这方面有点创意（例如，在最后一步中，使用docTR多次扫描文件以获取不同角度，根据提取的数据量来衡量结果的成功程度）。
一旦你有了你的角度（或角度），你可以旋转图像文件到之前计算的特定角度。
最后一步：使用docTR扫描旋转后的图像。

我知道这不是一个简单的复制粘贴解决方案。希望你能找到更容易的方法来解决问题。但如果没有更简单的方法，这将是我的方法。

英文:

These are my thoughts on the proposed problem:

If you are scanning the tables from paper, then the document (even if it is pdf format) it contains an image.
You know that you need to rotate the document for docTR to read it, but from what I read in docTR repository you could also transform the pdf to image and make docTR scan it as an image.
But, why are you making the pdf into an image? I think it might be easier to do the next two steps if the file is an image:
First you need to know the angle(amount in degrees, or radians - different for each file) you want to rotate the image. For that, you need to scan the image for "long straight lines" and get their angles (the table borders). You will get many angles, and you only need one, so you might have to get a bit creative there (you could, for example, in the last step scan the file multiple times with docTR for different angles, measuring the success of the result according to the amount of data extracted)
Once you have your angle(or angles), you rotate the image file to the specific angle you previously calculated
Last step: use docTR to scan the rotated image

I know this is not a snippet, copy-paste solution. Hopefully you find an easier way to get there. But this would be my approach if nothing easier worked.

答案2

得分: 1

你可以使用OCRmyPDF，这是一个非常好的OCR库：

ocrmypdf --rotate-pages input_scanned.pdf

该标志可以修复页面的错误旋转。我对doctr没有经验。

英文:

You can use OCRmyPDF, which is very good OCR library:

ocrmypdf --rotate-pages input_scanned.pdf

The flag can fix pages that are misrotated. I have no experience with doctr.

答案3

得分: 1

你可以使用一个已经训练过的旋转文档检测模型，并相应地传递选项 assume_straight_pages：

predictor = detection_predictor('db_resnet50_rotation', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True)

这里是官方文档链接。

英文:

You can use a detection model that has been trained on rotated documents and pass the option assume_straight_pages accordingly:

predictor = detection_predictor(&#39;db_resnet50_rotation&#39;, pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True)

Here is the official documentation.

答案4

得分: 1

在使用 DocTR 处理您的任务之前，您可以使用 Tesseract OCR 根据文本的对齐方式旋转您的 PDF 图像。有关源代码和详细实施方法，请参考以下链接：
https://pyimagesearch.com/2022/01/31/correcting-text-orientation-with-tesseract-and-python/

您的流程可能如下：

读取 PDF 并获取图像。
将图像发送到 Tesseract OCR 以进行重新对齐。
将响应发送给 DocTR 以进行字符识别。

英文:

Before using DocTR for your task. You can use tesseract OCR to rotate your pdf image as per the alignment of the text. The source code and detailed implementation has been provided here:
https://pyimagesearch.com/2022/01/31/correcting-text-orientation-with-tesseract-and-python/

Your flow might look as follows:

Read pdf and get image.
Send image to tesseract ocr for realignment.
Send the response to DocTR for character recognition.

答案5

得分: 1

步骤1：将PDF转换为图像。

步骤2：使用OpenCV读取图像。

import numpy
import math
import cv2
import matplotlib.pyplot as plt
img = cv2.imread("test.png", 0)  # 灰度图像

步骤3：如果需要，进行预处理。可以查看阈值处理等（因为对于您的图像示例不需要进行此步骤）。

步骤4：使用Canny边缘检测和Hough线检测。

dst = cv2.Canny(img, 50, 200, None, 3)  # 请查看Canny文档
lines = cv2.HoughLines(dst, 1, np.pi / 180, 150, None, 0, 0)  # 请查看文档

步骤5：将所有线条的角度转换为度数并找到最佳拟合。

deg_lines = [round(np.rad2deg(i[0][1])) % 90 for i in lines]
# lines的格式是[[rho, theta]]
# 我们还要对90进行取模，因为线条应该垂直于页面。即90度
# deg_lines现在包含了图像中找到的所有线条的角度度数。
candidates_angle = round(np.mean(deg_lines))  # 或使用中值/众数
# candidates_angle现在包含了文档的当前方向角度，最接近的度数。旋转它到正确的角度，然后应该可以了。
cdst = img.copy()  # 仅用于可视化线条
if lines is not None:
    for i in range(0, len(lines)):
        rho = lines[i][0][0]
        theta = lines[i][0][1]
        a = math.cos(theta)
        b = math.sin(theta)
        x0 = a * rho
        y0 = b * rho
        pt1 = (int(x0 + 1000 * (-b)), int(y0 + 1000 * (a)))
        pt2 = (int(x0 - 1000 * (-b)), int(y0 - 1000 * (a)))
        cv2.line(cdst, pt1, pt2, (0, 0, 255), 3, cv2.LINE_AA)
plt.imshow(cdst)
plt.show()

步骤6：旋转图像并运行您的代码。或者参考cv2/PIL库来旋转图像，您之前已经这样做，所以应该可以正常工作。

附加文档：

如果您有任何额外问题，请告诉我。

英文:

Step 1: Convert pdf to image.

Step 2: Read image with opencv

import numpy
import math
import cv2
import matplotlib.pyplot as plt
img = cv2.imread(&quot;test.png&quot;,0) #grayscale

Step 3: Preprocess if needed. See thresholding etc. (not done cos its not needed for your image example)

Step 4: Use Canny edge detection and Hough lines

dst = cv2.Canny(img, 50, 200, None, 3) #see Canny docs
lines = cv2.HoughLines(dst, 1, np.pi / 180, 150, None, 0, 0) # See docs

Step 5: Convert all your lines angles to degrees and find some best fit.

deg_lines = [round(np.rad2deg(i[0][1]))%90 for i in lines] 
#lines is in format [[rho,theta]]
#we also mod by 90 as the lines should be orthogonal on page. I.E 90degrees
#deg_lines now contains the degree angles of all lines found in the image. 
candidates_angle = round(np.mean(deg_lines)) # or use the median/mode
#candidates_angle now contatins to the nearest degree the current orientation angle of your doc. Rotate it to the correct angle and you should be good. 
cdst = img.copy() #Just to visualize your lines
if lines is not None:
    for i in range(0, len(lines)):
        rho = lines[i][0][0]
        theta = lines[i][0][1]
        a = math.cos(theta)
        b = math.sin(theta)
        x0 = a * rho
        y0 = b * rho
        pt1 = (int(x0 + 1000*(-b)), int(y0 + 1000*(a)))
        pt2 = (int(x0 - 1000*(-b)), int(y0 - 1000*(a)))
        cv2.line(cdst, pt1, pt2, (0,0,255), 3, cv2.LINE_AA)
plt.imshow(cdst)
plt.show()

Step 6: Rotate your image and run your code. Or refer to cv2/PIL libraries for rotating an image
you did this already so it should work.

Additional docs:

>https://docs.opencv.org/3.4/d9/db0/tutorial_hough_lines.html

>https://docs.opencv.org/3.4/da/d22/tutorial_py_canny.html

>https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html

Please let me know if you have any additional questions.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在任意角度旋转的PDF文件中识别文本

问题

答案1

答案2

答案3

答案4

答案5

Profile matching query does not exist.

在Python中，函数的数组参数在连续调用之间存储在何处？

如何在不重复昂贵工作的情况下高效地多次访问函数的返回值？

OpenCV：将曲线转换为Shapely Linestrings

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论