xdmp:pdf-convert能执行光学字符识别(OCR)吗?

huangapple go评论45阅读模式
英文:

Can xdmp:pdf-convert perform optical character recognition (OCR)?

问题

在文档中查看xdmp:pdf-convert的选项时,似乎MarkLogic可以执行OCR,但我的测试并不成功。文档中的ignore-text选项如下所示:

> 启用/禁用从图像中提取文本。如果将此参数设置为true,则仅当文档由扫描页面组成时,才能提取文本;然而,带有嵌入式文本标签的图表可能不太理想。对于逐页转换,文本和图形元素在图表中重新排列导致结果不佳的问题并不是一个问题,因此false可能是更好的选择。

然而,在我测试的包含扫描页面的PDF中,没有提取文本。我甚至尝试创建了一个包含lorem ipsum文本截图的PDF。MarkLogic正确地将图像提取到它们自己的文件中,但生成的XHTML只包含对图像的引用。有人成功使用xdmp:pdf-convert执行OCR吗?或者你是否不得不使用其他工具进行OCR?最终,我们希望使扫描的PDF可搜索并可用于解析/转换。

从我的简单PDF创建的示例XHTML:

<body class="font-0">
    <span class="pageStart" id="pgs0001">
    </span>
    <p
        style="text-align: left; line-height: 15.6pt; text-indent: 0pt; margin-left: 0pt; margin-right: 0pt; padding-left: 58.91pt; padding-top: 0; padding-bottom: 0; padding-right: 0; z-index: 100;">
        <span class="textStyle0">
            <a name="t0" id="t0">
            </a>
            这是我在PDF中键入的文本
        </span>
    </p>
    <div
        style="width: 768.00pt; height: 336.00pt; clip: rect(0pt, 768.00pt, 336.00pt, 0pt); margin-left: 49.09pt; margin-top: 0; margin-bottom: 0; margin-right: 0; padding: 0 0 0 0; z-index: 00;">
        <img src="testOcrPdf_pdf_parts/0001_00.jpg" width="768.00pt" height="336.00pt" border="0"
            alt="testOcrPdf_pdf_parts/0001_00.jpg(1587x695)">
        </img>
    </div>
    <span class="pageEnd" id="pge0001">
    </span>
</body>
英文:

Looking at the options in the documentation for xdmp:pdf-convert, it seems like MarkLogic can perform OCR but my testing of it has not been successful. The ignore-text option in the documentation reads:

> Enable/disable extraction of text from images. Documents consisting of scanned pages can only have text extracted if this parameter is set to true; however, diagrams with embedded text labels may be less palatable. For page-by-page conversion, the problem with reflowing of text and graphical elements within a diagram giving poor results is not such a problem, and the value of false will probably be the better choice.

However, in my tests with PDFs containing scanned pages, no extraction of the text is taking place. I have even tried creating my own PDF with a screenshot of lorem ipsum text. MarkLogic is correctly extracting the images into their own files but the resulting XHTML only contains a reference to the image. Has anyone had success with using xdmp:pdf-convert to perform OCR or have you had to use another tool for the OCR? In the end we would like to make the scanned PDFs searchable and available for parsing/transform.

Sample XHTML created from my simple PDF:

<body class="font-0">
    <span class="pageStart" id="pgs0001">
    </span>
    <p
        style="text-align: left; line-height: 15.6pt; text-indent: 0pt; margin-left: 0pt; margin-right: 0pt; padding-left: 58.91pt; padding-top: 0; padding-bottom: 0; padding-right: 0; z-index: 100;">
        <span class="textStyle0">
            <a name="t0" id="t0">
            </a>
            This is text I typed into the PDF
        </span>
    </p>
    <div
        style="width: 768.00pt; height: 336.00pt; clip: rect(0pt, 768.00pt, 336.00pt, 0pt); margin-left: 49.09pt; margin-top: 0; margin-bottom: 0; margin-right: 0; padding: 0 0 0 0; z-index: 00;">
        <img src="testOcrPdf_pdf_parts/0001_00.jpg" width="768.00pt" height="336.00pt" border="0"
            alt="testOcrPdf_pdf_parts/0001_00.jpg(1587x695)">
        </img>
    </div>
    <span class="pageEnd" id="pge0001">
    </span>
</body>

答案1

得分: 1

MarkLogic不提供对PDF(或其他)文档的OCR功能。您需要使用外部工具。

英文:

MarkLogic does not provide OCR on PDF (or other) documents. You'll need to use something external.

huangapple
  • 本文由 发表于 2023年8月9日 00:06:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76861356.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定