问题

在文档中查看xdmp:pdf-convert的选项时，似乎MarkLogic可以执行OCR，但我的测试并不成功。文档中的ignore-text选项如下所示：

> 启用/禁用从图像中提取文本。如果将此参数设置为true，则仅当文档由扫描页面组成时，才能提取文本；然而，带有嵌入式文本标签的图表可能不太理想。对于逐页转换，文本和图形元素在图表中重新排列导致结果不佳的问题并不是一个问题，因此false可能是更好的选择。

然而，在我测试的包含扫描页面的PDF中，没有提取文本。我甚至尝试创建了一个包含lorem ipsum文本截图的PDF。MarkLogic正确地将图像提取到它们自己的文件中，但生成的XHTML只包含对图像的引用。有人成功使用xdmp:pdf-convert执行OCR吗？或者你是否不得不使用其他工具进行OCR？最终，我们希望使扫描的PDF可搜索并可用于解析/转换。

从我的简单PDF创建的示例XHTML：

&lt;body class=&quot;font-0&quot;&gt;
    &lt;span class=&quot;pageStart&quot; id=&quot;pgs0001&quot;&gt;
    &lt;/span&gt;
    &lt;p
        style=&quot;text-align: left; line-height: 15.6pt; text-indent: 0pt; margin-left: 0pt; margin-right: 0pt; padding-left: 58.91pt; padding-top: 0; padding-bottom: 0; padding-right: 0; z-index: 100;&quot;&gt;
        &lt;span class=&quot;textStyle0&quot;&gt;
            &lt;a name=&quot;t0&quot; id=&quot;t0&quot;&gt;
            &lt;/a&gt;
            这是我在PDF中键入的文本
        &lt;/span&gt;
    &lt;/p&gt;
    &lt;div
        style=&quot;width: 768.00pt; height: 336.00pt; clip: rect(0pt, 768.00pt, 336.00pt, 0pt); margin-left: 49.09pt; margin-top: 0; margin-bottom: 0; margin-right: 0; padding: 0 0 0 0; z-index: 00;&quot;&gt;
        &lt;img src=&quot;testOcrPdf_pdf_parts/0001_00.jpg&quot; width=&quot;768.00pt&quot; height=&quot;336.00pt&quot; border=&quot;0&quot;
            alt=&quot;testOcrPdf_pdf_parts/0001_00.jpg(1587x695)&quot;&gt;
        &lt;/img&gt;
    &lt;/div&gt;
    &lt;span class=&quot;pageEnd&quot; id=&quot;pge0001&quot;&gt;
    &lt;/span&gt;
&lt;/body&gt;

英文:

Looking at the options in the documentation for xdmp:pdf-convert, it seems like MarkLogic can perform OCR but my testing of it has not been successful. The ignore-text option in the documentation reads:

> Enable/disable extraction of text from images. Documents consisting of scanned pages can only have text extracted if this parameter is set to true; however, diagrams with embedded text labels may be less palatable. For page-by-page conversion, the problem with reflowing of text and graphical elements within a diagram giving poor results is not such a problem, and the value of false will probably be the better choice.

However, in my tests with PDFs containing scanned pages, no extraction of the text is taking place. I have even tried creating my own PDF with a screenshot of lorem ipsum text. MarkLogic is correctly extracting the images into their own files but the resulting XHTML only contains a reference to the image. Has anyone had success with using xdmp:pdf-convert to perform OCR or have you had to use another tool for the OCR? In the end we would like to make the scanned PDFs searchable and available for parsing/transform.

Sample XHTML created from my simple PDF:

&lt;body class=&quot;font-0&quot;&gt;
    &lt;span class=&quot;pageStart&quot; id=&quot;pgs0001&quot;&gt;
    &lt;/span&gt;
    &lt;p
        style=&quot;text-align: left; line-height: 15.6pt; text-indent: 0pt; margin-left: 0pt; margin-right: 0pt; padding-left: 58.91pt; padding-top: 0; padding-bottom: 0; padding-right: 0; z-index: 100;&quot;&gt;
        &lt;span class=&quot;textStyle0&quot;&gt;
            &lt;a name=&quot;t0&quot; id=&quot;t0&quot;&gt;
            &lt;/a&gt;
            This is text I typed into the PDF
        &lt;/span&gt;
    &lt;/p&gt;
    &lt;div
        style=&quot;width: 768.00pt; height: 336.00pt; clip: rect(0pt, 768.00pt, 336.00pt, 0pt); margin-left: 49.09pt; margin-top: 0; margin-bottom: 0; margin-right: 0; padding: 0 0 0 0; z-index: 00;&quot;&gt;
        &lt;img src=&quot;testOcrPdf_pdf_parts/0001_00.jpg&quot; width=&quot;768.00pt&quot; height=&quot;336.00pt&quot; border=&quot;0&quot;
            alt=&quot;testOcrPdf_pdf_parts/0001_00.jpg(1587x695)&quot;&gt;
        &lt;/img&gt;
    &lt;/div&gt;
    &lt;span class=&quot;pageEnd&quot; id=&quot;pge0001&quot;&gt;
    &lt;/span&gt;
&lt;/body&gt;

答案1

得分: 1

MarkLogic不提供对PDF（或其他）文档的OCR功能。您需要使用外部工具。

英文:

MarkLogic does not provide OCR on PDF (or other) documents. You'll need to use something external.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

xdmp:pdf-convert能执行光学字符识别（OCR）吗？

问题

答案1

How to fix unability to save ".eps" file to marklogic database because XDMP-JSONDOC error is thrown?

如何将备份的mimetypes.xml文件从我的MarkLogic备份中排除？

Throw Error: XDMP-UNEXPECTED: (err:XPST0003) Unexpected token syntax error, unexpected For_, expecting Order_ or Return_ or Stable_

序列转换解决方案

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论