2020年1月4日 01:27:52go评论98阅读模式

英文:

iText 7 need to skip reading page header elements

问题

我正在使用EventHandler为我的PDF创建页面标题。页眉的内容在添加到Canvas之前添加到表格中。作为508合规的一部分，我需要排除页眉内容被朗读出来的情况。我该如何实现这一点？

public class TEirHeaderEventHandler : IEventHandler
{
    public void HandleEvent(Event e)
    {
        PdfDocumentEvent docEvent = (PdfDocumentEvent)e;
        PdfDocument pdf = docEvent.GetDocument();
        PdfPage page = docEvent.GetPage();
        PdfCanvas headerPdfCanvas = new PdfCanvas(page.NewContentStreamBefore(), page.GetResources(), pdf);
        Rectangle headerRect = new Rectangle(60, 725, 495, 96);
        Canvas headerCanvas = new Canvas(headerPdfCanvas, pdf, headerRect);
        // 创建页眉内容
        CreateHeaderContent(headerCanvas);
        headerCanvas.Close();
    }
    private void CreateHeaderContent(Canvas canvas)
    {
        // 创建页眉内容
        Table table = new Table(UnitValue.CreatePercentArray(new float[] { 60, 25, 15 }));
        table.SetWidth(UnitValue.CreatePercentValue(100));
        Cell cell1 = new Cell().Add(new Paragraph("Establishment Inspection Report").SetBold().SetTextAlignment(TextAlignment.LEFT));
        cell1.SetBorder(Border.NO_BORDER);
        table.AddCell(cell1);
        Cell cell2 = new Cell().Add(new Paragraph("FEI Number:").SetBold().SetTextAlignment(TextAlignment.RIGHT));
        cell2.SetBorder(Border.NO_BORDER);
        table.AddCell(cell2);
        Cell cell3 = new Cell().Add(new Paragraph(_feiNum).SetBold().SetTextAlignment(TextAlignment.RIGHT));
        cell3.SetBorder(Border.NO_BORDER);
        table.AddCell(cell3);
        canvas.Add(table);
    }
}
public static void CreatePdf()
{
    using (MemoryStream writeStream = new MemoryStream())
    using (FileStream inputHtmlStream = File.OpenRead(inputHtmlFile))
    {
        PdfDocument pdf = new PdfDocument(new PdfWriter(writeStream));
        pdf.SetTagged();
        iTextDocument document = new iTextDocument(pdf);
        TEirHeaderEventHandler teirEvent = new TEirHeaderEventHandler();
        pdf.AddEventHandler(PdfDocumentEvent.START_PAGE, teirEvent);
        // 将HTML转换为PDF
        HtmlConverter.ConvertToDocument(inputHtmlStream, pdf, properties);
        document.Close();
        byte[] bytes = TEirReorderingPages(writeStream, numOfPages);
        File.WriteAllBytes(outputPdfFile, bytes);
    }
}

请注意，我已将文档设置为标记。但是，当我打开文件时，仍然会看到“阅读未标记的文档”屏幕。但是，当我启用“大声朗读”功能时，所有内容都被朗读，包括页眉。欢迎提供任何意见或建议。非常感谢您的帮助。

英文:

I am using EventHandler to create page header for my pdf. The content of the header are added into a Table before adding to Canvas. As part of 508 compliance, i need to exclude the header content from being read out loud. How do i accomplice this?

public class TEirHeaderEventHandler : IEventHandler 
{
    public void HandleEvent(Event e)
    {
        PdfDocumentEvent docEvent = (PdfDocumentEvent)e;
        PdfDocument pdf = docEvent.GetDocument();
        PdfPage page = docEvent.GetPage();
        PdfCanvas headerPdfCanvas = new PdfCanvas(page.NewContentStreamBefore(), page.GetResources(), pdf);
        Rectangle headerRect = new Rectangle(60, 725, 495, 96);
        Canvas headerCanvas = new Canvas(headerPdfCanvas, pdf, headerRect);
        //creating content for header
        CreateHeaderContent(headerCanvas);
        headerCanvas.Close();
    }
    private void CreateHeaderContent(Canvas canvas)
    {
        //Create header content
        Table table = new Table(UnitValue.CreatePercentArray(new float[] { 60, 25, 15 } ));
        table.SetWidth(UnitValue.CreatePercentValue(100));
        Cell cell1 = new Cell().Add(new Paragraph(&quot;Establishment Inspection Report&quot;).SetBold().SetTextAlignment(TextAlignment.LEFT));
        cell1.SetBorder(Border.NO_BORDER);
        table.AddCell(cell1);
        Cell cell2 = new Cell().Add(new Paragraph(&quot;FEI Number:&quot;).SetBold().SetTextAlignment(TextAlignment.RIGHT));
        cell2.SetBorder(Border.NO_BORDER);
        table.AddCell(cell2);
        Cell cell3 = new Cell().Add(new Paragraph(_feiNum).SetBold().SetTextAlignment(TextAlignment.RIGHT));
        cell3.SetBorder(Border.NO_BORDER);
        table.AddCell(cell3);
        canvas.Add(table);
    }
}
public static void CreatePdf()
{
    using (MemoryStream writeStream = new MemoryStream())
    using (FileStream inputHtmlStream = File.OpenRead(inputHtmlFile))
    {
        PdfDocument pdf = new PdfDocument(new PdfWriter(writeStream));
        pdf.SetTagged();
        iTextDocument document = new iTextDocument(pdf);           
        TEirHeaderEventHandler teirEvent = new TEirHeaderEventHandler();
        pdf.AddEventHandler(PdfDocumentEvent.START_PAGE, teirEvent);
        //Convert html to pdf
        HtmlConverter.ConvertToDocument(inputHtmlStream, pdf, properties);
        document.Close();
        byte[] bytes = TEirReorderingPages(writeStream, numOfPages);
        File.WriteAllBytes(outputPdfFile, bytes);
    }
}

Note that i have set the document to be tagged. but i still get the "Reading Untagged Document" screen when i open the file. However, all of the content are read including the header when i activate the Read Out Loud feature. Any input or suggestion would be appreciated. Thank you in advance for your help.

答案1

得分: 2

以下是您要翻译的部分：

General

Alexey Subach建议的方法通常是正确的。您将内容标记为工件以区别于真实内容。

element.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);

这将在内容流中标记内容，并将元素排除在结构树之外。

Your case

但是，您的特定情况更为复杂。

对于标记良好的PDF文档，正确的朗读方式是处理结构树，这是一种表示文档的（语义）元素的逻辑阅读顺序的数据结构，例如段落、表格和列表。

由于您创建标题内容的方式，它不会自动标记：从PdfCanvas实例创建的Canvas实例默认情况下禁用自动标记。因此，标题中的表格未在内容流中标记，并且未包含在结构树中。如上述“一般”中所描述的方法明确将其标记为工件应该不会产生显著差异，因为它一开始就不在结构树中。

如果通过添加“headerCanvas.enableAutoTagging(page)”来启用自动标记，您将注意到表格确实出现在结构树中。

然后，如果添加“table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT)”，则表格将再次从结构树中排除。

总结：从结构树的角度看，您的原始代码与“一般”方法之间没有区别。

Adobe阅读顺序/辅助功能设置

根据您的描述，我认为您正在使用Adobe Acrobat或Reader进行朗读功能。在“首选项>阅读>阅读顺序选项”下，您可以配置如何处理朗读功能的内容：

来自https://helpx.adobe.com/reader/using/accessibility-features.html：

从文档中推断阅读顺序（建议）：通过使用高级的结构推断布局分析方法来解释未标记文档的阅读顺序。
从左到右，从上到下的阅读顺序：根据文本在页面上的放置位置传递文本，从左到右，然后从上到下阅读。此方法比从文档中推断阅读顺序更快。此方法仅分析文本；表单字段将被忽略，表格不会被识别为表格。
覆盖标记文档中的阅读顺序：使用阅读首选项中指定的阅读顺序，而不是文档的标签结构指定的内容。只有在处理标记不佳的PDF时遇到问题时才使用此首选项。

在我的测试中，我能够让Adobe Reader以您的原始代码创建的标题内容朗读出来的唯一方式是选择从左到右，从上到下的阅读顺序并启用覆盖标记文档中的阅读顺序。在这种情况下，它基本上忽略了标记并只按照页面上的位置处理内容。

如果禁用覆盖标记文档中的阅读顺序，则无论是对于您的原始代码还是对于明确的工件，都不会朗读标题内容。

结论

尽管始终将工件标记为工件以便能够正确区分它们和真实内容是个好主意，但在这种情况下，我认为您所经历的行为更与应用程序配置有关，而不是文件结构。

英文:

General

The approach suggested by Alexey Subach is generally correct. You mark the content as artifact to differentiate it from real content.

element.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);

This marks the content in the content stream and it excludes the element from the structure tree.

Your case

However, your specific case is more nuanced.

For a well tagged PDF document, the proper way to read it out loud is to process the structure tree, which is a data structure that represents the logical reading order of the (semantic) elements of the document, such as paragraphs, tables and lists.

Because of the way you are creating the header content, it is not automatically tagged: a Canvas instance that is created from a PdfCanvas instance has autotagging disabled by default. So the table in the header is not marked in the content stream and it is not included in the structure tree. Marking it explicitly as an artifact, with the approach described above in General, should not make a significant difference because it was not in the structure tree to begin with.

If you enable autotagging by adding headerCanvas.enableAutoTagging(page), you will notice that the table does appear in the structure tree.

If you then add table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT), the table is excluded from the structure tree again.

Summary: looking at the structure tree, there's no difference between your original code and the approach of General.

Adobe reading order / accessibility settings

From your description, I think you are using Adobe Acrobat or Reader for the read out loud functionality. Under Preferences > Reading > Reading Order Options, you can configure how the content should be processed for the read out loud feature:

From https://helpx.adobe.com/reader/using/accessibility-features.html:

Infer Reading Order From Document (Recommended): Interprets the reading order of untagged documents by using an advanced method of structure inference layout analysis.
Left-To-Right, Top-To-Bottom Reading Order: Delivers the text according to its placement on the page, reading from left to right and then top to bottom. This method is faster than Infer Reading Order From Document. This method analyzes text only; form fields are ignored and tables aren’t recognized as such.
Override The Reading Order In Tagged Documents: Uses the reading order specified in the Reading preferences instead what the tag structure of the document specifies. Use this preference only when you encounter problems in poorly tagged PDFs.

In my tests, the only way I can make Adobe Reader read out loud the header content created with your original code, is when I select Left-To-Right, Top-To-Bottom Reading Order and enable Override The Reading Order In Tagged Documents. In that case, it is basically ignoring the tagging and just processing the content per the location on the page.

With Override The Reading Order In Tagged Documents disabled, the header content is not read, for your original code and with explicit artifacts.

Conclusion

Although it's a good idea to always tag artifacts as such, so they can be properly differentiated from real content, in this case I believe the behaviour you're experiencing is more related to application configuration than to file structure.

答案2

得分: 1

Headers and footers are typically pagination artifacts and should be marked as such in the following way:

table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);

This will exclude the table from being read. Please note that you can mark any element implementing IAccessibleElement interface as artifact.

英文:

Headers and footers are typically pagination artifacts and should be marked as such in the following way:

table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);

This will exclude the table from being read. Please note that you can mark any element implementing IAccessibleElement interface as artifact.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

iText 7 需要跳过读取页面标题元素。

问题

答案1

General

Your case

Adobe阅读顺序/辅助功能设置

结论

General

Your case

Adobe reading order / accessibility settings

Conclusion

答案2

Pdf size issue in node js using puppeteer?

I have tamil font in my Laravel Blade View, It is working perfectly in Web page but not in pdf, When I download pdf the tamil words shuffled

Quarto PDF中跨越多个页面的R代码块输出存在问题。

使用IText7将PDF拆分为字节数组页：

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论