如何在XSLT中使用Apache Tika处理RTF超链接?

huangapple go评论57阅读模式
英文:

How would I handle RTF hyperlinks using Apache Tika in XSLT?

问题

这个问题是对如何使用XSLT 2 / Saxon HE 11.3将XML中的RTF文本节点转换为文本的一种方法?的后续问题。

在实施了回答中提供的解决方案后,我运行了代码针对一个大型数据集。在处理所有这些数据时,源RTF中的一个项目导致应用程序出现错误。

错误信息:

错误在urn:from-string的第11行第92列:SXXP0003 XML解析器报告的错误:元素类型“a”必须由匹配的结束标记“</a>”终止。:元素类型“a”必须由匹配的结束标记“</a>”终止。

我查看了源XML,其中包含了多个RTF超链接代码。源代码如下:

&lt;SPECORMETHOD&gt;{\rtf1\ansi\deff0\uc1\ansicpg1252\deftab720{...}{...}\f1\fs28\tab Caesar \f1\b\i DIP\f1\i0 : {\field{\*\fldinst{HYPERLINK &quot;..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc&quot;}}{\*\fldtitle{..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc}}{\fldrslt{\f1\cf3\cs103\ul\cs3 Dip, Caesar.doc\plain\f1\fs28\b}}}\par...}&lt;/SPECORMETHOD&gt;

对于我的用途,URL不会是一个功能性组件,但为了使这个RTF转换项目更实用,需要什么才能使超链接代码正常工作,或将它们输出为文本以供参考?我可以通过在XSLT中拦截元素,查找HYPERLINK代码并将其替换为常规文本的方式来处理这个问题。

从这个示例中,超链接的期望输出将是(仅文本):

CAESAR DIP: ..\..\SAUCES\Dips\Dip, Caesar.doc

对原始代码的唯一修改是在处理&lt;SPECORMETHOD&gt;时,在XSLT中进行了检查以查看元素是否为空。

&lt;xsl:choose&gt;
	&lt;xsl:when test=&quot;string-length(SPECORMETHOD) &amp;gt; 0&quot;&gt;
		&lt;rtf-as-xhtml&gt;
			&lt;xsl:sequence select=&quot;tika:parse-rtf(SPECORMETHOD[string-length(.) &amp;gt; 0])&quot;/&gt;
		&lt;/rtf-as-xhtml&gt;
	&lt;/xsl:when&gt;
	&lt;xsl:otherwise&gt;
		&lt;xsl:value-of select=&quot;&#39;[EMPTY]&#39;&quot;/&gt;
	&lt;/xsl:otherwise&gt;
&lt;/xsl:choose&gt;

我在Eclipse 2022-12(4.26.0)中构建了这个项目。它是一个使用Apache Tika 2.7.0和Saxon HE 11.3的Maven项目,使用Java SE 1.8。特别感谢Martin H。

英文:

This question is a follow-up to: What are some methods to converting RTF text nodes in XML using XSLT 2 / Saxon HE 11.3?.

After implementing the answered solution, I ran the code against a large dataset. During the processing of all that data, an item in source RTF caused the application to error.

The error:

Error on line 11 column 92 of urn:from-string:  SXXP0003   Error reported by XML parser: The element type &quot;a&quot; must be terminated by the matching end-tag &quot;&lt;/a&gt;&quot;.: The element type &quot;a&quot; must be terminated by the matching end-tag &quot;&lt;/a&gt;&quot;.

I took a look at the source xml, which contained several RTF HYPERLINK codes. Source:

&lt;SPECORMETHOD&gt;{\rtf1\ansi\deff0\uc1\ansicpg1252\deftab720{\fonttbl{\f0\fnil\fcharset1 Arial;}{\f1\fnil\fcharset1 Times New Roman;}{\f2\fnil\fcharset1 WingDings;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green128\blue0;\red0\green0\blue255;\red255\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red128\green0\blue0;\red0\green255\blue0;\red0\green255\blue255;\red0\green128\blue128;\red0\green0\blue128;\red255\green255\blue255;\red192\green192\blue192;\red128\green128\blue128;\red0\green0\blue0;\red128\green128\blue0;}\wpprheadfoot1\paperw12240\paperh15840\margl720\margr720\margt720\margb720\headery720\footery720\endnhere\sectdefaultcl{\*\generator WPTools_5.17;}{\stylesheet{\s1\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20 Normal;}{\s2\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20 Default Paragraph Font;}{\s3\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20\cf3\ul\sbasedon2 Hyperlink;}}{\pard\plain\plain\f1\fs36\par\pard\plain\plain\f1\fs36\par\plain\f1\fs28\tab 10\&#39;94Flour Tortilla\par\plain\f1\fs28\tab Caesar \f1\b\i DIP\f1\i0 : {\field{\*\fldinst{HYPERLINK &quot;..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc&quot;}}{\*\fldtitle{..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc}}{\fldrslt{\f1\cf3\cs103\ul\cs3 Dip, Caesar.doc\plain\f1\fs28\b}}}\par\plain\f1\fs28\tab Ripped Romaine\par\plain\f1\fs28\tab Blackened Salmon julienne\par\plain\f1\fs28\tab Shaved Red Onion\par\plain\f1\fs28\tab Julienne Tomato\par\plain\f1\fs28\tab Grated Parmesan\par\plain\f1\fs28\tab Blackening spice: {\field{\*\fldinst{HYPERLINK &quot;..\\\\..\\\\SPICE\\\\Blackening Spice.doc&quot;}}{\*\fldtitle{..\\\\..\\\\SPICE\\\\Blackening Spice.doc}}{\fldrslt{\f1\cf3\cs103\ul\cs3 Blackening Spice.doc\plain\f1\fs28}}}\par\pard\plain\plain\f1\fs28\par\plain\f1\fs28 Method\par\plain\f1\fs28 Procedure Text \par\pard\plain\plain\f1\fs36\par}}&lt;/SPECORMETHOD&gt;

For my purposes, the URL is not going to be a functional component, but for the sake of utility of this RTF conversion project, what might be needed to have the hyperlink codes work correctly, or to output them as text for reference? One way I can handle this is in the XSLT by intercepting the element, looking for the HYPERLINK code and replacing it with regular text.

The desired output for a hyperlink from this example would be (text only):

CAESAR DIP: ..\..\SAUCES\Dips\Dip, Caesar.doc

The only modification to the original code was in XSLT to do a check for an empty element when processing the &lt;SPECORMETHOD&gt;.

&lt;xsl:choose&gt;
	&lt;xsl:when test=&quot;string-length(SPECORMETHOD) &amp;gt; 0&quot;&gt;
		&lt;rtf-as-xhtml&gt;
			&lt;xsl:sequence select=&quot;tika:parse-rtf(SPECORMETHOD[string-length(.) &amp;gt; 0])&quot;/&gt;
		&lt;/rtf-as-xhtml&gt;
	&lt;/xsl:when&gt;
	&lt;xsl:otherwise&gt;
		&lt;xsl:value-of select=&quot;&#39;[EMPTY]&#39;&quot;/&gt;
	&lt;/xsl:otherwise&gt;
&lt;/xsl:choose&gt;

I've built this project in Eclipse 2022-12 (4.26.0). It's a Maven project using Apache Tika 2.7.0, and Saxon HE 11.3, using Java SE 1.8. Special thanks to Martin H.

答案1

得分: 1

我已经通过Tika运行了你的样本RTF文件,不幸的是,所谓的XHTML输出不是格式良好的:

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
&lt;head&gt;
&lt;meta name=&quot;X-TIKA:Parsed-By&quot; content=&quot;org.apache.tika.parser.DefaultParser&quot; /&gt;
&lt;meta name=&quot;X-TIKA:Parsed-By&quot; content=&quot;org.apache.tika.parser.microsoft.rtf.RTFParser&quot; /&gt;
&lt;meta name=&quot;Content-Type&quot; content=&quot;application/rtf&quot; /&gt;
&lt;title&gt;&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;&lt;p /&gt;
&lt;p /&gt;
&lt;p&gt;	10”Flour Tortilla&lt;/p&gt;
&lt;p&gt;	Caesar &lt;b&gt;&lt;i&gt;DIP&lt;/i&gt;: &lt;a href=&quot;..\\..\\SAUCES\\Dips\\Dip, Caesar.doc&quot;&gt;Dip, Caesar.doc&lt;/b&gt;&lt;b /&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;b /&gt;	Ripped Romaine&lt;/p&gt;
&lt;p&gt;	Blackened Salmon julienne&lt;/p&gt;
&lt;p&gt;	Shaved Red Onion&lt;/p&gt;
&lt;p&gt;	Julienne Tomato&lt;/p&gt;
&lt;p&gt;	Grated Parmesan&lt;/p&gt;
&lt;p&gt;	Blackening spice: &lt;a href=&quot;..\\..\\SPICE\\Blackening Spice.doc&quot;&gt;Blackening Spice.doc&lt;/a&gt;&lt;/p&gt;
&lt;p /&gt;
&lt;p&gt;Method&lt;/p&gt;
&lt;p&gt;Procedure Text &lt;/p&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;/body&gt;&lt;/html&gt;

因此,错误出现在片段&lt;p&gt; Caesar &lt;b&gt;&lt;i&gt;DIP&lt;/i&gt;: &lt;a href=&quot;..\\..\\SAUCES\\Dips\\Dip, Caesar.doc&quot;&gt;Dip, Caesar.doc&lt;/b&gt;&lt;b /&gt;&lt;/b&gt;&lt;/p&gt;

我不确定这是否是输入不是RTF的问题,但看起来更像是Tika解析器和ToXmlContentHandler中的错误。

我已经提出了潜在问题 https://issues.apache.org/jira/browse/TIKA-3972

最后,在Saxonica的帮助下(感谢Michael Kay和Norm Walsh),我找到了一个更好的方法,可能是使用Saxon与Tika解析器一起使用,而不是使用Tika的ToXMLContentHandler()和它的toString()方法的结果提供给Saxon的DocumentBuilder,可以直接将Saxon的BuildingContentHandler传递给Tika的解析器以获取XdmNode

public static XdmNode parseRtfToHTML2(String rtf, Processor processor) throws IOException, SAXException, TikaException, URISyntaxException, SaxonApiException {
    DocumentBuilder docBuilder = processor.newDocumentBuilder();

    BuildingContentHandler handler = docBuilder.newBuildingContentHandler();

    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = new ByteArrayInputStream(rtf.getBytes(&quot;utf8&quot;))) {
        parser.parse(stream, handler, metadata);
        return handler.getDocumentNode();//docBuilder.build(new StreamSource(new StringReader(handler.toString())));
    } catch (SaxonApiException e) {
        throw a RuntimeException(e);
    }
}

使用这种方法,在短暂的测试中,不会出现超链接RTF示例的错误,参见更新的项目https://github.com/martin-honnen/SaxonTikaRtfTest1,其中包含更多上下文中的代码。

英文:

I have run your sample rtf through Tika and the supposed XHTML output is unfortunately not well-formed:

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
&lt;head&gt;
&lt;meta name=&quot;X-TIKA:Parsed-By&quot; content=&quot;org.apache.tika.parser.DefaultParser&quot; /&gt;
&lt;meta name=&quot;X-TIKA:Parsed-By&quot; content=&quot;org.apache.tika.parser.microsoft.rtf.RTFParser&quot; /&gt;
&lt;meta name=&quot;Content-Type&quot; content=&quot;application/rtf&quot; /&gt;
&lt;title&gt;&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;&lt;p /&gt;
&lt;p /&gt;
&lt;p&gt;	10”Flour Tortilla&lt;/p&gt;
&lt;p&gt;	Caesar &lt;b&gt;&lt;i&gt;DIP&lt;/i&gt;: &lt;a href=&quot;..\\..\\SAUCES\\Dips\\Dip, Caesar.doc&quot;&gt;Dip, Caesar.doc&lt;/b&gt;&lt;b /&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;b /&gt;	Ripped Romaine&lt;/p&gt;
&lt;p&gt;	Blackened Salmon julienne&lt;/p&gt;
&lt;p&gt;	Shaved Red Onion&lt;/p&gt;
&lt;p&gt;	Julienne Tomato&lt;/p&gt;
&lt;p&gt;	Grated Parmesan&lt;/p&gt;
&lt;p&gt;	Blackening spice: &lt;a href=&quot;..\\..\\SPICE\\Blackening Spice.doc&quot;&gt;Blackening Spice.doc&lt;/a&gt;&lt;/p&gt;
&lt;p /&gt;
&lt;p&gt;Method&lt;/p&gt;
&lt;p&gt;Procedure Text &lt;/p&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;/body&gt;&lt;/html&gt;

So the error is in the fragment &lt;p&gt; Caesar &lt;b&gt;&lt;i&gt;DIP&lt;/i&gt;: &lt;a href=&quot;..\\..\\SAUCES\\Dips\\Dip, Caesar.doc&quot;&gt;Dip, Caesar.doc&lt;/b&gt;&lt;b /&gt;&lt;/b&gt;&lt;/p&gt;.

I don't know for sure whether that is a problem with the input somehow not being proper rtf but it looks more like a bug in the Tia parser and ToXmlContentHandler.

I have raised the potential issue https://issues.apache.org/jira/browse/TIKA-3972

In the end, with the help of the Saxonica guys (thanks to Michael Kay and Norm Walsh) I have found a better (probably anyway) approach of using Saxon with the Tika parser; instead of using Tika's ToXMLContentHandler() and its toString() method result fed to Saxon's DocumentBuilder it is possible to pass a Saxon BuildingContentHandler to Tika's parser directly to get an XdmNode:

public static XdmNode parseRtfToHTML2(String rtf, Processor processor) throws IOException, SAXException, TikaException, URISyntaxException, SaxonApiException {
    DocumentBuilder docBuilder = processor.newDocumentBuilder();


    BuildingContentHandler handler = docBuilder.newBuildingContentHandler();

    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = new ByteArrayInputStream(rtf.getBytes(&quot;utf8&quot;))) {
        parser.parse(stream, handler, metadata);
        return handler.getDocumentNode();//docBuilder.build(new StreamSource(new StringReader(handler.toString())));
    } catch (SaxonApiException e) {
        throw new RuntimeException(e);
    }
}

Using that approach, at least in a short test, no error is thrown for the hyperlink RTF example, see the updated project https://github.com/martin-honnen/SaxonTikaRtfTest1 for the code in more context.

huangapple
  • 本文由 发表于 2023年2月14日 03:28:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75440415.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定