在特定正则表达式匹配时拆分混合内容节点的XSLT 3。

huangapple go评论57阅读模式
英文:

Splitting mixed content nodes on particular regex match with xslt 3

问题

你的目标是根据全大写的单词来拆分段落,但你的XSLT模板似乎无法处理混合内容。为了实现你想要的输出,你可以尝试以下方法:

<xsl:output method="xml" indent="true"/>
<xsl:mode on-no-match="shallow-copy"/>

<xsl:template match="p">
  <xsl:for-each-group select="node()" group-adjacent="boolean(self::text()[matches(., '[A-Z]{2,}')])">
    <xsl:choose>
      <xsl:when test="current-grouping-key()">
        <xsl:element name="p">
          <xsl:apply-templates select="current-group()"/>
        </xsl:element>
      </xsl:when>
      <xsl:otherwise>
        <xsl:apply-templates select="current-group()"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:for-each-group>
</xsl:template>

这个模板使用了group-adjacent来根据文本是否包含至少两个大写字母来分组。如果是这样的文本,就创建一个新的<p>元素,否则继续添加到当前的<p>元素中。

这应该能够产生你期望的输出结果。

英文:

My simplified input looks like this:

&lt;stuff&gt;
    &lt;p&gt;CAPITALWORD is part of &lt;i&gt;mixed&lt;/i&gt; content.&lt;/p&gt;
    &lt;p&gt;ANOTHER is &lt;i&gt;here&lt;/i&gt; but it&#39;s not the only one. SOMEWORDS are &lt;i&gt;mixted up&lt;/i&gt; in the same
        paragraph. SOMETIMES even &lt;i&gt;multiple times.&lt;/i&gt;&lt;/p&gt;
&lt;/stuff&gt;

Now, my goal is to split paragraphs on each full-caps word. I thought I would go for grouping text starting with at least two capital letters like this:

&lt;xsl:output method=&quot;xml&quot; indent=&quot;true&quot;&gt;&lt;/xsl:output&gt;
&lt;xsl:mode on-no-match=&quot;shallow-copy&quot;/&gt;
    
&lt;xsl:template match=&quot;p&quot;&gt;
  &lt;xsl:for-each-group select=&quot;node()&quot; group-starting-with=&quot;text()[matches(., &#39;[A-Z]{2,}&#39;)]&quot;&gt;
    &lt;xsl:element name=&quot;p&quot; &gt;
      &lt;xsl:apply-templates select=&quot;current-group()&quot;/&gt;
    &lt;/xsl:element&gt;  
  &lt;/xsl:for-each-group&gt;
&lt;/xsl:template&gt;

but this won't work because I'm dealing with mixed content rather than strings only. So I get this:

&lt;stuff&gt;
   &lt;p&gt;CAPITALWORD is part of &lt;i&gt;mixed&lt;/i&gt; content.&lt;/p&gt;
   &lt;p&gt;ANOTHER is &lt;i&gt;here&lt;/i&gt;
   &lt;/p&gt;
   &lt;p&gt; but it&#39;s not the only one. SOMEWORDS are &lt;i&gt;mixed up&lt;/i&gt; in the &lt;i&gt;same&lt;/i&gt;
   &lt;/p&gt;
   &lt;p&gt;
        paragraph. SOMETIMES even &lt;i&gt;multiple times.&lt;/i&gt;
   &lt;/p&gt;
&lt;/stuff&gt;

instead of the desired output:

&lt;stuff&gt;
    &lt;p&gt;CAPITALWORD is part of &lt;i&gt;mixed&lt;/i&gt; content. &lt;/p&gt;
    &lt;p&gt;ANOTHER is &lt;i&gt;here&lt;/i&gt; but it&#39;s not the only one. &lt;/p&gt;
    &lt;p&gt;SOMEWORDS are &lt;i&gt;mixed up&lt;/i&gt; in the &lt;i&gt;same&lt;/i&gt; paragraph. &lt;/p&gt;
    &lt;p&gt;SOMETIMES even &lt;i&gt;multiple times.&lt;/i&gt;&lt;/p&gt;
&lt;/stuff&gt;

I will be most grateful for tips on how to achieve the desired output.

答案1

得分: 1

以下是翻译好的内容:

一种方法是两步转换,第一步使用 analyze-string 在文本节点上将您的大写单词包装到一个元素中,然后第二步可以轻松使用 group-starting-with 在这些包装元素上:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:fn="http://www.w3.org/2005/xpath-functions"
  exclude-result-prefixes="#all"
  expand-text="yes">

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:template match="p">
    <xsl:variable name="capitalized-marked-up" as="node()*">
      <xsl:apply-templates mode="markup-capitalized"/>
    </xsl:variable>
    <xsl:for-each-group select="$capitalized-marked-up" group-starting-with="capitalized-word">
      <p>
        <xsl:apply-templates select="current-group()"/>
      </p>
    </xsl:for-each-group>
  </xsl:template>

  <xsl:template match="capitalized-word">
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:mode name="markup-capitalized" on-no-match="shallow-copy"/>

  <xsl:template mode="markup-capitalized" match="text()">
    <xsl:apply-templates select="analyze-string(., '\p{Lu}{2,}')" mode="wrap"/>
  </xsl:template>

  <xsl:template mode="wrap" match="fn:match">
    <capitalized-word>{.}</capitalized-word>
  </xsl:template>

  <xsl:output indent="yes"/>

</xsl:stylesheet>
英文:

One approach is a two step transformation, the first step uses analyze-string on text nodes to wrap your capitalized word into an element, the second step then can easily use group-starting-with on those wrapper elements:

&lt;xsl:stylesheet xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot; version=&quot;3.0&quot;
  xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot;
  xmlns:fn=&quot;http://www.w3.org/2005/xpath-functions&quot;
  exclude-result-prefixes=&quot;#all&quot;
  expand-text=&quot;yes&quot;&gt;

  &lt;xsl:mode on-no-match=&quot;shallow-copy&quot;/&gt;
  
  &lt;xsl:template match=&quot;p&quot;&gt;
    &lt;xsl:variable name=&quot;capitalized-marked-up&quot; as=&quot;node()*&quot;&gt;
      &lt;xsl:apply-templates mode=&quot;markup-capitalized&quot;/&gt;
    &lt;/xsl:variable&gt;
    &lt;xsl:for-each-group select=&quot;$capitalized-marked-up&quot; group-starting-with=&quot;capitalized-word&quot;&gt;
      &lt;p&gt;
        &lt;xsl:apply-templates select=&quot;current-group()&quot;/&gt;
      &lt;/p&gt;
    &lt;/xsl:for-each-group&gt;
  &lt;/xsl:template&gt;
  
  &lt;xsl:template match=&quot;capitalized-word&quot;&gt;
    &lt;xsl:apply-templates/&gt;
  &lt;/xsl:template&gt;
  
  &lt;xsl:mode name=&quot;markup-capitalized&quot; on-no-match=&quot;shallow-copy&quot;/&gt;
  
  &lt;xsl:template mode=&quot;markup-capitalized&quot; match=&quot;text()&quot;&gt;
    &lt;xsl:apply-templates select=&quot;analyze-string(., &#39;\p{Lu}{2,}&#39;)&quot; mode=&quot;wrap&quot;/&gt;
  &lt;/xsl:template&gt;
  
  &lt;xsl:template mode=&quot;wrap&quot; match=&quot;fn:match&quot;&gt;
    &lt;capitalized-word&gt;{.}&lt;/capitalized-word&gt;
  &lt;/xsl:template&gt;

  &lt;xsl:output indent=&quot;yes&quot;/&gt;

&lt;/xsl:stylesheet&gt;

答案2

得分: 1

有基本上两种方法来处理这个问题。一种方法是通过向文本添加标记将所有信息转换为节点结构,然后使用分组机制等将其处理为节点树。这就是@MartinHonnen所做的。另一种方法是将所有信息转换为文本,例如将&lt;i&gt;italic&lt;/i&gt;转换为{italic},然后使用正则表达式(通常是xsl:analyze-string)进行处理,最后将{italic}转换回&lt;i&gt;italic&lt;/i&gt;作为后处理步骤。

我通常会使用第一种技术,但如果混合内容中唯一出现的标记是一个单一元素类型(i)且没有属性,那么你可以考虑第二种方法。

xsl:for-each-group永远不会将文本分割成片段,这似乎是你试图做的事情。

英文:

There are basically two approaches to this. One is to convert all the information to a node structure by adding markup to the text, and then process it as a tree of nodes using grouping mechanisms and the like. That's what @MartinHonnen has done. The other is to convert all the information to text, for example by converting &lt;i&gt;italic&lt;/i&gt; to {italic}, and then process it using regular expressions (typically xsl:analyze-string), finally converting {italic} back to&lt;i&gt;italic&lt;/i&gt; as a post-processing step.

I would usually use the first technique, but if the only markup that occurs within the mixed content is a single element type (i) with no attributes, then you could consider the second.

An xsl:for-each-group is never going to split text up into fragments, which is what you seem to be attempting to do.

huangapple
  • 本文由 发表于 2023年5月14日 15:32:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76246337.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定