在XSLT3中,将相邻节点分组并处理混合内容。

huangapple go评论91阅读模式
英文:

Grouping adjacent nodes and processing mixed content in XSLT3

问题

<?xml version="1.0" encoding="UTF-8"?>
<text>
   <p>TOKEN1 some other text.</p>
   <p>TOKEN2 } TOKEN3 } TOKEN4 } combo text <i>and potentially something else</i>.</p>
   <p>TOKEN5 some other text.</p>
   <p>TOKEN6 some other text.</p>
   <p>TOKEN7 } TOKEN8 } TOKEN9 } TOKEN10 } some other <b>combo</b> text.</p>
   <p>TOKEN11 some <i>other</i> text.</p>
   <p>TOKEN12 x.</p>
   <p>TOKEN13 y.</p>
   <p>TOKEN14 z.</p>
</text>
英文:

Given this (simplified) xml:

<?xml version="1.0" encoding="UTF-8"?>
<text>
    <p>TOKEN1 some other text.</p>
    <p>TOKEN2 }</p>
    <p>TOKEN3    } combo text <i>and potentially something else</i>.</p>
    <p>TOKEN4 }</p>
    <p>TOKEN5 some other text.</p>
    <p>TOKEN6 some other text.</p>
    <p>TOKEN7 }</p>
    <p>TOKEN8 }</p>
    <p>TOKEN9    } some other <b>combo</b> text.</p>
    <p>TOKEN10 }</p>
    <p>TOKEN11 some <i>other</i> text.</p>
    <p>TOKEN12 x.</p>
    <p>TOKEN13 y.</p>
    <p>TOKEN14 z.</p>
</text>

my goal is to arrive at:

<?xml version="1.0" encoding="UTF-8"?>
<text>
   <p>TOKEN1 some other text.</p>
   <p>TOKEN2 } TOKEN3 } TOKEN4 } combo text <i>and potentially something else</i>.</p>
   <p>TOKEN5 some other text.</p>
   <p>TOKEN6 some other text.</p>
   <p>TOKEN7 } TOKEN8 } TOKEN9 } TOKEN10 } some other <b>combo</b> text.</p>
   <p>TOKEN11 some <i>other</i> text.</p>
   <p>TOKEN12 x.</p>
   <p>TOKEN13 y.</p>
   <p>TOKEN14 z.</p>
</text>

In other words, I would like to merge adjacent paragraphs that have a curly bracket in them by:

  1. merging the text content up to and including the curly bracket; followed by:
  2. anything that might follow the curly bracket

The mixed content bit after the curly bracket will occur in only one of the paragraphs that need to be merged, but the number of the paragraphs to be merged, or the position of the paragraph which has mixed content after the bracket, cannot be not known in advance.

The following XSLT:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs" expand-text="true"    version="3.0">
    
    <xsl:output method="xml" indent="true"></xsl:output>
    <xsl:mode on-no-match="shallow-copy"/>
     
    <xsl:template match="text">
        <xsl:copy>
            <xsl:for-each-group select="p" group-adjacent="exists(text()[matches(., '\}')])">
                <xsl:choose>
                    <xsl:when test="exists(text()[matches(., '\}')])">
                        <xsl:copy>
                            <xsl:for-each select="current-group()">
                                <xsl:variable name="text" select="normalize-space(text()[1])"/>
                                <xsl:copy-of select="substring-before($text, '}')"/>
                                <xsl:text>}} </xsl:text>
                            </xsl:for-each>
                        </xsl:copy>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:copy>
                            <xsl:apply-templates/>
                        </xsl:copy>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:for-each-group>
        </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>

will get me as far as:

<?xml version="1.0" encoding="UTF-8"?>
<text>
   <p>TOKEN1 some other text.</p>
   <p>TOKEN2 } TOKEN3 } TOKEN4 } </p>
   <p>TOKEN5 some other text.</p>
   <p>TOKEN7 } TOKEN8 } TOKEN9 } TOKEN10 } </p>
   <p>TOKEN11 some <i>other</i> text.</p>
</text>

but there are two problems with it:

  • this only takes care of Point 1 above; and
  • I'm missing some paragraphs in the output (those containing TOKEN6, TOKEN12, TOKEN13 and TOKEN14). I don't understand why this happens, and why it doesn't happen to paragraphs containing TOKEN1 and TOKEN5.

I'll be most grateful for your help.

答案1

得分: 1

我认为,在分组后,你需要将你的标记(用 })包装在一个元素内(例如 token),然后你可以简单地先处理任何 token 包装,然后再处理未被包装为 token 的其余分组节点:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="#all"
  expand-text="yes">

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:output indent="yes"/>

  <xsl:template match="text">
    <xsl:copy>
      <xsl:for-each-group select="p" group-adjacent="contains(., '}')">
        <xsl:choose>
          <xsl:when test="current-grouping-key()">
            <xsl:copy>
              <xsl:variable name="splitted" as="node()*">
                <xsl:apply-templates select="current-group()/node()" mode="split"/>
              </xsl:variable>
              <xsl:apply-templates select="$splitted[self::token]/text(), $splitted[not(self::token)]"/>
            </xsl:copy>
          </xsl:when>
          <xsl:otherwise>
            <xsl:apply-templates select="current-group()"/>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

  <xsl:mode name="split" on-no-match="shallow-copy"/>

  <xsl:template match="text()[contains(., '}')]" mode="split">
    <xsl:apply-templates select="analyze-string(., '.*}')" mode="wrap"/>
  </xsl:template>

  <xsl:template match="*:match" mode="wrap">
    <token>{.}</token>
  </xsl:template>

</xsl:stylesheet>

如果你需要在输出标记时进行一些空格规范化,首先将 &lt;xsl:apply-templates select=&quot;$splitted[self::token]/text(), $splitted[not(self::token)]&quot;/&gt; 替换为例如:

<xsl:value-of select="$splitted[self::token]/normalize-space()" separator=" "/>
<xsl:apply-templates select="$splitted[not(self::token)]"/>
英文:

I think, after grouping, you need to wrap your tokens (with the }) into an element (e.g. token), then you can simply process any token wrappers first and after that the rest of the grouped nodes not being tokens:

&lt;xsl:stylesheet xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot; version=&quot;3.0&quot;
  xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot;
  exclude-result-prefixes=&quot;#all&quot;
  expand-text=&quot;yes&quot;&gt;

  &lt;xsl:mode on-no-match=&quot;shallow-copy&quot;/&gt;

  &lt;xsl:output indent=&quot;yes&quot;/&gt;
  
  &lt;xsl:template match=&quot;text&quot;&gt;
    &lt;xsl:copy&gt;
      &lt;xsl:for-each-group select=&quot;p&quot; group-adjacent=&quot;contains(., &#39;}&#39;)&quot;&gt;
        &lt;xsl:choose&gt;
          &lt;xsl:when test=&quot;current-grouping-key()&quot;&gt;
            &lt;xsl:copy&gt;
              &lt;xsl:variable name=&quot;splitted&quot; as=&quot;node()*&quot;&gt;
                &lt;xsl:apply-templates select=&quot;current-group()/node()&quot; mode=&quot;split&quot;/&gt;
              &lt;/xsl:variable&gt;
              &lt;xsl:apply-templates select=&quot;$splitted[self::token]/text(), $splitted[not(self::token)]&quot;/&gt;
            &lt;/xsl:copy&gt;
          &lt;/xsl:when&gt;
          &lt;xsl:otherwise&gt;
            &lt;xsl:apply-templates select=&quot;current-group()&quot;/&gt;
          &lt;/xsl:otherwise&gt;
        &lt;/xsl:choose&gt;
      &lt;/xsl:for-each-group&gt;
    &lt;/xsl:copy&gt;
  &lt;/xsl:template&gt;
  
  &lt;xsl:mode name=&quot;split&quot; on-no-match=&quot;shallow-copy&quot;/&gt;
  
  &lt;xsl:template match=&quot;text()[contains(., &#39;}&#39;)]&quot; mode=&quot;split&quot;&gt;
    &lt;xsl:apply-templates select=&quot;analyze-string(., &#39;.*\}&#39;)&quot; mode=&quot;wrap&quot;/&gt;
  &lt;/xsl:template&gt;

  &lt;xsl:template match=&quot;*:match&quot; mode=&quot;wrap&quot;&gt;
    &lt;token&gt;{.}&lt;/token&gt;
  &lt;/xsl:template&gt;

&lt;/xsl:stylesheet&gt;

If you need to do some white space normalization on outputting the tokens first replace &lt;xsl:apply-templates select=&quot;$splitted[self::token]/text(), $splitted[not(self::token)]&quot;/&gt; with e.g.

          &lt;xsl:value-of select=&quot;$splitted[self::token]/normalize-space()&quot; separator=&quot; &quot;/&gt;
          &lt;xsl:apply-templates select=&quot;$splitted[not(self::token)]&quot;/&gt;

huangapple
  • 本文由 发表于 2023年5月21日 16:55:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76299051.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定