英文:
Grouping adjacent nodes and processing mixed content in XSLT3
问题
<?xml version="1.0" encoding="UTF-8"?>
<text>
<p>TOKEN1 some other text.</p>
<p>TOKEN2 } TOKEN3 } TOKEN4 } combo text <i>and potentially something else</i>.</p>
<p>TOKEN5 some other text.</p>
<p>TOKEN6 some other text.</p>
<p>TOKEN7 } TOKEN8 } TOKEN9 } TOKEN10 } some other <b>combo</b> text.</p>
<p>TOKEN11 some <i>other</i> text.</p>
<p>TOKEN12 x.</p>
<p>TOKEN13 y.</p>
<p>TOKEN14 z.</p>
</text>
英文:
Given this (simplified) xml:
<?xml version="1.0" encoding="UTF-8"?>
<text>
<p>TOKEN1 some other text.</p>
<p>TOKEN2 }</p>
<p>TOKEN3 } combo text <i>and potentially something else</i>.</p>
<p>TOKEN4 }</p>
<p>TOKEN5 some other text.</p>
<p>TOKEN6 some other text.</p>
<p>TOKEN7 }</p>
<p>TOKEN8 }</p>
<p>TOKEN9 } some other <b>combo</b> text.</p>
<p>TOKEN10 }</p>
<p>TOKEN11 some <i>other</i> text.</p>
<p>TOKEN12 x.</p>
<p>TOKEN13 y.</p>
<p>TOKEN14 z.</p>
</text>
my goal is to arrive at:
<?xml version="1.0" encoding="UTF-8"?>
<text>
<p>TOKEN1 some other text.</p>
<p>TOKEN2 } TOKEN3 } TOKEN4 } combo text <i>and potentially something else</i>.</p>
<p>TOKEN5 some other text.</p>
<p>TOKEN6 some other text.</p>
<p>TOKEN7 } TOKEN8 } TOKEN9 } TOKEN10 } some other <b>combo</b> text.</p>
<p>TOKEN11 some <i>other</i> text.</p>
<p>TOKEN12 x.</p>
<p>TOKEN13 y.</p>
<p>TOKEN14 z.</p>
</text>
In other words, I would like to merge adjacent paragraphs that have a curly bracket in them by:
- merging the text content up to and including the curly bracket; followed by:
- anything that might follow the curly bracket
The mixed content bit after the curly bracket will occur in only one of the paragraphs that need to be merged, but the number of the paragraphs to be merged, or the position of the paragraph which has mixed content after the bracket, cannot be not known in advance.
The following XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs" expand-text="true" version="3.0">
<xsl:output method="xml" indent="true"></xsl:output>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="text">
<xsl:copy>
<xsl:for-each-group select="p" group-adjacent="exists(text()[matches(., '\}')])">
<xsl:choose>
<xsl:when test="exists(text()[matches(., '\}')])">
<xsl:copy>
<xsl:for-each select="current-group()">
<xsl:variable name="text" select="normalize-space(text()[1])"/>
<xsl:copy-of select="substring-before($text, '}')"/>
<xsl:text>}} </xsl:text>
</xsl:for-each>
</xsl:copy>
</xsl:when>
<xsl:otherwise>
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
will get me as far as:
<?xml version="1.0" encoding="UTF-8"?>
<text>
<p>TOKEN1 some other text.</p>
<p>TOKEN2 } TOKEN3 } TOKEN4 } </p>
<p>TOKEN5 some other text.</p>
<p>TOKEN7 } TOKEN8 } TOKEN9 } TOKEN10 } </p>
<p>TOKEN11 some <i>other</i> text.</p>
</text>
but there are two problems with it:
- this only takes care of Point 1 above; and
- I'm missing some paragraphs in the output (those containing TOKEN6, TOKEN12, TOKEN13 and TOKEN14). I don't understand why this happens, and why it doesn't happen to paragraphs containing TOKEN1 and TOKEN5.
I'll be most grateful for your help.
答案1
得分: 1
我认为,在分组后,你需要将你的标记(用 }
)包装在一个元素内(例如 token
),然后你可以简单地先处理任何 token
包装,然后再处理未被包装为 token
的其余分组节点:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output indent="yes"/>
<xsl:template match="text">
<xsl:copy>
<xsl:for-each-group select="p" group-adjacent="contains(., '}')">
<xsl:choose>
<xsl:when test="current-grouping-key()">
<xsl:copy>
<xsl:variable name="splitted" as="node()*">
<xsl:apply-templates select="current-group()/node()" mode="split"/>
</xsl:variable>
<xsl:apply-templates select="$splitted[self::token]/text(), $splitted[not(self::token)]"/>
</xsl:copy>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
<xsl:mode name="split" on-no-match="shallow-copy"/>
<xsl:template match="text()[contains(., '}')]" mode="split">
<xsl:apply-templates select="analyze-string(., '.*}')" mode="wrap"/>
</xsl:template>
<xsl:template match="*:match" mode="wrap">
<token>{.}</token>
</xsl:template>
</xsl:stylesheet>
如果你需要在输出标记时进行一些空格规范化,首先将 <xsl:apply-templates select="$splitted[self::token]/text(), $splitted[not(self::token)]"/>
替换为例如:
<xsl:value-of select="$splitted[self::token]/normalize-space()" separator=" "/>
<xsl:apply-templates select="$splitted[not(self::token)]"/>
英文:
I think, after grouping, you need to wrap your tokens (with the }
) into an element (e.g. token
), then you can simply process any token
wrappers first and after that the rest of the grouped nodes not being token
s:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output indent="yes"/>
<xsl:template match="text">
<xsl:copy>
<xsl:for-each-group select="p" group-adjacent="contains(., '}')">
<xsl:choose>
<xsl:when test="current-grouping-key()">
<xsl:copy>
<xsl:variable name="splitted" as="node()*">
<xsl:apply-templates select="current-group()/node()" mode="split"/>
</xsl:variable>
<xsl:apply-templates select="$splitted[self::token]/text(), $splitted[not(self::token)]"/>
</xsl:copy>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
<xsl:mode name="split" on-no-match="shallow-copy"/>
<xsl:template match="text()[contains(., '}')]" mode="split">
<xsl:apply-templates select="analyze-string(., '.*\}')" mode="wrap"/>
</xsl:template>
<xsl:template match="*:match" mode="wrap">
<token>{.}</token>
</xsl:template>
</xsl:stylesheet>
If you need to do some white space normalization on outputting the tokens first replace <xsl:apply-templates select="$splitted[self::token]/text(), $splitted[not(self::token)]"/>
with e.g.
<xsl:value-of select="$splitted[self::token]/normalize-space()" separator=" "/>
<xsl:apply-templates select="$splitted[not(self::token)]"/>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论