英文:
Splitting mixed content nodes on particular regex match with xslt 3
问题
你的目标是根据全大写的单词来拆分段落,但你的XSLT模板似乎无法处理混合内容。为了实现你想要的输出,你可以尝试以下方法:
<xsl:output method="xml" indent="true"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="p">
<xsl:for-each-group select="node()" group-adjacent="boolean(self::text()[matches(., '[A-Z]{2,}')])">
<xsl:choose>
<xsl:when test="current-grouping-key()">
<xsl:element name="p">
<xsl:apply-templates select="current-group()"/>
</xsl:element>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:template>
这个模板使用了group-adjacent
来根据文本是否包含至少两个大写字母来分组。如果是这样的文本,就创建一个新的<p>
元素,否则继续添加到当前的<p>
元素中。
这应该能够产生你期望的输出结果。
英文:
My simplified input looks like this:
<stuff>
<p>CAPITALWORD is part of <i>mixed</i> content.</p>
<p>ANOTHER is <i>here</i> but it's not the only one. SOMEWORDS are <i>mixted up</i> in the same
paragraph. SOMETIMES even <i>multiple times.</i></p>
</stuff>
Now, my goal is to split paragraphs on each full-caps word. I thought I would go for grouping text starting with at least two capital letters like this:
<xsl:output method="xml" indent="true"></xsl:output>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="p">
<xsl:for-each-group select="node()" group-starting-with="text()[matches(., '[A-Z]{2,}')]">
<xsl:element name="p" >
<xsl:apply-templates select="current-group()"/>
</xsl:element>
</xsl:for-each-group>
</xsl:template>
but this won't work because I'm dealing with mixed content rather than strings only. So I get this:
<stuff>
<p>CAPITALWORD is part of <i>mixed</i> content.</p>
<p>ANOTHER is <i>here</i>
</p>
<p> but it's not the only one. SOMEWORDS are <i>mixed up</i> in the <i>same</i>
</p>
<p>
paragraph. SOMETIMES even <i>multiple times.</i>
</p>
</stuff>
instead of the desired output:
<stuff>
<p>CAPITALWORD is part of <i>mixed</i> content. </p>
<p>ANOTHER is <i>here</i> but it's not the only one. </p>
<p>SOMEWORDS are <i>mixed up</i> in the <i>same</i> paragraph. </p>
<p>SOMETIMES even <i>multiple times.</i></p>
</stuff>
I will be most grateful for tips on how to achieve the desired output.
答案1
得分: 1
以下是翻译好的内容:
一种方法是两步转换,第一步使用 analyze-string 在文本节点上将您的大写单词包装到一个元素中,然后第二步可以轻松使用 group-starting-with 在这些包装元素上:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
exclude-result-prefixes="#all"
expand-text="yes">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="p">
<xsl:variable name="capitalized-marked-up" as="node()*">
<xsl:apply-templates mode="markup-capitalized"/>
</xsl:variable>
<xsl:for-each-group select="$capitalized-marked-up" group-starting-with="capitalized-word">
<p>
<xsl:apply-templates select="current-group()"/>
</p>
</xsl:for-each-group>
</xsl:template>
<xsl:template match="capitalized-word">
<xsl:apply-templates/>
</xsl:template>
<xsl:mode name="markup-capitalized" on-no-match="shallow-copy"/>
<xsl:template mode="markup-capitalized" match="text()">
<xsl:apply-templates select="analyze-string(., '\p{Lu}{2,}')" mode="wrap"/>
</xsl:template>
<xsl:template mode="wrap" match="fn:match">
<capitalized-word>{.}</capitalized-word>
</xsl:template>
<xsl:output indent="yes"/>
</xsl:stylesheet>
英文:
One approach is a two step transformation, the first step uses analyze-string on text nodes to wrap your capitalized word into an element, the second step then can easily use group-starting-with on those wrapper elements:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
exclude-result-prefixes="#all"
expand-text="yes">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="p">
<xsl:variable name="capitalized-marked-up" as="node()*">
<xsl:apply-templates mode="markup-capitalized"/>
</xsl:variable>
<xsl:for-each-group select="$capitalized-marked-up" group-starting-with="capitalized-word">
<p>
<xsl:apply-templates select="current-group()"/>
</p>
</xsl:for-each-group>
</xsl:template>
<xsl:template match="capitalized-word">
<xsl:apply-templates/>
</xsl:template>
<xsl:mode name="markup-capitalized" on-no-match="shallow-copy"/>
<xsl:template mode="markup-capitalized" match="text()">
<xsl:apply-templates select="analyze-string(., '\p{Lu}{2,}')" mode="wrap"/>
</xsl:template>
<xsl:template mode="wrap" match="fn:match">
<capitalized-word>{.}</capitalized-word>
</xsl:template>
<xsl:output indent="yes"/>
</xsl:stylesheet>
答案2
得分: 1
有基本上两种方法来处理这个问题。一种方法是通过向文本添加标记将所有信息转换为节点结构,然后使用分组机制等将其处理为节点树。这就是@MartinHonnen所做的。另一种方法是将所有信息转换为文本,例如将<i>italic</i>
转换为{italic}
,然后使用正则表达式(通常是xsl:analyze-string
)进行处理,最后将{italic}
转换回<i>italic</i>
作为后处理步骤。
我通常会使用第一种技术,但如果混合内容中唯一出现的标记是一个单一元素类型(i
)且没有属性,那么你可以考虑第二种方法。
xsl:for-each-group
永远不会将文本分割成片段,这似乎是你试图做的事情。
英文:
There are basically two approaches to this. One is to convert all the information to a node structure by adding markup to the text, and then process it as a tree of nodes using grouping mechanisms and the like. That's what @MartinHonnen has done. The other is to convert all the information to text, for example by converting <i>italic</i>
to {italic}
, and then process it using regular expressions (typically xsl:analyze-string
), finally converting {italic}
back to<i>italic</i>
as a post-processing step.
I would usually use the first technique, but if the only markup that occurs within the mixed content is a single element type (i
) with no attributes, then you could consider the second.
An xsl:for-each-group
is never going to split text up into fragments, which is what you seem to be attempting to do.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论