2023年5月14日 15:32:37go评论64阅读模式

英文:

Splitting mixed content nodes on particular regex match with xslt 3

问题

你的目标是根据全大写的单词来拆分段落，但你的XSLT模板似乎无法处理混合内容。为了实现你想要的输出，你可以尝试以下方法：

<xsl:output method="xml" indent="true"/>
<xsl:mode on-no-match="shallow-copy"/>

<xsl:template match="p">
  <xsl:for-each-group select="node()" group-adjacent="boolean(self::text()[matches(., '[A-Z]{2,}')])">
    <xsl:choose>
      <xsl:when test="current-grouping-key()">
        <xsl:element name="p">
          <xsl:apply-templates select="current-group()"/>
        </xsl:element>
      </xsl:when>
      <xsl:otherwise>
        <xsl:apply-templates select="current-group()"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:for-each-group>
</xsl:template>

这个模板使用了group-adjacent来根据文本是否包含至少两个大写字母来分组。如果是这样的文本，就创建一个新的元素，否则继续添加到当前的元素中。

这应该能够产生你期望的输出结果。

英文:

My simplified input looks like this:

&lt;stuff&gt;
    &lt;p&gt;CAPITALWORD is part of &lt;i&gt;mixed&lt;/i&gt; content.&lt;/p&gt;
    &lt;p&gt;ANOTHER is &lt;i&gt;here&lt;/i&gt; but it&#39;s not the only one. SOMEWORDS are &lt;i&gt;mixted up&lt;/i&gt; in the same
        paragraph. SOMETIMES even &lt;i&gt;multiple times.&lt;/i&gt;&lt;/p&gt;
&lt;/stuff&gt;

Now, my goal is to split paragraphs on each full-caps word. I thought I would go for grouping text starting with at least two capital letters like this:

&lt;xsl:output method=&quot;xml&quot; indent=&quot;true&quot;&gt;&lt;/xsl:output&gt;
&lt;xsl:mode on-no-match=&quot;shallow-copy&quot;/&gt;
    
&lt;xsl:template match=&quot;p&quot;&gt;
  &lt;xsl:for-each-group select=&quot;node()&quot; group-starting-with=&quot;text()[matches(., &#39;[A-Z]{2,}&#39;)]&quot;&gt;
    &lt;xsl:element name=&quot;p&quot; &gt;
      &lt;xsl:apply-templates select=&quot;current-group()&quot;/&gt;
    &lt;/xsl:element&gt;  
  &lt;/xsl:for-each-group&gt;
&lt;/xsl:template&gt;

but this won't work because I'm dealing with mixed content rather than strings only. So I get this:

&lt;stuff&gt;
   &lt;p&gt;CAPITALWORD is part of &lt;i&gt;mixed&lt;/i&gt; content.&lt;/p&gt;
   &lt;p&gt;ANOTHER is &lt;i&gt;here&lt;/i&gt;
   &lt;/p&gt;
   &lt;p&gt; but it&#39;s not the only one. SOMEWORDS are &lt;i&gt;mixed up&lt;/i&gt; in the &lt;i&gt;same&lt;/i&gt;
   &lt;/p&gt;
   &lt;p&gt;
        paragraph. SOMETIMES even &lt;i&gt;multiple times.&lt;/i&gt;
   &lt;/p&gt;
&lt;/stuff&gt;

instead of the desired output:

&lt;stuff&gt;
    &lt;p&gt;CAPITALWORD is part of &lt;i&gt;mixed&lt;/i&gt; content. &lt;/p&gt;
    &lt;p&gt;ANOTHER is &lt;i&gt;here&lt;/i&gt; but it&#39;s not the only one. &lt;/p&gt;
    &lt;p&gt;SOMEWORDS are &lt;i&gt;mixed up&lt;/i&gt; in the &lt;i&gt;same&lt;/i&gt; paragraph. &lt;/p&gt;
    &lt;p&gt;SOMETIMES even &lt;i&gt;multiple times.&lt;/i&gt;&lt;/p&gt;
&lt;/stuff&gt;

I will be most grateful for tips on how to achieve the desired output.

答案1

得分: 1

以下是翻译好的内容：

一种方法是两步转换，第一步使用 analyze-string 在文本节点上将您的大写单词包装到一个元素中，然后第二步可以轻松使用 group-starting-with 在这些包装元素上：

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:fn="http://www.w3.org/2005/xpath-functions"
  exclude-result-prefixes="#all"
  expand-text="yes">

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:template match="p">
    <xsl:variable name="capitalized-marked-up" as="node()*">
      <xsl:apply-templates mode="markup-capitalized"/>
    </xsl:variable>
    <xsl:for-each-group select="$capitalized-marked-up" group-starting-with="capitalized-word">
      <p>
        <xsl:apply-templates select="current-group()"/>
      </p>
    </xsl:for-each-group>
  </xsl:template>

  <xsl:template match="capitalized-word">
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:mode name="markup-capitalized" on-no-match="shallow-copy"/>

  <xsl:template mode="markup-capitalized" match="text()">
    <xsl:apply-templates select="analyze-string(., '\p{Lu}{2,}')" mode="wrap"/>
  </xsl:template>

  <xsl:template mode="wrap" match="fn:match">
    <capitalized-word>{.}</capitalized-word>
  </xsl:template>

  <xsl:output indent="yes"/>

</xsl:stylesheet>

英文:

One approach is a two step transformation, the first step uses analyze-string on text nodes to wrap your capitalized word into an element, the second step then can easily use group-starting-with on those wrapper elements:

&lt;xsl:stylesheet xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot; version=&quot;3.0&quot;
  xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot;
  xmlns:fn=&quot;http://www.w3.org/2005/xpath-functions&quot;
  exclude-result-prefixes=&quot;#all&quot;
  expand-text=&quot;yes&quot;&gt;

  &lt;xsl:mode on-no-match=&quot;shallow-copy&quot;/&gt;
  
  &lt;xsl:template match=&quot;p&quot;&gt;
    &lt;xsl:variable name=&quot;capitalized-marked-up&quot; as=&quot;node()*&quot;&gt;
      &lt;xsl:apply-templates mode=&quot;markup-capitalized&quot;/&gt;
    &lt;/xsl:variable&gt;
    &lt;xsl:for-each-group select=&quot;$capitalized-marked-up&quot; group-starting-with=&quot;capitalized-word&quot;&gt;
      &lt;p&gt;
        &lt;xsl:apply-templates select=&quot;current-group()&quot;/&gt;
      &lt;/p&gt;
    &lt;/xsl:for-each-group&gt;
  &lt;/xsl:template&gt;
  
  &lt;xsl:template match=&quot;capitalized-word&quot;&gt;
    &lt;xsl:apply-templates/&gt;
  &lt;/xsl:template&gt;
  
  &lt;xsl:mode name=&quot;markup-capitalized&quot; on-no-match=&quot;shallow-copy&quot;/&gt;
  
  &lt;xsl:template mode=&quot;markup-capitalized&quot; match=&quot;text()&quot;&gt;
    &lt;xsl:apply-templates select=&quot;analyze-string(., &#39;\p{Lu}{2,}&#39;)&quot; mode=&quot;wrap&quot;/&gt;
  &lt;/xsl:template&gt;
  
  &lt;xsl:template mode=&quot;wrap&quot; match=&quot;fn:match&quot;&gt;
    &lt;capitalized-word&gt;{.}&lt;/capitalized-word&gt;
  &lt;/xsl:template&gt;

  &lt;xsl:output indent=&quot;yes&quot;/&gt;

&lt;/xsl:stylesheet&gt;

答案2

得分: 1

有基本上两种方法来处理这个问题。一种方法是通过向文本添加标记将所有信息转换为节点结构，然后使用分组机制等将其处理为节点树。这就是@MartinHonnen所做的。另一种方法是将所有信息转换为文本，例如将italic转换为{italic}，然后使用正则表达式（通常是xsl:analyze-string）进行处理，最后将{italic}转换回italic作为后处理步骤。

我通常会使用第一种技术，但如果混合内容中唯一出现的标记是一个单一元素类型（i）且没有属性，那么你可以考虑第二种方法。

xsl:for-each-group永远不会将文本分割成片段，这似乎是你试图做的事情。

英文:

There are basically two approaches to this. One is to convert all the information to a node structure by adding markup to the text, and then process it as a tree of nodes using grouping mechanisms and the like. That's what @MartinHonnen has done. The other is to convert all the information to text, for example by converting italic to {italic}, and then process it using regular expressions (typically xsl:analyze-string), finally converting {italic} back toitalic as a post-processing step.

I would usually use the first technique, but if the only markup that occurs within the mixed content is a single element type (i) with no attributes, then you could consider the second.

An xsl:for-each-group is never going to split text up into fragments, which is what you seem to be attempting to do.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在特定正则表达式匹配时拆分混合内容节点的XSLT 3。

问题

答案1

答案2

xsl:for-each 和 xsl:sort 重复使用相同的列表

XSTL/ XSL file: need to remove duplicates generically from the parent tag given that all the child key values are same for XML

表达式必须求值为节点集是什么意思？

BizTalk映射中的内联XSLT

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论