2023年6月26日 19:21:39go评论82阅读模式

英文:

LXML don't want to parse text after comment

问题

I want to wrap tag.text into CDATA:

from lxml import etree

parser = etree.XMLParser()
# parser = etree.XMLParser(remove_comments=True)
tree = etree.parse("./data.xml", parser)
root = tree.getroot()

for tag in root.findall("tag"):
    tag.text = etree.CDATA(tag.text)

tree.write("./result.xml",
           encoding="utf-8",
           xml_declaration=True)

But when I parse tag.text with comments inside it, it only parses text before comments:

<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data
    <!-- some data2 -->
    <!-- some data2 -->
    some data
  </tag>
</root>

And I get this (tag.text = some data):

<?xml version='1.0' encoding='UTF-8'?>
<root>
  <tag><![CDATA[
    some data
    ]]><!-- some data2 -->
    <!-- some data2 -->
    some data
  </tag>
</root>

How to fix it?

英文:

I want to wrap tag.text into CDATA:

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; ?&gt;
&lt;root&gt;
  &lt;tag&gt;
    some data
    &lt;!-- some data2 --&gt;
    &lt;!-- some data2 --&gt;
    some data
  &lt;/tag&gt;
&lt;/root&gt;

But when i parse tag.text with comments inside it parse only text before comments:

from lxml import etree

parser = etree.XMLParser()
#parser = etree.XMLParser(remove_comments=True)
tree = etree.parse(&quot;./data.xml&quot;, parser)
root = tree.getroot()

for tag in root.findall(&quot;tag&quot;):
    tag.text = etree.CDATA(tag.text)

tree.write(&quot;./result.xml&quot;,
           encoding = &quot;utf-8&quot;,
           xml_declaration = True)

And i get this (tag.text = some data):

&lt;?xml version=&#39;1.0&#39; encoding=&#39;UTF-8&#39;?&gt;
&lt;root&gt;
  &lt;tag&gt;&lt;![CDATA[
    some data
    ]]&gt;&lt;!-- some data2 --&gt;
    &lt;!-- some data2 --&gt;
    some data
  &lt;/tag&gt;
&lt;/root&gt;

How to fix it?

答案1

得分: 1

Consider to use saxonche and XSLT 3.0:

from saxonche import *

with PySaxonProcessor(license=False) as saxon_proc:
xslt30_processor = saxon_proc.new_xslt30_processor()

xslt30_processor.transform_to_file(source_file='sample1.xml', stylesheet_file='serialize-wrap-in-cdata1.xsl', output_file='result-sample1.xml')

XSLT 3 is e.g.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes"
version="3.0">

<xsl:param name="cdata-tag-names" as="xs:string*" static="yes" select="'tag'"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" _cdata-section-elements="{$cdata-tag-names}"/>

<xsl:template _match="{$cdata-tag-names => string-join(' | ')}">
xsl:copy{serialize(node())}</xsl:copy>
</xsl:template>

</xsl:stylesheet>

sample1.xml is your input:

some data

some data

Public Gist with the files: https://gist.github.com/martin-honnen/61b91233fd73369d55f392ad4a0cee0b.

Example fiddle using SaxonC HE is at this link.

英文:

Consider to use saxonche and XSLT 3.0:

from saxonche import *

with PySaxonProcessor(license=False) as saxon_proc:
    xslt30_processor = saxon_proc.new_xslt30_processor()

    xslt30_processor.transform_to_file(source_file=&#39;sample1.xml&#39;, stylesheet_file=&#39;serialize-wrap-in-cdata1.xsl&#39;, output_file=&#39;result-sample1.xml&#39;)

XSLT 3 is e.g.

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;xsl:stylesheet xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;
	xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot;
	exclude-result-prefixes=&quot;#all&quot;
	expand-text=&quot;yes&quot;
	version=&quot;3.0&quot;&gt;

  &lt;xsl:param name=&quot;cdata-tag-names&quot; as=&quot;xs:string*&quot; static=&quot;yes&quot; select=&quot;&#39;tag&#39;&quot;/&gt;

  &lt;xsl:mode on-no-match=&quot;shallow-copy&quot;/&gt;

  &lt;xsl:output method=&quot;xml&quot; _cdata-section-elements=&quot;{$cdata-tag-names}&quot;/&gt;

  &lt;xsl:template _match=&quot;{$cdata-tag-names =&gt; string-join(&#39; | &#39;)}&quot;&gt;
    &lt;xsl:copy&gt;{serialize(node())}&lt;/xsl:copy&gt;
  &lt;/xsl:template&gt;

&lt;/xsl:stylesheet&gt;

sample1.xml is your input:

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; ?&gt;
&lt;root&gt;
  &lt;tag&gt;
    some data
    &lt;!-- some data2 --&gt;
    &lt;!-- some data2 --&gt;
    some data
  &lt;/tag&gt;
&lt;/root&gt;

Public Gist with the files: https://gist.github.com/martin-honnen/61b91233fd73369d55f392ad4a0cee0b.

Example fiddle using SaxonC HE is at this link.

答案2

得分: 0

如果你想要连接所有<tag>元素内的文本，你可以在元素的itertext方法上使用str.join方法。这将会连接所有的文本，包括空白字符，然后传递给CDATA方法。

for tag in root.findall("tag"):
    tag.text = etree.CDATA(''.join(tag.itertext()))

在你的示例中，注释被视为<tag>元素的子元素。当使用itertext方法时，会迭代尾部文本。

英文:

If you want to concatenate all of the text within the <tag> elements, you can use the str.join method on the elements itertext method. This will join all of the text including whitespaces before passing to the CDATA method.

for tag in root.findall(&quot;tag&quot;):
    tag.text = etree.CDATA(&#39;&#39;.join(tag.itertext()))

The comments are considered child elements of the <tag> element in your example. The tail text is iterated over when using the itertext method.

答案3

得分: 0

我找到了一个巧妙的方法来解析和修改文本、注释和尾部内容：

tmp = etree.tostring(tag).decode()
// 在这里你需要从tmp字符串中移除&lt;tag&gt;
tag.clear()
tag.text = etree.CDATA(tmp)

如果有人知道更正确/更美观的方法来做这个（例如，类似tag.all的方式），请提供。

英文:

I found tricky way to parse and modify text, comments and tails together:

tmp = etree.tostring(tag).decode()
// here you need to remove &lt;tag&gt; from tmp string
tag.clear()
tag.text = etree.CDATA(tmp)

If someone knows more correct/beautiful way to do this (for example, something like tag.all), please write.

答案4

得分: 0

迭代tag元素以获取其文本 + 评论元素的文本表示（不包括尾部文本） + 任何尾部文本（其中包括缩进）。然后删除该子元素，并用CDATA包装的文本填充tag元素。

from lxml import etree

parser = etree.XMLParser()
tree = etree.parse("tmp.xml", parser)
root = tree.getroot()

for s in root.findall("tag"):
    t = s.text
    for ele in s.iterchildren():
        t += etree.tostring(ele, with_tail=False).decode("utf8")
        t += ele.tail
        # remove item
        ele.getparent().remove(ele)
    s.text = etree.CDATA(t)
    #print(etree.tostring(s).decode("utf8"))

print(etree.tostring(tree, with_tail=True).decode("utf8"))

结果

<root>
  <tag><![CDATA[
    some data
    <!-- some data2 -->
    <!-- some data2 -->
    some data
  ]]></tag>
</root>

英文:

Iterate over tag element to get it's text + text representation of comment elements (without tail text) + any tail text (which includes indentation). Then remove that child and populate tag element with CDATA wrapped text.

from lxml import etree

parser = etree.XMLParser()
tree = etree.parse(&quot;tmp.xml&quot;, parser)
root = tree.getroot()

for s in root.findall(&quot;tag&quot;):
    t = s.text
    for ele in s.iterchildren():
        t += etree.tostring(ele, with_tail=False).decode(&quot;utf8&quot;)
        t += ele.tail
        # remove item
        ele.getparent().remove(ele)
    s.text = etree.CDATA(t)
    #print(etree.tostring(s).decode(&quot;utf8&quot;))

print(etree.tostring(tree, with_tail=True).decode(&quot;utf8&quot;))

Result

&lt;root&gt;
  &lt;tag&gt;&lt;![CDATA[
    some data
    &lt;!-- some data2 --&gt;
    &lt;!-- some data2 --&gt;
    some data
  ]]&gt;&lt;/tag&gt;
&lt;/root&gt;

答案5

得分: 0

xml.etree.ElementTree中有ET.iterparse()函数用于检测事件，包括注释：

import xml.etree.ElementTree as ET
from io import StringIO

xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data 1
    <!-- some data2 -->
    <!-- some data3 -->
    some data 4
  </tag>
</root>
"""

f = StringIO(xml_file)

for event, elem in ET.iterparse(f, events=('start','comment')):
    if elem.tag == 'tag' and event == 'start':
        print('Text start', elem.text)
    if '<function Comment' in repr(elem.tag):
        print("Comment", elem.text)

输出：

Text start 
    some data 1
    
    
    some data 4
  
Comment  some data2 
Comment  some data3

以下是使用lxml的示例：

from lxml import etree
from io import BytesIO

xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data 1
    <!-- some data2 -->
    <!-- some data3 -->
    some data 4
  </tag>
</root>
"""

f = BytesIO(xml_file.encode('utf-8'))

def cdata(text):
    tex = ' '.join(text)
    root = etree.Element('root')
    tag = etree.SubElement(root, 'tag')
    tag.text = etree.CDATA(tex)
    etree.dump(root)

tex = []
for event, elem in etree.iterparse(f, events=('start','comment')):
    if elem.tag == 'tag' and event == 'start':
        tex.append(elem.text.strip())
    if '<cyfunction Comment' in repr(elem.tag):
        com = f"<!--{elem.text}-->"
        tex.append(com)
        tex.append(elem.tail.strip())

cdata(tex)

输出：

<root>
  <tag><![CDATA[some data 1 <!-- some data2 -->. <!-- some data3 -->  some data 4]]></tag>
</root>

英文:

xml.etree.ElementTree has ET.iterparse() who detects events, including comments:

import xml.etree.ElementTree as ET
from io import StringIO

xml_file = f&quot;&quot;&quot;&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; ?&gt;
&lt;root&gt;
  &lt;tag&gt;
    some data 1
    &lt;!-- some data2 --&gt;
    &lt;!-- some data3 --&gt;
    some data 4
  &lt;/tag&gt;
&lt;/root&gt;
&quot;&quot;&quot;

f = StringIO(xml_file)

for event, elem in ET.iterparse(f, events=(&#39;start&#39;,&#39;comment&#39;)):
    if elem.tag == &#39;tag&#39; and event == &#39;start&#39;:
        print(&#39;Text start&#39;, elem.text)
    if &#39;&lt;function Comment&#39; in repr(elem.tag):
        print(&quot;Comment&quot;, elem.text)

Output:

Text start 
    some data 1
    
    
    some data 4
  
Comment  some data2 
Comment  some data3

And here the lxml adoption:

from lxml import etree
from io import BytesIO

xml_file = f&quot;&quot;&quot;&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; ?&gt;
&lt;root&gt;
  &lt;tag&gt;
    some data 1
    &lt;!-- some data2 --&gt;
    &lt;!-- some data3 --&gt;
    some data 4
  &lt;/tag&gt;
&lt;/root&gt;
&quot;&quot;&quot;

f = BytesIO(xml_file.encode(&#39;utf-8&#39;))

def cdata (text):
    tex = &#39; &#39;.join(text)
    root = etree.Element(&#39;root&#39;)
    tag = etree.SubElement(root, &#39;tag&#39;)
    tag.text = etree.CDATA(tex)
    etree.dump(root)
    


tex=[]
for event, elem in etree.iterparse(f, events=(&#39;start&#39;,&#39;comment&#39;)):
    if elem.tag == &#39;tag&#39; and event == &#39;start&#39;:
        tex.append(elem.text.strip())
        
    if &#39;&lt;cyfunction Comment&#39; in repr(elem.tag):
        com = f&quot;&lt;!--{elem.text}--&gt;&quot;
        tex.append(com)
        tex.append(elem.tail.strip())

cdata(tex)

Output:

&lt;root&gt;
  &lt;tag&gt;&lt;![CDATA[some data 1 &lt;!-- some data2 --&gt;. &lt;!-- some data3 --&gt;  some data 4]]&gt;&lt;/tag&gt;
&lt;/root&gt;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

LXML 不想解析注释后的文本

问题

答案1

答案2

答案3

答案4

答案5

在PyTorch中，是否可以通过系数来冻结一个模块？

np.astype(‘uint8’)在Windows和Mac上为什么会产生不同的结果？

如何在x轴上绘制datetime.time

Deadlock with Django / MYSQL and filter on select_for_update.

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论