LXML 不想解析注释后的文本

huangapple go评论82阅读模式
英文:

LXML don't want to parse text after comment

问题

I want to wrap tag.text into CDATA:

from lxml import etree

parser = etree.XMLParser()
# parser = etree.XMLParser(remove_comments=True)
tree = etree.parse("./data.xml", parser)
root = tree.getroot()

for tag in root.findall("tag"):
    tag.text = etree.CDATA(tag.text)

tree.write("./result.xml",
           encoding="utf-8",
           xml_declaration=True)

But when I parse tag.text with comments inside it, it only parses text before comments:

<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data
    <!-- some data2 -->
    <!-- some data2 -->
    some data
  </tag>
</root>

And I get this (tag.text = some data):

<?xml version='1.0' encoding='UTF-8'?>
<root>
  <tag><![CDATA[
    some data
    ]]><!-- some data2 -->
    <!-- some data2 -->
    some data
  </tag>
</root>

How to fix it?

英文:

I want to wrap tag.text into CDATA:

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; ?&gt;
&lt;root&gt;
  &lt;tag&gt;
    some data
    &lt;!-- some data2 --&gt;
    &lt;!-- some data2 --&gt;
    some data
  &lt;/tag&gt;
&lt;/root&gt;

But when i parse tag.text with comments inside it parse only text before comments:

from lxml import etree

parser = etree.XMLParser()
#parser = etree.XMLParser(remove_comments=True)
tree = etree.parse(&quot;./data.xml&quot;, parser)
root = tree.getroot()

for tag in root.findall(&quot;tag&quot;):
    tag.text = etree.CDATA(tag.text)

tree.write(&quot;./result.xml&quot;,
           encoding = &quot;utf-8&quot;,
           xml_declaration = True)

And i get this (tag.text = some data):

&lt;?xml version=&#39;1.0&#39; encoding=&#39;UTF-8&#39;?&gt;
&lt;root&gt;
  &lt;tag&gt;&lt;![CDATA[
    some data
    ]]&gt;&lt;!-- some data2 --&gt;
    &lt;!-- some data2 --&gt;
    some data
  &lt;/tag&gt;
&lt;/root&gt;

How to fix it?

答案1

得分: 1

Consider to use saxonche and XSLT 3.0:

from saxonche import *

with PySaxonProcessor(license=False) as saxon_proc:
xslt30_processor = saxon_proc.new_xslt30_processor()

xslt30_processor.transform_to_file(source_file='sample1.xml', stylesheet_file='serialize-wrap-in-cdata1.xsl', output_file='result-sample1.xml')

XSLT 3 is e.g.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes"
version="3.0">

<xsl:param name="cdata-tag-names" as="xs:string*" static="yes" select="'tag'"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" _cdata-section-elements="{$cdata-tag-names}"/>

<xsl:template _match="{$cdata-tag-names => string-join(' | ')}">
xsl:copy{serialize(node())}</xsl:copy>
</xsl:template>

</xsl:stylesheet>

sample1.xml is your input:




some data


some data

Public Gist with the files: https://gist.github.com/martin-honnen/61b91233fd73369d55f392ad4a0cee0b.

Example fiddle using SaxonC HE is at this link.

英文:

Consider to use saxonche and XSLT 3.0:

from saxonche import *

with PySaxonProcessor(license=False) as saxon_proc:
    xslt30_processor = saxon_proc.new_xslt30_processor()

    xslt30_processor.transform_to_file(source_file=&#39;sample1.xml&#39;, stylesheet_file=&#39;serialize-wrap-in-cdata1.xsl&#39;, output_file=&#39;result-sample1.xml&#39;)

XSLT 3 is e.g.

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;xsl:stylesheet xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;
	xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot;
	exclude-result-prefixes=&quot;#all&quot;
	expand-text=&quot;yes&quot;
	version=&quot;3.0&quot;&gt;

  &lt;xsl:param name=&quot;cdata-tag-names&quot; as=&quot;xs:string*&quot; static=&quot;yes&quot; select=&quot;&#39;tag&#39;&quot;/&gt;

  &lt;xsl:mode on-no-match=&quot;shallow-copy&quot;/&gt;

  &lt;xsl:output method=&quot;xml&quot; _cdata-section-elements=&quot;{$cdata-tag-names}&quot;/&gt;

  &lt;xsl:template _match=&quot;{$cdata-tag-names =&gt; string-join(&#39; | &#39;)}&quot;&gt;
    &lt;xsl:copy&gt;{serialize(node())}&lt;/xsl:copy&gt;
  &lt;/xsl:template&gt;

&lt;/xsl:stylesheet&gt;

sample1.xml is your input:

&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; ?&gt;
&lt;root&gt;
  &lt;tag&gt;
    some data
    &lt;!-- some data2 --&gt;
    &lt;!-- some data2 --&gt;
    some data
  &lt;/tag&gt;
&lt;/root&gt;

Public Gist with the files: https://gist.github.com/martin-honnen/61b91233fd73369d55f392ad4a0cee0b.

Example fiddle using SaxonC HE is at this link.

答案2

得分: 0

如果你想要连接所有&lt;tag&gt;元素内的文本,你可以在元素的itertext方法上使用str.join方法。这将会连接所有的文本,包括空白字符,然后传递给CDATA方法。

for tag in root.findall("tag"):
    tag.text = etree.CDATA(''.join(tag.itertext()))

在你的示例中,注释被视为&lt;tag&gt;元素的子元素。当使用itertext方法时,会迭代尾部文本。

英文:

If you want to concatenate all of the text within the &lt;tag&gt; elements, you can use the str.join method on the elements itertext method. This will join all of the text including whitespaces before passing to the CDATA method.

for tag in root.findall(&quot;tag&quot;):
    tag.text = etree.CDATA(&#39;&#39;.join(tag.itertext()))

The comments are considered child elements of the &lt;tag&gt; element in your example. The tail text is iterated over when using the itertext method.

答案3

得分: 0

我找到了一个巧妙的方法来解析和修改文本、注释和尾部内容:

tmp = etree.tostring(tag).decode()
// 在这里你需要从tmp字符串中移除&lt;tag&gt;
tag.clear()
tag.text = etree.CDATA(tmp)

如果有人知道更正确/更美观的方法来做这个(例如,类似tag.all的方式),请提供。

英文:

I found tricky way to parse and modify text, comments and tails together:

tmp = etree.tostring(tag).decode()
// here you need to remove &lt;tag&gt; from tmp string
tag.clear()
tag.text = etree.CDATA(tmp)

If someone knows more correct/beautiful way to do this (for example, something like tag.all), please write.

答案4

得分: 0

迭代tag元素以获取其文本 + 评论元素的文本表示(不包括尾部文本) + 任何尾部文本(其中包括缩进)。然后删除该子元素,并用CDATA包装的文本填充tag元素。

from lxml import etree

parser = etree.XMLParser()
tree = etree.parse("tmp.xml", parser)
root = tree.getroot()

for s in root.findall("tag"):
    t = s.text
    for ele in s.iterchildren():
        t += etree.tostring(ele, with_tail=False).decode("utf8")
        t += ele.tail
        # remove item
        ele.getparent().remove(ele)
    s.text = etree.CDATA(t)
    #print(etree.tostring(s).decode("utf8"))

print(etree.tostring(tree, with_tail=True).decode("utf8"))

结果

<root>
  <tag><![CDATA[
    some data
    <!-- some data2 -->
    <!-- some data2 -->
    some data
  ]]></tag>
</root>
英文:

Iterate over tag element to get it's text + text representation of comment elements (without tail text) + any tail text (which includes indentation). Then remove that child and populate tag element with CDATA wrapped text.

from lxml import etree

parser = etree.XMLParser()
tree = etree.parse(&quot;tmp.xml&quot;, parser)
root = tree.getroot()

for s in root.findall(&quot;tag&quot;):
    t = s.text
    for ele in s.iterchildren():
        t += etree.tostring(ele, with_tail=False).decode(&quot;utf8&quot;)
        t += ele.tail
        # remove item
        ele.getparent().remove(ele)
    s.text = etree.CDATA(t)
    #print(etree.tostring(s).decode(&quot;utf8&quot;))

print(etree.tostring(tree, with_tail=True).decode(&quot;utf8&quot;))

Result

&lt;root&gt;
  &lt;tag&gt;&lt;![CDATA[
    some data
    &lt;!-- some data2 --&gt;
    &lt;!-- some data2 --&gt;
    some data
  ]]&gt;&lt;/tag&gt;
&lt;/root&gt;

答案5

得分: 0

xml.etree.ElementTree中有ET.iterparse()函数用于检测事件,包括注释:

import xml.etree.ElementTree as ET
from io import StringIO

xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data 1
    <!-- some data2 -->
    <!-- some data3 -->
    some data 4
  </tag>
</root>
"""

f = StringIO(xml_file)

for event, elem in ET.iterparse(f, events=('start','comment')):
    if elem.tag == 'tag' and event == 'start':
        print('Text start', elem.text)
    if '<function Comment' in repr(elem.tag):
        print("Comment", elem.text)

输出:

Text start 
    some data 1
    
    
    some data 4
  
Comment  some data2 
Comment  some data3 

以下是使用lxml的示例:

from lxml import etree
from io import BytesIO

xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
  <tag>
    some data 1
    <!-- some data2 -->
    <!-- some data3 -->
    some data 4
  </tag>
</root>
"""

f = BytesIO(xml_file.encode('utf-8'))

def cdata(text):
    tex = ' '.join(text)
    root = etree.Element('root')
    tag = etree.SubElement(root, 'tag')
    tag.text = etree.CDATA(tex)
    etree.dump(root)

tex = []
for event, elem in etree.iterparse(f, events=('start','comment')):
    if elem.tag == 'tag' and event == 'start':
        tex.append(elem.text.strip())
    if '<cyfunction Comment' in repr(elem.tag):
        com = f"<!--{elem.text}-->"
        tex.append(com)
        tex.append(elem.tail.strip())

cdata(tex)

输出:

<root>
  <tag><![CDATA[some data 1 <!-- some data2 -->. <!-- some data3 -->  some data 4]]></tag>
</root>
英文:

xml.etree.ElementTree has ET.iterparse() who detects events, including comments:

import xml.etree.ElementTree as ET
from io import StringIO

xml_file = f&quot;&quot;&quot;&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; ?&gt;
&lt;root&gt;
  &lt;tag&gt;
    some data 1
    &lt;!-- some data2 --&gt;
    &lt;!-- some data3 --&gt;
    some data 4
  &lt;/tag&gt;
&lt;/root&gt;
&quot;&quot;&quot;

f = StringIO(xml_file)

for event, elem in ET.iterparse(f, events=(&#39;start&#39;,&#39;comment&#39;)):
    if elem.tag == &#39;tag&#39; and event == &#39;start&#39;:
        print(&#39;Text start&#39;, elem.text)
    if &#39;&lt;function Comment&#39; in repr(elem.tag):
        print(&quot;Comment&quot;, elem.text)

Output:

Text start 
    some data 1
    
    
    some data 4
  
Comment  some data2 
Comment  some data3 

And here the lxml adoption:

from lxml import etree
from io import BytesIO

xml_file = f&quot;&quot;&quot;&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot; ?&gt;
&lt;root&gt;
  &lt;tag&gt;
    some data 1
    &lt;!-- some data2 --&gt;
    &lt;!-- some data3 --&gt;
    some data 4
  &lt;/tag&gt;
&lt;/root&gt;
&quot;&quot;&quot;

f = BytesIO(xml_file.encode(&#39;utf-8&#39;))

def cdata (text):
    tex = &#39; &#39;.join(text)
    root = etree.Element(&#39;root&#39;)
    tag = etree.SubElement(root, &#39;tag&#39;)
    tag.text = etree.CDATA(tex)
    etree.dump(root)
    


tex=[]
for event, elem in etree.iterparse(f, events=(&#39;start&#39;,&#39;comment&#39;)):
    if elem.tag == &#39;tag&#39; and event == &#39;start&#39;:
        tex.append(elem.text.strip())
        
    if &#39;&lt;cyfunction Comment&#39; in repr(elem.tag):
        com = f&quot;&lt;!--{elem.text}--&gt;&quot;
        tex.append(com)
        tex.append(elem.tail.strip())

cdata(tex)

Output:

&lt;root&gt;
  &lt;tag&gt;&lt;![CDATA[some data 1 &lt;!-- some data2 --&gt;. &lt;!-- some data3 --&gt;  some data 4]]&gt;&lt;/tag&gt;
&lt;/root&gt;

huangapple
  • 本文由 发表于 2023年6月26日 19:21:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76556193.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定