英文:
LXML don't want to parse text after comment
问题
I want to wrap tag.text into CDATA:
from lxml import etree
parser = etree.XMLParser()
# parser = etree.XMLParser(remove_comments=True)
tree = etree.parse("./data.xml", parser)
root = tree.getroot()
for tag in root.findall("tag"):
tag.text = etree.CDATA(tag.text)
tree.write("./result.xml",
encoding="utf-8",
xml_declaration=True)
But when I parse tag.text with comments inside it, it only parses text before comments:
<?xml version="1.0" encoding="utf-8" ?>
<root>
<tag>
some data
<!-- some data2 -->
<!-- some data2 -->
some data
</tag>
</root>
And I get this (tag.text = some data):
<?xml version='1.0' encoding='UTF-8'?>
<root>
<tag><![CDATA[
some data
]]><!-- some data2 -->
<!-- some data2 -->
some data
</tag>
</root>
How to fix it?
英文:
I want to wrap tag.text into CDATA:
<?xml version="1.0" encoding="utf-8" ?>
<root>
<tag>
some data
<!-- some data2 -->
<!-- some data2 -->
some data
</tag>
</root>
But when i parse tag.text with comments inside it parse only text before comments:
from lxml import etree
parser = etree.XMLParser()
#parser = etree.XMLParser(remove_comments=True)
tree = etree.parse("./data.xml", parser)
root = tree.getroot()
for tag in root.findall("tag"):
tag.text = etree.CDATA(tag.text)
tree.write("./result.xml",
encoding = "utf-8",
xml_declaration = True)
And i get this (tag.text = some data):
<?xml version='1.0' encoding='UTF-8'?>
<root>
<tag><![CDATA[
some data
]]><!-- some data2 -->
<!-- some data2 -->
some data
</tag>
</root>
How to fix it?
答案1
得分: 1
Consider to use saxonche and XSLT 3.0:
from saxonche import *
with PySaxonProcessor(license=False) as saxon_proc:
xslt30_processor = saxon_proc.new_xslt30_processor()
xslt30_processor.transform_to_file(source_file='sample1.xml', stylesheet_file='serialize-wrap-in-cdata1.xsl', output_file='result-sample1.xml')
XSLT 3 is e.g.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes"
version="3.0">
<xsl:param name="cdata-tag-names" as="xs:string*" static="yes" select="'tag'"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" _cdata-section-elements="{$cdata-tag-names}"/>
<xsl:template _match="{$cdata-tag-names => string-join(' | ')}">
xsl:copy{serialize(node())}</xsl:copy>
</xsl:template>
</xsl:stylesheet>
sample1.xml is your input:
some data
some data
Public Gist with the files: https://gist.github.com/martin-honnen/61b91233fd73369d55f392ad4a0cee0b.
Example fiddle using SaxonC HE is at this link.
英文:
Consider to use saxonche and XSLT 3.0:
from saxonche import *
with PySaxonProcessor(license=False) as saxon_proc:
xslt30_processor = saxon_proc.new_xslt30_processor()
xslt30_processor.transform_to_file(source_file='sample1.xml', stylesheet_file='serialize-wrap-in-cdata1.xsl', output_file='result-sample1.xml')
XSLT 3 is e.g.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes"
version="3.0">
<xsl:param name="cdata-tag-names" as="xs:string*" static="yes" select="'tag'"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" _cdata-section-elements="{$cdata-tag-names}"/>
<xsl:template _match="{$cdata-tag-names => string-join(' | ')}">
<xsl:copy>{serialize(node())}</xsl:copy>
</xsl:template>
</xsl:stylesheet>
sample1.xml is your input:
<?xml version="1.0" encoding="utf-8" ?>
<root>
<tag>
some data
<!-- some data2 -->
<!-- some data2 -->
some data
</tag>
</root>
Public Gist with the files: https://gist.github.com/martin-honnen/61b91233fd73369d55f392ad4a0cee0b.
Example fiddle using SaxonC HE is at this link.
答案2
得分: 0
如果你想要连接所有<tag>
元素内的文本,你可以在元素的itertext
方法上使用str.join
方法。这将会连接所有的文本,包括空白字符,然后传递给CDATA
方法。
for tag in root.findall("tag"):
tag.text = etree.CDATA(''.join(tag.itertext()))
在你的示例中,注释被视为<tag>
元素的子元素。当使用itertext
方法时,会迭代尾部文本。
英文:
If you want to concatenate all of the text within the <tag>
elements, you can use the str.join
method on the elements itertext
method. This will join all of the text including whitespaces before passing to the CDATA
method.
for tag in root.findall("tag"):
tag.text = etree.CDATA(''.join(tag.itertext()))
The comments are considered child elements of the <tag>
element in your example. The tail text is iterated over when using the itertext
method.
答案3
得分: 0
我找到了一个巧妙的方法来解析和修改文本、注释和尾部内容:
tmp = etree.tostring(tag).decode()
// 在这里你需要从tmp字符串中移除<tag>
tag.clear()
tag.text = etree.CDATA(tmp)
如果有人知道更正确/更美观的方法来做这个(例如,类似tag.all的方式),请提供。
英文:
I found tricky way to parse and modify text, comments and tails together:
tmp = etree.tostring(tag).decode()
// here you need to remove <tag> from tmp string
tag.clear()
tag.text = etree.CDATA(tmp)
If someone knows more correct/beautiful way to do this (for example, something like tag.all), please write.
答案4
得分: 0
迭代tag
元素以获取其文本 + 评论元素的文本表示(不包括尾部文本) + 任何尾部文本(其中包括缩进)。然后删除该子元素,并用CDATA包装的文本填充tag
元素。
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse("tmp.xml", parser)
root = tree.getroot()
for s in root.findall("tag"):
t = s.text
for ele in s.iterchildren():
t += etree.tostring(ele, with_tail=False).decode("utf8")
t += ele.tail
# remove item
ele.getparent().remove(ele)
s.text = etree.CDATA(t)
#print(etree.tostring(s).decode("utf8"))
print(etree.tostring(tree, with_tail=True).decode("utf8"))
结果
<root>
<tag><![CDATA[
some data
<!-- some data2 -->
<!-- some data2 -->
some data
]]></tag>
</root>
英文:
Iterate over tag
element to get it's text + text representation of comment elements (without tail text) + any tail text (which includes indentation). Then remove that child and populate tag element with CDATA wrapped text.
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse("tmp.xml", parser)
root = tree.getroot()
for s in root.findall("tag"):
t = s.text
for ele in s.iterchildren():
t += etree.tostring(ele, with_tail=False).decode("utf8")
t += ele.tail
# remove item
ele.getparent().remove(ele)
s.text = etree.CDATA(t)
#print(etree.tostring(s).decode("utf8"))
print(etree.tostring(tree, with_tail=True).decode("utf8"))
Result
<root>
<tag><![CDATA[
some data
<!-- some data2 -->
<!-- some data2 -->
some data
]]></tag>
</root>
答案5
得分: 0
xml.etree.ElementTree
中有ET.iterparse()
函数用于检测事件,包括注释:
import xml.etree.ElementTree as ET
from io import StringIO
xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
<tag>
some data 1
<!-- some data2 -->
<!-- some data3 -->
some data 4
</tag>
</root>
"""
f = StringIO(xml_file)
for event, elem in ET.iterparse(f, events=('start','comment')):
if elem.tag == 'tag' and event == 'start':
print('Text start', elem.text)
if '<function Comment' in repr(elem.tag):
print("Comment", elem.text)
输出:
Text start
some data 1
some data 4
Comment some data2
Comment some data3
以下是使用lxml
的示例:
from lxml import etree
from io import BytesIO
xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
<tag>
some data 1
<!-- some data2 -->
<!-- some data3 -->
some data 4
</tag>
</root>
"""
f = BytesIO(xml_file.encode('utf-8'))
def cdata(text):
tex = ' '.join(text)
root = etree.Element('root')
tag = etree.SubElement(root, 'tag')
tag.text = etree.CDATA(tex)
etree.dump(root)
tex = []
for event, elem in etree.iterparse(f, events=('start','comment')):
if elem.tag == 'tag' and event == 'start':
tex.append(elem.text.strip())
if '<cyfunction Comment' in repr(elem.tag):
com = f"<!--{elem.text}-->"
tex.append(com)
tex.append(elem.tail.strip())
cdata(tex)
输出:
<root>
<tag><![CDATA[some data 1 <!-- some data2 -->. <!-- some data3 --> some data 4]]></tag>
</root>
英文:
xml.etree.ElementTree
has ET.iterparse()
who detects events, including comments:
import xml.etree.ElementTree as ET
from io import StringIO
xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
<tag>
some data 1
<!-- some data2 -->
<!-- some data3 -->
some data 4
</tag>
</root>
"""
f = StringIO(xml_file)
for event, elem in ET.iterparse(f, events=('start','comment')):
if elem.tag == 'tag' and event == 'start':
print('Text start', elem.text)
if '<function Comment' in repr(elem.tag):
print("Comment", elem.text)
Output:
Text start
some data 1
some data 4
Comment some data2
Comment some data3
And here the lxml
adoption:
from lxml import etree
from io import BytesIO
xml_file = f"""<?xml version="1.0" encoding="utf-8" ?>
<root>
<tag>
some data 1
<!-- some data2 -->
<!-- some data3 -->
some data 4
</tag>
</root>
"""
f = BytesIO(xml_file.encode('utf-8'))
def cdata (text):
tex = ' '.join(text)
root = etree.Element('root')
tag = etree.SubElement(root, 'tag')
tag.text = etree.CDATA(tex)
etree.dump(root)
tex=[]
for event, elem in etree.iterparse(f, events=('start','comment')):
if elem.tag == 'tag' and event == 'start':
tex.append(elem.text.strip())
if '<cyfunction Comment' in repr(elem.tag):
com = f"<!--{elem.text}-->"
tex.append(com)
tex.append(elem.tail.strip())
cdata(tex)
Output:
<root>
<tag><![CDATA[some data 1 <!-- some data2 -->. <!-- some data3 --> some data 4]]></tag>
</root>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论