英文:
Parsing XML with embedded node in text to DataFrame
问题
I have an XML like this:
<root>
<epig>
string1
<tit>string2</tit>
string3
</epig>
</root>
I'm trying to build a data frame with following:
dftext = pd.read_xml("filename.xml", xpath='root/epig')
which returns in the data frame a column epig
containing string1 and a column tit
with string2
, but string3
is disappeared in data frame. This is the current output:
epig | tit |
---|---|
string1 | string2 |
The data frame output instead should be:
epig | tit |
---|---|
string1+string3 | string 2 |
Where's my error?
英文:
I have an XML like this:
<root>
<epig>
string1
<tit>string2</tit>
string3
</epig>
</root>
I'm trying to build a data frame with following:
dftext = pd.read_xml("filename.xml", xpath='root/epig')
which returns in the data frame a column epig
containing string1 and a column tit
with string2
, but string3
is disappeared in data frame. This is the current output:
epig | tit |
---|---|
string1 | string2 |
The data frame output instead should be:
epig | tit |
---|---|
string1+string3 | string 2 |
Where's my error?
答案1
得分: 1
在XML中,<epig>
元素下有三个节点:两个 <text>
节点和 <tit>
节点。要检索后一个文本节点,在Python的 etree
库中,您需要使用 tit
元素上的 .tail
属性。在Pandas中,read_xml
(设计用于解析不是所有XML类型的方便方法)只解析第一个文本节点,因为它不会遍历多个文本节点。
对于多个文本节点的这种特殊用例,可以考虑使用 XSLT 重新设计XML,这是一种专门用于转换XML文件的特殊用途语言,支持在 read_xml
中使用 stylesheet
参数和默认的 lxml
解析器(而不是 etree
解析器)。
XSLT(另存为 .xsl,特殊的 .xml 文件)
以下将两个文本节点连接成一个新的 <epig>
子元素,成为 xpath
中的 <tit>
的同级兄弟元素,都位于新的父元素 <item>
下。
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/root">
<xsl:copy>
<xsl:apply-templates select="epig"/>
</xsl:copy>
</xsl:template>
<xsl:template match="epig">
<item>
<epig>
<xsl:value-of select="normalize-space(concat(text()[1], text()[2]))"/>
</epig>
<xsl:copy-of select="tit"/>
</item>
</xsl:template>
</xsl:stylesheet>
<kbd>在线演示</kbd>
Python
以下将解析XSLT的输出中所有 <item>
节点。
dftext = pd.read_xml("filename.xml", xpath=".//item", stylesheet="style.xsl")
dftext
# epig tit
# 0 string1 string3 string2
英文:
In XML speak, there are three nodes under the <epig>
element: two <text>
nodes and the <tit>
node. To retrieve the latter text node, in Python's etree
library, you would have to use the .tail
attribute on the tit
element. In Pandas, read_xml
(the convenience method designed to parse flat not all XML types) only parses the first text node since it does not iterate across multiple text nodes.
For this special use case of multiple text nodes, consider re-styling the XML with XSLT, the special-purpose language designed to transform XML files, which is supported in read_xml
using the stylesheet
argument and default lxml
parser (not etree
parser).
XSLT (save as .xsl, a special .xml file)
Below concatenates both text nodes into a new <epig>
child element that becomes a sibling to <tit>
each under a new parent <item>
used in xpath
.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/root">
<xsl:copy>
<xsl:apply-templates select="epig"/>
</xsl:copy>
</xsl:template>
<xsl:template match="epig">
<item>
<epig>
<xsl:value-of select="normalize-space(concat(text()[1], text()[2]))"/>
</epig>
<xsl:copy-of select="tit"/>
</item>
</xsl:template>
</xsl:stylesheet>
<kbd>Online Demo</kbd>
Python
Below will parse all <item>
nodes of the flattened output of XSLT.
dftext = pd.read_xml("filename.xml", xpath=".//item", stylesheet="style.xsl")
dftext
# epig tit
# 0 string1 string3 string2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论