解析包含在文本中的XML嵌套节点至DataFrame。

huangapple go评论58阅读模式
英文:

Parsing XML with embedded node in text to DataFrame

问题

I have an XML like this:

<root>
    <epig>
    string1
    <tit>string2</tit>
    string3
    </epig>
</root>

I'm trying to build a data frame with following:

dftext = pd.read_xml("filename.xml", xpath='root/epig')

which returns in the data frame a column epig containing string1 and a column tit with string2, but string3 is disappeared in data frame. This is the current output:

epig tit
string1 string2

The data frame output instead should be:

epig tit
string1+string3 string 2

Where's my error?

英文:

I have an XML like this:

<root>
    <epig>
    string1
    <tit>string2</tit>
    string3
    </epig>
</root>

I'm trying to build a data frame with following:

dftext = pd.read_xml("filename.xml", xpath='root/epig')

which returns in the data frame a column epig containing string1 and a column tit with string2, but string3 is disappeared in data frame. This is the current output:

epig tit
string1 string2

The data frame output instead should be:

epig tit
string1+string3 string 2

Where's my error?

答案1

得分: 1

在XML中,<epig> 元素下有三个节点:两个 <text> 节点和 <tit> 节点。要检索后一个文本节点,在Python的 etree 库中,您需要使用 tit 元素上的 .tail 属性。在Pandas中,read_xml(设计用于解析不是所有XML类型的方便方法)只解析第一个文本节点,因为它不会遍历多个文本节点。

对于多个文本节点的这种特殊用例,可以考虑使用 XSLT 重新设计XML,这是一种专门用于转换XML文件的特殊用途语言,支持在 read_xml 中使用 stylesheet 参数和默认的 lxml 解析器(而不是 etree 解析器)。

XSLT(另存为 .xsl,特殊的 .xml 文件)

以下将两个文本节点连接成一个新的 <epig> 子元素,成为 xpath 中的 <tit> 的同级兄弟元素,都位于新的父元素 <item> 下。

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="/root">
     <xsl:copy>
       <xsl:apply-templates select="epig"/>
     </xsl:copy>
    </xsl:template>

    <xsl:template match="epig">
     <item>
       <epig>
         <xsl:value-of select="normalize-space(concat(text()[1], text()[2]))"/>
       </epig>
       <xsl:copy-of select="tit"/>
     </item>
    </xsl:template>
</xsl:stylesheet>

<kbd>在线演示</kbd>

Python

以下将解析XSLT的输出中所有 &lt;item&gt; 节点。

dftext = pd.read_xml(&quot;filename.xml&quot;, xpath=&quot;.//item&quot;, stylesheet=&quot;style.xsl&quot;)

dftext
#               epig      tit
# 0  string1 string3  string2
英文:

In XML speak, there are three nodes under the &lt;epig&gt; element: two &lt;text&gt; nodes and the &lt;tit&gt; node. To retrieve the latter text node, in Python's etree library, you would have to use the .tail attribute on the tit element. In Pandas, read_xml (the convenience method designed to parse flat not all XML types) only parses the first text node since it does not iterate across multiple text nodes.

For this special use case of multiple text nodes, consider re-styling the XML with XSLT, the special-purpose language designed to transform XML files, which is supported in read_xml using the stylesheet argument and default lxml parser (not etree parser).

XSLT (save as .xsl, a special .xml file)

Below concatenates both text nodes into a new &lt;epig&gt; child element that becomes a sibling to &lt;tit&gt; each under a new parent &lt;item&gt; used in xpath.

&lt;xsl:stylesheet version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;&gt;
    &lt;xsl:output method=&quot;xml&quot; omit-xml-declaration=&quot;no&quot; indent=&quot;yes&quot;/&gt;
    &lt;xsl:strip-space elements=&quot;*&quot;/&gt;

    &lt;xsl:template match=&quot;/root&quot;&gt;
     &lt;xsl:copy&gt;
       &lt;xsl:apply-templates select=&quot;epig&quot;/&gt;
     &lt;/xsl:copy&gt;
    &lt;/xsl:template&gt;

    &lt;xsl:template match=&quot;epig&quot;&gt;
     &lt;item&gt;
       &lt;epig&gt;
         &lt;xsl:value-of select=&quot;normalize-space(concat(text()[1], text()[2]))&quot;/&gt;
       &lt;/epig&gt;
       &lt;xsl:copy-of select=&quot;tit&quot;/&gt;
     &lt;/item&gt;
    &lt;/xsl:template&gt;
&lt;/xsl:stylesheet&gt;

<kbd>Online Demo</kbd>

Python

Below will parse all &lt;item&gt; nodes of the flattened output of XSLT.

dftext = pd.read_xml(&quot;filename.xml&quot;, xpath=&quot;.//item&quot;, stylesheet=&quot;style.xsl&quot;)

dftext
#               epig      tit
# 0  string1 string3  string2

huangapple
  • 本文由 发表于 2023年2月19日 01:47:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/75495243.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定