2023年2月19日 01:47:26go评论58阅读模式

英文:

Parsing XML with embedded node in text to DataFrame

问题

I have an XML like this:

&lt;root&gt;
    &lt;epig&gt;
    string1
    &lt;tit&gt;string2&lt;/tit&gt;
    string3
    &lt;/epig&gt;
&lt;/root&gt;

I'm trying to build a data frame with following:

dftext = pd.read_xml(&quot;filename.xml&quot;, xpath=&#39;root/epig&#39;)

which returns in the data frame a column epig containing string1 and a column tit with string2, but string3 is disappeared in data frame. This is the current output:

epig	tit
string1	string2

The data frame output instead should be:

epig	tit
string1+string3	string 2

Where's my error?

英文:

I have an XML like this:

&lt;root&gt;
    &lt;epig&gt;
    string1
    &lt;tit&gt;string2&lt;/tit&gt;
    string3
    &lt;/epig&gt;
&lt;/root&gt;

I'm trying to build a data frame with following:

dftext = pd.read_xml(&quot;filename.xml&quot;, xpath=&#39;root/epig&#39;)

which returns in the data frame a column epig containing string1 and a column tit with string2, but string3 is disappeared in data frame. This is the current output:

epig	tit
string1	string2

The data frame output instead should be:

epig	tit
string1+string3	string 2

Where's my error?

答案1

得分: 1

在XML中，<epig> 元素下有三个节点：两个 <text> 节点和 <tit> 节点。要检索后一个文本节点，在Python的 etree 库中，您需要使用 tit 元素上的 .tail 属性。在Pandas中，read_xml（设计用于解析不是所有XML类型的方便方法）只解析第一个文本节点，因为它不会遍历多个文本节点。

对于多个文本节点的这种特殊用例，可以考虑使用 XSLT 重新设计XML，这是一种专门用于转换XML文件的特殊用途语言，支持在 read_xml 中使用 stylesheet 参数和默认的 lxml 解析器（而不是 etree 解析器）。

XSLT（另存为 .xsl，特殊的 .xml 文件）

以下将两个文本节点连接成一个新的 <epig> 子元素，成为 xpath 中的 <tit> 的同级兄弟元素，都位于新的父元素 <item> 下。

&lt;xsl:stylesheet version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;&gt;
    &lt;xsl:output method=&quot;xml&quot; omit-xml-declaration=&quot;no&quot; indent=&quot;yes&quot;/&gt;
    &lt;xsl:strip-space elements=&quot;*&quot;/&gt;

    &lt;xsl:template match=&quot;/root&quot;&gt;
     &lt;xsl:copy&gt;
       &lt;xsl:apply-templates select=&quot;epig&quot;/&gt;
     &lt;/xsl:copy&gt;
    &lt;/xsl:template&gt;

    &lt;xsl:template match=&quot;epig&quot;&gt;
     &lt;item&gt;
       &lt;epig&gt;
         &lt;xsl:value-of select=&quot;normalize-space(concat(text()[1], text()[2]))&quot;/&gt;
       &lt;/epig&gt;
       &lt;xsl:copy-of select=&quot;tit&quot;/&gt;
     &lt;/item&gt;
    &lt;/xsl:template&gt;
&lt;/xsl:stylesheet&gt;

Python

以下将解析XSLT的输出中所有 <item> 节点。

dftext = pd.read_xml(&quot;filename.xml&quot;, xpath=&quot;.//item&quot;, stylesheet=&quot;style.xsl&quot;)

dftext
#               epig      tit
# 0  string1 string3  string2

英文:

In XML speak, there are three nodes under the <epig> element: two <text> nodes and the <tit> node. To retrieve the latter text node, in Python's etree library, you would have to use the .tail attribute on the tit element. In Pandas, read_xml (the convenience method designed to parse flat not all XML types) only parses the first text node since it does not iterate across multiple text nodes.

For this special use case of multiple text nodes, consider re-styling the XML with XSLT, the special-purpose language designed to transform XML files, which is supported in read_xml using the stylesheet argument and default lxml parser (not etree parser).

XSLT (save as .xsl, a special .xml file)

Below concatenates both text nodes into a new <epig> child element that becomes a sibling to <tit> each under a new parent <item> used in xpath.

&lt;xsl:stylesheet version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;&gt;
    &lt;xsl:output method=&quot;xml&quot; omit-xml-declaration=&quot;no&quot; indent=&quot;yes&quot;/&gt;
    &lt;xsl:strip-space elements=&quot;*&quot;/&gt;

    &lt;xsl:template match=&quot;/root&quot;&gt;
     &lt;xsl:copy&gt;
       &lt;xsl:apply-templates select=&quot;epig&quot;/&gt;
     &lt;/xsl:copy&gt;
    &lt;/xsl:template&gt;

    &lt;xsl:template match=&quot;epig&quot;&gt;
     &lt;item&gt;
       &lt;epig&gt;
         &lt;xsl:value-of select=&quot;normalize-space(concat(text()[1], text()[2]))&quot;/&gt;
       &lt;/epig&gt;
       &lt;xsl:copy-of select=&quot;tit&quot;/&gt;
     &lt;/item&gt;
    &lt;/xsl:template&gt;
&lt;/xsl:stylesheet&gt;

<kbd>Online Demo</kbd>

Python

Below will parse all <item> nodes of the flattened output of XSLT.

dftext = pd.read_xml(&quot;filename.xml&quot;, xpath=&quot;.//item&quot;, stylesheet=&quot;style.xsl&quot;)

dftext
#               epig      tit
# 0  string1 string3  string2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

解析包含在文本中的XML嵌套节点至DataFrame。

问题

答案1

使用Jackson解析具有重复元素的XML

`pd.read_excel`出错，显示没有这样的文件或目录。

如何在Python编程中处理数据框中的这个问题。

将Pandas数据框中的单元格拆分为多行的Python代码。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论