2023年6月5日 17:58:47go评论114阅读模式

英文:

Python xml to dataframe tag in tags

问题

我有以下的xml文件：

&lt;pdv_liste&gt;
&lt;pdv id=&quot;10&quot; latitude=&quot;46&quot; longitude=&quot;52&quot; cp=&quot;01000&quot; pop=&quot;R&quot;&gt;
  &lt;city&gt;LA&lt;/city&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-02T09:37:03&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-03T09:54:58&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-06T12:33:57&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-09T08:59:53&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-10T10:38:39&quot; value=&quot;1258&quot;/&gt;
&lt;/pdv&gt;
&lt;pdv id=&quot;2&quot; latitude=&quot;46&quot; longitude=&quot;53&quot; cp=&quot;01000&quot; pop=&quot;R&quot;&gt;
  &lt;city&gt;NY&lt;/city&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-03T09:38:59&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-06T11:19:09&quot; value=&quot;1258&quot;/&gt;
&lt;/pdv&gt;
&lt;/pdv_liste&gt;

我想要获取以下的数据框：

id latitude longitude city name   maj                 value
10 46       52        LA   diesel 2017-01-02T09:37:03 1258
10 46       52        LA   diesel 2017-01-03T09:54:58 1258
10 46       52        LA   diesel 2017-01-06T12:33:57 1258
10 46       52        LA   diesel 2017-01-09T08:59:53 1258
10 46       52        LA   diesel 2017-01-10T10:38:39 1258
2  46       53        NY   diesel 2017-01-03T09:38:59 1258
2  46       53        NY   diesel 2017-01-06T11:19:09 1258

我尝试了以下的代码：

df = pd.read_xml(&quot;myfile.xml&quot;, xpath=&quot;//price&quot;)

但我得到了以下我不想要的数据框：

name   id maj                 value
diesel 1  2017-01-02T09:37:03 1258
diesel 1  2017-01-03T09:54:58 1258
diesel 1  2017-01-06T12:33:57 1258
diesel 1  2017-01-09T08:59:53 1258
diesel 1  2017-01-10T10:38:39 1258
diesel 1  2017-01-03T09:38:59 1258
diesel 1  2017-01-06T11:19:09 1258

有多个pdv。我应该如何更改我的代码？

英文:

I have the following xml file:

&lt;pdv_liste&gt;
&lt;pdv id=&quot;10&quot; latitude=&quot;46&quot; longitude=&quot;52&quot; cp=&quot;01000&quot; pop=&quot;R&quot;&gt;
  &lt;city&gt;LA&lt;/city&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-02T09:37:03&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-03T09:54:58&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-06T12:33:57&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-09T08:59:53&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-10T10:38:39&quot; value=&quot;1258&quot;/&gt;
&lt;/pdv&gt;
&lt;pdv id=&quot;2&quot; latitude=&quot;46&quot; longitude=&quot;53&quot; cp=&quot;01000&quot; pop=&quot;R&quot;&gt;
  &lt;city&gt;NY&lt;/city&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-03T09:38:59&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-06T11:19:09&quot; value=&quot;1258&quot;/&gt;
&lt;/pdv&gt;
&lt;/pdv_liste&gt;

I want to obtain the following dataframe:

id latitude longitude city name   maj                 value
10 46       52        LA   diesel 2017-01-02T09:37:03 1258
10 46       52        LA   diesel 2017-01-03T09:54:58 1258
10 46       52        LA   diesel 2017-01-06T12:33:57 1258
10 46       52        LA   diesel 2017-01-09T08:59:53 1258
10 46       52        LA   diesel 2017-01-10T10:38:39 1258
2  46       53        NY   diesel 2017-01-03T09:38:59 1258
2  46       53        NY   diesel 2017-01-06T11:19:09 1258

I tried the following code:

df = pd.read_xml(&quot;myfile.xml&quot;, xpath=&quot;//price&quot;)

But I obtain the following dataframe that I do not want:

name   id maj                 value
diesel 1  2017-01-02T09:37:03 1258
diesel 1  2017-01-03T09:54:58 1258
diesel 1  2017-01-06T12:33:57 1258
diesel 1  2017-01-09T08:59:53 1258
diesel 1  2017-01-10T10:38:39 1258
diesel 1  2017-01-03T09:38:59 1258
diesel 1  2017-01-06T11:19:09 1258

There are multiple pdv. How I should change my code?

答案1

得分: 1

以下是使用[lxml]的一个选项：

from lxml import etree
tree = etree.parse("myfile.xml")
df = pd.DataFrame(
    [
        {
            "id": pdv.attrib["id"],
            "latitude": pdv.attrib["latitude"],
            "longitude": pdv.attrib["longitude"],
            "city": pdv.findtext("city"),
            "name": price.attrib["name"],
            "maj": price.attrib["maj"],
            "value": price.attrib["value"]
        }
        for pdv in tree.getroot().findall(".//pdv")
        for price in pdv.findall("price")
    ]
)

输出：

print(df)
   id latitude longitude city    name                  maj value
0  10       46        52   LA  diesel  2017-01-02T09:37:03  1258
1  10       46        52   LA  diesel  2017-01-03T09:54:58  1258
2  10       46        52   LA  diesel  2017-01-06T12:33:57  1258
3  10       46        52   LA  diesel  2017-01-09T08:59:53  1258
4  10       46        52   LA  diesel  2017-01-10T10:38:39  1258
5   2       46        53   NY  diesel  2017-01-03T09:38:59  1258
6   2       46        53   NY  diesel  2017-01-06T11:19:09  1258

如果你想使用[excel]来解析/保存电子表格以使用read_excel：

打开一个新的工作簿，选择Open/Browse/myfile.xml，然后点击OK：

Python 将 XML 转换为数据框中的标签中的标签。

之后，再次点击OK以便Excel尝试推断模式：

Python 将 XML 转换为数据框中的标签中的标签。

最后，将创建一个新的电子表格/表格：

Python 将 XML 转换为数据框中的标签中的标签。

英文:

Here is one option with [tag:lxml] :

from lxml import etree
tree = etree.parse(&quot;myfile.xml&quot;)
df = pd.DataFrame(
    [
        {
            &quot;id&quot;: pdv.attrib[&quot;id&quot;],
            &quot;latitude&quot;: pdv.attrib[&quot;latitude&quot;],
            &quot;longitude&quot;: pdv.attrib[&quot;longitude&quot;],
            &quot;city&quot;: pdv.findtext(&quot;city&quot;),
            &quot;name&quot;: price.attrib[&quot;name&quot;],
            &quot;maj&quot;: price.attrib[&quot;maj&quot;],
            &quot;value&quot;: price.attrib[&quot;value&quot;]
        }
        for pdv in tree.getroot().findall(&quot;.//pdv&quot;)
        for price in pdv.findall(&quot;price&quot;)
    ]
)

Output :

print(df)
   id latitude longitude city    name                  maj value
0  10       46        52   LA  diesel  2017-01-02T09:37:03  1258
1  10       46        52   LA  diesel  2017-01-03T09:54:58  1258
2  10       46        52   LA  diesel  2017-01-06T12:33:57  1258
3  10       46        52   LA  diesel  2017-01-09T08:59:53  1258
4  10       46        52   LA  diesel  2017-01-10T10:38:39  1258
5   2       46        53   NY  diesel  2017-01-03T09:38:59  1258
6   2       46        53   NY  diesel  2017-01-06T11:19:09  1258

If you are tempted to use [tag:excel] to parse/save a spreadsheet in order to use read_excel :

Open a new workbook, choose Open/Browse/myfile.xml, then click OK :

After that, confirm with another OK so Excel tries to infer the schema :

And finally, a new spreadsheet/table will be created :

答案2

得分: 1

根据您提供的代码和输出，以下是翻译好的部分：

import pandas as pd
x_s = '''<?xml version="1.0" encoding="UTF-8"?>
<pdv_liste>
<pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
  <city>LA</city>
  <price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
</pdv>
<pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
  <city>NY</city>
  <price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
</pdv>
</pdv_liste>'''
df_style = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
    <xsl:template match="/pdv_liste">
        <xsl:variable name="var1_initial" select="."/>
        <result>
            <xsl:for-each select="pdv">
                <xsl:variable name="var2_level" select="."/>
                <xsl:for-each select="price">
                    <xsl:variable name="var3_level" select="."/>
                    <ROW>
                        <id>
                            <xsl:value-of select="$var2_level/@id"/>
                        </id>
                        <latitude>
                            <xsl:value-of select="$var2_level/@latitude"/>
                        </latitude>
                        <longitude>
                            <xsl:value-of select="$var2_level/@longitude"/>
                        </longitude>
                        <name>
                            <xsl:value-of select="$var3_level/@name"/>
                        </name>
                        <maj>
                            <xsl:value-of select="$var3_level/@maj"/>
                        </maj>
                        <value>
                            <xsl:value-of select="$var3_level/@value"/>
                        </value>
                        <city>
                            <xsl:value-of select="$var2_level"/>
                        </city>
                    </ROW>
                </xsl:for-each>
            </xsl:for-each>
        </result>
    </xsl:template>
</xsl:stylesheet>'''
df = pd.read_xml(x_s, stylesheet=df_style)
df

作为结果，我得到了：

   id  latitude  longitude    name                  maj  value city
0  10        46         52  diesel  2017-01-02T09:37:03   1258   LA
1  10        46         52  diesel  2017-01-03T09:54:58   1258   LA
2  10        46         52  diesel  2017-01-06T12:33:57   1258   LA
3  10        46         52  diesel  2017-01-09T08:59:53   1258   LA
4  10        46         52  diesel  2017-01-10T10:38:39   1258   LA
5   2        46         53  diesel  2017-01-03T09:38:59   1258   NY
6   2        46         53  diesel  2017-01-06T11:19:09   1258   NY

英文:

pandas documentation:

> for more complex XML documents, stylesheet allows you to temporarily
> redesign original document with XSLT (a special purpose language) for
> a flatter version for migration to a DataFrame.

So one of solutions is to denormalize your xml, using xslt transformation stylesheet. Here the code how to use it with pandas:

import pandas as pd
x_s = &#39;&#39;&#39;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;pdv_liste&gt;
&lt;pdv id=&quot;10&quot; latitude=&quot;46&quot; longitude=&quot;52&quot; cp=&quot;01000&quot; pop=&quot;R&quot;&gt;
  &lt;city&gt;LA&lt;/city&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-02T09:37:03&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-03T09:54:58&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-06T12:33:57&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-09T08:59:53&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-10T10:38:39&quot; value=&quot;1258&quot;/&gt;
&lt;/pdv&gt;
&lt;pdv id=&quot;2&quot; latitude=&quot;46&quot; longitude=&quot;53&quot; cp=&quot;01000&quot; pop=&quot;R&quot;&gt;
  &lt;city&gt;NY&lt;/city&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-03T09:38:59&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-06T11:19:09&quot; value=&quot;1258&quot;/&gt;
&lt;/pdv&gt;
&lt;/pdv_liste&gt;&#39;&#39;&#39;
df_style = &#39;&#39;&#39;&lt;xsl:stylesheet version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot; xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot; exclude-result-prefixes=&quot;xs&quot;&gt;
    &lt;xsl:output method=&quot;xml&quot; encoding=&quot;UTF-8&quot; indent=&quot;yes&quot;/&gt;
    &lt;xsl:template match=&quot;/pdv_liste&quot;&gt;
        &lt;xsl:variable name=&quot;var1_initial&quot; select=&quot;.&quot;/&gt;
        &lt;result&gt;
            &lt;xsl:for-each select=&quot;pdv&quot;&gt;
                &lt;xsl:variable name=&quot;var2_level&quot; select=&quot;.&quot;/&gt;
                &lt;xsl:for-each select=&quot;price&quot;&gt;
                    &lt;xsl:variable name=&quot;var3_level&quot; select=&quot;.&quot;/&gt;
                    &lt;ROW&gt;
                        &lt;id&gt;
                            &lt;xsl:value-of select=&quot;$var2_level/@id&quot;/&gt;
                        &lt;/id&gt;
                        &lt;latitude&gt;
                            &lt;xsl:value-of select=&quot;$var2_level/@latitude&quot;/&gt;
                        &lt;/latitude&gt;
                        &lt;longitude&gt;
                            &lt;xsl:value-of select=&quot;$var2_level/@longitude&quot;/&gt;
                        &lt;/longitude&gt;
                        &lt;name&gt;
                            &lt;xsl:value-of select=&quot;$var3_level/@name&quot;/&gt;
                        &lt;/name&gt;
                        &lt;maj&gt;
                            &lt;xsl:value-of select=&quot;$var3_level/@maj&quot;/&gt;
                        &lt;/maj&gt;
                        &lt;value&gt;
                            &lt;xsl:value-of select=&quot;$var3_level/@value&quot;/&gt;
                        &lt;/value&gt;
                        &lt;city&gt;
                            &lt;xsl:value-of select=&quot;$var2_level&quot;/&gt;
                        &lt;/city&gt;
                    &lt;/ROW&gt;
                &lt;/xsl:for-each&gt;
            &lt;/xsl:for-each&gt;
        &lt;/result&gt;
    &lt;/xsl:template&gt;
&lt;/xsl:stylesheet&gt;&#39;&#39;&#39;
df = pd.read_xml(x_s, stylesheet=df_style)
df

As result I got:

   id  latitude  longitude    name                  maj  value city
0  10        46         52  diesel  2017-01-02T09:37:03   1258   LA
1  10        46         52  diesel  2017-01-03T09:54:58   1258   LA
2  10        46         52  diesel  2017-01-06T12:33:57   1258   LA
3  10        46         52  diesel  2017-01-09T08:59:53   1258   LA
4  10        46         52  diesel  2017-01-10T10:38:39   1258   LA
5   2        46         53  diesel  2017-01-03T09:38:59   1258   NY
6   2        46         53  diesel  2017-01-06T11:19:09   1258   NY

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python 将 XML 转换为数据框中的标签中的标签。

问题

答案1

答案2

PyTables 在 macOS M1 上安装时与 Python 3.11 失败。

Python映射调试

无法在Ubuntu 20.04 Docker镜像中安装Poetry，因为出现了setuptools错误。

Python numpy: Add elements of a numpy array of arrays to elements of another array of arrays initialized to at the specified positions

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论