Python 将 XML 转换为数据框中的标签中的标签。

huangapple go评论79阅读模式
英文:

Python xml to dataframe tag in tags

问题

我有以下的xml文件:

<pdv_liste>
<pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
  <city>LA</city>
  <price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
</pdv>
<pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
  <city>NY</city>
  <price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
</pdv>
</pdv_liste>

我想要获取以下的数据框:

id latitude longitude city name   maj                 value
10 46       52        LA   diesel 2017-01-02T09:37:03 1258
10 46       52        LA   diesel 2017-01-03T09:54:58 1258
10 46       52        LA   diesel 2017-01-06T12:33:57 1258
10 46       52        LA   diesel 2017-01-09T08:59:53 1258
10 46       52        LA   diesel 2017-01-10T10:38:39 1258
2  46       53        NY   diesel 2017-01-03T09:38:59 1258
2  46       53        NY   diesel 2017-01-06T11:19:09 1258

我尝试了以下的代码:

df = pd.read_xml("myfile.xml", xpath="//price") 

但我得到了以下我不想要的数据框:

name   id maj                 value
diesel 1  2017-01-02T09:37:03 1258
diesel 1  2017-01-03T09:54:58 1258
diesel 1  2017-01-06T12:33:57 1258
diesel 1  2017-01-09T08:59:53 1258
diesel 1  2017-01-10T10:38:39 1258
diesel 1  2017-01-03T09:38:59 1258
diesel 1  2017-01-06T11:19:09 1258

有多个pdv。我应该如何更改我的代码?

英文:

I have the following xml file:

<pdv_liste>
<pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
  <city>LA</city>
  <price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
</pdv>
<pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
  <city>NY</city>
  <price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
</pdv>
</pdv_liste>

I want to obtain the following dataframe:

id latitude longitude city name   maj                 value
10 46       52        LA   diesel 2017-01-02T09:37:03 1258
10 46       52        LA   diesel 2017-01-03T09:54:58 1258
10 46       52        LA   diesel 2017-01-06T12:33:57 1258
10 46       52        LA   diesel 2017-01-09T08:59:53 1258
10 46       52        LA   diesel 2017-01-10T10:38:39 1258
2  46       53        NY   diesel 2017-01-03T09:38:59 1258
2  46       53        NY   diesel 2017-01-06T11:19:09 1258

I tried the following code:

df = pd.read_xml("myfile.xml", xpath="//price") 

But I obtain the following dataframe that I do not want:

name   id maj                 value
diesel 1  2017-01-02T09:37:03 1258
diesel 1  2017-01-03T09:54:58 1258
diesel 1  2017-01-06T12:33:57 1258
diesel 1  2017-01-09T08:59:53 1258
diesel 1  2017-01-10T10:38:39 1258
diesel 1  2017-01-03T09:38:59 1258
diesel 1  2017-01-06T11:19:09 1258

There are multiple pdv. How I should change my code?

答案1

得分: 1

以下是使用[lxml]的一个选项:

from lxml import etree

tree = etree.parse("myfile.xml")

df = pd.DataFrame(
    [
        {
            "id": pdv.attrib["id"],
            "latitude": pdv.attrib["latitude"],
            "longitude": pdv.attrib["longitude"],
            "city": pdv.findtext("city"),
            "name": price.attrib["name"],
            "maj": price.attrib["maj"],
            "value": price.attrib["value"]
        }
        for pdv in tree.getroot().findall(".//pdv")
        for price in pdv.findall("price")
    ]
)

输出:

print(df)

   id latitude longitude city    name                  maj value
0  10       46        52   LA  diesel  2017-01-02T09:37:03  1258
1  10       46        52   LA  diesel  2017-01-03T09:54:58  1258
2  10       46        52   LA  diesel  2017-01-06T12:33:57  1258
3  10       46        52   LA  diesel  2017-01-09T08:59:53  1258
4  10       46        52   LA  diesel  2017-01-10T10:38:39  1258
5   2       46        53   NY  diesel  2017-01-03T09:38:59  1258
6   2       46        53   NY  diesel  2017-01-06T11:19:09  1258

如果你想使用[excel]来解析/保存电子表格以使用read_excel

打开一个新的工作簿,选择Open/Browse/myfile.xml,然后点击OK

Python 将 XML 转换为数据框中的标签中的标签。

之后,再次点击OK以便Excel尝试推断模式:

Python 将 XML 转换为数据框中的标签中的标签。

最后,将创建一个新的电子表格/表格:

Python 将 XML 转换为数据框中的标签中的标签。

英文:

Here is one option with [tag:lxml] :

from lxml import etree

tree = etree.parse("myfile.xml")

df = pd.DataFrame(
    [
        {
            "id": pdv.attrib["id"],
            "latitude": pdv.attrib["latitude"],
            "longitude": pdv.attrib["longitude"],
            "city": pdv.findtext("city"),
            "name": price.attrib["name"],
            "maj": price.attrib["maj"],
            "value": price.attrib["value"]
        }
        for pdv in tree.getroot().findall(".//pdv")
        for price in pdv.findall("price")
    ]
)

Output :

print(df)

   id latitude longitude city    name                  maj value
0  10       46        52   LA  diesel  2017-01-02T09:37:03  1258
1  10       46        52   LA  diesel  2017-01-03T09:54:58  1258
2  10       46        52   LA  diesel  2017-01-06T12:33:57  1258
3  10       46        52   LA  diesel  2017-01-09T08:59:53  1258
4  10       46        52   LA  diesel  2017-01-10T10:38:39  1258
5   2       46        53   NY  diesel  2017-01-03T09:38:59  1258
6   2       46        53   NY  diesel  2017-01-06T11:19:09  1258

If you are tempted to use [tag:excel] to parse/save a spreadsheet in order to use read_excel :

Open a new workbook, choose Open/Browse/myfile.xml, then click OK :

Python 将 XML 转换为数据框中的标签中的标签。

After that, confirm with another OK so Excel tries to infer the schema :

Python 将 XML 转换为数据框中的标签中的标签。

And finally, a new spreadsheet/table will be created :

Python 将 XML 转换为数据框中的标签中的标签。

答案2

得分: 1

根据您提供的代码和输出,以下是翻译好的部分:

import pandas as pd

x_s = '''<?xml version="1.0" encoding="UTF-8"?>
<pdv_liste>
<pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
  <city>LA</city>
  <price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
</pdv>
<pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
  <city>NY</city>
  <price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
  <price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
</pdv>
</pdv_liste>'''

df_style = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
    <xsl:template match="/pdv_liste">
        <xsl:variable name="var1_initial" select="."/>
        <result>
            <xsl:for-each select="pdv">
                <xsl:variable name="var2_level" select="."/>
                <xsl:for-each select="price">
                    <xsl:variable name="var3_level" select="."/>
                    <ROW>
                        <id>
                            <xsl:value-of select="$var2_level/@id"/>
                        </id>
                        <latitude>
                            <xsl:value-of select="$var2_level/@latitude"/>
                        </latitude>
                        <longitude>
                            <xsl:value-of select="$var2_level/@longitude"/>
                        </longitude>
                        <name>
                            <xsl:value-of select="$var3_level/@name"/>
                        </name>
                        <maj>
                            <xsl:value-of select="$var3_level/@maj"/>
                        </maj>
                        <value>
                            <xsl:value-of select="$var3_level/@value"/>
                        </value>
                        <city>
                            <xsl:value-of select="$var2_level"/>
                        </city>
                    </ROW>
                </xsl:for-each>
            </xsl:for-each>
        </result>
    </xsl:template>
</xsl:stylesheet>'''

df = pd.read_xml(x_s, stylesheet=df_style)
df

作为结果,我得到了:

   id  latitude  longitude    name                  maj  value city
0  10        46         52  diesel  2017-01-02T09:37:03   1258   LA
1  10        46         52  diesel  2017-01-03T09:54:58   1258   LA
2  10        46         52  diesel  2017-01-06T12:33:57   1258   LA
3  10        46         52  diesel  2017-01-09T08:59:53   1258   LA
4  10        46         52  diesel  2017-01-10T10:38:39   1258   LA
5   2        46         53  diesel  2017-01-03T09:38:59   1258   NY
6   2        46         53  diesel  2017-01-06T11:19:09   1258   NY
英文:

pandas documentation:

> for more complex XML documents, stylesheet allows you to temporarily
> redesign original document with XSLT (a special purpose language) for
> a flatter version for migration to a DataFrame.

So one of solutions is to denormalize your xml, using xslt transformation stylesheet. Here the code how to use it with pandas:

import pandas as pd

x_s = &#39;&#39;&#39;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;pdv_liste&gt;
&lt;pdv id=&quot;10&quot; latitude=&quot;46&quot; longitude=&quot;52&quot; cp=&quot;01000&quot; pop=&quot;R&quot;&gt;
  &lt;city&gt;LA&lt;/city&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-02T09:37:03&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-03T09:54:58&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-06T12:33:57&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-09T08:59:53&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-10T10:38:39&quot; value=&quot;1258&quot;/&gt;
&lt;/pdv&gt;
&lt;pdv id=&quot;2&quot; latitude=&quot;46&quot; longitude=&quot;53&quot; cp=&quot;01000&quot; pop=&quot;R&quot;&gt;
  &lt;city&gt;NY&lt;/city&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-03T09:38:59&quot; value=&quot;1258&quot;/&gt;
  &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-06T11:19:09&quot; value=&quot;1258&quot;/&gt;
&lt;/pdv&gt;
&lt;/pdv_liste&gt;&#39;&#39;&#39;

df_style = &#39;&#39;&#39;&lt;xsl:stylesheet version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot; xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot; exclude-result-prefixes=&quot;xs&quot;&gt;
    &lt;xsl:output method=&quot;xml&quot; encoding=&quot;UTF-8&quot; indent=&quot;yes&quot;/&gt;
    &lt;xsl:template match=&quot;/pdv_liste&quot;&gt;
        &lt;xsl:variable name=&quot;var1_initial&quot; select=&quot;.&quot;/&gt;
        &lt;result&gt;
            &lt;xsl:for-each select=&quot;pdv&quot;&gt;
                &lt;xsl:variable name=&quot;var2_level&quot; select=&quot;.&quot;/&gt;
                &lt;xsl:for-each select=&quot;price&quot;&gt;
                    &lt;xsl:variable name=&quot;var3_level&quot; select=&quot;.&quot;/&gt;
                    &lt;ROW&gt;
                        &lt;id&gt;
                            &lt;xsl:value-of select=&quot;$var2_level/@id&quot;/&gt;
                        &lt;/id&gt;
                        &lt;latitude&gt;
                            &lt;xsl:value-of select=&quot;$var2_level/@latitude&quot;/&gt;
                        &lt;/latitude&gt;
                        &lt;longitude&gt;
                            &lt;xsl:value-of select=&quot;$var2_level/@longitude&quot;/&gt;
                        &lt;/longitude&gt;
                        &lt;name&gt;
                            &lt;xsl:value-of select=&quot;$var3_level/@name&quot;/&gt;
                        &lt;/name&gt;
                        &lt;maj&gt;
                            &lt;xsl:value-of select=&quot;$var3_level/@maj&quot;/&gt;
                        &lt;/maj&gt;
                        &lt;value&gt;
                            &lt;xsl:value-of select=&quot;$var3_level/@value&quot;/&gt;
                        &lt;/value&gt;
                        &lt;city&gt;
                            &lt;xsl:value-of select=&quot;$var2_level&quot;/&gt;
                        &lt;/city&gt;
                    &lt;/ROW&gt;
                &lt;/xsl:for-each&gt;
            &lt;/xsl:for-each&gt;
        &lt;/result&gt;
    &lt;/xsl:template&gt;
&lt;/xsl:stylesheet&gt;&#39;&#39;&#39;

df = pd.read_xml(x_s, stylesheet=df_style)
df

As result I got:

   id  latitude  longitude    name                  maj  value city
0  10        46         52  diesel  2017-01-02T09:37:03   1258   LA
1  10        46         52  diesel  2017-01-03T09:54:58   1258   LA
2  10        46         52  diesel  2017-01-06T12:33:57   1258   LA
3  10        46         52  diesel  2017-01-09T08:59:53   1258   LA
4  10        46         52  diesel  2017-01-10T10:38:39   1258   LA
5   2        46         53  diesel  2017-01-03T09:38:59   1258   NY
6   2        46         53  diesel  2017-01-06T11:19:09   1258   NY

huangapple
  • 本文由 发表于 2023年6月5日 17:58:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76405289.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定