英文:
Python xml to dataframe tag in tags
问题
我有以下的xml文件:
<pdv_liste>
<pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
<city>LA</city>
<price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
<price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
<price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
<price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
<price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
</pdv>
<pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
<city>NY</city>
<price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
<price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
</pdv>
</pdv_liste>
我想要获取以下的数据框:
id latitude longitude city name maj value
10 46 52 LA diesel 2017-01-02T09:37:03 1258
10 46 52 LA diesel 2017-01-03T09:54:58 1258
10 46 52 LA diesel 2017-01-06T12:33:57 1258
10 46 52 LA diesel 2017-01-09T08:59:53 1258
10 46 52 LA diesel 2017-01-10T10:38:39 1258
2 46 53 NY diesel 2017-01-03T09:38:59 1258
2 46 53 NY diesel 2017-01-06T11:19:09 1258
我尝试了以下的代码:
df = pd.read_xml("myfile.xml", xpath="//price")
但我得到了以下我不想要的数据框:
name id maj value
diesel 1 2017-01-02T09:37:03 1258
diesel 1 2017-01-03T09:54:58 1258
diesel 1 2017-01-06T12:33:57 1258
diesel 1 2017-01-09T08:59:53 1258
diesel 1 2017-01-10T10:38:39 1258
diesel 1 2017-01-03T09:38:59 1258
diesel 1 2017-01-06T11:19:09 1258
有多个pdv
。我应该如何更改我的代码?
英文:
I have the following xml file:
<pdv_liste>
<pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
<city>LA</city>
<price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
<price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
<price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
<price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
<price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
</pdv>
<pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
<city>NY</city>
<price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
<price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
</pdv>
</pdv_liste>
I want to obtain the following dataframe:
id latitude longitude city name maj value
10 46 52 LA diesel 2017-01-02T09:37:03 1258
10 46 52 LA diesel 2017-01-03T09:54:58 1258
10 46 52 LA diesel 2017-01-06T12:33:57 1258
10 46 52 LA diesel 2017-01-09T08:59:53 1258
10 46 52 LA diesel 2017-01-10T10:38:39 1258
2 46 53 NY diesel 2017-01-03T09:38:59 1258
2 46 53 NY diesel 2017-01-06T11:19:09 1258
I tried the following code:
df = pd.read_xml("myfile.xml", xpath="//price")
But I obtain the following dataframe that I do not want:
name id maj value
diesel 1 2017-01-02T09:37:03 1258
diesel 1 2017-01-03T09:54:58 1258
diesel 1 2017-01-06T12:33:57 1258
diesel 1 2017-01-09T08:59:53 1258
diesel 1 2017-01-10T10:38:39 1258
diesel 1 2017-01-03T09:38:59 1258
diesel 1 2017-01-06T11:19:09 1258
There are multiple pdv. How I should change my code?
答案1
得分: 1
以下是使用[lxml]的一个选项:
from lxml import etree
tree = etree.parse("myfile.xml")
df = pd.DataFrame(
[
{
"id": pdv.attrib["id"],
"latitude": pdv.attrib["latitude"],
"longitude": pdv.attrib["longitude"],
"city": pdv.findtext("city"),
"name": price.attrib["name"],
"maj": price.attrib["maj"],
"value": price.attrib["value"]
}
for pdv in tree.getroot().findall(".//pdv")
for price in pdv.findall("price")
]
)
输出:
print(df)
id latitude longitude city name maj value
0 10 46 52 LA diesel 2017-01-02T09:37:03 1258
1 10 46 52 LA diesel 2017-01-03T09:54:58 1258
2 10 46 52 LA diesel 2017-01-06T12:33:57 1258
3 10 46 52 LA diesel 2017-01-09T08:59:53 1258
4 10 46 52 LA diesel 2017-01-10T10:38:39 1258
5 2 46 53 NY diesel 2017-01-03T09:38:59 1258
6 2 46 53 NY diesel 2017-01-06T11:19:09 1258
如果你想使用[excel]来解析/保存电子表格以使用read_excel
:
打开一个新的工作簿,选择Open/Browse/myfile.xml
,然后点击OK
:
之后,再次点击OK
以便Excel尝试推断模式:
最后,将创建一个新的电子表格/表格:
英文:
Here is one option with [tag:lxml] :
from lxml import etree
tree = etree.parse("myfile.xml")
df = pd.DataFrame(
[
{
"id": pdv.attrib["id"],
"latitude": pdv.attrib["latitude"],
"longitude": pdv.attrib["longitude"],
"city": pdv.findtext("city"),
"name": price.attrib["name"],
"maj": price.attrib["maj"],
"value": price.attrib["value"]
}
for pdv in tree.getroot().findall(".//pdv")
for price in pdv.findall("price")
]
)
Output :
print(df)
id latitude longitude city name maj value
0 10 46 52 LA diesel 2017-01-02T09:37:03 1258
1 10 46 52 LA diesel 2017-01-03T09:54:58 1258
2 10 46 52 LA diesel 2017-01-06T12:33:57 1258
3 10 46 52 LA diesel 2017-01-09T08:59:53 1258
4 10 46 52 LA diesel 2017-01-10T10:38:39 1258
5 2 46 53 NY diesel 2017-01-03T09:38:59 1258
6 2 46 53 NY diesel 2017-01-06T11:19:09 1258
If you are tempted to use [tag:excel] to parse/save a spreadsheet in order to use read_excel
:
Open a new workbook, choose Open/Browse/myfile.xml
, then click OK
:
After that, confirm with another OK
so Excel tries to infer the schema :
And finally, a new spreadsheet/table will be created :
答案2
得分: 1
根据您提供的代码和输出,以下是翻译好的部分:
import pandas as pd
x_s = '''<?xml version="1.0" encoding="UTF-8"?>
<pdv_liste>
<pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
<city>LA</city>
<price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
<price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
<price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
<price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
<price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
</pdv>
<pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
<city>NY</city>
<price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
<price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
</pdv>
</pdv_liste>'''
df_style = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="/pdv_liste">
<xsl:variable name="var1_initial" select="."/>
<result>
<xsl:for-each select="pdv">
<xsl:variable name="var2_level" select="."/>
<xsl:for-each select="price">
<xsl:variable name="var3_level" select="."/>
<ROW>
<id>
<xsl:value-of select="$var2_level/@id"/>
</id>
<latitude>
<xsl:value-of select="$var2_level/@latitude"/>
</latitude>
<longitude>
<xsl:value-of select="$var2_level/@longitude"/>
</longitude>
<name>
<xsl:value-of select="$var3_level/@name"/>
</name>
<maj>
<xsl:value-of select="$var3_level/@maj"/>
</maj>
<value>
<xsl:value-of select="$var3_level/@value"/>
</value>
<city>
<xsl:value-of select="$var2_level"/>
</city>
</ROW>
</xsl:for-each>
</xsl:for-each>
</result>
</xsl:template>
</xsl:stylesheet>'''
df = pd.read_xml(x_s, stylesheet=df_style)
df
作为结果,我得到了:
id latitude longitude name maj value city
0 10 46 52 diesel 2017-01-02T09:37:03 1258 LA
1 10 46 52 diesel 2017-01-03T09:54:58 1258 LA
2 10 46 52 diesel 2017-01-06T12:33:57 1258 LA
3 10 46 52 diesel 2017-01-09T08:59:53 1258 LA
4 10 46 52 diesel 2017-01-10T10:38:39 1258 LA
5 2 46 53 diesel 2017-01-03T09:38:59 1258 NY
6 2 46 53 diesel 2017-01-06T11:19:09 1258 NY
英文:
pandas documentation:
> for more complex XML documents, stylesheet allows you to temporarily
> redesign original document with XSLT (a special purpose language) for
> a flatter version for migration to a DataFrame.
So one of solutions is to denormalize your xml, using xslt transformation stylesheet. Here the code how to use it with pandas:
import pandas as pd
x_s = '''<?xml version="1.0" encoding="UTF-8"?>
<pdv_liste>
<pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
<city>LA</city>
<price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
<price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
<price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
<price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
<price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
</pdv>
<pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
<city>NY</city>
<price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
<price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
</pdv>
</pdv_liste>'''
df_style = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="/pdv_liste">
<xsl:variable name="var1_initial" select="."/>
<result>
<xsl:for-each select="pdv">
<xsl:variable name="var2_level" select="."/>
<xsl:for-each select="price">
<xsl:variable name="var3_level" select="."/>
<ROW>
<id>
<xsl:value-of select="$var2_level/@id"/>
</id>
<latitude>
<xsl:value-of select="$var2_level/@latitude"/>
</latitude>
<longitude>
<xsl:value-of select="$var2_level/@longitude"/>
</longitude>
<name>
<xsl:value-of select="$var3_level/@name"/>
</name>
<maj>
<xsl:value-of select="$var3_level/@maj"/>
</maj>
<value>
<xsl:value-of select="$var3_level/@value"/>
</value>
<city>
<xsl:value-of select="$var2_level"/>
</city>
</ROW>
</xsl:for-each>
</xsl:for-each>
</result>
</xsl:template>
</xsl:stylesheet>'''
df = pd.read_xml(x_s, stylesheet=df_style)
df
As result I got:
id latitude longitude name maj value city
0 10 46 52 diesel 2017-01-02T09:37:03 1258 LA
1 10 46 52 diesel 2017-01-03T09:54:58 1258 LA
2 10 46 52 diesel 2017-01-06T12:33:57 1258 LA
3 10 46 52 diesel 2017-01-09T08:59:53 1258 LA
4 10 46 52 diesel 2017-01-10T10:38:39 1258 LA
5 2 46 53 diesel 2017-01-03T09:38:59 1258 NY
6 2 46 53 diesel 2017-01-06T11:19:09 1258 NY
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论