Python 将 XML 转换为数据框中的标签中的标签。

huangapple go评论101阅读模式
英文:

Python xml to dataframe tag in tags

问题

我有以下的xml文件:

  1. <pdv_liste>
  2. <pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
  3. <city>LA</city>
  4. <price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
  5. <price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
  6. <price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
  7. <price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
  8. <price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
  9. </pdv>
  10. <pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
  11. <city>NY</city>
  12. <price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
  13. <price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
  14. </pdv>
  15. </pdv_liste>

我想要获取以下的数据框:

  1. id latitude longitude city name maj value
  2. 10 46 52 LA diesel 2017-01-02T09:37:03 1258
  3. 10 46 52 LA diesel 2017-01-03T09:54:58 1258
  4. 10 46 52 LA diesel 2017-01-06T12:33:57 1258
  5. 10 46 52 LA diesel 2017-01-09T08:59:53 1258
  6. 10 46 52 LA diesel 2017-01-10T10:38:39 1258
  7. 2 46 53 NY diesel 2017-01-03T09:38:59 1258
  8. 2 46 53 NY diesel 2017-01-06T11:19:09 1258

我尝试了以下的代码:

  1. df = pd.read_xml("myfile.xml", xpath="//price")

但我得到了以下我不想要的数据框:

  1. name id maj value
  2. diesel 1 2017-01-02T09:37:03 1258
  3. diesel 1 2017-01-03T09:54:58 1258
  4. diesel 1 2017-01-06T12:33:57 1258
  5. diesel 1 2017-01-09T08:59:53 1258
  6. diesel 1 2017-01-10T10:38:39 1258
  7. diesel 1 2017-01-03T09:38:59 1258
  8. diesel 1 2017-01-06T11:19:09 1258

有多个pdv。我应该如何更改我的代码?

英文:

I have the following xml file:

  1. <pdv_liste>
  2. <pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
  3. <city>LA</city>
  4. <price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
  5. <price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
  6. <price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
  7. <price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
  8. <price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
  9. </pdv>
  10. <pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
  11. <city>NY</city>
  12. <price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
  13. <price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
  14. </pdv>
  15. </pdv_liste>

I want to obtain the following dataframe:

  1. id latitude longitude city name maj value
  2. 10 46 52 LA diesel 2017-01-02T09:37:03 1258
  3. 10 46 52 LA diesel 2017-01-03T09:54:58 1258
  4. 10 46 52 LA diesel 2017-01-06T12:33:57 1258
  5. 10 46 52 LA diesel 2017-01-09T08:59:53 1258
  6. 10 46 52 LA diesel 2017-01-10T10:38:39 1258
  7. 2 46 53 NY diesel 2017-01-03T09:38:59 1258
  8. 2 46 53 NY diesel 2017-01-06T11:19:09 1258

I tried the following code:

  1. df = pd.read_xml("myfile.xml", xpath="//price")

But I obtain the following dataframe that I do not want:

  1. name id maj value
  2. diesel 1 2017-01-02T09:37:03 1258
  3. diesel 1 2017-01-03T09:54:58 1258
  4. diesel 1 2017-01-06T12:33:57 1258
  5. diesel 1 2017-01-09T08:59:53 1258
  6. diesel 1 2017-01-10T10:38:39 1258
  7. diesel 1 2017-01-03T09:38:59 1258
  8. diesel 1 2017-01-06T11:19:09 1258

There are multiple pdv. How I should change my code?

答案1

得分: 1

以下是使用[lxml]的一个选项:

  1. from lxml import etree
  2. tree = etree.parse("myfile.xml")
  3. df = pd.DataFrame(
  4. [
  5. {
  6. "id": pdv.attrib["id"],
  7. "latitude": pdv.attrib["latitude"],
  8. "longitude": pdv.attrib["longitude"],
  9. "city": pdv.findtext("city"),
  10. "name": price.attrib["name"],
  11. "maj": price.attrib["maj"],
  12. "value": price.attrib["value"]
  13. }
  14. for pdv in tree.getroot().findall(".//pdv")
  15. for price in pdv.findall("price")
  16. ]
  17. )

输出:

  1. print(df)
  2. id latitude longitude city name maj value
  3. 0 10 46 52 LA diesel 2017-01-02T09:37:03 1258
  4. 1 10 46 52 LA diesel 2017-01-03T09:54:58 1258
  5. 2 10 46 52 LA diesel 2017-01-06T12:33:57 1258
  6. 3 10 46 52 LA diesel 2017-01-09T08:59:53 1258
  7. 4 10 46 52 LA diesel 2017-01-10T10:38:39 1258
  8. 5 2 46 53 NY diesel 2017-01-03T09:38:59 1258
  9. 6 2 46 53 NY diesel 2017-01-06T11:19:09 1258

如果你想使用[excel]来解析/保存电子表格以使用read_excel

打开一个新的工作簿,选择Open/Browse/myfile.xml,然后点击OK

Python 将 XML 转换为数据框中的标签中的标签。

之后,再次点击OK以便Excel尝试推断模式:

Python 将 XML 转换为数据框中的标签中的标签。

最后,将创建一个新的电子表格/表格:

Python 将 XML 转换为数据框中的标签中的标签。

英文:

Here is one option with [tag:lxml] :

  1. from lxml import etree
  2. tree = etree.parse("myfile.xml")
  3. df = pd.DataFrame(
  4. [
  5. {
  6. "id": pdv.attrib["id"],
  7. "latitude": pdv.attrib["latitude"],
  8. "longitude": pdv.attrib["longitude"],
  9. "city": pdv.findtext("city"),
  10. "name": price.attrib["name"],
  11. "maj": price.attrib["maj"],
  12. "value": price.attrib["value"]
  13. }
  14. for pdv in tree.getroot().findall(".//pdv")
  15. for price in pdv.findall("price")
  16. ]
  17. )

Output :

  1. print(df)
  2. id latitude longitude city name maj value
  3. 0 10 46 52 LA diesel 2017-01-02T09:37:03 1258
  4. 1 10 46 52 LA diesel 2017-01-03T09:54:58 1258
  5. 2 10 46 52 LA diesel 2017-01-06T12:33:57 1258
  6. 3 10 46 52 LA diesel 2017-01-09T08:59:53 1258
  7. 4 10 46 52 LA diesel 2017-01-10T10:38:39 1258
  8. 5 2 46 53 NY diesel 2017-01-03T09:38:59 1258
  9. 6 2 46 53 NY diesel 2017-01-06T11:19:09 1258

If you are tempted to use [tag:excel] to parse/save a spreadsheet in order to use read_excel :

Open a new workbook, choose Open/Browse/myfile.xml, then click OK :

Python 将 XML 转换为数据框中的标签中的标签。

After that, confirm with another OK so Excel tries to infer the schema :

Python 将 XML 转换为数据框中的标签中的标签。

And finally, a new spreadsheet/table will be created :

Python 将 XML 转换为数据框中的标签中的标签。

答案2

得分: 1

根据您提供的代码和输出,以下是翻译好的部分:

  1. import pandas as pd
  2. x_s = '''<?xml version="1.0" encoding="UTF-8"?>
  3. <pdv_liste>
  4. <pdv id="10" latitude="46" longitude="52" cp="01000" pop="R">
  5. <city>LA</city>
  6. <price name="diesel" id="1" maj="2017-01-02T09:37:03" value="1258"/>
  7. <price name="diesel" id="1" maj="2017-01-03T09:54:58" value="1258"/>
  8. <price name="diesel" id="1" maj="2017-01-06T12:33:57" value="1258"/>
  9. <price name="diesel" id="1" maj="2017-01-09T08:59:53" value="1258"/>
  10. <price name="diesel" id="1" maj="2017-01-10T10:38:39" value="1258"/>
  11. </pdv>
  12. <pdv id="2" latitude="46" longitude="53" cp="01000" pop="R">
  13. <city>NY</city>
  14. <price name="diesel" id="1" maj="2017-01-03T09:38:59" value="1258"/>
  15. <price name="diesel" id="1" maj="2017-01-06T11:19:09" value="1258"/>
  16. </pdv>
  17. </pdv_liste>'''
  18. df_style = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
  19. <xsl:output method="xml" encoding="UTF-8" indent="yes"/>
  20. <xsl:template match="/pdv_liste">
  21. <xsl:variable name="var1_initial" select="."/>
  22. <result>
  23. <xsl:for-each select="pdv">
  24. <xsl:variable name="var2_level" select="."/>
  25. <xsl:for-each select="price">
  26. <xsl:variable name="var3_level" select="."/>
  27. <ROW>
  28. <id>
  29. <xsl:value-of select="$var2_level/@id"/>
  30. </id>
  31. <latitude>
  32. <xsl:value-of select="$var2_level/@latitude"/>
  33. </latitude>
  34. <longitude>
  35. <xsl:value-of select="$var2_level/@longitude"/>
  36. </longitude>
  37. <name>
  38. <xsl:value-of select="$var3_level/@name"/>
  39. </name>
  40. <maj>
  41. <xsl:value-of select="$var3_level/@maj"/>
  42. </maj>
  43. <value>
  44. <xsl:value-of select="$var3_level/@value"/>
  45. </value>
  46. <city>
  47. <xsl:value-of select="$var2_level"/>
  48. </city>
  49. </ROW>
  50. </xsl:for-each>
  51. </xsl:for-each>
  52. </result>
  53. </xsl:template>
  54. </xsl:stylesheet>'''
  55. df = pd.read_xml(x_s, stylesheet=df_style)
  56. df

作为结果,我得到了:

  1. id latitude longitude name maj value city
  2. 0 10 46 52 diesel 2017-01-02T09:37:03 1258 LA
  3. 1 10 46 52 diesel 2017-01-03T09:54:58 1258 LA
  4. 2 10 46 52 diesel 2017-01-06T12:33:57 1258 LA
  5. 3 10 46 52 diesel 2017-01-09T08:59:53 1258 LA
  6. 4 10 46 52 diesel 2017-01-10T10:38:39 1258 LA
  7. 5 2 46 53 diesel 2017-01-03T09:38:59 1258 NY
  8. 6 2 46 53 diesel 2017-01-06T11:19:09 1258 NY
英文:

pandas documentation:

> for more complex XML documents, stylesheet allows you to temporarily
> redesign original document with XSLT (a special purpose language) for
> a flatter version for migration to a DataFrame.

So one of solutions is to denormalize your xml, using xslt transformation stylesheet. Here the code how to use it with pandas:

  1. import pandas as pd
  2. x_s = &#39;&#39;&#39;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
  3. &lt;pdv_liste&gt;
  4. &lt;pdv id=&quot;10&quot; latitude=&quot;46&quot; longitude=&quot;52&quot; cp=&quot;01000&quot; pop=&quot;R&quot;&gt;
  5. &lt;city&gt;LA&lt;/city&gt;
  6. &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-02T09:37:03&quot; value=&quot;1258&quot;/&gt;
  7. &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-03T09:54:58&quot; value=&quot;1258&quot;/&gt;
  8. &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-06T12:33:57&quot; value=&quot;1258&quot;/&gt;
  9. &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-09T08:59:53&quot; value=&quot;1258&quot;/&gt;
  10. &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-10T10:38:39&quot; value=&quot;1258&quot;/&gt;
  11. &lt;/pdv&gt;
  12. &lt;pdv id=&quot;2&quot; latitude=&quot;46&quot; longitude=&quot;53&quot; cp=&quot;01000&quot; pop=&quot;R&quot;&gt;
  13. &lt;city&gt;NY&lt;/city&gt;
  14. &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-03T09:38:59&quot; value=&quot;1258&quot;/&gt;
  15. &lt;price name=&quot;diesel&quot; id=&quot;1&quot; maj=&quot;2017-01-06T11:19:09&quot; value=&quot;1258&quot;/&gt;
  16. &lt;/pdv&gt;
  17. &lt;/pdv_liste&gt;&#39;&#39;&#39;
  18. df_style = &#39;&#39;&#39;&lt;xsl:stylesheet version=&quot;1.0&quot; xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot; xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot; exclude-result-prefixes=&quot;xs&quot;&gt;
  19. &lt;xsl:output method=&quot;xml&quot; encoding=&quot;UTF-8&quot; indent=&quot;yes&quot;/&gt;
  20. &lt;xsl:template match=&quot;/pdv_liste&quot;&gt;
  21. &lt;xsl:variable name=&quot;var1_initial&quot; select=&quot;.&quot;/&gt;
  22. &lt;result&gt;
  23. &lt;xsl:for-each select=&quot;pdv&quot;&gt;
  24. &lt;xsl:variable name=&quot;var2_level&quot; select=&quot;.&quot;/&gt;
  25. &lt;xsl:for-each select=&quot;price&quot;&gt;
  26. &lt;xsl:variable name=&quot;var3_level&quot; select=&quot;.&quot;/&gt;
  27. &lt;ROW&gt;
  28. &lt;id&gt;
  29. &lt;xsl:value-of select=&quot;$var2_level/@id&quot;/&gt;
  30. &lt;/id&gt;
  31. &lt;latitude&gt;
  32. &lt;xsl:value-of select=&quot;$var2_level/@latitude&quot;/&gt;
  33. &lt;/latitude&gt;
  34. &lt;longitude&gt;
  35. &lt;xsl:value-of select=&quot;$var2_level/@longitude&quot;/&gt;
  36. &lt;/longitude&gt;
  37. &lt;name&gt;
  38. &lt;xsl:value-of select=&quot;$var3_level/@name&quot;/&gt;
  39. &lt;/name&gt;
  40. &lt;maj&gt;
  41. &lt;xsl:value-of select=&quot;$var3_level/@maj&quot;/&gt;
  42. &lt;/maj&gt;
  43. &lt;value&gt;
  44. &lt;xsl:value-of select=&quot;$var3_level/@value&quot;/&gt;
  45. &lt;/value&gt;
  46. &lt;city&gt;
  47. &lt;xsl:value-of select=&quot;$var2_level&quot;/&gt;
  48. &lt;/city&gt;
  49. &lt;/ROW&gt;
  50. &lt;/xsl:for-each&gt;
  51. &lt;/xsl:for-each&gt;
  52. &lt;/result&gt;
  53. &lt;/xsl:template&gt;
  54. &lt;/xsl:stylesheet&gt;&#39;&#39;&#39;
  55. df = pd.read_xml(x_s, stylesheet=df_style)
  56. df

As result I got:

  1. id latitude longitude name maj value city
  2. 0 10 46 52 diesel 2017-01-02T09:37:03 1258 LA
  3. 1 10 46 52 diesel 2017-01-03T09:54:58 1258 LA
  4. 2 10 46 52 diesel 2017-01-06T12:33:57 1258 LA
  5. 3 10 46 52 diesel 2017-01-09T08:59:53 1258 LA
  6. 4 10 46 52 diesel 2017-01-10T10:38:39 1258 LA
  7. 5 2 46 53 diesel 2017-01-03T09:38:59 1258 NY
  8. 6 2 46 53 diesel 2017-01-06T11:19:09 1258 NY

huangapple
  • 本文由 发表于 2023年6月5日 17:58:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76405289.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定