读取单个文件中的多个XML

huangapple go评论105阅读模式
英文:

Reading multiple XML in a single file

问题

我需要从各种地方收集多个短XML片段,它们都具有相似的格式,在Python中。问题是它们将全部呈现在一个文件中:

  1. <infodump>
  2. ...
  3. </infodump>
  4. <infodump>
  5. ...
  6. </infodump>

以此类推。我可以找到许多在多个文件中解析XML的示例(例如,迭代目录中的所有文件),以及合并成单个文件的示例(如使用xmlmerge等),但我还没有找到如何解析同一文件中的多个示例的方法。

我尝试了通常的方法:

  1. tree = ET.parse("id.xml")
  2. root = tree.getroot()

但它在文件中的第二个XML上出现问题:

  1. Traceback (most recent call last):
  2. File "C:\Users\dlevey\PycharmProjects\reconcile\main.py", line 27, in <module>
  3. main()
  4. File "C:\Users\dlevey\PycharmProjects\reconcile\main.py", line 5, in main
  5. tree = ET.parse("id.xml")
  6. ^^^^^^^^^^^^^^^^^^^^^
  7. File "C:\Program Files\Python311\Lib\xml\etree\ElementTree.py", line 1218, in parse
  8. tree.parse(source, parser)
  9. File "C:\Program Files\Python311\Lib\xml\etree\ElementTree.py", line 580, in parse
  10. self._root = parser._parse_whole(source)
  11. ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  12. xml.etree.ElementTree.ParseError: junk after document element: line 25, column 0

我真的不想再次将它们拆分成多个文件,也不想要求它们都作为单独的文件发送,因为其中一些来自源头的文件是作为单个文件发送的,我将一次处理数千个XML样本。

英文:

I need to collect multiple pieces of short XML from a variety of places, all with similar format, in Python. The catch is that they will be presented to me all in one file:

  1. &lt;infodump&gt;
  2. ...
  3. &lt;/infodump&gt;
  4. &lt;infodump&gt;
  5. ...
  6. &lt;/infodump&gt;

and so on. I can find many examples of parsing XML in multiple files (iterating over all the files in a directory, for example), and examples of merging into a single file (with such things as xmlmerge), but I have yet found nothing on parsing multiple examples in the same file.

I try the usual

  1. tree = ET.parse(&quot;./id.xml&quot;)
  2. root = tree.getroot()

but it chokes on the second XML in the file:

  1. Traceback (most recent call last):
  2. File &quot;C:\Users\dlevey\PycharmProjects\reconcile\main.py&quot;, line 27, in &lt;module&gt;
  3. main()
  4. File &quot;C:\Users\dlevey\PycharmProjects\reconcile\main.py&quot;, line 5, in main
  5. tree = ET.parse(&quot;./id.xml&quot;)
  6. ^^^^^^^^^^^^^^^^^^^^^
  7. File &quot;C:\Program Files\Python311\Lib\xml\etree\ElementTree.py&quot;, line 1218, in parse
  8. tree.parse(source, parser)
  9. File &quot;C:\Program Files\Python311\Lib\xml\etree\ElementTree.py&quot;, line 580, in parse
  10. self._root = parser._parse_whole(source)
  11. ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  12. xml.etree.ElementTree.ParseError: junk after document element: line 25, column 0

I'd really rather not split them again into multiple files, or request that they are all sent as individual files, as some of them are coming from the source as a single file, and I'll be processing thousands of XML samples at once.

答案1

得分: 1

正如评论中所暗示的,您的输入不是格式良好的(详见此处 了解格式良好与有效 XML 的详细信息),因为它没有单一的根元素。

解决此问题的最简单方法是在解析之前将输入包装在根元素中。

下面是一个简单的示例,它将输入的 XML 片段包装在一个 input 元素中...

  1. import xml.etree.ElementTree as ET
  2. with open("id.xml", "r") as input_file:
  3. content = input_file.read()
  4. well_formed_input = f"<input>{content}</input>"
  5. root = ET.fromstring(well_formed_input)
  6. for elem in root.findall("infodump"):
  7. print(elem)

这将打印以下输出:

  1. <Element 'infodump' at 0x0000020B5FFB8400>
  2. <Element 'infodump' at 0x0000020B5FFB84A0>
英文:

Like hinted to in the comments, your input is not well-formed (see here for details on well-formed vs valid xml) because it doesn't have a single root element.

The easiest way to solve this is to wrap the input in a root element before parsing.

Here's a simple example that wraps the input xml fragments in an input element...

  1. import xml.etree.ElementTree as ET
  2. with open(&quot;id.xml&quot;, &quot;r&quot;) as input_file:
  3. content = input_file.read()
  4. well_formed_input = f&quot;&lt;input&gt;{content}&lt;/input&gt;&quot;
  5. root = ET.fromstring(well_formed_input)
  6. for elem in root.findall(&quot;infodump&quot;):
  7. print(elem)

This prints the following output:

  1. &lt;Element &#39;infodump&#39; at 0x0000020B5FFB8400&gt;
  2. &lt;Element &#39;infodump&#39; at 0x0000020B5FFB84A0&gt;

答案2

得分: 0

作为对Daniel建议的另一种选择,您可以使用一个HTML解析器:

  1. from html.parser import HTMLParser
  2. class MyHTMLParser(HTMLParser):
  3. def handle_starttag(self, tag, attrs):
  4. print("开始标签:", tag)
  5. def handle_endtag(self, tag):
  6. print("结束标签:", tag)
  7. def handle_data(self, data):
  8. print("一些数据:", data)
  9. parser = MyHTMLParser()
  10. parser.feed('''<infodump>
  11. ...
  12. </infodump>
  13. <infodump>
  14. ...
  15. </infodump>''')

输出:

  1. 开始标签: infodump
  2. 一些数据:
  3. ...
  4. 结束标签: infodump
  5. 一些数据:
  6. 开始标签: infodump
  7. 一些数据:
  8. ...
  9. 结束标签: infodump
英文:

As an alternative to Daniel’s suggestion you can use a html parser:

  1. from html.parser import HTMLParser
  2. class MyHTMLParser(HTMLParser):
  3. def handle_starttag(self, tag, attrs):
  4. print(&quot;Start tag:&quot;, tag)
  5. def handle_endtag(self, tag):
  6. print(&quot;End tag :&quot;, tag)
  7. def handle_data(self, data):
  8. print(&quot;Some data :&quot;, data)
  9. parser = MyHTMLParser()
  10. parser.feed(&quot;&quot;&quot;&lt;infodump&gt;
  11. ...
  12. &lt;/infodump&gt;
  13. &lt;infodump&gt;
  14. ...
  15. &lt;/infodump&gt;&quot;&quot;&quot;)

Output:

  1. Start tag: infodump
  2. Some data :
  3. ...
  4. End tag : infodump
  5. Some data :
  6. Start tag: infodump
  7. Some data :
  8. ...
  9. End tag : infodump

huangapple
  • 本文由 发表于 2023年6月22日 19:40:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76531507.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定