读取单个文件中的多个XML

huangapple go评论154阅读模式
英文:

Reading multiple XML in a single file

问题

我需要从各种地方收集多个短XML片段,它们都具有相似的格式,在Python中。问题是它们将全部呈现在一个文件中:

<infodump>
   ...
</infodump>

<infodump>
   ...
</infodump>

以此类推。我可以找到许多在多个文件中解析XML的示例(例如,迭代目录中的所有文件),以及合并成单个文件的示例(如使用xmlmerge等),但我还没有找到如何解析同一文件中的多个示例的方法。

我尝试了通常的方法:

tree = ET.parse("id.xml")
root = tree.getroot()

但它在文件中的第二个XML上出现问题:

Traceback (most recent call last):
    File "C:\Users\dlevey\PycharmProjects\reconcile\main.py", line 27, in <module>
      main()
    File "C:\Users\dlevey\PycharmProjects\reconcile\main.py", line 5, in main
      tree = ET.parse("id.xml")
           ^^^^^^^^^^^^^^^^^^^^^
    File "C:\Program Files\Python311\Lib\xml\etree\ElementTree.py", line 1218, in parse
      tree.parse(source, parser)
    File "C:\Program Files\Python311\Lib\xml\etree\ElementTree.py", line 580, in parse
      self._root = parser._parse_whole(source)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    xml.etree.ElementTree.ParseError: junk after document element: line 25, column 0

我真的不想再次将它们拆分成多个文件,也不想要求它们都作为单独的文件发送,因为其中一些来自源头的文件是作为单个文件发送的,我将一次处理数千个XML样本。

英文:

I need to collect multiple pieces of short XML from a variety of places, all with similar format, in Python. The catch is that they will be presented to me all in one file:

&lt;infodump&gt;
   ...
&lt;/infodump&gt;

&lt;infodump&gt;
   ...
&lt;/infodump&gt;

and so on. I can find many examples of parsing XML in multiple files (iterating over all the files in a directory, for example), and examples of merging into a single file (with such things as xmlmerge), but I have yet found nothing on parsing multiple examples in the same file.

I try the usual

tree = ET.parse(&quot;./id.xml&quot;)
root = tree.getroot()

but it chokes on the second XML in the file:

Traceback (most recent call last):
    File &quot;C:\Users\dlevey\PycharmProjects\reconcile\main.py&quot;, line 27, in &lt;module&gt;
      main()
    File &quot;C:\Users\dlevey\PycharmProjects\reconcile\main.py&quot;, line 5, in main
      tree = ET.parse(&quot;./id.xml&quot;)
           ^^^^^^^^^^^^^^^^^^^^^
    File &quot;C:\Program Files\Python311\Lib\xml\etree\ElementTree.py&quot;, line 1218, in parse
      tree.parse(source, parser)
    File &quot;C:\Program Files\Python311\Lib\xml\etree\ElementTree.py&quot;, line 580, in parse
      self._root = parser._parse_whole(source)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    xml.etree.ElementTree.ParseError: junk after document element: line 25, column 0

I'd really rather not split them again into multiple files, or request that they are all sent as individual files, as some of them are coming from the source as a single file, and I'll be processing thousands of XML samples at once.

答案1

得分: 1

正如评论中所暗示的,您的输入不是格式良好的(详见此处 了解格式良好与有效 XML 的详细信息),因为它没有单一的根元素。

解决此问题的最简单方法是在解析之前将输入包装在根元素中。

下面是一个简单的示例,它将输入的 XML 片段包装在一个 input 元素中...

import xml.etree.ElementTree as ET

with open("id.xml", "r") as input_file:
    content = input_file.read()

well_formed_input = f"<input>{content}</input>"

root = ET.fromstring(well_formed_input)

for elem in root.findall("infodump"):
    print(elem)

这将打印以下输出:

<Element 'infodump' at 0x0000020B5FFB8400>
<Element 'infodump' at 0x0000020B5FFB84A0>
英文:

Like hinted to in the comments, your input is not well-formed (see here for details on well-formed vs valid xml) because it doesn't have a single root element.

The easiest way to solve this is to wrap the input in a root element before parsing.

Here's a simple example that wraps the input xml fragments in an input element...

import xml.etree.ElementTree as ET

with open(&quot;id.xml&quot;, &quot;r&quot;) as input_file:
    content = input_file.read()

well_formed_input = f&quot;&lt;input&gt;{content}&lt;/input&gt;&quot;

root = ET.fromstring(well_formed_input)

for elem in root.findall(&quot;infodump&quot;):
    print(elem)

This prints the following output:

&lt;Element &#39;infodump&#39; at 0x0000020B5FFB8400&gt;
&lt;Element &#39;infodump&#39; at 0x0000020B5FFB84A0&gt;

答案2

得分: 0

作为对Daniel建议的另一种选择,您可以使用一个HTML解析器:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("开始标签:", tag)

    def handle_endtag(self, tag):
        print("结束标签:", tag)

    def handle_data(self, data):
        print("一些数据:", data)

parser = MyHTMLParser()
parser.feed('''<infodump>
   ...
</infodump>

<infodump>
   ...
</infodump>''')

输出:

开始标签: infodump
一些数据: 
   ...

结束标签: infodump
一些数据: 


开始标签: infodump
一些数据: 
   ...

结束标签: infodump
英文:

As an alternative to Daniel’s suggestion you can use a html parser:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(&quot;Start tag:&quot;, tag)

    def handle_endtag(self, tag):
        print(&quot;End tag :&quot;, tag)

    def handle_data(self, data):
        print(&quot;Some data  :&quot;, data)

parser = MyHTMLParser()
parser.feed(&quot;&quot;&quot;&lt;infodump&gt;
   ...
&lt;/infodump&gt;

&lt;infodump&gt;
   ...
&lt;/infodump&gt;&quot;&quot;&quot;)

Output:

Start tag: infodump
Some data  : 
   ...

End tag : infodump
Some data  : 


Start tag: infodump
Some data  : 
   ...

End tag : infodump

huangapple
  • 本文由 发表于 2023年6月22日 19:40:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76531507.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定