2023年6月22日 19:40:16go评论105阅读模式

英文:

Reading multiple XML in a single file

问题

我需要从各种地方收集多个短XML片段，它们都具有相似的格式，在Python中。问题是它们将全部呈现在一个文件中：

&lt;infodump&gt;
   ...
&lt;/infodump&gt;
&lt;infodump&gt;
   ...
&lt;/infodump&gt;

以此类推。我可以找到许多在多个文件中解析XML的示例（例如，迭代目录中的所有文件），以及合并成单个文件的示例（如使用xmlmerge等），但我还没有找到如何解析同一文件中的多个示例的方法。

我尝试了通常的方法：

tree = ET.parse("id.xml")
root = tree.getroot()

但它在文件中的第二个XML上出现问题：

Traceback (most recent call last):
    File "C:\Users\dlevey\PycharmProjects\reconcile\main.py", line 27, in <module>
      main()
    File "C:\Users\dlevey\PycharmProjects\reconcile\main.py", line 5, in main
      tree = ET.parse("id.xml")
           ^^^^^^^^^^^^^^^^^^^^^
    File "C:\Program Files\Python311\Lib\xml\etree\ElementTree.py", line 1218, in parse
      tree.parse(source, parser)
    File "C:\Program Files\Python311\Lib\xml\etree\ElementTree.py", line 580, in parse
      self._root = parser._parse_whole(source)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    xml.etree.ElementTree.ParseError: junk after document element: line 25, column 0

我真的不想再次将它们拆分成多个文件，也不想要求它们都作为单独的文件发送，因为其中一些来自源头的文件是作为单个文件发送的，我将一次处理数千个XML样本。

英文:

I need to collect multiple pieces of short XML from a variety of places, all with similar format, in Python. The catch is that they will be presented to me all in one file:

&lt;infodump&gt;
   ...
&lt;/infodump&gt;
&lt;infodump&gt;
   ...
&lt;/infodump&gt;

and so on. I can find many examples of parsing XML in multiple files (iterating over all the files in a directory, for example), and examples of merging into a single file (with such things as xmlmerge), but I have yet found nothing on parsing multiple examples in the same file.

I try the usual

tree = ET.parse(&quot;./id.xml&quot;)
root = tree.getroot()

but it chokes on the second XML in the file:

Traceback (most recent call last):
    File &quot;C:\Users\dlevey\PycharmProjects\reconcile\main.py&quot;, line 27, in &lt;module&gt;
      main()
    File &quot;C:\Users\dlevey\PycharmProjects\reconcile\main.py&quot;, line 5, in main
      tree = ET.parse(&quot;./id.xml&quot;)
           ^^^^^^^^^^^^^^^^^^^^^
    File &quot;C:\Program Files\Python311\Lib\xml\etree\ElementTree.py&quot;, line 1218, in parse
      tree.parse(source, parser)
    File &quot;C:\Program Files\Python311\Lib\xml\etree\ElementTree.py&quot;, line 580, in parse
      self._root = parser._parse_whole(source)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    xml.etree.ElementTree.ParseError: junk after document element: line 25, column 0

I'd really rather not split them again into multiple files, or request that they are all sent as individual files, as some of them are coming from the source as a single file, and I'll be processing thousands of XML samples at once.

答案1

得分: 1

正如评论中所暗示的，您的输入不是格式良好的（详见此处了解格式良好与有效 XML 的详细信息），因为它没有单一的根元素。

解决此问题的最简单方法是在解析之前将输入包装在根元素中。

下面是一个简单的示例，它将输入的 XML 片段包装在一个 input 元素中...

import xml.etree.ElementTree as ET
with open("id.xml", "r") as input_file:
    content = input_file.read()
well_formed_input = f"<input>{content}</input>"
root = ET.fromstring(well_formed_input)
for elem in root.findall("infodump"):
    print(elem)

这将打印以下输出：

<Element 'infodump' at 0x0000020B5FFB8400>
<Element 'infodump' at 0x0000020B5FFB84A0>

英文:

Like hinted to in the comments, your input is not well-formed (see here for details on well-formed vs valid xml) because it doesn't have a single root element.

The easiest way to solve this is to wrap the input in a root element before parsing.

Here's a simple example that wraps the input xml fragments in an input element...

import xml.etree.ElementTree as ET
with open(&quot;id.xml&quot;, &quot;r&quot;) as input_file:
    content = input_file.read()
well_formed_input = f&quot;&lt;input&gt;{content}&lt;/input&gt;&quot;
root = ET.fromstring(well_formed_input)
for elem in root.findall(&quot;infodump&quot;):
    print(elem)

This prints the following output:

&lt;Element &#39;infodump&#39; at 0x0000020B5FFB8400&gt;
&lt;Element &#39;infodump&#39; at 0x0000020B5FFB84A0&gt;

答案2

得分: 0

作为对Daniel建议的另一种选择，您可以使用一个HTML解析器：

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("开始标签：", tag)
    def handle_endtag(self, tag):
        print("结束标签：", tag)
    def handle_data(self, data):
        print("一些数据：", data)
parser = MyHTMLParser()
parser.feed('''<infodump>
   ...
</infodump>
<infodump>
   ...
</infodump>''')

输出：

开始标签： infodump
一些数据： 
   ...
结束标签： infodump
一些数据： 
开始标签： infodump
一些数据： 
   ...
结束标签： infodump

英文:

As an alternative to Daniel’s suggestion you can use a html parser:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(&quot;Start tag:&quot;, tag)
    def handle_endtag(self, tag):
        print(&quot;End tag :&quot;, tag)
    def handle_data(self, data):
        print(&quot;Some data  :&quot;, data)
parser = MyHTMLParser()
parser.feed(&quot;&quot;&quot;&lt;infodump&gt;
   ...
&lt;/infodump&gt;
&lt;infodump&gt;
   ...
&lt;/infodump&gt;&quot;&quot;&quot;)

Output:

Start tag: infodump
Some data  : 
   ...
End tag : infodump
Some data  : 
Start tag: infodump
Some data  : 
   ...
End tag : infodump

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

读取单个文件中的多个XML

问题

答案1

答案2

AttributeError: ‘NoneType’ object has no attribute ‘randomSplit’

确定使用Python单独识别表格单元格。

Web scraper未收集URL。

Multiprocessing code behaves differently when commenting out one print statement

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。