英文:
Reading multiple XML in a single file
问题
我需要从各种地方收集多个短XML片段,它们都具有相似的格式,在Python中。问题是它们将全部呈现在一个文件中:
<infodump>
...
</infodump>
<infodump>
...
</infodump>
以此类推。我可以找到许多在多个文件中解析XML的示例(例如,迭代目录中的所有文件),以及合并成单个文件的示例(如使用xmlmerge等),但我还没有找到如何解析同一文件中的多个示例的方法。
我尝试了通常的方法:
tree = ET.parse("id.xml")
root = tree.getroot()
但它在文件中的第二个XML上出现问题:
Traceback (most recent call last):
File "C:\Users\dlevey\PycharmProjects\reconcile\main.py", line 27, in <module>
main()
File "C:\Users\dlevey\PycharmProjects\reconcile\main.py", line 5, in main
tree = ET.parse("id.xml")
^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\xml\etree\ElementTree.py", line 1218, in parse
tree.parse(source, parser)
File "C:\Program Files\Python311\Lib\xml\etree\ElementTree.py", line 580, in parse
self._root = parser._parse_whole(source)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
xml.etree.ElementTree.ParseError: junk after document element: line 25, column 0
我真的不想再次将它们拆分成多个文件,也不想要求它们都作为单独的文件发送,因为其中一些来自源头的文件是作为单个文件发送的,我将一次处理数千个XML样本。
英文:
I need to collect multiple pieces of short XML from a variety of places, all with similar format, in Python. The catch is that they will be presented to me all in one file:
<infodump>
...
</infodump>
<infodump>
...
</infodump>
and so on. I can find many examples of parsing XML in multiple files (iterating over all the files in a directory, for example), and examples of merging into a single file (with such things as xmlmerge), but I have yet found nothing on parsing multiple examples in the same file.
I try the usual
tree = ET.parse("./id.xml")
root = tree.getroot()
but it chokes on the second XML in the file:
Traceback (most recent call last):
File "C:\Users\dlevey\PycharmProjects\reconcile\main.py", line 27, in <module>
main()
File "C:\Users\dlevey\PycharmProjects\reconcile\main.py", line 5, in main
tree = ET.parse("./id.xml")
^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\xml\etree\ElementTree.py", line 1218, in parse
tree.parse(source, parser)
File "C:\Program Files\Python311\Lib\xml\etree\ElementTree.py", line 580, in parse
self._root = parser._parse_whole(source)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
xml.etree.ElementTree.ParseError: junk after document element: line 25, column 0
I'd really rather not split them again into multiple files, or request that they are all sent as individual files, as some of them are coming from the source as a single file, and I'll be processing thousands of XML samples at once.
答案1
得分: 1
正如评论中所暗示的,您的输入不是格式良好的(详见此处 了解格式良好与有效 XML 的详细信息),因为它没有单一的根元素。
解决此问题的最简单方法是在解析之前将输入包装在根元素中。
下面是一个简单的示例,它将输入的 XML 片段包装在一个 input
元素中...
import xml.etree.ElementTree as ET
with open("id.xml", "r") as input_file:
content = input_file.read()
well_formed_input = f"<input>{content}</input>"
root = ET.fromstring(well_formed_input)
for elem in root.findall("infodump"):
print(elem)
这将打印以下输出:
<Element 'infodump' at 0x0000020B5FFB8400>
<Element 'infodump' at 0x0000020B5FFB84A0>
英文:
Like hinted to in the comments, your input is not well-formed (see here for details on well-formed vs valid xml) because it doesn't have a single root element.
The easiest way to solve this is to wrap the input in a root element before parsing.
Here's a simple example that wraps the input xml fragments in an input
element...
import xml.etree.ElementTree as ET
with open("id.xml", "r") as input_file:
content = input_file.read()
well_formed_input = f"<input>{content}</input>"
root = ET.fromstring(well_formed_input)
for elem in root.findall("infodump"):
print(elem)
This prints the following output:
<Element 'infodump' at 0x0000020B5FFB8400>
<Element 'infodump' at 0x0000020B5FFB84A0>
答案2
得分: 0
作为对Daniel建议的另一种选择,您可以使用一个HTML解析器:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("开始标签:", tag)
def handle_endtag(self, tag):
print("结束标签:", tag)
def handle_data(self, data):
print("一些数据:", data)
parser = MyHTMLParser()
parser.feed('''<infodump>
...
</infodump>
<infodump>
...
</infodump>''')
输出:
开始标签: infodump
一些数据:
...
结束标签: infodump
一些数据:
开始标签: infodump
一些数据:
...
结束标签: infodump
英文:
As an alternative to Daniel’s suggestion you can use a html parser:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
def handle_endtag(self, tag):
print("End tag :", tag)
def handle_data(self, data):
print("Some data :", data)
parser = MyHTMLParser()
parser.feed("""<infodump>
...
</infodump>
<infodump>
...
</infodump>""")
Output:
Start tag: infodump
Some data :
...
End tag : infodump
Some data :
Start tag: infodump
Some data :
...
End tag : infodump
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论