英文:
Poorly documented SOAP XML endpoint - overview of elements and attributes
问题
作为初学者,我正在使用Python处理一个包含多个页面的XML列表。我不了解XML的结构,包括元素、属性或嵌套元素的存在。API文档质量很差,关于哪些元素、属性...实例可能存在的文档有限。
例如,我甚至不知道是否存在ISBN。在前100页中,我没有找到任何ISBN,但谁知道呢...前100页没有ISBN,并不意味着没有任何实体具有ISBN。所以我需要知道是否存在ISBN。因此,我需要“检查”这1000多页中是否有一些书籍具有元素“ISBN”。如果有的话,我将把它添加到脚本中以提取该元素/值(如果存在)。然而,如果没有在任何书籍中出现,我就不会费心。其次,由于我不知道XML的结构,我不知道ISBN是否会是一个实际的元素,还是一个属性,如下所示:
<book isbn="9780747532699">
<title>Harry Potter and the Philosopher's Stone</title>
<author>J.K. Rowling</author>
<publicationYear>1997</publicationYear>
<genre>Fantasy</genre>
<publisher>Bloomsbury Publishing</publisher>
<language>English</language>
<price>19.99</price>
</book>
而不是以下的元素:
<book>
<isbn>9780747532699</isbn>
<title>Harry Potter and the Philosopher's Stone</title>
<author>J.K. Rowling</author>
<publicationYear>1997</publicationYear>
<genre>Fantasy</genre>
<publisher>Bloomsbury Publishing</publisher>
<language>English</language>
<price>19.99</price>
</book>
这也适用于所有元素,其中一些我不知道是否存在。
此外,一些元素很可能是嵌套的。在多语言摘要的情况下,我注意到它们确实嵌套在一个名为“abstracts”的容器元素中。请参考以下示例:
<Collection>
<poetry>
<title>The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<publicationYear>1925</publicationYear>
</poetry>
<book>
<title>Pride and Prejudice</title>
<author>Jane Austen</author>
<publicationYear>1813</publicationYear>
</book>
<novel>
<title>1984</title>
<author>George Orwell</author>
<publicationYear>1949</publicationYear>
<abstracts>
<abstract it="Italian">Un romanzo distopico che descrive un futuro dominato da un regime autoritario.</abstract>
</abstracts>
</novel>
<journal>
<title>Some journal</title>
<editor>George Handsome</editor>
<publicationYear>1949</publicationYear>
</journal>
</Collection>
因此,我需要知道如何在XML页面列表中查找属性/元素。因此,我希望有一种方法来:
- 查询端点,
- 迭代超过1000页,
- 构建某种嵌套结构,以清晰地查看可以找到什么(如果存在的话),以及它是如何存储的,
- 编写脚本来提取基于推断的结构的实际元素/属性。
问题:
- 我应该继续探索BeautifulSoup和ElementTree吗?如果是这样,有什么指导意见吗?
- 还有哪些其他解决方案/建议可以探索/了解XML、其元素和属性?
- 我是不是在错误的道路上白费力气?
英文:
Being new to the game, I am working (in python) with an XML list that contains multiple pages. I don't have prior knowledge of the XML structure, including the presence of elements, attributes, or nested elements. The API is poorly documented, and limited docs are produced of which elements, attributes... instances might have.
E.g. I don't even know if there is an ISBN at all. From the first 100 pages of the list, I did not find any ISBN, but who knows... A first 100 not having an ISBN, is not to say that none of the entities has an ISBN.
So I need to know if there is an ISBN at all. I therefore need to 'check' if on one of those +1000 pages some books has the element 'ISBN'. If there is, I'll add it to the script to fetch that element/value if present.
However, if it is not there with any of the books, I won't bother. Second, since I don't know how the XML is structured, I don't know whether the ISBN will be an actual element or an attribute as in below:
<book isbn="9780747532699">
<title>Harry Potter and the Philosopher's Stone</title>
<author>J.K. Rowling</author>
<publicationYear>1997</publicationYear>
<genre>Fantasy</genre>
<publisher>Bloomsbury Publishing</publisher>
<language>English</language>
<price>19.99</price>
</book>
rather than an element as follows:
<book>
<isbn>9780747532699</isbn>
<title>Harry Potter and the Philosopher's Stone</title>
<author>J.K. Rowling</author>
<publicationYear>1997</publicationYear>
<genre>Fantasy</genre>
<publisher>Bloomsbury Publishing</publisher>
<language>English</language>
<price>19.99</price>
</book>
This applies to all elements. Some of which I have no idea whether they are there or not.
Additionally, some elements will most likely be nested. In case of multilingual abstracts, I noticed they are indeed nested in a container element 'abstracts'. See below:
<Collection>
<poetry>
<title>The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<publicationYear>1925</publicationYear>
</poetry>
<book>
<title>Pride and Prejudice</title>
<author>Jane Austen</author>
<publicationYear>1813</publicationYear>
</book>
<novel>
<title>1984</title>
<author>George Orwell</author>
<publicationYear>1949</publicationYear>
<abstracts>
<abstract it="Italian">Un romanzo distopico che descrive un futuro dominato da un regime autoritario.</abstract>
</abstracts>
</novel>
<journal>
<title>Some journal</title>
<editor>George Handsome</editor>
<publicationYear>1949</publicationYear>
</journal>
</Collection>
So, I need to know how the attributes/elements are to be found in the list of XML pages. So therefore I hoped that there would be a way to
- query the endpoint,
- iterate over the +1000 pages
- build some kind of nested structure of elements, childs... and attributes so I can clearly see what can be found (if present), how it is stored
- write out the scripts to fetch the actual elements/attributes based on the deduced structure.
Questions:
- Should I keep exploring BeautifulSoup and ElementTree? Any pointers if so?
- What other solutions/recommendations are there to explore/understand the XML, its elements and the attributes?
- Am I delusional and barking up the wrong tree?
答案1
得分: 0
你可以搜索ISBN作为xml标签或属性:
import xml.etree.ElementTree as ET
for event, elem in ET.iterparse(file, events=('start',)):
if event == 'start' and elem.tag == 'isbn':
print('ISBN标签:', elem.text)
if event == 'start' and 'isbn' in elem.attrib:
print('ISBN属性:', elem.get('isbn'))
输出:
ISBN属性: 9780747532699
ISBN标签: 9780747532699
英文:
You can search for ISBN as xml tag or attribute:
import xml.etree.ElementTree as ET
for event, elem in ET.iterparse(file, events=('start',)):
if event == 'start' and elem.tag == 'isbn':
print('ISBN Tag: ', elem.text)
if event == 'start' and 'isbn' in elem.attrib:
print('ISBN Attrib: ', elem.get('isbn'))
Output:
ISBN Attrib: 9780747532699
ISBN Tag: 9780747532699
答案2
得分: 0
以下是代码部分的翻译:
from pprint import pprint
from lxml import etree
sample_xml1 = """
<book isbn="9780747532699">
<title>Harry Potter and the Philosopher's Stone</title>
<author>J.K. Rowling</author>
<publicationYear>1997</publicationYear>
<genre>Fantasy</genre>
<publisher>Bloomsbury Publishing</publisher>
<language>English</language>
<price>19.99</price>
</book>
"""
sample_xml2 = """
<book>
<isbn>9780747532699</isbn>
<title>Harry Potter and the Philosopher's Stone</title>
<author>J.K. Rowling</author>
<publicationYear>1997</publicationYear>
<genre>Fantasy</genre>
<publisher>Bloomsbury Publishing</publisher>
<language>English</language>
<price>19.99</price>
</book>
"""
sample_xml = [sample_xml1, sample_xml2]
analysis_results = set()
for xml in sample_xml:
tree = etree.ElementTree(etree.fromstring(xml))
for elem in tree.xpath("//"):
xpath = tree.getpath(elem)
analysis_results.add(xpath)
for attr in elem.attrib:
analysis_results.add(f"{xpath}/@{attr}")
pprint(analysis_results)
打印输出:
{'/book',
'/book/@isbn',
'/book/author',
'/book/genre',
'/book/isbn',
'/book/language',
'/book/price',
'/book/publicationYear',
'/book/publisher',
'/book/title'}
请注意,代码中的注释和字符串不会被翻译。
英文:
One option is to process all of the elements and attributes in the XML and capture the unique xpath's for each one. Getting the xpath of an element is easy using lxml.
This will basically map out the entire structure for you showing not only the element and attribute names, but also where they appear in the tree.
Example:
from pprint import pprint
from lxml import etree
sample_xml1 = """
<book isbn="9780747532699">
<title>Harry Potter and the Philosopher's Stone</title>
<author>J.K. Rowling</author>
<publicationYear>1997</publicationYear>
<genre>Fantasy</genre>
<publisher>Bloomsbury Publishing</publisher>
<language>English</language>
<price>19.99</price>
</book>
"""
sample_xml2 = """
<book>
<isbn>9780747532699</isbn>
<title>Harry Potter and the Philosopher's Stone</title>
<author>J.K. Rowling</author>
<publicationYear>1997</publicationYear>
<genre>Fantasy</genre>
<publisher>Bloomsbury Publishing</publisher>
<language>English</language>
<price>19.99</price>
</book>
"""
sample_xml = [sample_xml1, sample_xml2]
analysis_results = set()
for xml in sample_xml:
tree = etree.ElementTree(etree.fromstring(xml))
for elem in tree.xpath("//*"):
xpath = tree.getpath(elem)
analysis_results.add(xpath)
for attr in elem.attrib:
analysis_results.add(f"{xpath}/@{attr}")
pprint(analysis_results)
Printed output:
{'/book',
'/book/@isbn',
'/book/author',
'/book/genre',
'/book/isbn',
'/book/language',
'/book/price',
'/book/publicationYear',
'/book/publisher',
'/book/title'}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论