2023年5月11日 11:00:40go评论59阅读模式

英文:

parsing XML within HTML using python

问题

I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:

<!--<?xml version="1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
  <mytag>
    <headername>BASE</headername>
    <fieldname>NAME</fieldname>
    <val><![CDATA[Testcase]]></val>
  </mytag>
  <mytag>
    <headername>BASE</headername>
    <fieldname>AGE</fieldname>
    <val><![CDATA[5]]></val>
  </mytag>
</ROOTTAG>
-->

Requirement is to parse the XML which is in comments in the above HTML. So far I have tried to read the HTML file and pass it to a string and did the following:

with open('my_html.html', 'rb') as file:
    d = str(file.read())
    d2 = d[d.index('<!--') + 4:d.index('-->')]
    d3 = "'''" + d2 + "'''"

This is returning the XML piece of data in string d3 with triple single quotes.

Then trying to read it via ElementTree:

ET.fromstring(d3)

But it is failing with the following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

Need some help to basically:

Read HTML
Take out the snippet with the XML piece which is commented at the bottom of HTML
Take that string and pass it to ET.fromstring() function, but since this function takes a string with triple quotes, it is not formatting it properly and hence throwing the error.

英文:

I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:

&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
	***
&lt;/head&gt;
&lt;body&gt;
    &lt;div class=&quot;panel panel-primary call__report-modal-panel&quot;&gt;
        &lt;div class=&quot;panel-heading text-center custom-panel-heading&quot;&gt;
            &lt;h2&gt;Report&lt;/h2&gt;
        &lt;/div&gt;
        &lt;div class=&quot;panel-body&quot;&gt;
            &lt;div class=&quot;panel panel-default&quot;&gt;
                &lt;div class=&quot;panel-heading&quot;&gt;
                    &lt;div class=&quot;panel-title&quot;&gt;Info&lt;/div&gt;
                &lt;/div&gt;
                &lt;div class=&quot;panel-body&quot;&gt;
                    &lt;table class=&quot;table table-bordered table-page-break-auto table-layout-fixed&quot;&gt;
                        &lt;tr&gt;
                            &lt;td class=&quot;col-sm-4&quot;&gt;ID&lt;/td&gt;
                            &lt;td class=&quot;col-sm-8&quot;&gt;1&lt;/td&gt;
                        &lt;/tr&gt;

            &lt;/table&gt;
        &lt;/div&gt;
    &lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;
&lt;!--&lt;?xml version = &quot;1.0&quot; encoding=&quot;Windows-1252&quot; standalone=&quot;yes&quot;?&gt;
&lt;ROOTTAG&gt;
  &lt;mytag&gt;
    &lt;headername&gt;BASE&lt;/headername&gt;
    &lt;fieldname&gt;NAME&lt;/fieldname&gt;
    &lt;val&gt;&lt;![CDATA[Testcase]]&gt;&lt;/val&gt;
  &lt;/mytag&gt;
  &lt;mytag&gt;
    &lt;headername&gt;BASE&lt;/headername&gt;
    &lt;fieldname&gt;AGE&lt;/fieldname&gt;
    &lt;val&gt;&lt;![CDATA[5]]&gt;&lt;/val&gt;
  &lt;/mytag&gt;

&lt;/ROOTTAG&gt;
--&gt;

Requirement is to parse the XML which is in comments in above HTML.
So far I have tried to read the HTML file and pass it to a string and did following:

with open(&#39;my_html.html&#39;, &#39;rb&#39;) as file:
    d = str(file.read())
    d2 = d[d.index(&#39;&lt;!--&#39;) + 4:d.index(&#39;--&gt;&#39;)]
    d3 = &quot;&#39;&#39;&#39;&quot;+d2+&quot;&#39;&#39;&#39;&quot;

this is returning XML piece of data in string d3 with 3 single qoutes.

Then trying to read it via Etree:

ET.fromstring(d3)

but it is failing with following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

need some help to basically:

Read HTML
take out snippet with XML piece which is commented at bottom of HTML
take that string and pass to ET.fromString() function, but since this function takes string with triple qoutes, it is not formatting it properly and hence throwing the error

答案1

得分: 3

使用内置的 html.parser() (文档)，您可以将 XML 注释作为字符串获取，然后可以使用 xml.etree.ElementTree 进行解析：

from html.parser import HTMLParser
import xml.etree.ElementTree as ET

class MyHTMLParser(HTMLParser):
        
    def handle_comment(self, data):
        xml_str = data
        tree = ET.fromstring(xml_str)
        for elem in tree.iter():
            print(elem.tag, elem.text)

parser = MyHTMLParser()

with open("your.html", "r") as f:
    lines = f.readlines()
    
for line in lines:
    parser.feed(line)

输出:

ROOTTAG 
  
mytag 
    
headername BASE
fieldname NAME
val Testcase
mytag 
    
headername BASE
fieldname AGE
val 5

英文:

With the build in html.parser() (Doc) you get the xml comment as string what you can parse with xml.entree.ElementTree:

from html.parser import HTMLParser
import xml.etree.ElementTree as ET

class MyHTMLParser(HTMLParser):
        
    def handle_comment(self, data):
        xml_str = data
        tree = ET.fromstring(xml_str)
        for elem in tree.iter():
            print(elem.tag, elem.text)

parser = MyHTMLParser()

with open(&quot;your.html&quot;, &quot;r&quot;) as f:
    lines = f.readlines()
    
for line in lines:
    parser.feed(line)

Output:

ROOTTAG 
  
mytag 
    
headername BASE
fieldname NAME
val Testcase
mytag 
    
headername BASE
fieldname AGE
val 5

答案2

得分: 1

首先，通过逐行读取并使用if string.startswith来分割你的HTML和XML，以过滤掉注释块：

with open('xmlfile.xml') as fh:
    html, xml = [], []

    for line in fh:
        # 检查是否为注释行
        if line.startswith('<!--'):
            break

        html.append(line)

    # 添加当前行
    xml.append(line)

    # 继续迭代
    for line in fh:
        # 检查是否为结束块注释
        if line.startswith('-->'):
            break
        xml.append(line)

# 获取根标签以关闭所有内容
root_tag = xml[1].strip().strip('<>')

# 添加闭合标签并拼接，使用4：切片去掉块注释
xml = ''.join((*xml, f'</{root_tag}>'))[4:]
html = ''.join(html)

现在，你应该能够使用你选择的解析器独立解析它们。

英文:

First, split up your html and xml by just reading line by line and using an if string.startswith to filter out the comment block:

with open(&#39;xmlfile.xml&#39;) as fh:
    html, xml = [], []

    for line in fh:
        # check for that comment line
        if line.startswith(&#39;&lt;!--&#39;):
            break

        html.append(line)

    # append current line
    xml.append(line)

    # keep iterating
    for line in fh:
        # check for ending block comment
        if line.startswith(&#39;--&gt;&#39;):
            break
        xml.append(line)

# Get the root tag to close everything up
root_tag = xml[1].strip().strip(&#39;&lt;&gt;&#39;)

# add the closing tag and join, using the 4: slice to strip off block comment
xml = &#39;&#39;.join((*xml, f&#39;&lt;/{root_tag}&gt;&#39;))[4:]
html = &#39;&#39;.join(html)

Now you should be able to parse them independently using your parser of choice

答案3

得分: 1

你已经走在正确的道路上。我把你的 HTML 放入文件中，像下面这样工作。

import xml.etree.ElementTree as ET

with open('extract_xml.html') as handle:
    content = handle.read()
    xml = content[content.index('<!--')+4: content.index('-->')]
    document = ET.fromstring(xml)

    for element in document.findall("./mytag"):
        for child in element:
            print(child, child.text)

英文:

You already have been on the right path. I put your HTML in the file and it works fine like following.

import xml.etree.ElementTree as ET

with open(&#39;extract_xml.html&#39;) as handle:
    content = handle.read()
    xml = content[content.index(&#39;&lt;!--&#39;)+4: content.index(&#39;--&gt;&#39;)]
    document = ET.fromstring(xml)

    for element in document.findall(&quot;./mytag&quot;):
        for child in element:
            print(child, child.text)

答案4

得分: 1

这是您提供的代码的翻译：

如果你一次读取文件的一行，你会发现这更容易管理。

import xml.etree.ElementTree as ET

START_COMMENT = '<!--'
END_COMMENT = '-->'

def getxml(filename):
    with open(filename) as data:
        lines = []
        inxml = False
        for line in data.readlines():
            if inxml:
                if line.startswith(END_COMMENT):
                    break
                lines.append(line)
            elif line.startswith(START_COMMENT):
                inxml = True
        return ''.join(lines)

ET.fromstring(xml := getxml('/Volumes/G-Drive/foo.html'))
print(xml)

输出：

<ROOTTAG>
  <mytag>
    <headername>BASE</headername>
    <fieldname>NAME</fieldname>
    <val><![CDATA[Testcase]]></val>
  </mytag>
  <mytag>
    <headername>BASE</headername>
    <fieldname>AGE</fieldname>
    <val><![CDATA[5]]></val>
  </mytag>
</ROOTTAG>

英文:

If you read the file one line at a time you'll find this easier to manage.

import xml.etree.ElementTree as ET

START_COMMENT = &#39;&lt;!--&#39;
END_COMMENT = &#39;--&gt;&#39;

def getxml(filename):
    with open(filename) as data:
        lines = []
        inxml = False
        for line in data.readlines():
            if inxml:
                if line.startswith(END_COMMENT):
                    break
                lines.append(line)
            elif line.startswith(START_COMMENT):
                inxml = True
        return &#39;&#39;.join(lines)

ET.fromstring(xml := getxml(&#39;/Volumes/G-Drive/foo.html&#39;))
print(xml)

Output:

&lt;ROOTTAG&gt;
  &lt;mytag&gt;
    &lt;headername&gt;BASE&lt;/headername&gt;
    &lt;fieldname&gt;NAME&lt;/fieldname&gt;
    &lt;val&gt;&lt;![CDATA[Testcase]]&gt;&lt;/val&gt;
  &lt;/mytag&gt;
  &lt;mytag&gt;
    &lt;headername&gt;BASE&lt;/headername&gt;
    &lt;fieldname&gt;AGE&lt;/fieldname&gt;
    &lt;val&gt;&lt;![CDATA[5]]&gt;&lt;/val&gt;
  &lt;/mytag&gt;
&lt;/ROOTTAG&gt;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

parsing XML within HTML using python

问题

答案1

答案2

答案3

答案4

如何在Python中在特定点停止递归函数的执行

创建一个用于HTML表单中范围滑块的事件监听器。

如何使用 addEventListener 在按钮点击时重定向到另一个 HTML 页面？

将多张图像一次性转换为张量。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论