parsing XML within HTML using python

huangapple go评论71阅读模式
英文:

parsing XML within HTML using python

问题

I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:

<!--<?xml version="1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
  <mytag>
    <headername>BASE</headername>
    <fieldname>NAME</fieldname>
    <val><![CDATA[Testcase]]></val>
  </mytag>
  <mytag>
    <headername>BASE</headername>
    <fieldname>AGE</fieldname>
    <val><![CDATA[5]]></val>
  </mytag>
</ROOTTAG>
-->

Requirement is to parse the XML which is in comments in the above HTML. So far I have tried to read the HTML file and pass it to a string and did the following:

with open('my_html.html', 'rb') as file:
    d = str(file.read())
    d2 = d[d.index('<!--') + 4:d.index('-->')]
    d3 = "'''" + d2 + "'''"

This is returning the XML piece of data in string d3 with triple single quotes.

Then trying to read it via ElementTree:

ET.fromstring(d3)

But it is failing with the following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

Need some help to basically:

  • Read HTML
  • Take out the snippet with the XML piece which is commented at the bottom of HTML
  • Take that string and pass it to ET.fromstring() function, but since this function takes a string with triple quotes, it is not formatting it properly and hence throwing the error.
英文:

I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:

&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
	***
&lt;/head&gt;
&lt;body&gt;
    &lt;div class=&quot;panel panel-primary call__report-modal-panel&quot;&gt;
        &lt;div class=&quot;panel-heading text-center custom-panel-heading&quot;&gt;
            &lt;h2&gt;Report&lt;/h2&gt;
        &lt;/div&gt;
        &lt;div class=&quot;panel-body&quot;&gt;
            &lt;div class=&quot;panel panel-default&quot;&gt;
                &lt;div class=&quot;panel-heading&quot;&gt;
                    &lt;div class=&quot;panel-title&quot;&gt;Info&lt;/div&gt;
                &lt;/div&gt;
                &lt;div class=&quot;panel-body&quot;&gt;
                    &lt;table class=&quot;table table-bordered table-page-break-auto table-layout-fixed&quot;&gt;
                        &lt;tr&gt;
                            &lt;td class=&quot;col-sm-4&quot;&gt;ID&lt;/td&gt;
                            &lt;td class=&quot;col-sm-8&quot;&gt;1&lt;/td&gt;
                        &lt;/tr&gt;

            &lt;/table&gt;
        &lt;/div&gt;
    &lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;
&lt;!--&lt;?xml version = &quot;1.0&quot; encoding=&quot;Windows-1252&quot; standalone=&quot;yes&quot;?&gt;
&lt;ROOTTAG&gt;
  &lt;mytag&gt;
    &lt;headername&gt;BASE&lt;/headername&gt;
    &lt;fieldname&gt;NAME&lt;/fieldname&gt;
    &lt;val&gt;&lt;![CDATA[Testcase]]&gt;&lt;/val&gt;
  &lt;/mytag&gt;
  &lt;mytag&gt;
    &lt;headername&gt;BASE&lt;/headername&gt;
    &lt;fieldname&gt;AGE&lt;/fieldname&gt;
    &lt;val&gt;&lt;![CDATA[5]]&gt;&lt;/val&gt;
  &lt;/mytag&gt;

&lt;/ROOTTAG&gt;
--&gt;

Requirement is to parse the XML which is in comments in above HTML.
So far I have tried to read the HTML file and pass it to a string and did following:

with open(&#39;my_html.html&#39;, &#39;rb&#39;) as file:
    d = str(file.read())
    d2 = d[d.index(&#39;&lt;!--&#39;) + 4:d.index(&#39;--&gt;&#39;)]
    d3 = &quot;&#39;&#39;&#39;&quot;+d2+&quot;&#39;&#39;&#39;&quot;

this is returning XML piece of data in string d3 with 3 single qoutes.

Then trying to read it via Etree:

ET.fromstring(d3)

but it is failing with following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

need some help to basically:

  • Read HTML
  • take out snippet with XML piece which is commented at bottom of HTML
  • take that string and pass to ET.fromString() function, but since this function takes string with triple qoutes, it is not formatting it properly and hence throwing the error

答案1

得分: 3

使用内置的 html.parser() (文档),您可以将 XML 注释作为字符串获取,然后可以使用 xml.etree.ElementTree 进行解析:

from html.parser import HTMLParser
import xml.etree.ElementTree as ET

class MyHTMLParser(HTMLParser):
        
    def handle_comment(self, data):
        xml_str = data
        tree = ET.fromstring(xml_str)
        for elem in tree.iter():
            print(elem.tag, elem.text)

parser = MyHTMLParser()

with open("your.html", "r") as f:
    lines = f.readlines()
    
for line in lines:
    parser.feed(line)

输出:

ROOTTAG 
  
mytag 
    
headername BASE
fieldname NAME
val Testcase
mytag 
    
headername BASE
fieldname AGE
val 5
英文:

With the build in html.parser() (Doc) you get the xml comment as string what you can parse with xml.entree.ElementTree:

from html.parser import HTMLParser
import xml.etree.ElementTree as ET

class MyHTMLParser(HTMLParser):
        
    def handle_comment(self, data):
        xml_str = data
        tree = ET.fromstring(xml_str)
        for elem in tree.iter():
            print(elem.tag, elem.text)

parser = MyHTMLParser()

with open(&quot;your.html&quot;, &quot;r&quot;) as f:
    lines = f.readlines()
    
for line in lines:
    parser.feed(line)

Output:

ROOTTAG 
  
mytag 
    
headername BASE
fieldname NAME
val Testcase
mytag 
    
headername BASE
fieldname AGE
val 5

答案2

得分: 1

首先,通过逐行读取并使用if string.startswith来分割你的HTML和XML,以过滤掉注释块:

with open('xmlfile.xml') as fh:
    html, xml = [], []

    for line in fh:
        # 检查是否为注释行
        if line.startswith('<!--'):
            break

        html.append(line)

    # 添加当前行
    xml.append(line)

    # 继续迭代
    for line in fh:
        # 检查是否为结束块注释
        if line.startswith('-->'):
            break
        xml.append(line)

# 获取根标签以关闭所有内容
root_tag = xml[1].strip().strip('<>')

# 添加闭合标签并拼接,使用4:切片去掉块注释
xml = ''.join((*xml, f'</{root_tag}>'))[4:]
html = ''.join(html)

现在,你应该能够使用你选择的解析器独立解析它们。

英文:

First, split up your html and xml by just reading line by line and using an if string.startswith to filter out the comment block:

with open(&#39;xmlfile.xml&#39;) as fh:
    html, xml = [], []

    for line in fh:
        # check for that comment line
        if line.startswith(&#39;&lt;!--&#39;):
            break

        html.append(line)

    # append current line
    xml.append(line)

    # keep iterating
    for line in fh:
        # check for ending block comment
        if line.startswith(&#39;--&gt;&#39;):
            break
        xml.append(line)

# Get the root tag to close everything up
root_tag = xml[1].strip().strip(&#39;&lt;&gt;&#39;)

# add the closing tag and join, using the 4: slice to strip off block comment
xml = &#39;&#39;.join((*xml, f&#39;&lt;/{root_tag}&gt;&#39;))[4:]
html = &#39;&#39;.join(html)

Now you should be able to parse them independently using your parser of choice

答案3

得分: 1

你已经走在正确的道路上。我把你的 HTML 放入文件中,像下面这样工作。

import xml.etree.ElementTree as ET

with open('extract_xml.html') as handle:
    content = handle.read()
    xml = content[content.index('<!--')+4: content.index('-->')]
    document = ET.fromstring(xml)

    for element in document.findall("./mytag"):
        for child in element:
            print(child, child.text)
英文:

You already have been on the right path. I put your HTML in the file and it works fine like following.

import xml.etree.ElementTree as ET

with open(&#39;extract_xml.html&#39;) as handle:
    content = handle.read()
    xml = content[content.index(&#39;&lt;!--&#39;)+4: content.index(&#39;--&gt;&#39;)]
    document = ET.fromstring(xml)

    for element in document.findall(&quot;./mytag&quot;):
        for child in element:
            print(child, child.text)

答案4

得分: 1

这是您提供的代码的翻译:

如果你一次读取文件的一行你会发现这更容易管理

import xml.etree.ElementTree as ET

START_COMMENT = '<!--'
END_COMMENT = '-->'

def getxml(filename):
    with open(filename) as data:
        lines = []
        inxml = False
        for line in data.readlines():
            if inxml:
                if line.startswith(END_COMMENT):
                    break
                lines.append(line)
            elif line.startswith(START_COMMENT):
                inxml = True
        return ''.join(lines)

ET.fromstring(xml := getxml('/Volumes/G-Drive/foo.html'))
print(xml)

输出:

<ROOTTAG>
  <mytag>
    <headername>BASE</headername>
    <fieldname>NAME</fieldname>
    <val><![CDATA[Testcase]]></val>
  </mytag>
  <mytag>
    <headername>BASE</headername>
    <fieldname>AGE</fieldname>
    <val><![CDATA[5]]></val>
  </mytag>
</ROOTTAG>
英文:

If you read the file one line at a time you'll find this easier to manage.

import xml.etree.ElementTree as ET

START_COMMENT = &#39;&lt;!--&#39;
END_COMMENT = &#39;--&gt;&#39;

def getxml(filename):
    with open(filename) as data:
        lines = []
        inxml = False
        for line in data.readlines():
            if inxml:
                if line.startswith(END_COMMENT):
                    break
                lines.append(line)
            elif line.startswith(START_COMMENT):
                inxml = True
        return &#39;&#39;.join(lines)

ET.fromstring(xml := getxml(&#39;/Volumes/G-Drive/foo.html&#39;))
print(xml)

Output:

&lt;ROOTTAG&gt;
  &lt;mytag&gt;
    &lt;headername&gt;BASE&lt;/headername&gt;
    &lt;fieldname&gt;NAME&lt;/fieldname&gt;
    &lt;val&gt;&lt;![CDATA[Testcase]]&gt;&lt;/val&gt;
  &lt;/mytag&gt;
  &lt;mytag&gt;
    &lt;headername&gt;BASE&lt;/headername&gt;
    &lt;fieldname&gt;AGE&lt;/fieldname&gt;
    &lt;val&gt;&lt;![CDATA[5]]&gt;&lt;/val&gt;
  &lt;/mytag&gt;
&lt;/ROOTTAG&gt;

huangapple
  • 本文由 发表于 2023年5月11日 11:00:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76223870.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定