英文:
parsing XML within HTML using python
问题
I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:
<!--<?xml version="1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
<mytag>
<headername>BASE</headername>
<fieldname>NAME</fieldname>
<val><![CDATA[Testcase]]></val>
</mytag>
<mytag>
<headername>BASE</headername>
<fieldname>AGE</fieldname>
<val><![CDATA[5]]></val>
</mytag>
</ROOTTAG>
-->
Requirement is to parse the XML which is in comments in the above HTML. So far I have tried to read the HTML file and pass it to a string and did the following:
with open('my_html.html', 'rb') as file:
d = str(file.read())
d2 = d[d.index('<!--') + 4:d.index('-->')]
d3 = "'''" + d2 + "'''"
This is returning the XML piece of data in string d3
with triple single quotes.
Then trying to read it via ElementTree:
ET.fromstring(d3)
But it is failing with the following error:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2
Need some help to basically:
- Read HTML
- Take out the snippet with the XML piece which is commented at the bottom of HTML
- Take that string and pass it to
ET.fromstring()
function, but since this function takes a string with triple quotes, it is not formatting it properly and hence throwing the error.
英文:
I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:
<!DOCTYPE html>
<html>
<head>
***
</head>
<body>
<div class="panel panel-primary call__report-modal-panel">
<div class="panel-heading text-center custom-panel-heading">
<h2>Report</h2>
</div>
<div class="panel-body">
<div class="panel panel-default">
<div class="panel-heading">
<div class="panel-title">Info</div>
</div>
<div class="panel-body">
<table class="table table-bordered table-page-break-auto table-layout-fixed">
<tr>
<td class="col-sm-4">ID</td>
<td class="col-sm-8">1</td>
</tr>
</table>
</div>
</div>
</body>
</html>
<!--<?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
<mytag>
<headername>BASE</headername>
<fieldname>NAME</fieldname>
<val><![CDATA[Testcase]]></val>
</mytag>
<mytag>
<headername>BASE</headername>
<fieldname>AGE</fieldname>
<val><![CDATA[5]]></val>
</mytag>
</ROOTTAG>
-->
Requirement is to parse the XML which is in comments in above HTML.
So far I have tried to read the HTML file and pass it to a string and did following:
with open('my_html.html', 'rb') as file:
d = str(file.read())
d2 = d[d.index('<!--') + 4:d.index('-->')]
d3 = "'''"+d2+"'''"
this is returning XML piece of data in string d3 with 3 single qoutes.
Then trying to read it via Etree:
ET.fromstring(d3)
but it is failing with following error:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2
need some help to basically:
- Read HTML
- take out snippet with XML piece which is commented at bottom of HTML
- take that string and pass to ET.fromString() function, but since this function takes string with triple qoutes, it is not formatting it properly and hence throwing the error
答案1
得分: 3
使用内置的 html.parser()
(文档),您可以将 XML 注释作为字符串获取,然后可以使用 xml.etree.ElementTree
进行解析:
from html.parser import HTMLParser
import xml.etree.ElementTree as ET
class MyHTMLParser(HTMLParser):
def handle_comment(self, data):
xml_str = data
tree = ET.fromstring(xml_str)
for elem in tree.iter():
print(elem.tag, elem.text)
parser = MyHTMLParser()
with open("your.html", "r") as f:
lines = f.readlines()
for line in lines:
parser.feed(line)
输出:
ROOTTAG
mytag
headername BASE
fieldname NAME
val Testcase
mytag
headername BASE
fieldname AGE
val 5
英文:
With the build in html.parser()
(Doc) you get the xml comment as string what you can parse with xml.entree.ElementTree
:
from html.parser import HTMLParser
import xml.etree.ElementTree as ET
class MyHTMLParser(HTMLParser):
def handle_comment(self, data):
xml_str = data
tree = ET.fromstring(xml_str)
for elem in tree.iter():
print(elem.tag, elem.text)
parser = MyHTMLParser()
with open("your.html", "r") as f:
lines = f.readlines()
for line in lines:
parser.feed(line)
Output:
ROOTTAG
mytag
headername BASE
fieldname NAME
val Testcase
mytag
headername BASE
fieldname AGE
val 5
答案2
得分: 1
首先,通过逐行读取并使用if string.startswith
来分割你的HTML和XML,以过滤掉注释块:
with open('xmlfile.xml') as fh:
html, xml = [], []
for line in fh:
# 检查是否为注释行
if line.startswith('<!--'):
break
html.append(line)
# 添加当前行
xml.append(line)
# 继续迭代
for line in fh:
# 检查是否为结束块注释
if line.startswith('-->'):
break
xml.append(line)
# 获取根标签以关闭所有内容
root_tag = xml[1].strip().strip('<>')
# 添加闭合标签并拼接,使用4:切片去掉块注释
xml = ''.join((*xml, f'</{root_tag}>'))[4:]
html = ''.join(html)
现在,你应该能够使用你选择的解析器独立解析它们。
英文:
First, split up your html and xml by just reading line by line and using an if string.startswith
to filter out the comment block:
with open('xmlfile.xml') as fh:
html, xml = [], []
for line in fh:
# check for that comment line
if line.startswith('<!--'):
break
html.append(line)
# append current line
xml.append(line)
# keep iterating
for line in fh:
# check for ending block comment
if line.startswith('-->'):
break
xml.append(line)
# Get the root tag to close everything up
root_tag = xml[1].strip().strip('<>')
# add the closing tag and join, using the 4: slice to strip off block comment
xml = ''.join((*xml, f'</{root_tag}>'))[4:]
html = ''.join(html)
Now you should be able to parse them independently using your parser of choice
答案3
得分: 1
你已经走在正确的道路上。我把你的 HTML 放入文件中,像下面这样工作。
import xml.etree.ElementTree as ET
with open('extract_xml.html') as handle:
content = handle.read()
xml = content[content.index('<!--')+4: content.index('-->')]
document = ET.fromstring(xml)
for element in document.findall("./mytag"):
for child in element:
print(child, child.text)
英文:
You already have been on the right path. I put your HTML in the file and it works fine like following.
import xml.etree.ElementTree as ET
with open('extract_xml.html') as handle:
content = handle.read()
xml = content[content.index('<!--')+4: content.index('-->')]
document = ET.fromstring(xml)
for element in document.findall("./mytag"):
for child in element:
print(child, child.text)
答案4
得分: 1
这是您提供的代码的翻译:
如果你一次读取文件的一行,你会发现这更容易管理。
import xml.etree.ElementTree as ET
START_COMMENT = '<!--'
END_COMMENT = '-->'
def getxml(filename):
with open(filename) as data:
lines = []
inxml = False
for line in data.readlines():
if inxml:
if line.startswith(END_COMMENT):
break
lines.append(line)
elif line.startswith(START_COMMENT):
inxml = True
return ''.join(lines)
ET.fromstring(xml := getxml('/Volumes/G-Drive/foo.html'))
print(xml)
输出:
<ROOTTAG>
<mytag>
<headername>BASE</headername>
<fieldname>NAME</fieldname>
<val><![CDATA[Testcase]]></val>
</mytag>
<mytag>
<headername>BASE</headername>
<fieldname>AGE</fieldname>
<val><![CDATA[5]]></val>
</mytag>
</ROOTTAG>
英文:
If you read the file one line at a time you'll find this easier to manage.
import xml.etree.ElementTree as ET
START_COMMENT = '<!--'
END_COMMENT = '-->'
def getxml(filename):
with open(filename) as data:
lines = []
inxml = False
for line in data.readlines():
if inxml:
if line.startswith(END_COMMENT):
break
lines.append(line)
elif line.startswith(START_COMMENT):
inxml = True
return ''.join(lines)
ET.fromstring(xml := getxml('/Volumes/G-Drive/foo.html'))
print(xml)
Output:
<ROOTTAG>
<mytag>
<headername>BASE</headername>
<fieldname>NAME</fieldname>
<val><![CDATA[Testcase]]></val>
</mytag>
<mytag>
<headername>BASE</headername>
<fieldname>AGE</fieldname>
<val><![CDATA[5]]></val>
</mytag>
</ROOTTAG>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论