parsing XML within HTML using python

huangapple go评论104阅读模式
英文:

parsing XML within HTML using python

问题

I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:

  1. <!--<?xml version="1.0" encoding="Windows-1252" standalone="yes"?>
  2. <ROOTTAG>
  3. <mytag>
  4. <headername>BASE</headername>
  5. <fieldname>NAME</fieldname>
  6. <val><![CDATA[Testcase]]></val>
  7. </mytag>
  8. <mytag>
  9. <headername>BASE</headername>
  10. <fieldname>AGE</fieldname>
  11. <val><![CDATA[5]]></val>
  12. </mytag>
  13. </ROOTTAG>
  14. -->

Requirement is to parse the XML which is in comments in the above HTML. So far I have tried to read the HTML file and pass it to a string and did the following:

  1. with open('my_html.html', 'rb') as file:
  2. d = str(file.read())
  3. d2 = d[d.index('<!--') + 4:d.index('-->')]
  4. d3 = "'''" + d2 + "'''"

This is returning the XML piece of data in string d3 with triple single quotes.

Then trying to read it via ElementTree:

  1. ET.fromstring(d3)

But it is failing with the following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

Need some help to basically:

  • Read HTML
  • Take out the snippet with the XML piece which is commented at the bottom of HTML
  • Take that string and pass it to ET.fromstring() function, but since this function takes a string with triple quotes, it is not formatting it properly and hence throwing the error.
英文:

I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:

  1. &lt;!DOCTYPE html&gt;
  2. &lt;html&gt;
  3. &lt;head&gt;
  4. ***
  5. &lt;/head&gt;
  6. &lt;body&gt;
  7. &lt;div class=&quot;panel panel-primary call__report-modal-panel&quot;&gt;
  8. &lt;div class=&quot;panel-heading text-center custom-panel-heading&quot;&gt;
  9. &lt;h2&gt;Report&lt;/h2&gt;
  10. &lt;/div&gt;
  11. &lt;div class=&quot;panel-body&quot;&gt;
  12. &lt;div class=&quot;panel panel-default&quot;&gt;
  13. &lt;div class=&quot;panel-heading&quot;&gt;
  14. &lt;div class=&quot;panel-title&quot;&gt;Info&lt;/div&gt;
  15. &lt;/div&gt;
  16. &lt;div class=&quot;panel-body&quot;&gt;
  17. &lt;table class=&quot;table table-bordered table-page-break-auto table-layout-fixed&quot;&gt;
  18. &lt;tr&gt;
  19. &lt;td class=&quot;col-sm-4&quot;&gt;ID&lt;/td&gt;
  20. &lt;td class=&quot;col-sm-8&quot;&gt;1&lt;/td&gt;
  21. &lt;/tr&gt;
  22. &lt;/table&gt;
  23. &lt;/div&gt;
  24. &lt;/div&gt;
  25. &lt;/body&gt;
  26. &lt;/html&gt;
  27. &lt;!--&lt;?xml version = &quot;1.0&quot; encoding=&quot;Windows-1252&quot; standalone=&quot;yes&quot;?&gt;
  28. &lt;ROOTTAG&gt;
  29. &lt;mytag&gt;
  30. &lt;headername&gt;BASE&lt;/headername&gt;
  31. &lt;fieldname&gt;NAME&lt;/fieldname&gt;
  32. &lt;val&gt;&lt;![CDATA[Testcase]]&gt;&lt;/val&gt;
  33. &lt;/mytag&gt;
  34. &lt;mytag&gt;
  35. &lt;headername&gt;BASE&lt;/headername&gt;
  36. &lt;fieldname&gt;AGE&lt;/fieldname&gt;
  37. &lt;val&gt;&lt;![CDATA[5]]&gt;&lt;/val&gt;
  38. &lt;/mytag&gt;
  39. &lt;/ROOTTAG&gt;
  40. --&gt;

Requirement is to parse the XML which is in comments in above HTML.
So far I have tried to read the HTML file and pass it to a string and did following:

  1. with open(&#39;my_html.html&#39;, &#39;rb&#39;) as file:
  2. d = str(file.read())
  3. d2 = d[d.index(&#39;&lt;!--&#39;) + 4:d.index(&#39;--&gt;&#39;)]
  4. d3 = &quot;&#39;&#39;&#39;&quot;+d2+&quot;&#39;&#39;&#39;&quot;

this is returning XML piece of data in string d3 with 3 single qoutes.

Then trying to read it via Etree:

  1. ET.fromstring(d3)

but it is failing with following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

need some help to basically:

  • Read HTML
  • take out snippet with XML piece which is commented at bottom of HTML
  • take that string and pass to ET.fromString() function, but since this function takes string with triple qoutes, it is not formatting it properly and hence throwing the error

答案1

得分: 3

使用内置的 html.parser() (文档),您可以将 XML 注释作为字符串获取,然后可以使用 xml.etree.ElementTree 进行解析:

  1. from html.parser import HTMLParser
  2. import xml.etree.ElementTree as ET
  3. class MyHTMLParser(HTMLParser):
  4. def handle_comment(self, data):
  5. xml_str = data
  6. tree = ET.fromstring(xml_str)
  7. for elem in tree.iter():
  8. print(elem.tag, elem.text)
  9. parser = MyHTMLParser()
  10. with open("your.html", "r") as f:
  11. lines = f.readlines()
  12. for line in lines:
  13. parser.feed(line)

输出:

  1. ROOTTAG
  2. mytag
  3. headername BASE
  4. fieldname NAME
  5. val Testcase
  6. mytag
  7. headername BASE
  8. fieldname AGE
  9. val 5
英文:

With the build in html.parser() (Doc) you get the xml comment as string what you can parse with xml.entree.ElementTree:

  1. from html.parser import HTMLParser
  2. import xml.etree.ElementTree as ET
  3. class MyHTMLParser(HTMLParser):
  4. def handle_comment(self, data):
  5. xml_str = data
  6. tree = ET.fromstring(xml_str)
  7. for elem in tree.iter():
  8. print(elem.tag, elem.text)
  9. parser = MyHTMLParser()
  10. with open(&quot;your.html&quot;, &quot;r&quot;) as f:
  11. lines = f.readlines()
  12. for line in lines:
  13. parser.feed(line)

Output:

  1. ROOTTAG
  2. mytag
  3. headername BASE
  4. fieldname NAME
  5. val Testcase
  6. mytag
  7. headername BASE
  8. fieldname AGE
  9. val 5

答案2

得分: 1

首先,通过逐行读取并使用if string.startswith来分割你的HTML和XML,以过滤掉注释块:

  1. with open('xmlfile.xml') as fh:
  2. html, xml = [], []
  3. for line in fh:
  4. # 检查是否为注释行
  5. if line.startswith('<!--'):
  6. break
  7. html.append(line)
  8. # 添加当前行
  9. xml.append(line)
  10. # 继续迭代
  11. for line in fh:
  12. # 检查是否为结束块注释
  13. if line.startswith('-->'):
  14. break
  15. xml.append(line)
  16. # 获取根标签以关闭所有内容
  17. root_tag = xml[1].strip().strip('<>')
  18. # 添加闭合标签并拼接,使用4:切片去掉块注释
  19. xml = ''.join((*xml, f'</{root_tag}>'))[4:]
  20. html = ''.join(html)

现在,你应该能够使用你选择的解析器独立解析它们。

英文:

First, split up your html and xml by just reading line by line and using an if string.startswith to filter out the comment block:

  1. with open(&#39;xmlfile.xml&#39;) as fh:
  2. html, xml = [], []
  3. for line in fh:
  4. # check for that comment line
  5. if line.startswith(&#39;&lt;!--&#39;):
  6. break
  7. html.append(line)
  8. # append current line
  9. xml.append(line)
  10. # keep iterating
  11. for line in fh:
  12. # check for ending block comment
  13. if line.startswith(&#39;--&gt;&#39;):
  14. break
  15. xml.append(line)
  16. # Get the root tag to close everything up
  17. root_tag = xml[1].strip().strip(&#39;&lt;&gt;&#39;)
  18. # add the closing tag and join, using the 4: slice to strip off block comment
  19. xml = &#39;&#39;.join((*xml, f&#39;&lt;/{root_tag}&gt;&#39;))[4:]
  20. html = &#39;&#39;.join(html)

Now you should be able to parse them independently using your parser of choice

答案3

得分: 1

你已经走在正确的道路上。我把你的 HTML 放入文件中,像下面这样工作。

  1. import xml.etree.ElementTree as ET
  2. with open('extract_xml.html') as handle:
  3. content = handle.read()
  4. xml = content[content.index('<!--')+4: content.index('-->')]
  5. document = ET.fromstring(xml)
  6. for element in document.findall("./mytag"):
  7. for child in element:
  8. print(child, child.text)
英文:

You already have been on the right path. I put your HTML in the file and it works fine like following.

  1. import xml.etree.ElementTree as ET
  2. with open(&#39;extract_xml.html&#39;) as handle:
  3. content = handle.read()
  4. xml = content[content.index(&#39;&lt;!--&#39;)+4: content.index(&#39;--&gt;&#39;)]
  5. document = ET.fromstring(xml)
  6. for element in document.findall(&quot;./mytag&quot;):
  7. for child in element:
  8. print(child, child.text)

答案4

得分: 1

这是您提供的代码的翻译:

  1. 如果你一次读取文件的一行你会发现这更容易管理
  2. import xml.etree.ElementTree as ET
  3. START_COMMENT = '<!--'
  4. END_COMMENT = '-->'
  5. def getxml(filename):
  6. with open(filename) as data:
  7. lines = []
  8. inxml = False
  9. for line in data.readlines():
  10. if inxml:
  11. if line.startswith(END_COMMENT):
  12. break
  13. lines.append(line)
  14. elif line.startswith(START_COMMENT):
  15. inxml = True
  16. return ''.join(lines)
  17. ET.fromstring(xml := getxml('/Volumes/G-Drive/foo.html'))
  18. print(xml)

输出:

  1. <ROOTTAG>
  2. <mytag>
  3. <headername>BASE</headername>
  4. <fieldname>NAME</fieldname>
  5. <val><![CDATA[Testcase]]></val>
  6. </mytag>
  7. <mytag>
  8. <headername>BASE</headername>
  9. <fieldname>AGE</fieldname>
  10. <val><![CDATA[5]]></val>
  11. </mytag>
  12. </ROOTTAG>
英文:

If you read the file one line at a time you'll find this easier to manage.

  1. import xml.etree.ElementTree as ET
  2. START_COMMENT = &#39;&lt;!--&#39;
  3. END_COMMENT = &#39;--&gt;&#39;
  4. def getxml(filename):
  5. with open(filename) as data:
  6. lines = []
  7. inxml = False
  8. for line in data.readlines():
  9. if inxml:
  10. if line.startswith(END_COMMENT):
  11. break
  12. lines.append(line)
  13. elif line.startswith(START_COMMENT):
  14. inxml = True
  15. return &#39;&#39;.join(lines)
  16. ET.fromstring(xml := getxml(&#39;/Volumes/G-Drive/foo.html&#39;))
  17. print(xml)

Output:

  1. &lt;ROOTTAG&gt;
  2. &lt;mytag&gt;
  3. &lt;headername&gt;BASE&lt;/headername&gt;
  4. &lt;fieldname&gt;NAME&lt;/fieldname&gt;
  5. &lt;val&gt;&lt;![CDATA[Testcase]]&gt;&lt;/val&gt;
  6. &lt;/mytag&gt;
  7. &lt;mytag&gt;
  8. &lt;headername&gt;BASE&lt;/headername&gt;
  9. &lt;fieldname&gt;AGE&lt;/fieldname&gt;
  10. &lt;val&gt;&lt;![CDATA[5]]&gt;&lt;/val&gt;
  11. &lt;/mytag&gt;
  12. &lt;/ROOTTAG&gt;

huangapple
  • 本文由 发表于 2023年5月11日 11:00:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76223870.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定