How to Unmarshal XML containing dirty HTML in Go

huangapple go评论104阅读模式

How to Unmarshal XML containing dirty HTML in Go




我能够使用Python 2.7解析这个XML,所以我希望在Go中也能实现-虽然我更喜欢在我的用例中使用Go How to Unmarshal XML containing dirty HTML in Go - 我正在使用Google的AppEngine,所以解决方案必须是原生的Go,不能依赖外部C库。


  1. var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
  2. <soapenv:Envelope xmlns:soapenv="" xmlns:xsd="" xmlns:xsi="">
  3. <soapenv:Body>
  4. <Container>
  5. <Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
  6. <Item>
  7. <Description>
  8. <table width="100%" border=0 ><tr><td><table width="100%"><tr><td><!-- Begin Description -->
  9. <TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
  10. <TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
  11. <H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
  12. <IMG SRC=>
  13. <BR><BR>
  14. <IMG SRC=>
  15. </Description>
  16. </Item>
  17. <Container>
  18. </soapenv:Body>
  19. </soapenv:Envelope>`)
  20. type Data struct {
  21. Timestamp string `xml:"Body>Container>Timestamp"`
  22. }
  23. var o Data
  24. decoder := xml.NewDecoder(bytes.NewBuffer(XMLData))
  25. decoder.Strict = false
  26. decoder.AutoClose = xml.HTMLAutoClose
  27. decoder.Entity = xml.HTMLEntity
  28. if err := decoder.Decode(&o); err != nil {
  29. fmt.Println("Error: ", err)
  30. } else {
  31. fmt.Println("Timestamp: ", o.Timestamp)
  32. }

Error: XML语法错误,位于第14行:在元素中预期/ >



I have some XML I want to Unmarshal but it contains dirty HTML in a field I don't even care about. I posted an example here:

Is there a way I can tell the Decoder to skip or ignore these errors? I tried making a non-strict Decoder described in the docs, but couldn't get any combinations of AutoClose or Entity values to get this working. I should mention this XML is from a 3rd party that I have no control over and the contents are always variable, I'm not sure compiling a static list of elements to skip would be feasible. Adding Description to the struct with the xml:"-" tag makes no difference.

I was able to parse this using Python 2.7 so I hope it would be possible in Go - though I'd prefer to use Go for my use-case How to Unmarshal XML containing dirty HTML in Go - I am using the Google's AppEngine for this so the solution would have to be in native Go and not rely on external C libraries.

Relevant code:

  1. var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
  2. <soapenv:Envelope xmlns:soapenv="" xmlns:xsd="" xmlns:xsi="">
  3. <soapenv:Body>
  4. <Container>
  5. <Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
  6. <Item>
  7. <Description>
  8. <table width="100%" border=0 ><tr><td><table width="100%"><tr><td><!-- Begin Description -->
  9. <TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
  10. <TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
  11. <H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
  12. <IMG SRC=>
  13. <BR><BR>
  14. <IMG SRC=>
  15. </Description>
  16. </Item>
  17. <Container>
  18. </soapenv:Body>
  19. </soapenv:Envelope>`)
  20. type Data struct {
  21. Timestamp string `xml:"Body>Container>Timestamp"`
  22. }
  23. var o Data
  24. decoder := xml.NewDecoder(bytes.NewBuffer(XMLData))
  25. decoder.Strict = false
  26. decoder.AutoClose = xml.HTMLAutoClose
  27. decoder.Entity = xml.HTMLEntity
  28. if err := decoder.Decode(&o); err != nil {
  29. fmt.Println("Error: ", err)
  30. } else {
  31. fmt.Println("Timestamp: ", o.Timestamp)
  32. }

Error: XML syntax error on line 14: expected /> in element

Thank you.


得分: 1



  1. package main
  2. import (
  3. "fmt"
  4. ""
  5. ""
  6. ""
  7. )
  8. func main() {
  9. var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
  10. <soapenv:Envelope xmlns:soapenv="" xmlns:xsd="" xmlns:xsi="">
  11. <soapenv:Body>
  12. <Container>
  13. <Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
  14. <Item>
  15. <Description>
  16. <table width="100%" border=0 ><tr><td><table width="100%"></tr></td><!-- Begin Description -->
  17. <TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
  18. <TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
  19. <H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
  20. <IMG SRC=>
  21. <BR><BR>
  22. <IMG SRC=>
  23. </Description>
  24. </Item>
  25. <Container>
  26. </soapenv:Body>
  27. </soapenv:Envelope>`)
  28. doc, err := gokogiri.ParseXml(XMLData)
  29. if err != nil {
  30. fmt.Printf("XML document could not be parsed")
  31. return
  32. }
  33. nxpath := xpath.NewXPath(doc.DocPtr())
  34. nodes, err := nxpath.Evaluate(doc.DocPtr(), xpath.Compile("//Timestamp"))
  35. if err != nil {
  36. fmt.Printf("XPath could not be evaluated")
  37. return
  38. }
  39. if len(nodes) == 0 {
  40. fmt.Printf("Elements matching XPath not found")
  41. return
  42. }
  43. timestamp := xml.NewNode(nodes[0], doc).InnerHtml()
  44. fmt.Printf("%s", timestamp) // "2014-01-15T21:07:07.217Z"
  45. }

这段代码在Go v1.2和OS X 10.9.1上运行正常。Gokogiri包还包括一个CSS选择器转换器,但我从未使用过,无法保证其可靠性。


As an alternative to the xml package, if you have libxml2 installed, you can use Gokogiri to harness its parsing flexibility in Go.

For example, evaluating using an XPath:

<!-- language: go —>

  1. package main
  2. import (
  3. &quot;fmt&quot;
  4. &quot;;
  5. &quot;;
  6. &quot;;
  7. )
  8. func main() {
  9. var XMLData = []byte(`&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
  10. &lt;soapenv:Envelope xmlns:soapenv=&quot;; xmlns:xsd=&quot;; xmlns:xsi=&quot;;&gt;
  11. &lt;soapenv:Body&gt;
  12. &lt;Container&gt;
  13. &lt;Timestamp&gt;2014-01-15T21:07:07.217Z&lt;/Timestamp&gt;
  14. &lt;Item&gt;
  15. &lt;Description&gt;
  16. &lt;table width=&quot;100%&quot; border=0 &gt;&lt;tr&gt;&lt;td&gt;&lt;table width=&quot;100%&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;!-- Begin Description --&gt;
  17. &lt;TABLE cellSpacing=27 cellPadding=0 width=&quot;100%&quot;&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD vAlign=top&gt;&lt;P align=center&gt;
  18. &lt;TABLE cellPadding=15 width=&quot;86%&quot; border=1&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;H3&gt;&lt;P&gt;
  19. &lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H&gt;&lt;H2&gt;&lt;H2&gt;
  20. &lt;IMG SRC=;
  21. &lt;BR&gt;&lt;BR&gt;
  22. &lt;IMG SRC=;
  23. &lt;/Description&gt;
  24. &lt;/Item&gt;
  25. &lt;Container&gt;
  26. &lt;/soapenv:Body&gt;
  27. &lt;/soapenv:Envelope&gt;`)
  28. doc, err := gokogiri.ParseXml(XMLData)
  29. if err != nil {
  30. fmt.Printf(&quot;XML document could not be parsed&quot;)
  31. return
  32. }
  33. nxpath := xpath.NewXPath(doc.DocPtr())
  34. nodes, err := nxpath.Evaluate(doc.DocPtr(), xpath.Compile(&quot;//Timestamp&quot;))
  35. if err != nil {
  36. fmt.Printf(&quot;XPath could not be evaluated&quot;)
  37. return
  38. }
  39. if len(nodes) == 0 {
  40. fmt.Printf(&quot;Elements matching XPath not found&quot;)
  41. return
  42. }
  43. timestamp := xml.NewNode(nodes[0], doc).InnerHtml()
  44. fmt.Printf(&quot;%s&quot;, timestamp) // &quot;2014-01-15T21:07:07.217Z&quot;
  45. }

This works with Go v1.2 on OS X 10.9.1. The Gokogiri package also includes a CSS selector converter, but I've never used it and can't vouch for it.


得分: 0

你的解码器代码没问题(实际上你可以删除decoder.AutoClose = xml.HTMLAutoClose这一行)。问题在于img标签的src属性周围没有引号。请参考这个示例


Your decoder code is fine (you can actually remove the decoder.AutoClose = xml.HTMLAutoClose line). The problem is that the img tags don't have quotes around the src attributes. See this playground.


得分: 0






Consider using package &mdash; for me, it parsed your sample data just OK.

The problem with this package, as I perceive it, is that it returns to you a hierarchy of "nodes" (one per HTML element) which you're supposed to traverse. I mean, no unmarshaling to a struct, at least on the first sight. Thus you might have better luck with something like html-query or goquery which should allow you to query the parsed DOM using the().so().called().fluent().style()&hellip;

go-html-transform is yet another possible option.

In other words, my key idea is to treat the whole SOAP reply you're dealing with as HTML, not XML because that's what it really is and hope a HTML parser will be able to cope with it thanks to the HTML's more lax formatting rules and consequently more permissive parsers.

  • 本文由 发表于 2014年1月24日 08:21:12
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
