How to Unmarshal XML containing dirty HTML in Go

huangapple go评论104阅读模式
英文:

How to Unmarshal XML containing dirty HTML in Go

问题

我有一些XML需要解组,但其中包含我不关心的一个字段中的脏HTML。我在这里发布了一个示例:http://play.golang.org/p/caKCAYyXX2

有没有办法告诉解码器跳过或忽略这些错误?我尝试创建一个在文档中描述的非严格解码器,但无论我如何组合AutoCloseEntity的值,都无法使其工作。我应该提到这个XML来自一个我无法控制的第三方,并且内容始终是可变的,我不确定编译一个要跳过的元素的静态列表是否可行。在结构体中使用带有xml:"-"标记的Description对结果没有影响。

我能够使用Python 2.7解析这个XML,所以我希望在Go中也能实现-虽然我更喜欢在我的用例中使用Go How to Unmarshal XML containing dirty HTML in Go - 我正在使用Google的AppEngine,所以解决方案必须是原生的Go,不能依赖外部C库。

相关代码:

  1. var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
  2. <soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  3. <soapenv:Body>
  4. <Container>
  5. <Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
  6. <Item>
  7. <Description>
  8. <table width="100%" border=0 ><tr><td><table width="100%"><tr><td><!-- Begin Description -->
  9. <TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
  10. <TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
  11. <H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
  12. <IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg>
  13. <BR><BR>
  14. <IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg>
  15. </Description>
  16. </Item>
  17. <Container>
  18. </soapenv:Body>
  19. </soapenv:Envelope>`)
  20. type Data struct {
  21. Timestamp string `xml:"Body>Container>Timestamp"`
  22. }
  23. var o Data
  24. decoder := xml.NewDecoder(bytes.NewBuffer(XMLData))
  25. decoder.Strict = false
  26. decoder.AutoClose = xml.HTMLAutoClose
  27. decoder.Entity = xml.HTMLEntity
  28. if err := decoder.Decode(&o); err != nil {
  29. fmt.Println("Error: ", err)
  30. } else {
  31. fmt.Println("Timestamp: ", o.Timestamp)
  32. }

结果:
Error: XML语法错误,位于第14行:在元素中预期/ >

谢谢。

英文:

I have some XML I want to Unmarshal but it contains dirty HTML in a field I don't even care about. I posted an example here: http://play.golang.org/p/caKCAYyXX2

Is there a way I can tell the Decoder to skip or ignore these errors? I tried making a non-strict Decoder described in the docs, but couldn't get any combinations of AutoClose or Entity values to get this working. I should mention this XML is from a 3rd party that I have no control over and the contents are always variable, I'm not sure compiling a static list of elements to skip would be feasible. Adding Description to the struct with the xml:"-" tag makes no difference.

I was able to parse this using Python 2.7 so I hope it would be possible in Go - though I'd prefer to use Go for my use-case How to Unmarshal XML containing dirty HTML in Go - I am using the Google's AppEngine for this so the solution would have to be in native Go and not rely on external C libraries.

Relevant code:

  1. var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
  2. <soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  3. <soapenv:Body>
  4. <Container>
  5. <Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
  6. <Item>
  7. <Description>
  8. <table width="100%" border=0 ><tr><td><table width="100%"><tr><td><!-- Begin Description -->
  9. <TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
  10. <TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
  11. <H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
  12. <IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg>
  13. <BR><BR>
  14. <IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg>
  15. </Description>
  16. </Item>
  17. <Container>
  18. </soapenv:Body>
  19. </soapenv:Envelope>`)
  20. type Data struct {
  21. Timestamp string `xml:"Body>Container>Timestamp"`
  22. }
  23. var o Data
  24. decoder := xml.NewDecoder(bytes.NewBuffer(XMLData))
  25. decoder.Strict = false
  26. decoder.AutoClose = xml.HTMLAutoClose
  27. decoder.Entity = xml.HTMLEntity
  28. if err := decoder.Decode(&o); err != nil {
  29. fmt.Println("Error: ", err)
  30. } else {
  31. fmt.Println("Timestamp: ", o.Timestamp)
  32. }

Result:
Error: XML syntax error on line 14: expected /> in element

Thank you.

答案1

得分: 1

作为xml包的替代方案,如果你已经安装了libxml2,你可以使用Gokogiri在Go语言中利用其灵活的解析功能。

例如,使用XPath进行评估:

  1. package main
  2. import (
  3. "fmt"
  4. "github.com/moovweb/gokogiri"
  5. "github.com/moovweb/gokogiri/xml"
  6. "github.com/moovweb/gokogiri/xpath"
  7. )
  8. func main() {
  9. var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
  10. <soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  11. <soapenv:Body>
  12. <Container>
  13. <Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
  14. <Item>
  15. <Description>
  16. <table width="100%" border=0 ><tr><td><table width="100%"></tr></td><!-- Begin Description -->
  17. <TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
  18. <TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
  19. <H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
  20. <IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg>
  21. <BR><BR>
  22. <IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg>
  23. </Description>
  24. </Item>
  25. <Container>
  26. </soapenv:Body>
  27. </soapenv:Envelope>`)
  28. doc, err := gokogiri.ParseXml(XMLData)
  29. if err != nil {
  30. fmt.Printf("XML document could not be parsed")
  31. return
  32. }
  33. nxpath := xpath.NewXPath(doc.DocPtr())
  34. nodes, err := nxpath.Evaluate(doc.DocPtr(), xpath.Compile("//Timestamp"))
  35. if err != nil {
  36. fmt.Printf("XPath could not be evaluated")
  37. return
  38. }
  39. if len(nodes) == 0 {
  40. fmt.Printf("Elements matching XPath not found")
  41. return
  42. }
  43. timestamp := xml.NewNode(nodes[0], doc).InnerHtml()
  44. fmt.Printf("%s", timestamp) // "2014-01-15T21:07:07.217Z"
  45. }

这段代码在Go v1.2和OS X 10.9.1上运行正常。Gokogiri包还包括一个CSS选择器转换器,但我从未使用过,无法保证其可靠性。

英文:

As an alternative to the xml package, if you have libxml2 installed, you can use Gokogiri to harness its parsing flexibility in Go.

For example, evaluating using an XPath:

<!-- language: go —>

  1. package main
  2. import (
  3. &quot;fmt&quot;
  4. &quot;github.com/moovweb/gokogiri&quot;
  5. &quot;github.com/moovweb/gokogiri/xml&quot;
  6. &quot;github.com/moovweb/gokogiri/xpath&quot;
  7. )
  8. func main() {
  9. var XMLData = []byte(`&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
  10. &lt;soapenv:Envelope xmlns:soapenv=&quot;http://schemas.xmlsoap.org/soap/envelope/&quot; xmlns:xsd=&quot;http://www.w3.org/2001/XMLSchema&quot; xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt;
  11. &lt;soapenv:Body&gt;
  12. &lt;Container&gt;
  13. &lt;Timestamp&gt;2014-01-15T21:07:07.217Z&lt;/Timestamp&gt;
  14. &lt;Item&gt;
  15. &lt;Description&gt;
  16. &lt;table width=&quot;100%&quot; border=0 &gt;&lt;tr&gt;&lt;td&gt;&lt;table width=&quot;100%&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;!-- Begin Description --&gt;
  17. &lt;TABLE cellSpacing=27 cellPadding=0 width=&quot;100%&quot;&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD vAlign=top&gt;&lt;P align=center&gt;
  18. &lt;TABLE cellPadding=15 width=&quot;86%&quot; border=1&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;H3&gt;&lt;P&gt;
  19. &lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H&gt;&lt;H2&gt;&lt;H2&gt;
  20. &lt;IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg&gt;
  21. &lt;BR&gt;&lt;BR&gt;
  22. &lt;IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg&gt;
  23. &lt;/Description&gt;
  24. &lt;/Item&gt;
  25. &lt;Container&gt;
  26. &lt;/soapenv:Body&gt;
  27. &lt;/soapenv:Envelope&gt;`)
  28. doc, err := gokogiri.ParseXml(XMLData)
  29. if err != nil {
  30. fmt.Printf(&quot;XML document could not be parsed&quot;)
  31. return
  32. }
  33. nxpath := xpath.NewXPath(doc.DocPtr())
  34. nodes, err := nxpath.Evaluate(doc.DocPtr(), xpath.Compile(&quot;//Timestamp&quot;))
  35. if err != nil {
  36. fmt.Printf(&quot;XPath could not be evaluated&quot;)
  37. return
  38. }
  39. if len(nodes) == 0 {
  40. fmt.Printf(&quot;Elements matching XPath not found&quot;)
  41. return
  42. }
  43. timestamp := xml.NewNode(nodes[0], doc).InnerHtml()
  44. fmt.Printf(&quot;%s&quot;, timestamp) // &quot;2014-01-15T21:07:07.217Z&quot;
  45. }

This works with Go v1.2 on OS X 10.9.1. The Gokogiri package also includes a CSS selector converter, but I've never used it and can't vouch for it.

答案2

得分: 0

你的解码器代码没问题(实际上你可以删除decoder.AutoClose = xml.HTMLAutoClose这一行)。问题在于img标签的src属性周围没有引号。请参考这个示例

英文:

Your decoder code is fine (you can actually remove the decoder.AutoClose = xml.HTMLAutoClose line). The problem is that the img tags don't have quotes around the src attributes. See this playground.

答案3

得分: 0

考虑使用go.net/html包,根据我的测试,它可以很好地解析你的示例数据。

我认为这个包的问题在于它返回给你一个“节点”层次结构(每个HTML元素一个节点),你需要遍历这个层次结构。至少在第一眼看上去,它没有提供将节点解组为结构体的功能。因此,你可以尝试使用html-querygoquery等工具,它们可以让你使用().so().called().fluent().style()等方式查询解析后的DOM。

go-html-transform也是另一个可选的选择。

换句话说,我的主要建议是将你处理的整个SOAP响应视为HTML而不是XML,因为实际上它就是HTML,希望HTML解析器能够处理它,因为HTML具有更宽松的格式规则和更宽容的解析器。

英文:

Consider using go.net/html package &mdash; for me, it parsed your sample data just OK.

The problem with this package, as I perceive it, is that it returns to you a hierarchy of "nodes" (one per HTML element) which you're supposed to traverse. I mean, no unmarshaling to a struct, at least on the first sight. Thus you might have better luck with something like html-query or goquery which should allow you to query the parsed DOM using the().so().called().fluent().style()&hellip;

go-html-transform is yet another possible option.

In other words, my key idea is to treat the whole SOAP reply you're dealing with as HTML, not XML because that's what it really is and hope a HTML parser will be able to cope with it thanks to the HTML's more lax formatting rules and consequently more permissive parsers.

huangapple
  • 本文由 发表于 2014年1月24日 08:21:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/21322093.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定