英文:
How to Unmarshal XML containing dirty HTML in Go
问题
我有一些XML需要解组,但其中包含我不关心的一个字段中的脏HTML。我在这里发布了一个示例:http://play.golang.org/p/caKCAYyXX2
有没有办法告诉解码器跳过或忽略这些错误?我尝试创建一个在文档中描述的非严格解码器,但无论我如何组合AutoClose
或Entity
的值,都无法使其工作。我应该提到这个XML来自一个我无法控制的第三方,并且内容始终是可变的,我不确定编译一个要跳过的元素的静态列表是否可行。在结构体中使用带有xml:"-"
标记的Description
对结果没有影响。
我能够使用Python 2.7解析这个XML,所以我希望在Go中也能实现-虽然我更喜欢在我的用例中使用Go - 我正在使用Google的AppEngine,所以解决方案必须是原生的Go,不能依赖外部C库。
相关代码:
var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<Container>
<Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
<Item>
<Description>
<table width="100%" border=0 ><tr><td><table width="100%"><tr><td><!-- Begin Description -->
<TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
<TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
<H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
<IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg>
<BR><BR>
<IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg>
</Description>
</Item>
<Container>
</soapenv:Body>
</soapenv:Envelope>`)
type Data struct {
Timestamp string `xml:"Body>Container>Timestamp"`
}
var o Data
decoder := xml.NewDecoder(bytes.NewBuffer(XMLData))
decoder.Strict = false
decoder.AutoClose = xml.HTMLAutoClose
decoder.Entity = xml.HTMLEntity
if err := decoder.Decode(&o); err != nil {
fmt.Println("Error: ", err)
} else {
fmt.Println("Timestamp: ", o.Timestamp)
}
结果:
Error: XML语法错误,位于第14行:在元素中预期/ >
谢谢。
英文:
I have some XML I want to Unmarshal but it contains dirty HTML in a field I don't even care about. I posted an example here: http://play.golang.org/p/caKCAYyXX2
Is there a way I can tell the Decoder to skip or ignore these errors? I tried making a non-strict Decoder described in the docs, but couldn't get any combinations of AutoClose
or Entity
values to get this working. I should mention this XML is from a 3rd party that I have no control over and the contents are always variable, I'm not sure compiling a static list of elements to skip would be feasible. Adding Description
to the struct with the xml:"-"
tag makes no difference.
I was able to parse this using Python 2.7 so I hope it would be possible in Go - though I'd prefer to use Go for my use-case - I am using the Google's AppEngine for this so the solution would have to be in native Go and not rely on external C libraries.
Relevant code:
var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<Container>
<Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
<Item>
<Description>
<table width="100%" border=0 ><tr><td><table width="100%"><tr><td><!-- Begin Description -->
<TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
<TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
<H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
<IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg>
<BR><BR>
<IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg>
</Description>
</Item>
<Container>
</soapenv:Body>
</soapenv:Envelope>`)
type Data struct {
Timestamp string `xml:"Body>Container>Timestamp"`
}
var o Data
decoder := xml.NewDecoder(bytes.NewBuffer(XMLData))
decoder.Strict = false
decoder.AutoClose = xml.HTMLAutoClose
decoder.Entity = xml.HTMLEntity
if err := decoder.Decode(&o); err != nil {
fmt.Println("Error: ", err)
} else {
fmt.Println("Timestamp: ", o.Timestamp)
}
Result:
Error: XML syntax error on line 14: expected /> in element
Thank you.
答案1
得分: 1
作为xml
包的替代方案,如果你已经安装了libxml2,你可以使用Gokogiri在Go语言中利用其灵活的解析功能。
例如,使用XPath进行评估:
package main
import (
"fmt"
"github.com/moovweb/gokogiri"
"github.com/moovweb/gokogiri/xml"
"github.com/moovweb/gokogiri/xpath"
)
func main() {
var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<Container>
<Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
<Item>
<Description>
<table width="100%" border=0 ><tr><td><table width="100%"></tr></td><!-- Begin Description -->
<TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
<TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
<H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
<IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg>
<BR><BR>
<IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg>
</Description>
</Item>
<Container>
</soapenv:Body>
</soapenv:Envelope>`)
doc, err := gokogiri.ParseXml(XMLData)
if err != nil {
fmt.Printf("XML document could not be parsed")
return
}
nxpath := xpath.NewXPath(doc.DocPtr())
nodes, err := nxpath.Evaluate(doc.DocPtr(), xpath.Compile("//Timestamp"))
if err != nil {
fmt.Printf("XPath could not be evaluated")
return
}
if len(nodes) == 0 {
fmt.Printf("Elements matching XPath not found")
return
}
timestamp := xml.NewNode(nodes[0], doc).InnerHtml()
fmt.Printf("%s", timestamp) // "2014-01-15T21:07:07.217Z"
}
这段代码在Go v1.2和OS X 10.9.1上运行正常。Gokogiri包还包括一个CSS选择器转换器,但我从未使用过,无法保证其可靠性。
英文:
As an alternative to the xml
package, if you have libxml2 installed, you can use Gokogiri to harness its parsing flexibility in Go.
For example, evaluating using an XPath:
<!-- language: go —>
package main
import (
"fmt"
"github.com/moovweb/gokogiri"
"github.com/moovweb/gokogiri/xml"
"github.com/moovweb/gokogiri/xpath"
)
func main() {
var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<Container>
<Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
<Item>
<Description>
<table width="100%" border=0 ><tr><td><table width="100%"><tr><td><!-- Begin Description -->
<TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
<TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
<H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
<IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg>
<BR><BR>
<IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg>
</Description>
</Item>
<Container>
</soapenv:Body>
</soapenv:Envelope>`)
doc, err := gokogiri.ParseXml(XMLData)
if err != nil {
fmt.Printf("XML document could not be parsed")
return
}
nxpath := xpath.NewXPath(doc.DocPtr())
nodes, err := nxpath.Evaluate(doc.DocPtr(), xpath.Compile("//Timestamp"))
if err != nil {
fmt.Printf("XPath could not be evaluated")
return
}
if len(nodes) == 0 {
fmt.Printf("Elements matching XPath not found")
return
}
timestamp := xml.NewNode(nodes[0], doc).InnerHtml()
fmt.Printf("%s", timestamp) // "2014-01-15T21:07:07.217Z"
}
This works with Go v1.2 on OS X 10.9.1. The Gokogiri package also includes a CSS selector converter, but I've never used it and can't vouch for it.
答案2
得分: 0
你的解码器代码没问题(实际上你可以删除decoder.AutoClose = xml.HTMLAutoClose
这一行)。问题在于img
标签的src
属性周围没有引号。请参考这个示例。
英文:
Your decoder code is fine (you can actually remove the decoder.AutoClose = xml.HTMLAutoClose
line). The problem is that the img
tags don't have quotes around the src
attributes. See this playground.
答案3
得分: 0
考虑使用go.net/html
包,根据我的测试,它可以很好地解析你的示例数据。
我认为这个包的问题在于它返回给你一个“节点”层次结构(每个HTML元素一个节点),你需要遍历这个层次结构。至少在第一眼看上去,它没有提供将节点解组为结构体的功能。因此,你可以尝试使用html-query
或goquery
等工具,它们可以让你使用().so().called().fluent().style()等方式查询解析后的DOM。
go-html-transform
也是另一个可选的选择。
换句话说,我的主要建议是将你处理的整个SOAP响应视为HTML而不是XML,因为实际上它就是HTML,希望HTML解析器能够处理它,因为HTML具有更宽松的格式规则和更宽容的解析器。
英文:
Consider using go.net/html
package — for me, it parsed your sample data just OK.
The problem with this package, as I perceive it, is that it returns to you a hierarchy of "nodes" (one per HTML element) which you're supposed to traverse. I mean, no unmarshaling to a struct, at least on the first sight. Thus you might have better luck with something like html-query
or goquery
which should allow you to query the parsed DOM using the().so().called().fluent().style()…
go-html-transform
is yet another possible option.
In other words, my key idea is to treat the whole SOAP reply you're dealing with as HTML, not XML because that's what it really is and hope a HTML parser will be able to cope with it thanks to the HTML's more lax formatting rules and consequently more permissive parsers.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论