2014年1月24日 08:21:12go评论114阅读模式

英文:

How to Unmarshal XML containing dirty HTML in Go

问题

我有一些XML需要解组，但其中包含我不关心的一个字段中的脏HTML。我在这里发布了一个示例：http://play.golang.org/p/caKCAYyXX2

有没有办法告诉解码器跳过或忽略这些错误？我尝试创建一个在文档中描述的非严格解码器，但无论我如何组合AutoClose或Entity的值，都无法使其工作。我应该提到这个XML来自一个我无法控制的第三方，并且内容始终是可变的，我不确定编译一个要跳过的元素的静态列表是否可行。在结构体中使用带有xml:"-"标记的Description对结果没有影响。

我能够使用Python 2.7解析这个XML，所以我希望在Go中也能实现-虽然我更喜欢在我的用例中使用Go - 我正在使用Google的AppEngine，所以解决方案必须是原生的Go，不能依赖外部C库。

相关代码：

var XMLData = []byte(`&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;soapenv:Envelope xmlns:soapenv=&quot;http://schemas.xmlsoap.org/soap/envelope/&quot; xmlns:xsd=&quot;http://www.w3.org/2001/XMLSchema&quot; xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt;
 &lt;soapenv:Body&gt;
  &lt;Container&gt;
   &lt;Timestamp&gt;2014-01-15T21:07:07.217Z&lt;/Timestamp&gt;
   &lt;Item&gt;
    &lt;Description&gt;
&lt;table  width=&quot;100%&quot; border=0 &gt;&lt;tr&gt;&lt;td&gt;&lt;table width=&quot;100%&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;!-- Begin Description --&gt;
&lt;TABLE cellSpacing=27 cellPadding=0 width=&quot;100%&quot;&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD vAlign=top&gt;&lt;P align=center&gt;
&lt;TABLE cellPadding=15 width=&quot;86%&quot; border=1&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;H3&gt;&lt;P&gt;
&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H&gt;&lt;H2&gt;&lt;H2&gt;
&lt;IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg&gt;
&lt;BR&gt;&lt;BR&gt;
&lt;IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg&gt;
    &lt;/Description&gt;
   &lt;/Item&gt;
  &lt;Container&gt;
 &lt;/soapenv:Body&gt;
&lt;/soapenv:Envelope&gt;`)
type Data struct {
	Timestamp string `xml:&quot;Body&gt;Container&gt;Timestamp&quot;`
}
var o Data
decoder := xml.NewDecoder(bytes.NewBuffer(XMLData))
decoder.Strict = false
decoder.AutoClose = xml.HTMLAutoClose
decoder.Entity = xml.HTMLEntity
if err := decoder.Decode(&amp;o); err != nil {
	fmt.Println(&quot;Error: &quot;, err)
} else {
	fmt.Println(&quot;Timestamp: &quot;, o.Timestamp)
}

结果：
Error: XML语法错误，位于第14行：在元素中预期/ >

谢谢。

英文:

I have some XML I want to Unmarshal but it contains dirty HTML in a field I don't even care about. I posted an example here: http://play.golang.org/p/caKCAYyXX2

Is there a way I can tell the Decoder to skip or ignore these errors? I tried making a non-strict Decoder described in the docs, but couldn't get any combinations of AutoClose or Entity values to get this working. I should mention this XML is from a 3rd party that I have no control over and the contents are always variable, I'm not sure compiling a static list of elements to skip would be feasible. Adding Description to the struct with the xml:"-" tag makes no difference.

I was able to parse this using Python 2.7 so I hope it would be possible in Go - though I'd prefer to use Go for my use-case - I am using the Google's AppEngine for this so the solution would have to be in native Go and not rely on external C libraries.

Relevant code:

var XMLData = []byte(`&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;soapenv:Envelope xmlns:soapenv=&quot;http://schemas.xmlsoap.org/soap/envelope/&quot; xmlns:xsd=&quot;http://www.w3.org/2001/XMLSchema&quot; xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt;
 &lt;soapenv:Body&gt;
  &lt;Container&gt;
   &lt;Timestamp&gt;2014-01-15T21:07:07.217Z&lt;/Timestamp&gt;
   &lt;Item&gt;
    &lt;Description&gt;
&lt;table  width=&quot;100%&quot; border=0 &gt;&lt;tr&gt;&lt;td&gt;&lt;table width=&quot;100%&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;!-- Begin Description --&gt;
&lt;TABLE cellSpacing=27 cellPadding=0 width=&quot;100%&quot;&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD vAlign=top&gt;&lt;P align=center&gt;
&lt;TABLE cellPadding=15 width=&quot;86%&quot; border=1&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;H3&gt;&lt;P&gt;
&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H&gt;&lt;H2&gt;&lt;H2&gt;
&lt;IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg&gt;
&lt;BR&gt;&lt;BR&gt;
&lt;IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg&gt;
    &lt;/Description&gt;
   &lt;/Item&gt;
  &lt;Container&gt;
 &lt;/soapenv:Body&gt;
&lt;/soapenv:Envelope&gt;`)
type Data struct {
	Timestamp string `xml:&quot;Body&gt;Container&gt;Timestamp&quot;`
}
var o Data
decoder := xml.NewDecoder(bytes.NewBuffer(XMLData))
decoder.Strict = false
decoder.AutoClose = xml.HTMLAutoClose
decoder.Entity = xml.HTMLEntity
if err := decoder.Decode(&amp;o); err != nil {
	fmt.Println(&quot;Error: &quot;, err)
} else {
	fmt.Println(&quot;Timestamp: &quot;, o.Timestamp)
}

Result:
Error: XML syntax error on line 14: expected /> in element

Thank you.

答案1

得分: 1

作为xml包的替代方案，如果你已经安装了libxml2，你可以使用Gokogiri在Go语言中利用其灵活的解析功能。

例如，使用XPath进行评估：

package main
import (
	"fmt"
	"github.com/moovweb/gokogiri"
	"github.com/moovweb/gokogiri/xml"
	"github.com/moovweb/gokogiri/xpath"
)
func main() {
	var XMLData = []byte(`<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<Container>
<Timestamp>2014-01-15T21:07:07.217Z</Timestamp>
<Item>
	<Description>
<table  width="100%" border=0 ><tr><td><table width="100%"></tr></td><!-- Begin Description -->
<TABLE cellSpacing=27 cellPadding=0 width="100%"><TBODY><TR><TD vAlign=top><P align=center>
<TABLE cellPadding=15 width="86%" border=1><TBODY><TR><TD><H3><P>
<H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H2><H><H2><H2>
<IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg>
<BR><BR>
<IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg>
	</Description>
</Item>
<Container>
</soapenv:Body>
</soapenv:Envelope>`)
	doc, err := gokogiri.ParseXml(XMLData)
	if err != nil {
		fmt.Printf("XML document could not be parsed")
		return
	}
	nxpath := xpath.NewXPath(doc.DocPtr())
	nodes, err := nxpath.Evaluate(doc.DocPtr(), xpath.Compile("//Timestamp"))
	if err != nil {
		fmt.Printf("XPath could not be evaluated")
		return
	}
	if len(nodes) == 0 {
		fmt.Printf("Elements matching XPath not found")
		return
	}
	timestamp := xml.NewNode(nodes[0], doc).InnerHtml()
	fmt.Printf("%s", timestamp) // "2014-01-15T21:07:07.217Z"
}

这段代码在Go v1.2和OS X 10.9.1上运行正常。Gokogiri包还包括一个CSS选择器转换器，但我从未使用过，无法保证其可靠性。

英文:

As an alternative to the xml package, if you have libxml2 installed, you can use Gokogiri to harness its parsing flexibility in Go.

For example, evaluating using an XPath:

<!-- language: go —>

package main
import (
	&quot;fmt&quot;
	&quot;github.com/moovweb/gokogiri&quot;
	&quot;github.com/moovweb/gokogiri/xml&quot;
	&quot;github.com/moovweb/gokogiri/xpath&quot;
)
func main() {
	var XMLData = []byte(`&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;soapenv:Envelope xmlns:soapenv=&quot;http://schemas.xmlsoap.org/soap/envelope/&quot; xmlns:xsd=&quot;http://www.w3.org/2001/XMLSchema&quot; xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt;
&lt;soapenv:Body&gt;
&lt;Container&gt;
&lt;Timestamp&gt;2014-01-15T21:07:07.217Z&lt;/Timestamp&gt;
&lt;Item&gt;
	&lt;Description&gt;
&lt;table  width=&quot;100%&quot; border=0 &gt;&lt;tr&gt;&lt;td&gt;&lt;table width=&quot;100%&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;!-- Begin Description --&gt;
&lt;TABLE cellSpacing=27 cellPadding=0 width=&quot;100%&quot;&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD vAlign=top&gt;&lt;P align=center&gt;
&lt;TABLE cellPadding=15 width=&quot;86%&quot; border=1&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;H3&gt;&lt;P&gt;
&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H2&gt;&lt;H&gt;&lt;H2&gt;&lt;H2&gt;
&lt;IMG SRC=http://www.REMOVED.com/simage/j6x516.jpg&gt;
&lt;BR&gt;&lt;BR&gt;
&lt;IMG SRC=http://www.REMOVED.com/simage/j6x517.jpg&gt;
	&lt;/Description&gt;
&lt;/Item&gt;
&lt;Container&gt;
&lt;/soapenv:Body&gt;
&lt;/soapenv:Envelope&gt;`)
	doc, err := gokogiri.ParseXml(XMLData)
	if err != nil {
		fmt.Printf(&quot;XML document could not be parsed&quot;)
		return
	}
	nxpath := xpath.NewXPath(doc.DocPtr())
	nodes, err := nxpath.Evaluate(doc.DocPtr(), xpath.Compile(&quot;//Timestamp&quot;))
	if err != nil {
		fmt.Printf(&quot;XPath could not be evaluated&quot;)
		return
	}
	if len(nodes) == 0 {
		fmt.Printf(&quot;Elements matching XPath not found&quot;)
		return
	}
	timestamp := xml.NewNode(nodes[0], doc).InnerHtml()
	fmt.Printf(&quot;%s&quot;, timestamp) // &quot;2014-01-15T21:07:07.217Z&quot;
}

This works with Go v1.2 on OS X 10.9.1. The Gokogiri package also includes a CSS selector converter, but I've never used it and can't vouch for it.

答案2

得分: 0

你的解码器代码没问题（实际上你可以删除decoder.AutoClose = xml.HTMLAutoClose这一行）。问题在于img标签的src属性周围没有引号。请参考这个示例。

英文:

Your decoder code is fine (you can actually remove the decoder.AutoClose = xml.HTMLAutoClose line). The problem is that the img tags don't have quotes around the src attributes. See this playground.

答案3

得分: 0

考虑使用go.net/html包，根据我的测试，它可以很好地解析你的示例数据。

我认为这个包的问题在于它返回给你一个“节点”层次结构（每个HTML元素一个节点），你需要遍历这个层次结构。至少在第一眼看上去，它没有提供将节点解组为结构体的功能。因此，你可以尝试使用html-query或goquery等工具，它们可以让你使用().so().called().fluent().style()等方式查询解析后的DOM。

go-html-transform也是另一个可选的选择。

换句话说，我的主要建议是将你处理的整个SOAP响应视为HTML而不是XML，因为实际上它就是HTML，希望HTML解析器能够处理它，因为HTML具有更宽松的格式规则和更宽容的解析器。

英文:

Consider using go.net/html package — for me, it parsed your sample data just OK.

The problem with this package, as I perceive it, is that it returns to you a hierarchy of "nodes" (one per HTML element) which you're supposed to traverse. I mean, no unmarshaling to a struct, at least on the first sight. Thus you might have better luck with something like html-query or goquery which should allow you to query the parsed DOM using the().so().called().fluent().style()…

go-html-transform is yet another possible option.

In other words, my key idea is to treat the whole SOAP reply you're dealing with as HTML, not XML because that's what it really is and hope a HTML parser will be able to cope with it thanks to the HTML's more lax formatting rules and consequently more permissive parsers.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to Unmarshal XML containing dirty HTML in Go

问题

答案1

答案2

答案3

passing array or slice into variable args function in golang

proplem gorm many to many get list

缓冲的 golang 通道丢失数据

如何在Go语言中使用命名管道处理进程输出

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。