如何使用Go语言读取损坏的XML文件

huangapple go评论86阅读模式
英文:

How to read bad XML with Go

问题

我想使用Go语言读取一个XML文件。问题是这个XML文件有问题,不符合规范。以下是一个示例:

<?xml version="1.0" encoding="UTF-8"?>
<something abc="1" def="2">
    <0 x="a"/>
    <1 x="b"/>
    <2 x="c"/>
    <26 x="z"/>
</something>

当我尝试读取这个文件时,我的Go程序会正确地报错:

$ go run rs.go <real.xml
chardata: '
'
start: name.local='something'
start {{ something} [{{ abc} 1} {{ def} 2}]}
'abc'='1'
'def'='2'
offset=66
chardata: '
	'
XML syntax error on line 3: invalid XML name: 0
exit status 1

以下是这个小小的Go程序:

package main

import (
	"encoding/xml"
	"fmt"
	"io"
	"os"
)

// <something abc="1" def="2">
type Something struct {
	abc   string `xml:"abc"`
	def   string `xml:"def"`
	spots []Spot
}

// <0 x="a"/>
type Spot struct {
	num  int    // ??
	xval string `xml:"x"`
}

func main() {
	dec := xml.NewDecoder(os.Stdin)
	// dec.Strict = false		// doesn't help <0 ...> problem
	// dec.Entity = xml.HTMLEntity

	for {
		tok, err := dec.Token()
		if err == io.EOF {
			break
		} else if err != nil {
			fmt.Fprintf(os.Stderr, "%v\n", err)
			os.Exit(1)
		}

		switch tok := tok.(type) {
		case xml.StartElement:
			fmt.Printf("start: name.local='%s'\n", tok.Name.Local)
			fmt.Printf("start %v\n", tok)
			for _, a := range tok.Attr {
				fmt.Printf("'%s'='%s'\n", a.Name.Local, a.Value)
			}
			fmt.Printf("offset=%d\n", dec.InputOffset())
		case xml.EndElement:
			fmt.Printf("end: name.local='%s'\n", tok.Name.Local)
		case xml.CharData:
			fmt.Printf("chardata: '%s'\n", tok)
		case xml.Comment:
			fmt.Printf("comment: '%s'\n", tok)
		}
	}
}

有没有Go专家可以帮助我解决如何让Go读取这个奇怪的XML文件的问题?谢谢!

英文:

I'd like to use Go to read an XML file. The problem is that it's a bad XML file -- it doesn't conform to the spec. Here's a sample:

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;something abc=&quot;1&quot; def=&quot;2&quot;&gt;
&lt;0 x=&quot;a&quot;/&gt;
&lt;1 x=&quot;b&quot;/&gt;
&lt;2 x=&quot;c&quot;/&gt;
&lt;26 x=&quot;z&quot;/&gt;
&lt;/something&gt;

My Go program correctly gives an error when trying to read this:

$ go run rs.go &lt;real.xml
chardata: &#39;
&#39;
start: name.local=&#39;something&#39;
start {{ something} [{{ abc} 1} {{ def} 2}]}
&#39;abc&#39;=&#39;1&#39;
&#39;def&#39;=&#39;2&#39;
offset=66
chardata: &#39;
&#39;
XML syntax error on line 3: invalid XML name: 0
exit status 1

Here's the little Go program:

package main
import (
&quot;encoding/xml&quot;
&quot;fmt&quot;
&quot;io&quot;
&quot;os&quot;
)
//  &lt;something abc=&quot;1&quot; def=&quot;2&quot;&gt;
type Something struct {
abc   string `xml:&quot;abc&quot;`
def   string `xml:&quot;def&quot;`
spots []Spot
}
//    &lt;0 x=&quot;a&quot;/&gt;
type Spot struct {
num  int    // ??
xval string `xml:&quot;x&quot;`
}
func main() {
dec := xml.NewDecoder(os.Stdin)
//	dec.Strict = false		// doesn&#39;t help  &lt;0 ...&gt; problem
//	dec.Entity = xml.HTMLEntity
for {
tok, err := dec.Token()
if err == io.EOF {
break
} else if err != nil {
fmt.Fprintf(os.Stderr, &quot;%v\n&quot;, err)
os.Exit(1)
}
switch tok := tok.(type) {
case xml.StartElement:
fmt.Printf(&quot;start: name.local=&#39;%s&#39;\n&quot;, tok.Name.Local)
fmt.Printf(&quot;start %v\n&quot;, tok)
for _, a := range tok.Attr {
fmt.Printf(&quot;&#39;%s&#39;=&#39;%s&#39;\n&quot;, a.Name.Local, a.Value)
}
fmt.Printf(&quot;offset=%d\n&quot;, dec.InputOffset())
case xml.EndElement:
fmt.Printf(&quot;end: name.local=&#39;%s&#39;\n&quot;, tok.Name.Local)
case xml.CharData:
fmt.Printf(&quot;chardata: &#39;%s&#39;\n&quot;, tok)
case xml.Comment:
fmt.Printf(&quot;comment: &#39;%s&#39;\n&quot;, tok)
}
}
}

Is there a Go expert out there who can help me figure out how to get Go to read this goofy XML file? Thanks!

答案1

得分: 2

将我的评论作为答案发布。

在这里似乎不能直接使用Go的xml包。但你可以:

  • 考虑分叉xml包并更改isName函数以允许你的格式,或者
  • 首先对XML进行清理,将其更改为有效的XML,然后使用Go的xml包进行解析。
  • 另一个选项(根据你的“XML”输入有多复杂而定)是实现自己的解析器,如Gopher Academy博客中所解释的:advent-2014/parsers-lexers
英文:

Posting my comment as an answer.

It doesn't seem like you would be able to use the Go xml package directly here. But you could:

  • consider forking the xml package and changing the isName function to allow your format, or
  • sanitize the XML first, changing it into valid XML, and then use the Go xml package to do the parsing.
  • Yet another option (probably a good one, depending on how wild your "XML" input is), is to implement your own parser, as explained on the Gopher Academy blog: advent-2014/parsers-lexers

答案2

得分: 1

感谢您的指导和建议,我能够读取XML文件。
只需将错误的条目重写为正确的条目,然后让Unmarshall完成其工作。
我拥有的格式错误的文件很小(小于10k),
所以如果XML文件大小为100MB,这可能不是一个好选择。

re := regexp.MustCompile("<([0-9]+)")
s := re.ReplaceAllString(string(raw), "<splat n="${1}"")

x := Something{Abc: "0"}
err = xml.Unmarshal([]byte(s), &x)

谢谢!

英文:

Thanks to your pointers and suggestions, I was able to read the XML files.
Just rewrite the bad entries to good, and let Unmarshall do its job.
The malformed files I have are small (less than 10k),
so this might not be a good choice if the XML file was 100 MB.

re := regexp.MustCompile(&quot;&lt;([0-9]+)&quot;)
s := re.ReplaceAllString(string(raw), &quot;&lt;splat n=\&quot;${1}\&quot;&quot;)
x := Something{Abc: &quot;0&quot;}
err = xml.Unmarshal([]byte(s), &amp;x)

Thank you!

huangapple
  • 本文由 发表于 2016年2月17日 10:57:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/35447040.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定