英文:
How to read bad XML with Go
问题
我想使用Go语言读取一个XML文件。问题是这个XML文件有问题,不符合规范。以下是一个示例:
<?xml version="1.0" encoding="UTF-8"?>
<something abc="1" def="2">
<0 x="a"/>
<1 x="b"/>
<2 x="c"/>
<26 x="z"/>
</something>
当我尝试读取这个文件时,我的Go程序会正确地报错:
$ go run rs.go <real.xml
chardata: '
'
start: name.local='something'
start {{ something} [{{ abc} 1} {{ def} 2}]}
'abc'='1'
'def'='2'
offset=66
chardata: '
'
XML syntax error on line 3: invalid XML name: 0
exit status 1
以下是这个小小的Go程序:
package main
import (
"encoding/xml"
"fmt"
"io"
"os"
)
// <something abc="1" def="2">
type Something struct {
abc string `xml:"abc"`
def string `xml:"def"`
spots []Spot
}
// <0 x="a"/>
type Spot struct {
num int // ??
xval string `xml:"x"`
}
func main() {
dec := xml.NewDecoder(os.Stdin)
// dec.Strict = false // doesn't help <0 ...> problem
// dec.Entity = xml.HTMLEntity
for {
tok, err := dec.Token()
if err == io.EOF {
break
} else if err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
switch tok := tok.(type) {
case xml.StartElement:
fmt.Printf("start: name.local='%s'\n", tok.Name.Local)
fmt.Printf("start %v\n", tok)
for _, a := range tok.Attr {
fmt.Printf("'%s'='%s'\n", a.Name.Local, a.Value)
}
fmt.Printf("offset=%d\n", dec.InputOffset())
case xml.EndElement:
fmt.Printf("end: name.local='%s'\n", tok.Name.Local)
case xml.CharData:
fmt.Printf("chardata: '%s'\n", tok)
case xml.Comment:
fmt.Printf("comment: '%s'\n", tok)
}
}
}
有没有Go专家可以帮助我解决如何让Go读取这个奇怪的XML文件的问题?谢谢!
英文:
I'd like to use Go to read an XML file. The problem is that it's a bad XML file -- it doesn't conform to the spec. Here's a sample:
<?xml version="1.0" encoding="UTF-8"?>
<something abc="1" def="2">
<0 x="a"/>
<1 x="b"/>
<2 x="c"/>
<26 x="z"/>
</something>
My Go program correctly gives an error when trying to read this:
$ go run rs.go <real.xml
chardata: '
'
start: name.local='something'
start {{ something} [{{ abc} 1} {{ def} 2}]}
'abc'='1'
'def'='2'
offset=66
chardata: '
'
XML syntax error on line 3: invalid XML name: 0
exit status 1
Here's the little Go program:
package main
import (
"encoding/xml"
"fmt"
"io"
"os"
)
// <something abc="1" def="2">
type Something struct {
abc string `xml:"abc"`
def string `xml:"def"`
spots []Spot
}
// <0 x="a"/>
type Spot struct {
num int // ??
xval string `xml:"x"`
}
func main() {
dec := xml.NewDecoder(os.Stdin)
// dec.Strict = false // doesn't help <0 ...> problem
// dec.Entity = xml.HTMLEntity
for {
tok, err := dec.Token()
if err == io.EOF {
break
} else if err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
switch tok := tok.(type) {
case xml.StartElement:
fmt.Printf("start: name.local='%s'\n", tok.Name.Local)
fmt.Printf("start %v\n", tok)
for _, a := range tok.Attr {
fmt.Printf("'%s'='%s'\n", a.Name.Local, a.Value)
}
fmt.Printf("offset=%d\n", dec.InputOffset())
case xml.EndElement:
fmt.Printf("end: name.local='%s'\n", tok.Name.Local)
case xml.CharData:
fmt.Printf("chardata: '%s'\n", tok)
case xml.Comment:
fmt.Printf("comment: '%s'\n", tok)
}
}
}
Is there a Go expert out there who can help me figure out how to get Go to read this goofy XML file? Thanks!
答案1
得分: 2
将我的评论作为答案发布。
在这里似乎不能直接使用Go的xml包。但你可以:
- 考虑分叉xml包并更改
isName
函数以允许你的格式,或者 - 首先对XML进行清理,将其更改为有效的XML,然后使用Go的
xml
包进行解析。 - 另一个选项(根据你的“XML”输入有多复杂而定)是实现自己的解析器,如Gopher Academy博客中所解释的:advent-2014/parsers-lexers
英文:
Posting my comment as an answer.
It doesn't seem like you would be able to use the Go xml package directly here. But you could:
- consider forking the xml package and changing the
isName
function to allow your format, or - sanitize the XML first, changing it into valid XML, and then use the Go
xml
package to do the parsing. - Yet another option (probably a good one, depending on how wild your "XML" input is), is to implement your own parser, as explained on the Gopher Academy blog: advent-2014/parsers-lexers
答案2
得分: 1
感谢您的指导和建议,我能够读取XML文件。
只需将错误的条目重写为正确的条目,然后让Unmarshall完成其工作。
我拥有的格式错误的文件很小(小于10k),
所以如果XML文件大小为100MB,这可能不是一个好选择。
re := regexp.MustCompile("<([0-9]+)")
s := re.ReplaceAllString(string(raw), "<splat n="${1}"")
x := Something{Abc: "0"}
err = xml.Unmarshal([]byte(s), &x)
谢谢!
英文:
Thanks to your pointers and suggestions, I was able to read the XML files.
Just rewrite the bad entries to good, and let Unmarshall do its job.
The malformed files I have are small (less than 10k),
so this might not be a good choice if the XML file was 100 MB.
re := regexp.MustCompile("<([0-9]+)")
s := re.ReplaceAllString(string(raw), "<splat n=\"${1}\"")
x := Something{Abc: "0"}
err = xml.Unmarshal([]byte(s), &x)
Thank you!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论