如何使用Go语言读取损坏的XML文件

huangapple go评论122阅读模式
英文:

How to read bad XML with Go

问题

我想使用Go语言读取一个XML文件。问题是这个XML文件有问题,不符合规范。以下是一个示例:

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <something abc="1" def="2">
  3. <0 x="a"/>
  4. <1 x="b"/>
  5. <2 x="c"/>
  6. <26 x="z"/>
  7. </something>

当我尝试读取这个文件时,我的Go程序会正确地报错:

  1. $ go run rs.go <real.xml
  2. chardata: '
  3. '
  4. start: name.local='something'
  5. start {{ something} [{{ abc} 1} {{ def} 2}]}
  6. 'abc'='1'
  7. 'def'='2'
  8. offset=66
  9. chardata: '
  10. '
  11. XML syntax error on line 3: invalid XML name: 0
  12. exit status 1

以下是这个小小的Go程序:

  1. package main
  2. import (
  3. "encoding/xml"
  4. "fmt"
  5. "io"
  6. "os"
  7. )
  8. // <something abc="1" def="2">
  9. type Something struct {
  10. abc string `xml:"abc"`
  11. def string `xml:"def"`
  12. spots []Spot
  13. }
  14. // <0 x="a"/>
  15. type Spot struct {
  16. num int // ??
  17. xval string `xml:"x"`
  18. }
  19. func main() {
  20. dec := xml.NewDecoder(os.Stdin)
  21. // dec.Strict = false // doesn't help <0 ...> problem
  22. // dec.Entity = xml.HTMLEntity
  23. for {
  24. tok, err := dec.Token()
  25. if err == io.EOF {
  26. break
  27. } else if err != nil {
  28. fmt.Fprintf(os.Stderr, "%v\n", err)
  29. os.Exit(1)
  30. }
  31. switch tok := tok.(type) {
  32. case xml.StartElement:
  33. fmt.Printf("start: name.local='%s'\n", tok.Name.Local)
  34. fmt.Printf("start %v\n", tok)
  35. for _, a := range tok.Attr {
  36. fmt.Printf("'%s'='%s'\n", a.Name.Local, a.Value)
  37. }
  38. fmt.Printf("offset=%d\n", dec.InputOffset())
  39. case xml.EndElement:
  40. fmt.Printf("end: name.local='%s'\n", tok.Name.Local)
  41. case xml.CharData:
  42. fmt.Printf("chardata: '%s'\n", tok)
  43. case xml.Comment:
  44. fmt.Printf("comment: '%s'\n", tok)
  45. }
  46. }
  47. }

有没有Go专家可以帮助我解决如何让Go读取这个奇怪的XML文件的问题?谢谢!

英文:

I'd like to use Go to read an XML file. The problem is that it's a bad XML file -- it doesn't conform to the spec. Here's a sample:

  1. &lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
  2. &lt;something abc=&quot;1&quot; def=&quot;2&quot;&gt;
  3. &lt;0 x=&quot;a&quot;/&gt;
  4. &lt;1 x=&quot;b&quot;/&gt;
  5. &lt;2 x=&quot;c&quot;/&gt;
  6. &lt;26 x=&quot;z&quot;/&gt;
  7. &lt;/something&gt;

My Go program correctly gives an error when trying to read this:

  1. $ go run rs.go &lt;real.xml
  2. chardata: &#39;
  3. &#39;
  4. start: name.local=&#39;something&#39;
  5. start {{ something} [{{ abc} 1} {{ def} 2}]}
  6. &#39;abc&#39;=&#39;1&#39;
  7. &#39;def&#39;=&#39;2&#39;
  8. offset=66
  9. chardata: &#39;
  10. &#39;
  11. XML syntax error on line 3: invalid XML name: 0
  12. exit status 1

Here's the little Go program:

  1. package main
  2. import (
  3. &quot;encoding/xml&quot;
  4. &quot;fmt&quot;
  5. &quot;io&quot;
  6. &quot;os&quot;
  7. )
  8. // &lt;something abc=&quot;1&quot; def=&quot;2&quot;&gt;
  9. type Something struct {
  10. abc string `xml:&quot;abc&quot;`
  11. def string `xml:&quot;def&quot;`
  12. spots []Spot
  13. }
  14. // &lt;0 x=&quot;a&quot;/&gt;
  15. type Spot struct {
  16. num int // ??
  17. xval string `xml:&quot;x&quot;`
  18. }
  19. func main() {
  20. dec := xml.NewDecoder(os.Stdin)
  21. // dec.Strict = false // doesn&#39;t help &lt;0 ...&gt; problem
  22. // dec.Entity = xml.HTMLEntity
  23. for {
  24. tok, err := dec.Token()
  25. if err == io.EOF {
  26. break
  27. } else if err != nil {
  28. fmt.Fprintf(os.Stderr, &quot;%v\n&quot;, err)
  29. os.Exit(1)
  30. }
  31. switch tok := tok.(type) {
  32. case xml.StartElement:
  33. fmt.Printf(&quot;start: name.local=&#39;%s&#39;\n&quot;, tok.Name.Local)
  34. fmt.Printf(&quot;start %v\n&quot;, tok)
  35. for _, a := range tok.Attr {
  36. fmt.Printf(&quot;&#39;%s&#39;=&#39;%s&#39;\n&quot;, a.Name.Local, a.Value)
  37. }
  38. fmt.Printf(&quot;offset=%d\n&quot;, dec.InputOffset())
  39. case xml.EndElement:
  40. fmt.Printf(&quot;end: name.local=&#39;%s&#39;\n&quot;, tok.Name.Local)
  41. case xml.CharData:
  42. fmt.Printf(&quot;chardata: &#39;%s&#39;\n&quot;, tok)
  43. case xml.Comment:
  44. fmt.Printf(&quot;comment: &#39;%s&#39;\n&quot;, tok)
  45. }
  46. }
  47. }

Is there a Go expert out there who can help me figure out how to get Go to read this goofy XML file? Thanks!

答案1

得分: 2

将我的评论作为答案发布。

在这里似乎不能直接使用Go的xml包。但你可以:

  • 考虑分叉xml包并更改isName函数以允许你的格式,或者
  • 首先对XML进行清理,将其更改为有效的XML,然后使用Go的xml包进行解析。
  • 另一个选项(根据你的“XML”输入有多复杂而定)是实现自己的解析器,如Gopher Academy博客中所解释的:advent-2014/parsers-lexers
英文:

Posting my comment as an answer.

It doesn't seem like you would be able to use the Go xml package directly here. But you could:

  • consider forking the xml package and changing the isName function to allow your format, or
  • sanitize the XML first, changing it into valid XML, and then use the Go xml package to do the parsing.
  • Yet another option (probably a good one, depending on how wild your "XML" input is), is to implement your own parser, as explained on the Gopher Academy blog: advent-2014/parsers-lexers

答案2

得分: 1

感谢您的指导和建议,我能够读取XML文件。
只需将错误的条目重写为正确的条目,然后让Unmarshall完成其工作。
我拥有的格式错误的文件很小(小于10k),
所以如果XML文件大小为100MB,这可能不是一个好选择。

re := regexp.MustCompile("<([0-9]+)")
s := re.ReplaceAllString(string(raw), "<splat n="${1}"")

x := Something{Abc: "0"}
err = xml.Unmarshal([]byte(s), &x)

谢谢!

英文:

Thanks to your pointers and suggestions, I was able to read the XML files.
Just rewrite the bad entries to good, and let Unmarshall do its job.
The malformed files I have are small (less than 10k),
so this might not be a good choice if the XML file was 100 MB.

  1. re := regexp.MustCompile(&quot;&lt;([0-9]+)&quot;)
  2. s := re.ReplaceAllString(string(raw), &quot;&lt;splat n=\&quot;${1}\&quot;&quot;)
  3. x := Something{Abc: &quot;0&quot;}
  4. err = xml.Unmarshal([]byte(s), &amp;x)

Thank you!

huangapple
  • 本文由 发表于 2016年2月17日 10:57:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/35447040.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定