在Go语言中解析格式错误的XML文件

huangapple go评论73阅读模式
英文:

Parsing malformed xml file in Go

问题

我有大量的XML文件需要解析,其中包含在闭合标签中的未闭合标签。类似下面的内容:

<submission>
<first-name>Henry
<last-name>Donald
<id>4224
</submission>

我将decoder.Strict设置为false,但仍然无法正确解析整个XML文件。

type Submission struct {
	FirstName string `xml:"first-name"`
	LastName  string `xml:"last-name"`
	ID        string `xml:"id"`
}

func main() {
	dec := xml.NewDecoder(bytes.NewReader([]byte(sub)))
	dec.Strict = false
	dec.AutoClose = xml.HTMLAutoClose
	dec.Entity = xml.HTMLEntity

	var s Submission
	err := dec.Decode(&s)
	if err != nil {
		fmt.Println(err)
	}

	fmt.Println(s)
}

Playground: https://play.golang.org/p/-_chEpDhzX

我知道有一个HTML标记解析器可以尝试使用,但我更愿意使用XML包,因为大多数文件都是格式正确的。

英文:

I have a large number of xml files to parse that contain unclosed tags wrapped in closed tags. Something like below:

<submission>
<first-name>Henry
<last-name>Donald
<id>4224
</submission>

I set decoder.Strict = false but it is still unable to parse the entire xml file properly.

type Submission struct {
	FirstName string `xml:"first-name"`
	LastName  string `xml:"last-name"`
	ID        string `xml:"id"`
}

func main() {
	dec := xml.NewDecoder(bytes.NewReader([]byte(sub)))
	dec.Strict = false
	dec.AutoClose = xml.HTMLAutoClose
	dec.Entity = xml.HTMLEntity

	var s Submission
	err := dec.Decode(&s)
	if err != nil {
		fmt.Println(err)
	}

	fmt.Println(s)
}

Playground: https://play.golang.org/p/-_chEpDhzX

I know there is a html tokenizer that I may try using but I would prefer to use the XML package as the majority of the files are properly formatted.

答案1

得分: 2

以下对我有用,这可能只适用于您知道有问题的标签的情况。不过,奇怪的是,如果我还添加了 first-name,它就不起作用。

dec.AutoClose = append(dec.AutoClose, "last-name")
dec.AutoClose = append(dec.AutoClose, "id")

英文:

Below worked for me, which is probably only ideal if you know the problematic tags. Although, strangely it doesn't work if I also add first-name.

dec.AutoClose = append(dec.AutoClose, "last-name")
dec.AutoClose = append(dec.AutoClose, "id")

答案2

得分: -1

没有其他办法。你需要自己的解码器:http://play.golang.org/p/Kr7nq64f-c

英文:

No ways around it. You need your own decoder: http://play.golang.org/p/Kr7nq64f-c

huangapple
  • 本文由 发表于 2015年4月8日 02:14:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/29498353.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定