使用Go解析XML,忽略嵌套元素?

huangapple go评论89阅读模式
英文:

Parsing xml with Go, ignoring nested elements?

问题

我正在尝试使用Golang的xml解析器解析一个HTML文档。我已经成功提取了所有的<li>元素,但是如果元素包含一个链接<a>,那么链接的内容会被忽略。我想要忽略嵌套的<a>元素,并将其内容显示为纯文本,但是我不知道该如何做。

以下是我的代码:

d := xml.NewDecoder(resp.Body)
d.Strict = false
d.AutoClose = xml.HTMLAutoClose
d.Entity = xml.HTMLEntity

type list_item struct {
    Data string `xml:",chardata"`
}

for {
    t, _ := d.Token()
    if t == nil {
        break
    }

    switch se := t.(type) {
    case xml.StartElement:
        if se.Name.Local == "li" {
            var q list_item
            d.DecodeElement(&q, &se)

            c.Infof("%+v\n", q)
        }
    }
}

有没有办法忽略嵌套元素并显示它们的内容?

英文:

I am trying to parse a html document with the Golang xml parser. I have managed it to extract all the &lt;li&gt;elements but if the element contains a link &lt;a&gt;, then the content of the link is ignored. I would like to just ignore the nested &lt;a&gt; and display it's content as plain text but I don't know how.

Here is my code:

d := xml.NewDecoder(resp.Body)
d.Strict = false
d.AutoClose = xml.HTMLAutoClose
d.Entity = xml.HTMLEntity

type list_item struct {
	Data string `xml:&quot;,chardata&quot;`
}
			
for {
	t,_ := d.Token()
	if t == nil {
		break
	}

	switch se := t.(type) {
	case xml.StartElement:
		if se.Name.Local == &quot;li&quot; {
			var q list_item
			d.DecodeElement(&amp;q, &amp;se)

			c.Infof(&quot;%+v\n&quot;, q)

		}
	}
}

Is there any way to just ignore nested elements and display their content?

答案1

得分: 1

考虑使用专门的包来解析HTML。一般来说,HTML不是XML(XHTML 1.0是,但使用它格式化的文档并不常见,而且该标准已被弃用)。

在我看来,更好的方法是使用XPath来使用查询提取所需的信息,考虑到您的使用情况。

至于所述的问题,我认为没有内置的方法可以实现您想要的功能:xml.Decoder实现了Skip()方法,但它只允许您跳过不需要的内容;没有任何返回“内部XML”的方法。您可以通过使用xml.DecoderRawToken()自行实现此功能:立即渲染其返回的内容,直到返回表示您要查找的结束元素的内容(您将需要实现处理嵌套元素的支持)。

英文:

Constder using specialized package for parsing HTML. In general, HTML is not XML (XHTML 1.0 is, but documents formatted using it are not very common, and that standard has been deprecated).

An even better approach in my opinion&mdash;given your apparent use case,&mdash; would be using XPath to extract the necessary information using a query.

As to the question as stated, I think there's no built-in way to do what you want: the xml.Decoder implements the Skip() method but it only allows you to skip over unneeded content; there's nothing returning "inner XML" as is. You could roll this yourself by using xml.Decoder's RawToken(): by immediately rendering whatever it returns until it returns something denoting and end element you're looking for (you'll have to implement support for handling nested elements).

答案2

得分: 0

我找到了一个使用jQuery风格获取HTML信息的库:http://godoc.org/github.com/PuerkitoBio/goquery

我使用了这个库,问题得到了解决。

英文:

I found a library that uses the jQuery style of getting html information: http://godoc.org/github.com/PuerkitoBio/goquery

I used that and it solved the problem.

huangapple
  • 本文由 发表于 2015年3月29日 18:09:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/29327863.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定