问题

我正在尝试使用Golang的xml解析器解析一个HTML文档。我已经成功提取了所有的<li>元素，但是如果元素包含一个链接<a>，那么链接的内容会被忽略。我想要忽略嵌套的<a>元素，并将其内容显示为纯文本，但是我不知道该如何做。

以下是我的代码：

d := xml.NewDecoder(resp.Body)
d.Strict = false
d.AutoClose = xml.HTMLAutoClose
d.Entity = xml.HTMLEntity

type list_item struct {
    Data string `xml:",chardata"`
}

for {
    t, _ := d.Token()
    if t == nil {
        break
    }

    switch se := t.(type) {
    case xml.StartElement:
        if se.Name.Local == "li" {
            var q list_item
            d.DecodeElement(&q, &se)

            c.Infof("%+v\n", q)
        }
    }
}

有没有办法忽略嵌套元素并显示它们的内容？

英文:

I am trying to parse a html document with the Golang xml parser. I have managed it to extract all the <li>elements but if the element contains a link <a>, then the content of the link is ignored. I would like to just ignore the nested <a> and display it's content as plain text but I don't know how.

Here is my code:

d := xml.NewDecoder(resp.Body)
d.Strict = false
d.AutoClose = xml.HTMLAutoClose
d.Entity = xml.HTMLEntity

type list_item struct {
	Data string `xml:&quot;,chardata&quot;`
}
			
for {
	t,_ := d.Token()
	if t == nil {
		break
	}

	switch se := t.(type) {
	case xml.StartElement:
		if se.Name.Local == &quot;li&quot; {
			var q list_item
			d.DecodeElement(&amp;q, &amp;se)

			c.Infof(&quot;%+v\n&quot;, q)

		}
	}
}

Is there any way to just ignore nested elements and display their content?

答案1

得分: 1

考虑使用专门的包来解析HTML。一般来说，HTML不是XML（XHTML 1.0是，但使用它格式化的文档并不常见，而且该标准已被弃用）。

在我看来，更好的方法是使用XPath来使用查询提取所需的信息，考虑到您的使用情况。

至于所述的问题，我认为没有内置的方法可以实现您想要的功能：xml.Decoder实现了Skip()方法，但它只允许您跳过不需要的内容；没有任何返回“内部XML”的方法。您可以通过使用xml.Decoder的RawToken()自行实现此功能：立即渲染其返回的内容，直到返回表示您要查找的结束元素的内容（您将需要实现处理嵌套元素的支持）。

英文:

Constder using specialized package for parsing HTML. In general, HTML is not XML (XHTML 1.0 is, but documents formatted using it are not very common, and that standard has been deprecated).

An even better approach in my opinion—given your apparent use case,— would be using XPath to extract the necessary information using a query.

As to the question as stated, I think there's no built-in way to do what you want: the xml.Decoder implements the Skip() method but it only allows you to skip over unneeded content; there's nothing returning "inner XML" as is. You could roll this yourself by using xml.Decoder's RawToken(): by immediately rendering whatever it returns until it returns something denoting and end element you're looking for (you'll have to implement support for handling nested elements).

答案2

得分: 0

我找到了一个使用jQuery风格获取HTML信息的库：http://godoc.org/github.com/PuerkitoBio/goquery

我使用了这个库，问题得到了解决。

英文:

I found a library that uses the jQuery style of getting html information: http://godoc.org/github.com/PuerkitoBio/goquery

I used that and it solved the problem.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Go解析XML，忽略嵌套元素？

问题

答案1

答案2

Design patterns for map channel?

尝试编写一个工作方法池时出现死锁问题。

MongoDB in Go (golang) with mgo: How do I update a record, find out if update was successful and get the data in a single atomic operation?

简单的RPC服务器在我的客户端连接时不会回答。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论