2012年10月14日 22:20:08go评论122阅读模式

英文:

Go Parse HTML table

问题

我有一个在HTML中的表格，我想要解析它。就像下面这个链接中的表格一样：
http://sprunge.us/IJUC
然而，我不确定一个好的方法来解析信息。我看过一些HTML解析器，但是它们似乎需要每个要解析的内容都有一个特殊的标签，比如<tag id="specialid">要抓取的信息</tag>; 然而，我的大部分信息都在<td></td>中。

有人有解析这些信息的建议吗？

英文:

I have a table in html that I would like to parse. Something like the one in the following
http://sprunge.us/IJUC
However, I'm not sure of a good way to parse out the information. I've seen a couple of html parsers, but those seem to require that everything has a special tag for you to parse it like <tag id="specialid">info to grab</tag>; however, the majority of my info is within <td></td>

Does anyone have a suggestion for parsing this information out?

答案1

得分: 16

无耻的插件：我的goquery库。它是将jQuery语法引入到Go中的（需要Go的实验性html包，请参阅库的README中的说明）。

所以你可以做像这样的事情（假设你的HTML文档已经加载到doc中，一个*goquery.Document）：

doc.Find("td").Each(func (i int, s *goquery.Selection) {
  fmt.Printf("第%d个单元格的内容：%s\n", i, s.Text())
})

编辑：在示例中将doc.Root.Find更改为doc.Find，因为goquery Document现在也是一个Selection（在v0.2/master分支中新增）

英文:

Shameless plug: My goquery library. It's the jQuery syntax brought to Go (requires Go's experimental html package, see instructions in the README of the library).

So you can do things like that (assuming your HTML document is loaded in doc, a *goquery.Document):

doc.Find(&quot;td&quot;).Each(func (i int, s *goquery.Selection) {
  fmt.Printf(&quot;Content of cell %d: %s\n&quot;, i, s.Text())
})

Edit: Change doc.Root.Find to doc.Find in the example since a goquery Document is now a Selection too (new in v0.2/master branch)

答案2

得分: 2

你可能也对Go语言的实验性HTML解析器感兴趣：
https://code.google.com/p/go.net/html

根据godoc的定义，该包实现了一个符合HTML5标准的标记解析器：
> Package html 实现了一个符合HTML5标准的标记解析器

我自己没有使用过，但它似乎非常直观：
> 通过调用Parse并传入一个io.Reader来进行解析，它会返回解析树的根节点（文档元素）作为*Node类型。调用者需要确保提供的Reader包含UTF-8编码的HTML。

go get code.google.com/p/go.net/html

import "code.google.com/p/go.net/html"

doc, err := html.Parse(r)

它不是当前任何版本的一部分，但如果你从源代码安装或使用golang-tip的ubuntu apt仓库，可以使用它。

编辑：你也可以在这里使用实验性Go包的镜像：https://github.com/kless/go-exp

go get github.com/kless/go-exp/html

import (
    "github.com/kless/go-exp/html"
)

英文:

You may also be interested in Go's experimental HTML parser:
https://code.google.com/p/go.net/html

The package definition according to the godoc:

> Package html implements an HTML5-compliant tokenizer and parser

I haven't used it myself, but it seems pretty straight-forward:

> Parsing is done by calling Parse with an io.Reader, which returns the
> root of the parse tree (the document element) as a *Node. It is the
> caller's responsibility to ensure that the Reader provides UTF-8
> encoded HTML.

go get code.google.com/p/go.net/html

import &quot;code.google.com/p/go.net/html&quot;

doc, err := html.Parse(r)

It is not part of any current release, but can be used if you install from source, or use the golang-tip ubuntu apt repo.

EDIT: you can also use the following mirror of the experimental Go packages here: https://github.com/kless/go-exp

go get github.com/kless/go-exp/html

import (
    &quot;github.com/kless/go-exp/html&quot;
)

答案3

得分: -1

如果你的HTML格式正确，你可以使用内置的XML解析器：

http://golang.org/pkg/encoding/xml/

英文:

If your HTML is well-formed, you can use the built-in XML parser:

http://golang.org/pkg/encoding/xml/

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

去解析HTML表格

问题

答案1

答案2

答案3

在Go 1.18中，”any”类型是什么？

循环通道，但缺少索引。

如何在不聚焦在任何文本框上的情况下使用键盘调用函数？

为什么 strings.HasPrefix 比 bytes.HasPrefix 更快？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论