去解析HTML表格

huangapple go评论122阅读模式
英文:

Go Parse HTML table

问题

我有一个在HTML中的表格,我想要解析它。就像下面这个链接中的表格一样:
http://sprunge.us/IJUC
然而,我不确定一个好的方法来解析信息。我看过一些HTML解析器,但是它们似乎需要每个要解析的内容都有一个特殊的标签,比如<tag id="specialid">要抓取的信息</tag>; 然而,我的大部分信息都在&lt;td&gt;&lt;/td&gt;中。

有人有解析这些信息的建议吗?

英文:

I have a table in html that I would like to parse. Something like the one in the following
http://sprunge.us/IJUC
However, I'm not sure of a good way to parse out the information. I've seen a couple of html parsers, but those seem to require that everything has a special tag for you to parse it like <tag id="specialid">info to grab</tag>; however, the majority of my info is within &lt;td&gt;&lt;/td&gt;

Does anyone have a suggestion for parsing this information out?

答案1

得分: 16

无耻的插件:我的goquery库。它是将jQuery语法引入到Go中的(需要Go的实验性html包,请参阅库的README中的说明)。

所以你可以做像这样的事情(假设你的HTML文档已经加载到doc中,一个*goquery.Document):

doc.Find("td").Each(func (i int, s *goquery.Selection) {
  fmt.Printf("第%d个单元格的内容:%s\n", i, s.Text())
})

编辑:在示例中将doc.Root.Find更改为doc.Find,因为goquery Document现在也是一个Selection(在v0.2/master分支中新增)

英文:

Shameless plug: My goquery library. It's the jQuery syntax brought to Go (requires Go's experimental html package, see instructions in the README of the library).

So you can do things like that (assuming your HTML document is loaded in doc, a *goquery.Document):

doc.Find(&quot;td&quot;).Each(func (i int, s *goquery.Selection) {
  fmt.Printf(&quot;Content of cell %d: %s\n&quot;, i, s.Text())
})

Edit: Change doc.Root.Find to doc.Find in the example since a goquery Document is now a Selection too (new in v0.2/master branch)

答案2

得分: 2

你可能也对Go语言的实验性HTML解析器感兴趣:
https://code.google.com/p/go.net/html

根据godoc的定义,该包实现了一个符合HTML5标准的标记解析器:
> Package html 实现了一个符合HTML5标准的标记解析器

我自己没有使用过,但它似乎非常直观:
> 通过调用Parse并传入一个io.Reader来进行解析,它会返回解析树的根节点(文档元素)作为*Node类型。调用者需要确保提供的Reader包含UTF-8编码的HTML。

go get code.google.com/p/go.net/html

import "code.google.com/p/go.net/html"

doc, err := html.Parse(r)

它不是当前任何版本的一部分,但如果你从源代码安装或使用golang-tip的ubuntu apt仓库,可以使用它。

编辑:你也可以在这里使用实验性Go包的镜像:https://github.com/kless/go-exp

go get github.com/kless/go-exp/html

import (
    "github.com/kless/go-exp/html"
)
英文:

You may also be interested in Go's experimental HTML parser:
https://code.google.com/p/go.net/html

The package definition according to the godoc:

> Package html implements an HTML5-compliant tokenizer and parser

I haven't used it myself, but it seems pretty straight-forward:

> Parsing is done by calling Parse with an io.Reader, which returns the
> root of the parse tree (the document element) as a *Node. It is the
> caller's responsibility to ensure that the Reader provides UTF-8
> encoded HTML.

go get code.google.com/p/go.net/html

import &quot;code.google.com/p/go.net/html&quot;

doc, err := html.Parse(r)

It is not part of any current release, but can be used if you install from source, or use the golang-tip ubuntu apt repo.

EDIT: you can also use the following mirror of the experimental Go packages here: https://github.com/kless/go-exp

go get github.com/kless/go-exp/html

import (
    &quot;github.com/kless/go-exp/html&quot;
)

答案3

得分: -1

如果你的HTML格式正确,你可以使用内置的XML解析器:

http://golang.org/pkg/encoding/xml/

英文:

If your HTML is well-formed, you can use the built-in XML parser:

http://golang.org/pkg/encoding/xml/

huangapple
  • 本文由 发表于 2012年10月14日 22:20:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/12883079.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定