使用Go语言从网页中提取链接

huangapple go评论76阅读模式
英文:

Extract links from a web page using Go lang

问题

我正在学习谷歌的Go编程语言。有人知道从HTML网页中提取所有URL的最佳实践吗?

从Java世界来看,有一些库可以完成这个任务,例如jsouphtmlparser等。但是对于Go语言,我猜目前还没有类似的可用库?

英文:

I am learning google's Go programming language. Does anyone know the best practice to extract all URLs from a html web page?

Coming from the Java world, there are libraries to do the job, for example jsoup , htmlparser, etc. But for go lang, I guess no available similar library was made yet?

答案1

得分: 26

如果你了解jQuery,你会喜欢GoQuery

说实话,它是我在Go语言中找到的最简单、最强大的HTML工具,它基于go.net仓库中的html包。(好吧,它比仅仅是一个解析器更高级,因为它不会暴露原始的HTML标记等内容,但如果你想要真正处理HTML文档,这个包会很有帮助。)

英文:

If you know jQuery, you'll love GoQuery.

Honestly, it's the easiest, most powerful HTML utility I've found in Go, and it's based off of the html package in the go.net repository. (Okay, so it's higher-level than just a parser as it doesn't expose raw HTML tokens and the like, but if you want to actually get anything done with an HTML document, this package will help.)

答案2

得分: 21

Go的标准HTML解析包仍在开发中,不是当前版本的一部分。但是,你可以尝试使用第三方包go-html-transform。它正在积极维护。

英文:

Go's standard package for HTML parsing is still a work in progress and is not part of the current release. A third party package you might try though is go-html-transform. It is being actively maintained.

答案3

得分: 17

虽然用于HTML解析的Go包仍在开发中,但它可以在go.net代码库中找到。

它的源代码位于<del>code.google.com/p/go.net/html</del> github.com/golang/net,并且正在积极开发中。

它在最近的go-nuts讨论中提到。

请注意,从Go 1.4(2014年12月)开始,正如我在这个答案中提到的,该包现在是golang.org/x/net(请参阅godoc)。

英文:

While the Go package for HTML parsing is indeed still in progress, it is available in the go.net repository.

Its sources are at <del>code.google.com/p/go.net/html</del> github.com/golang/net and it is being actively developed.

It is mentioned in this recent go-nuts discussion.


Note that with Go 1.4 (Dec 2014), as I mentioned in this answer, the package is now golang.org/x/net (see godoc).

答案4

得分: 6

我在周围搜索并发现有一个名为Gokogiri的库,听起来很像Ruby的Nogokiri。我认为这个项目也是活跃的

英文:

I've searched around and found that there are is a library called Gokogiri which sounds alike Nogokiri for Ruby. I think the project is active too.

答案5

得分: 0

我刚刚发布了一个用于Go的开源事件驱动的HTML 5.0兼容解析包。你可以在这里找到它

以下是从页面中获取所有链接(来自A元素)的示例代码:

links := make([]string)

parser := NewParser(htmlContent)

parser.Parse(nil, func(e *HtmlElement, isEmpty bool) {
    if e.TagName == "link" {
        link,_ := e.GetAttributeValue("href")
        if(link != "") {
            links = appends(links, link)
        } 
    }
}, nil)

需要注意的几点:

  • 这些是相对链接,不是完整的URL

  • 动态生成的链接不会被收集

  • 还有其他链接没有被收集(META标签、图片、iframe等)。很容易修改这段代码来收集它们。

英文:

I just published an open source event-based HTML 5.0 compliant parsing package for Go. You can find it here

Here is the sample code to get all the links from a page (from A elements):

links := make([]string)

parser := NewParser(htmlContent)

parser.Parse(nil, func(e *HtmlElement, isEmpty bool) {
    if e.TagName == &quot;link&quot; {
        link,_ := e.GetAttributeValue(&quot;href&quot;)
        if(link != &quot;&quot;) {
            links = appends(links, link)
        } 
    }
}, nil)

A few things to keep in mind:

  • These are relative links, not full URLs

  • Dynamically generated links will not be collected

  • There are other links not being collected (META tags, images, iframes, etc.). It's pretty easy to modify this code to collect those.

答案6

得分: 0

也可以使用["Colly"] 1([文档] 2),
它通常用于网络爬虫

特点

  1. 清晰的API
  2. 快速(单核心每秒> 1k请求)
  3. 管理请求延迟和每个域的最大并发数
  4. 自动处理cookie和会话
  5. 同步/异步/并行爬取
  6. 分布式爬取
  7. 缓存
  8. 自动编码非Unicode响应
  9. Robots.txt支持
  10. Google App Engine支持

import (
"fmt"
"github.com/gocolly/colly"
)

func main() {
c := colly.NewCollector()

// 查找并访问所有链接
c.OnHTML("a", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})

c.OnRequest(func(r *colly.Request) {
fmt.Println("正在访问", r.URL)
})

c.Visit("http://go-colly.org/")
}

英文:

also you may use "Colly" (documentations),
it usually use for web scrapping

Features

  1. Clean API
  2. Fast (>1k request/sec on a single core)
  3. Manages request delays and maximum concurrency per domain
  4. Automatic cookie and session handling
  5. Sync/async/parallel scraping
  6. Distributed scraping
  7. Caching
  8. Automatic encoding of non-unicode responses
  9. Robots.txt support
  10. Google App Engine support
import (
   &quot;fmt&quot;
   &quot;github.com/gocolly/colly&quot;
)

func main() {
   c := colly.NewCollector()
 
   // Find and visit all links
   c.OnHTML(&quot;a&quot;, func(e *colly.HTMLElement) {
     e.Request.Visit(e.Attr(&quot;href&quot;))
   })
 
   c.OnRequest(func(r *colly.Request) {
	fmt.Println(&quot;Visiting&quot;, r.URL)
   })

   c.Visit(&quot;http://go-colly.org/&quot;)
}

huangapple
  • 本文由 发表于 2012年6月18日 18:24:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/11080936.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定