2012年6月18日 18:24:34go评论161阅读模式

英文:

Extract links from a web page using Go lang

问题

我正在学习谷歌的Go编程语言。有人知道从HTML网页中提取所有URL的最佳实践吗？

从Java世界来看，有一些库可以完成这个任务，例如jsoup，htmlparser等。但是对于Go语言，我猜目前还没有类似的可用库？

英文:

I am learning google's Go programming language. Does anyone know the best practice to extract all URLs from a html web page?

Coming from the Java world, there are libraries to do the job, for example jsoup , htmlparser, etc. But for go lang, I guess no available similar library was made yet?

答案1

得分: 26

如果你了解jQuery，你会喜欢GoQuery。

说实话，它是我在Go语言中找到的最简单、最强大的HTML工具，它基于go.net仓库中的html包。（好吧，它比仅仅是一个解析器更高级，因为它不会暴露原始的HTML标记等内容，但如果你想要真正处理HTML文档，这个包会很有帮助。）

英文:

If you know jQuery, you'll love GoQuery.

Honestly, it's the easiest, most powerful HTML utility I've found in Go, and it's based off of the html package in the go.net repository. (Okay, so it's higher-level than just a parser as it doesn't expose raw HTML tokens and the like, but if you want to actually get anything done with an HTML document, this package will help.)

答案2

得分: 21

Go的标准HTML解析包仍在开发中，不是当前版本的一部分。但是，你可以尝试使用第三方包go-html-transform。它正在积极维护。

英文:

Go's standard package for HTML parsing is still a work in progress and is not part of the current release. A third party package you might try though is go-html-transform. It is being actively maintained.

答案3

得分: 17

虽然用于HTML解析的Go包仍在开发中，但它可以在go.net代码库中找到。

它的源代码位于<del>code.google.com/p/go.net/html</del> github.com/golang/net，并且正在积极开发中。

它在最近的go-nuts讨论中提到。

请注意，从Go 1.4（2014年12月）开始，正如我在这个答案中提到的，该包现在是golang.org/x/net（请参阅godoc）。

英文:

While the Go package for HTML parsing is indeed still in progress, it is available in the go.net repository.

Its sources are at <del>code.google.com/p/go.net/html</del> github.com/golang/net and it is being actively developed.

It is mentioned in this recent go-nuts discussion.

Note that with Go 1.4 (Dec 2014), as I mentioned in this answer, the package is now golang.org/x/net (see godoc).

答案4

得分: 6

我在周围搜索并发现有一个名为Gokogiri的库，听起来很像Ruby的Nogokiri。我认为这个项目也是活跃的。

英文:

I've searched around and found that there are is a library called Gokogiri which sounds alike Nogokiri for Ruby. I think the project is active too.

答案5

得分: 0

我刚刚发布了一个用于Go的开源事件驱动的HTML 5.0兼容解析包。你可以在这里找到它。

以下是从页面中获取所有链接（来自A元素）的示例代码：

links := make([]string)

parser := NewParser(htmlContent)

parser.Parse(nil, func(e *HtmlElement, isEmpty bool) {
    if e.TagName == "link" {
        link,_ := e.GetAttributeValue("href")
        if(link != "") {
            links = appends(links, link)
        } 
    }
}, nil)

需要注意的几点：

这些是相对链接，不是完整的URL
动态生成的链接不会被收集
还有其他链接没有被收集（META标签、图片、iframe等）。很容易修改这段代码来收集它们。

英文:

I just published an open source event-based HTML 5.0 compliant parsing package for Go. You can find it here

Here is the sample code to get all the links from a page (from A elements):

links := make([]string)

parser := NewParser(htmlContent)

parser.Parse(nil, func(e *HtmlElement, isEmpty bool) {
    if e.TagName == &quot;link&quot; {
        link,_ := e.GetAttributeValue(&quot;href&quot;)
        if(link != &quot;&quot;) {
            links = appends(links, link)
        } 
    }
}, nil)

A few things to keep in mind:

These are relative links, not full URLs
Dynamically generated links will not be collected
There are other links not being collected (META tags, images, iframes, etc.). It's pretty easy to modify this code to collect those.

答案6

得分: 0

也可以使用["Colly"] 1（[文档] 2），
它通常用于网络爬虫

特点

清晰的API
快速（单核心每秒> 1k请求）
管理请求延迟和每个域的最大并发数
自动处理cookie和会话
同步/异步/并行爬取
分布式爬取
缓存
自动编码非Unicode响应
Robots.txt支持
Google App Engine支持

import (
"fmt"
"github.com/gocolly/colly"
)

func main() {
c := colly.NewCollector()

// 查找并访问所有链接
c.OnHTML("a", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})

c.OnRequest(func(r *colly.Request) {
fmt.Println("正在访问", r.URL)
})

c.Visit("http://go-colly.org/")
}

英文:

also you may use "Colly" (documentations),
it usually use for web scrapping

Features

Clean API
Fast (>1k request/sec on a single core)
Manages request delays and maximum concurrency per domain
Automatic cookie and session handling
Sync/async/parallel scraping
Distributed scraping
Caching
Automatic encoding of non-unicode responses
Robots.txt support
Google App Engine support

import (
   &quot;fmt&quot;
   &quot;github.com/gocolly/colly&quot;
)

func main() {
   c := colly.NewCollector()
 
   // Find and visit all links
   c.OnHTML(&quot;a&quot;, func(e *colly.HTMLElement) {
     e.Request.Visit(e.Attr(&quot;href&quot;))
   })
 
   c.OnRequest(func(r *colly.Request) {
	fmt.Println(&quot;Visiting&quot;, r.URL)
   })

   c.Visit(&quot;http://go-colly.org/&quot;)
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Go语言从网页中提取链接

问题

答案1

答案2

答案3

答案4

答案5

答案6

在测试时解决 Golang 中的循环导入错误

我的GAE本地服务器如何连接到Firebase模拟器？

从Go结构体动态地向JSON中添加元素。

Accessing Struct Data in Inner Scope from main and save to csv in Golang

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论