从谷歌搜索结果页面提取URLs

huangapple go评论85阅读模式
英文:

Extract URLs from Google search result page

问题

我正在尝试从Google搜索页面中获取所有的URL,并且有两种方法可以实现,但我不太清楚如何操作。

首先,我可以从.r标签中直接提取它们,并获取每个链接的href属性。然而,这会给我一个非常长的字符串,我需要解析它才能得到URL。以下是一个需要解析的示例:

> https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA

我想要从中提取的URL是:

> https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

因此,我需要创建一个在https&sa之间的字符串,但我不确定如何做到这一点,因为Google给出的每个非常长的字符串的大小都不同,所以仅仅使用切片和截取固定数量的字符是行不通的。

其次,在Google搜索中,每个链接下面都有绿色文本显示的URL。右键点击并检查元素会显示:cite class="_Rm"(在尖括号之间),但我不知道如何在goquery中找到它,因为使用我的小函数查找cite只会给我更多的长字符串。

这是我的小函数,它目前执行的是第一种选项,没有解析,只给我一个将我带到搜索页面的长字符串:

func GetUrls(url string) {
    doc, err := goquery.NewDocument(url)
    if err != nil {
        panic(err)
    }
    doc.Find(".r").Each(func(i int, s *goquery.Selection) {
        doc.Find(".r a").Each(func(i int, s *goquery.Selection) {
            Link, _ := s.Attr("href")
            Link = url + Link
            fmt.Printf("link is [%s]\n", Link)
        })
    })
}

请注意,这只是提供了第一种方法的代码示例,并没有解决你的问题。

英文:

I'm trying to grab all the URLs off of a Google search page and there are two ways I think I could do it, but I don't really have any idea how to do them.

First, I could simply scrape them from the .r tags and get the href attribute for each link. However, this gives me a really long string that I would have to parse through to get the URL. Here's an example of what would have to be parsed through:

> https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA

The URL I would want out of this would be:

> https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

So I would have to create a string between the https and &sa which I'm not 100% sure how to do because each really long string Google gives me is a different size so just using slice and cutting it up "x" amount of characters wouldn't work.

Second, underneath each link in a Google search there is the URL in green text. Right clicking that and inspecting the element gives: cite class="_Rm" (between chevrons) which I don't know how to find with goquery because looking for cite with my small function just gives me more long strings of characters.

Here is my small function, it currently does the first option without the parsing and gives me a long string of text that just takes me to the search page:

func GetUrls(url string) {

    doc, err := goquery.NewDocument(url)

    if err != nil {
	    panic(err)
    }

    doc.Find(".r").Each(func(i int, s *goquery.Selection) {

	    doc.Find(".r a").Each(func(i int, s *goquery.Selection) {
		    Link, _ := s.Attr("href")
		    Link = url + Link
		    fmt.Printf("link is [%s]\n", Link)
	    })

    })

}

答案1

得分: 1

标准库提供了解析URL的支持。请查看net/url包。使用该包,我们可以从URL中获取查询参数。

请注意,原始的URL包含在形式为"aqs"参数中的URL中:

chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

这实际上是另一个URL。

让我们编写一个小的辅助函数,从原始URL文本中获取参数:

func getParam(raw, param string) (string, error) {
    u, err := url.Parse(raw)
    if err != nil {
        return "", err
    }

    q := u.Query()
    if q == nil {
        return "", fmt.Errorf("No query part")
    }

    v := q.Get(param)
    if v == "" {
        return "", fmt.Errorf("Param not found")
    }
    return v, nil
}

使用这个函数,我们可以从原始URL中获取"aqs"参数,并再次使用它来获取"q"参数,这正是您想要的URL:

raw := "https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA"
aqs, err := getParam(raw, "aqs")
if err != nil {
    panic(err)
}
fmt.Println(aqs)

result, err := getParam(aqs, "q")
fmt.Println(result)

输出结果(在Go Playground上尝试):

chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
英文:

The standard library has support parsing URLs. Check out the net/url package. Using this package, we can get query parameters from URLs.

Note that your original raw URL contains the URL you want to extract in the "aqs" parameter in the form of

chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

Which is basically another URL.

Let's write a little helper function which gets a parameter from a raw URL text:

func getParam(raw, param string) (string, error) {
	u, err := url.Parse(raw)
	if err != nil {
		return "", err
	}

	q := u.Query()
	if q == nil {
		return "", fmt.Errorf("No query part")
	}

	v := q.Get(param)
	if v == "" {
		return "", fmt.Errorf("Param not found")
	}
	return v, nil
}

Using this we can get the "aqs" parameter from the original URL, and using this again we can get the "q" parameter which is exactly your desired URL:

raw := "https://www.google.com/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=mh4u%20items&oq=mh4u%20items&aqs=chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/&sa=U&ei=n8NvVdSvBMOsyATSzYKoCQ&ved=0CEUQFjAL&usg=AFQjCNGyD5NjsqOncyLElJt9C0hqVQ7gyA"
aqs, err := getParam(raw, "aqs")
if err != nil {
	panic(err)
}
fmt.Println(aqs)

result, err := getParam(aqs, "q")
fmt.Println(result)

Output (try it on the Go Playground):

chrome.0.0l2j69i59j69i60j0l2.1754j0j7/url?q=https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/
https://youknowumsayin.wordpress.com/2015/03/16/the-inventory-and-you-what-items-should-i-bring-mh4u/

huangapple
  • 本文由 发表于 2015年6月4日 13:02:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/30635272.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定