2016年12月6日 02:15:19go评论179阅读模式

英文:

How can I circumvent bot protection when scraping full NYTimes articles?

问题

我正在尝试从《纽约时报》上爬取完整的书评，以便对其进行情感分析。我知道有纽约时报 API，我正在使用它来获取书评的URL，但我需要设计一个爬虫来获取完整的文章内容，因为API只提供了摘要。我相信nytimes.com有机器人保护措施，以防止爬虫爬取网站，但我知道有办法绕过这个保护措施。

我找到了这个Python爬虫，它可以从nytimes.com上提取完整的文本，但我更希望用Go来实现我的解决方案。我应该将它转换成Go语言，还是这个解决方案过于复杂了？我已经尝试过更改User-Agent头部，但在Go语言中，我所做的一切都以无限重定向循环错误结束。

代码：

package main

import (
    //&quot;fmt&quot;
    &quot;io/ioutil&quot;
    &quot;log&quot;
    &quot;math/rand&quot;
    &quot;net/http&quot;
    &quot;time&quot;
    //&quot;net/url&quot;
)

func main() {

    rand.Seed(time.Now().Unix())

    userAgents := [5]string{
        &quot;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36&quot;,
        &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36&quot;,
        &quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36&quot;,
        &quot;Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0&quot;,
        &quot;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0&quot;,
    }

    url := &quot;http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html&quot;

    client := &amp;http.Client{}

    req, err := http.NewRequest(&quot;GET&quot;, url, nil)
    if err != nil {
        log.Fatalln(err)
    }

    req.Header.Set(&quot;User-Agent&quot;, userAgents[rand.Intn(len(userAgents))])

    resp, err := client.Do(req)
    if err != nil {
        log.Fatalln(err)
    }

    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Fatalln(err)
    }

    log.Println(string(body))
}

结果为：

2016/12/05 21:57:53 Get http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html?_r=4: stopped after 10 redirects
exit status 1

非常感谢您的帮助！谢谢！

英文:

I am trying to scrape full book reviews from the New York Times in order to perform sentiment analysis on them. I am aware of the NY Times API and am using it to get book review URLs, but I need to devise a scraper to get the full article text, as the API only gives a snippet. I believe that nytimes.com has bot protection to prevent bots from scraping the website but I know there are ways to circumvent it.

I found this python scraper that works and can pull full text from nytimes.com, but I would prefer to implement my solution in Go. Should I just port this to Go or is this solution unnecessarily complex? I have already played around with changing the User-Agent header but everything that I do in Go ends in an infinite redirect loop error.

Code:

package main

import (
    //&quot;fmt&quot;
    &quot;io/ioutil&quot;
    &quot;log&quot;
    &quot;math/rand&quot;
    &quot;net/http&quot;
    &quot;time&quot;
    //&quot;net/url&quot;
)

func main() {

    rand.Seed(time.Now().Unix())

    userAgents := [5]string{
        &quot;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36&quot;,
        &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36&quot;,
        &quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36&quot;,
        &quot;Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0&quot;,
        &quot;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0&quot;,
    }

    url := &quot;http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html&quot;

    client := &amp;http.Client{}

    req, err := http.NewRequest(&quot;GET&quot;, url, nil)
    if err != nil {
        log.Fatalln(err)
    }

    req.Header.Set(&quot;User-Agent&quot;, userAgents[rand.Intn(len(userAgents))])

    resp, err := client.Do(req)
    if err != nil {
        log.Fatalln(err)
    }

    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Fatalln(err)
    }

    log.Println(string(body))
}

Results in:

2016/12/05 21:57:53 Get http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html?_r=4: stopped after 10 redirects
exit status 1

Any help is appreciated! Thank you!

答案1

得分: 0

<!-- import (
"log"
"os"
"net/http"
"bytes"
"fmt"
"io/ioutil"
"net/http/cookiejar"
"math/rand"
)

var cookieJar, _ = cookiejar.New(nil)
var nyclient = &http.Client{Jar: cookieJar}

func main() {
    // 登录网站
    loginUrl := "https://myaccount.nytimes.com/auth/login"
	req, err := http.NewRequest("POST", loginUrl,
	bytes.NewBufferString("{ " +
		"\"userid\": \"your_login\", " +
		"\"password\": \"your_password\", " +
		"\"expires\": 1481046045871 }"))
    req.Header.Set("Content-Type", "application/json; charset=UTF-8")

    resp, _ := nyclient.Do(req)
	defer resp.Body.Close()

    // 准备用户代理和其他重要信息
    // 然后使用nyclient发送请求
    resp, err := nyclient.Do(req)
    if err != nil {
        log.Fatalln(err)
    }
    // 现在响应包含了您需要的所有内容
    // 您可以在控制台上显示它或保存到文件中
}--&gt;

您只需要将cookie添加到您的客户端中：

var cookieJar, _ = cookiejar.New(nil)
var client = &http.Client{Jar: cookieJar}

resp, err := client.Do(req)
if err != nil {
    log.Fatalln(err)
}
// 现在响应包含了您需要的所有内容
// 您可以在控制台上显示它或保存到文件中

英文:

<!-- import (
"log"
"os"
"net/http"
"bytes"
"fmt"
"io/ioutil"
"net/http/cookiejar"
"math/rand"
)

var cookieJar, _ = cookiejar.New(nil)
var nyclient = &amp;http.Client{Jar: cookieJar}

func main() {
    // login to the site
    loginUrl := &quot;https://myaccount.nytimes.com/auth/login&quot;
	req, err := http.NewRequest(&quot;POST&quot;, loginUrl,
	bytes.NewBufferString(&quot;{ &quot; +
		&quot;\&quot;userid\&quot;: \&quot;your_login\&quot;, &quot; +
		&quot;\&quot;password\&quot;: \&quot;your_password\&quot;, &quot; +
		&quot;\&quot;expires\&quot;: 1481046045871 }&quot;))
    req.Header.Set(&quot;Content-Type&quot;, &quot;application/json; charset=UTF-8&quot;)

    resp, _ := nyclient.Do(req)
	defer resp.Body.Close()

    // prepare useragents and other important things
    // then send request using the nyclient
    resp, err := nyclient.Do(req)
    if err != nil {
        log.Fatalln(err)
    }
    // now response contains all you need and 
    // you can show it on the console or save to file
}--&gt;

You just have to add cookies to your client:

var cookieJar, _ = cookiejar.New(nil)
var client = &amp;http.Client{Jar: cookieJar}

resp, err := client.Do(req)
if err != nil {
    log.Fatalln(err)
}
// now response contains all you need and 
// you can show it on the console or save to file

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在爬取完整的《纽约时报》文章时绕过机器人保护？

问题

答案1

Terraform Terratest – 未定义的 Destroy 函数错误

为什么在调用结构体的 fmt.Println 时，不使用成员的 String() 方法？

如何在 Go 中调用结构类型的方法？

读者接口更改值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论