英文:
How can I circumvent bot protection when scraping full NYTimes articles?
问题
我正在尝试从《纽约时报》上爬取完整的书评,以便对其进行情感分析。我知道有纽约时报 API,我正在使用它来获取书评的URL,但我需要设计一个爬虫来获取完整的文章内容,因为API只提供了摘要。我相信nytimes.com有机器人保护措施,以防止爬虫爬取网站,但我知道有办法绕过这个保护措施。
我找到了这个Python爬虫,它可以从nytimes.com上提取完整的文本,但我更希望用Go来实现我的解决方案。我应该将它转换成Go语言,还是这个解决方案过于复杂了?我已经尝试过更改User-Agent头部,但在Go语言中,我所做的一切都以无限重定向循环错误结束。
代码:
package main
import (
//"fmt"
"io/ioutil"
"log"
"math/rand"
"net/http"
"time"
//"net/url"
)
func main() {
rand.Seed(time.Now().Unix())
userAgents := [5]string{
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0",
}
url := "http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html"
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Fatalln(err)
}
req.Header.Set("User-Agent", userAgents[rand.Intn(len(userAgents))])
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatalln(err)
}
log.Println(string(body))
}
结果为:
2016/12/05 21:57:53 Get http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html?_r=4: stopped after 10 redirects
exit status 1
非常感谢您的帮助!谢谢!
英文:
I am trying to scrape full book reviews from the New York Times in order to perform sentiment analysis on them. I am aware of the NY Times API and am using it to get book review URLs, but I need to devise a scraper to get the full article text, as the API only gives a snippet. I believe that nytimes.com has bot protection to prevent bots from scraping the website but I know there are ways to circumvent it.
I found this python scraper that works and can pull full text from nytimes.com, but I would prefer to implement my solution in Go. Should I just port this to Go or is this solution unnecessarily complex? I have already played around with changing the User-Agent header but everything that I do in Go ends in an infinite redirect loop error.
Code:
package main
import (
//"fmt"
"io/ioutil"
"log"
"math/rand"
"net/http"
"time"
//"net/url"
)
func main() {
rand.Seed(time.Now().Unix())
userAgents := [5]string{
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0",
}
url := "http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html"
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Fatalln(err)
}
req.Header.Set("User-Agent", userAgents[rand.Intn(len(userAgents))])
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatalln(err)
}
log.Println(string(body))
}
Results in:
2016/12/05 21:57:53 Get http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html?_r=4: stopped after 10 redirects
exit status 1
Any help is appreciated! Thank you!
答案1
得分: 0
<!--您需要登录到https://nytimes.com。并保存服务器返回的cookie以供下一次请求使用,代码如下:-->
<!-- import (
"log"
"os"
"net/http"
"bytes"
"fmt"
"io/ioutil"
"net/http/cookiejar"
"math/rand"
)
var cookieJar, _ = cookiejar.New(nil)
var nyclient = &http.Client{Jar: cookieJar}
func main() {
// 登录网站
loginUrl := "https://myaccount.nytimes.com/auth/login"
req, err := http.NewRequest("POST", loginUrl,
bytes.NewBufferString("{ " +
"\"userid\": \"your_login\", " +
"\"password\": \"your_password\", " +
"\"expires\": 1481046045871 }"))
req.Header.Set("Content-Type", "application/json; charset=UTF-8")
resp, _ := nyclient.Do(req)
defer resp.Body.Close()
// 准备用户代理和其他重要信息
// 然后使用nyclient发送请求
resp, err := nyclient.Do(req)
if err != nil {
log.Fatalln(err)
}
// 现在响应包含了您需要的所有内容
// 您可以在控制台上显示它或保存到文件中
}-->
您只需要将cookie添加到您的客户端中:
var cookieJar, _ = cookiejar.New(nil)
var client = &http.Client{Jar: cookieJar}
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
// 现在响应包含了您需要的所有内容
// 您可以在控制台上显示它或保存到文件中
英文:
<!--You have to login into https://nytimes.com. And save cookies that server returns to you for the next requests like that:-->
<!-- import (
"log"
"os"
"net/http"
"bytes"
"fmt"
"io/ioutil"
"net/http/cookiejar"
"math/rand"
)
var cookieJar, _ = cookiejar.New(nil)
var nyclient = &http.Client{Jar: cookieJar}
func main() {
// login to the site
loginUrl := "https://myaccount.nytimes.com/auth/login"
req, err := http.NewRequest("POST", loginUrl,
bytes.NewBufferString("{ " +
"\"userid\": \"your_login\", " +
"\"password\": \"your_password\", " +
"\"expires\": 1481046045871 }"))
req.Header.Set("Content-Type", "application/json; charset=UTF-8")
resp, _ := nyclient.Do(req)
defer resp.Body.Close()
// prepare useragents and other important things
// then send request using the nyclient
resp, err := nyclient.Do(req)
if err != nil {
log.Fatalln(err)
}
// now response contains all you need and
// you can show it on the console or save to file
}-->
You just have to add cookies to your client:
var cookieJar, _ = cookiejar.New(nil)
var client = &http.Client{Jar: cookieJar}
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
// now response contains all you need and
// you can show it on the console or save to file
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论