如何使用Go避免一些网站拒绝HTTP GET请求

huangapple go评论83阅读模式
英文:

How to avoid some sites rejecting HTTP get using go

问题

我们有一个脚本,每天检查我们数据库记录中的所有网页链接(用户希望在链接过期时收到通知)。

有几个网站在通过Web浏览器从此IP地址访问时正常工作,但是通过GO获取时,它们要么在完成请求之前断开连接,要么返回HTTP授权被拒绝的消息。

我猜测某种防火墙(F5)正在过滤/阻止请求。即使我将HTTP请求更改为使用常见的用户代理,这种情况仍然发生。我们该怎么做才能确保GO请求看起来像标准的浏览器?

func fetch_url(url string, d time.Duration) (int, error) {

    client := &http.Client{
        Timeout: d,
    }

    req, err := http.NewRequest("GET", url, nil)
    if err != nil {
        return 0, err
    }

    req.Header.Set("User-Agent", "Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53")

    resp, err := client.Do(req)
    if err != nil {
        return 0, err
    }

    status := resp.StatusCode
    resp.Body.Close()
    return status, nil
}
英文:

We have a script that on a daily basis checks all of the web links in all of our database records (the users want notifications when a link becomes out of date).

There are a couple of sites that work fine through a web browser from this IP address, but when fetched through GO, they either disconnect before completing the request or return a HTTP authorisation denied message.

I am assuming some sort of firewall (F5) is filtering/blocking the request. This occurs even when I change the HTTP request to use a common user agent. What can we do to ensure a GO request looks like a standard browser?

func fetch_url(url string, d time.Duration) (int, error) {

	client := &http.Client{
		Timeout: d,
	}

	req, err := http.NewRequest("GET", url, nil)
	if err != nil {
		return 0, err
	}

	req.Header.Set("User-Agent", "Mozilla/5.0 (iPad; CPU OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53")

	resp, err := client.Do(req)
	if err != nil {
		return 0, err
	}

	status := resp.StatusCode
	resp.Body.Close()
	return status, nil
}

答案1

得分: 3

尝试匹配来自您的Web浏览器的请求的确切标头,以消除其他因素。智能防火墙可以根据外观来区分Web浏览器和机器人。

请注意,Go HTTP客户端只发送一个最小的HTTP请求:

GET /foo HTTP/1.1
Host: localhost:3030
User-Agent: Go 1.1 package http
Accept-Encoding: gzip

而Web浏览器则更加健谈:

GET /foo HTTP/1.1
Host: localhost:3030
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
英文:

Try matching the exact headers from a request from your web browser to eliminate other factors. A smart firewall could have heuristics on what looks like a web browser versus a robot.

Notice that the go http client sends only a minimal HTTP request:

<!-- language: lang-txt -->

GET /foo HTTP/1.1
Host: localhost:3030
User-Agent: Go 1.1 package http
Accept-Encoding: gzip

Whereas a web browser is more chatty:

<!-- language: lang-txt -->

GET /foo HTTP/1.1
Host: localhost:3030
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

huangapple
  • 本文由 发表于 2015年3月24日 06:37:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/29221752.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定