StackOverflow的最大速率限制是多少?

huangapple go评论78阅读模式
英文:

Max Rate limit of StackOverflow

问题

我一直在尝试以每秒30个请求的频率访问StackOverflow,但它不起作用。几秒钟后就被阻止了。尽管StackOverflow的文档中说StackExchange的最大速率限制是每秒30个请求。

我使用的库是gocolly。以下是我的代码:

package main

import (
	"fmt"
	"log"
	"strconv"

	"time"

	"github.com/gocolly/colly"
	"github.com/gocolly/colly/debug"
)

func finish() {
	fmt.Println("Finish")
}

func main() {
	c := colly.NewCollector(
		colly.AllowedDomains("stackoverflow.com"),
		colly.MaxDepth(1),
		colly.Async(true),
		colly.Debugger(&debug.LogDebugger{}),
	)

	c.Limit(&colly.LimitRule{DomainGlob: "*stackoverflow.*", Parallelism: 10, Delay: 1 * time.Second})
	c.OnRequest(func(r *colly.Request) {
		r.Headers.Set("User-Agent", "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println("Something went wrong:", err)
	})

	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", r.Request.URL)
	})
	c.OnHTML("#questions", func(e *colly.HTMLElement) {
		e.ForEach(".s-post-summary.js-post-summary", func(i int, el *colly.HTMLElement) {
			link := el.ChildAttr("a[href]", "href")
			e.Request.Visit("https://stackoverflow.com" + link)
		})
	})

	for i := 0; i <= 1000; i++ {

	   var link = "https://stackoverflow.com/questions?tab=votes&page=" + strconv.Itoa(i)
	   c.Visit(link)
	   c.Wait()

	}

	finish()
}

希望有人能帮助我。

英文:

I have been trying to access StackOverflow with the amount of 30 requests / second but it not working. It has been blocked after a few seconds. Although the document of StackOverflow said the max rate limit of StackExchange is 30 req /s.

The libraries i used to access is gocolly
Here is my code:

package main
import (
&quot;fmt&quot;
&quot;log&quot;
&quot;strconv&quot;
&quot;time&quot;
&quot;github.com/gocolly/colly&quot;
&quot;github.com/gocolly/colly/debug&quot;
)
func finish() {
fmt.Println(&quot;Finish&quot;)
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains(&quot;stackoverflow.com&quot;),
colly.MaxDepth(1),
colly.Async(true),
colly.Debugger(&amp;debug.LogDebugger{}),
)
c.Limit(&amp;colly.LimitRule{DomainGlob: &quot;*stackoverflow.*&quot;, Parallelism: 10, Delay: 1 * time.Second})
c.OnRequest(func(r *colly.Request) {
r.Headers.Set(&quot;User-Agent&quot;, &quot;Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;)
})
c.OnError(func(_ *colly.Response, err error) {
log.Println(&quot;Something went wrong:&quot;, err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println(&quot;Visited&quot;, r.Request.URL)
})
c.OnHTML(&quot;#questions&quot;, func(e *colly.HTMLElement) {
e.ForEach(&quot;.s-post-summary.js-post-summary&quot;, func(i int, el *colly.HTMLElement) {
link := el.ChildAttr(&quot;a[href]&quot;, &quot;href&quot;)
e.Request.Visit(&quot;https://stackoverflow.com&quot; + link)
})
})
for i := 0; i &lt;= 1000; i++ {
var link = &quot;https://stackoverflow.com/questions?tab=votes&amp;page=&quot; + strconv.Itoa(i)
c.Visit(link)
c.Wait()
}
finish()
}

I hope someone can help me.

答案1

得分: 1

很抱歉,我无法在我的机器上重现你的问题。顺便说一下,我会指出一些可以改进你的解决方案的地方。首先,让我分享一下我的工作解决方案:

package main

import (
	"fmt"
	"log"
	"strconv"
	"time"

	"github.com/gocolly/colly/v2"
)

func finish() {
	fmt.Println("Finish")
}

func main() {
	c := colly.NewCollector(
		colly.AllowedDomains("stackoverflow.com"),
		colly.MaxDepth(1),
		colly.Async(true),
		// colly.Debugger(&debug.LogDebugger{}),
	)

	c.Limit(&colly.LimitRule{DomainGlob: "*stackoverflow.*", Parallelism: 8, Delay: 1 * time.Second})
	c.OnRequest(func(r *colly.Request) {
		r.Headers.Set("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println("Something went wrong:", err)
	})

	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", r.Request.URL)
	})
	c.OnHTML("#questions", func(e *colly.HTMLElement) {
		e.ForEach(".s-post-summary.js-post-summary", func(i int, el *colly.HTMLElement) {
			link := el.ChildAttr("a[href]", "href")
			e.Request.Visit("https://stackoverflow.com" + link)
		})
	})

	for i := 0; i <= 29; i++ {
		link := "https://stackoverflow.com/questions?tab=votes&page=" + strconv.Itoa(i)
		c.Visit(link)
	}

	c.Wait()
	finish()
}

所做的更改如下:

  1. 潜在并发线程数从10减少到8
  2. 使用了我的User-Agent值。
  3. c.Wait调用放在了for循环之外。

最后一项更改是最重要的,因为你误解了它的用法。基本上,它等待之前创建的所有线程(例如,根据你的机器,你可能有8个并发线程在处理你的爬取请求)。如果你将这个语句放在循环内部,每次你只等待刚刚实例化的线程,导致同步操作。

你可以通过几次尝试轻松地注意到这一点。如果你将c.Wait放在for循环内部,你会注意到页面按顺序访问。如果你将这个语句放在for循环之外,页面将以无序的方式访问。

如果你对这些更改后你的解决方案也能正常工作,请告诉我,谢谢!

英文:

Unfortunately, I was not able to repro your issue on my machine. By the way, I'll point out some things that will improve your solution. First, let me share my working solution:

package main

import (
	&quot;fmt&quot;
	&quot;log&quot;
	&quot;strconv&quot;
	&quot;time&quot;

	&quot;github.com/gocolly/colly/v2&quot;
)

func finish() {
	fmt.Println(&quot;Finish&quot;)
}

func main() {
	c := colly.NewCollector(
		colly.AllowedDomains(&quot;stackoverflow.com&quot;),
		colly.MaxDepth(1),
		colly.Async(true),
		// colly.Debugger(&amp;debug.LogDebugger{}),
	)

	c.Limit(&amp;colly.LimitRule{DomainGlob: &quot;*stackoverflow.*&quot;, Parallelism: 8, Delay: 1 * time.Second})
	c.OnRequest(func(r *colly.Request) {
		r.Headers.Set(&quot;User-Agent&quot;, &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36&quot;)
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println(&quot;Something went wrong:&quot;, err)
	})

	c.OnResponse(func(r *colly.Response) {
		fmt.Println(&quot;Visited&quot;, r.Request.URL)
	})
	c.OnHTML(&quot;#questions&quot;, func(e *colly.HTMLElement) {
		e.ForEach(&quot;.s-post-summary.js-post-summary&quot;, func(i int, el *colly.HTMLElement) {
			link := el.ChildAttr(&quot;a[href]&quot;, &quot;href&quot;)
			e.Request.Visit(&quot;https://stackoverflow.com&quot; + link)
		})
	})

	for i := 0; i &lt;= 29; i++ {
		link := &quot;https://stackoverflow.com/questions?tab=votes&amp;page=&quot; + strconv.Itoa(i)
		c.Visit(link)
	}

	c.Wait()
	finish()
}

The changes done are:

  1. Decreased from 10 to 8 the potential concurrent threads.
  2. Used my User-Agent value.
  3. Put the c.Wait call outside the for loop.

The last change is the most important as you misunderstood its usage. Basically, it waits for all the threads that were created before (e.g. based on your machine you might have 8 concurrent threads working on your crawl requests). If you put this statement within the loop, every time you're waiting only for the just-instantiated thread resulting in synchronous operations.
> You can easily notice with a couple of attempts. If you leave the c.Wait within the for loop you notice that the pages are visited in an ordered way. If you put this statement out of the for loop, the pages get visited in an unsorted way.

Let me know if with these changes also your solution works, thanks!

huangapple
  • 本文由 发表于 2023年1月9日 15:41:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/75054327.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定