问题

我一直在尝试以每秒30个请求的频率访问StackOverflow，但它不起作用。几秒钟后就被阻止了。尽管StackOverflow的文档中说StackExchange的最大速率限制是每秒30个请求。

我使用的库是gocolly。以下是我的代码：

package main

import (
	"fmt"
	"log"
	"strconv"

	"time"

	"github.com/gocolly/colly"
	"github.com/gocolly/colly/debug"
)

func finish() {
	fmt.Println("Finish")
}

func main() {
	c := colly.NewCollector(
		colly.AllowedDomains("stackoverflow.com"),
		colly.MaxDepth(1),
		colly.Async(true),
		colly.Debugger(&debug.LogDebugger{}),
	)

	c.Limit(&colly.LimitRule{DomainGlob: "*stackoverflow.*", Parallelism: 10, Delay: 1 * time.Second})
	c.OnRequest(func(r *colly.Request) {
		r.Headers.Set("User-Agent", "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println("Something went wrong:", err)
	})

	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", r.Request.URL)
	})
	c.OnHTML("#questions", func(e *colly.HTMLElement) {
		e.ForEach(".s-post-summary.js-post-summary", func(i int, el *colly.HTMLElement) {
			link := el.ChildAttr("a[href]", "href")
			e.Request.Visit("https://stackoverflow.com" + link)
		})
	})

	for i := 0; i <= 1000; i++ {

	   var link = "https://stackoverflow.com/questions?tab=votes&page=" + strconv.Itoa(i)
	   c.Visit(link)
	   c.Wait()

	}

	finish()
}

希望有人能帮助我。

英文:

I have been trying to access StackOverflow with the amount of 30 requests / second but it not working. It has been blocked after a few seconds. Although the document of StackOverflow said the max rate limit of StackExchange is 30 req /s.

The libraries i used to access is gocolly
Here is my code:

package main
import (
&quot;fmt&quot;
&quot;log&quot;
&quot;strconv&quot;
&quot;time&quot;
&quot;github.com/gocolly/colly&quot;
&quot;github.com/gocolly/colly/debug&quot;
)
func finish() {
fmt.Println(&quot;Finish&quot;)
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains(&quot;stackoverflow.com&quot;),
colly.MaxDepth(1),
colly.Async(true),
colly.Debugger(&amp;debug.LogDebugger{}),
)
c.Limit(&amp;colly.LimitRule{DomainGlob: &quot;*stackoverflow.*&quot;, Parallelism: 10, Delay: 1 * time.Second})
c.OnRequest(func(r *colly.Request) {
r.Headers.Set(&quot;User-Agent&quot;, &quot;Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;)
})
c.OnError(func(_ *colly.Response, err error) {
log.Println(&quot;Something went wrong:&quot;, err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println(&quot;Visited&quot;, r.Request.URL)
})
c.OnHTML(&quot;#questions&quot;, func(e *colly.HTMLElement) {
e.ForEach(&quot;.s-post-summary.js-post-summary&quot;, func(i int, el *colly.HTMLElement) {
link := el.ChildAttr(&quot;a[href]&quot;, &quot;href&quot;)
e.Request.Visit(&quot;https://stackoverflow.com&quot; + link)
})
})
for i := 0; i &lt;= 1000; i++ {
var link = &quot;https://stackoverflow.com/questions?tab=votes&amp;page=&quot; + strconv.Itoa(i)
c.Visit(link)
c.Wait()
}
finish()
}

I hope someone can help me.

答案1

得分: 1

很抱歉，我无法在我的机器上重现你的问题。顺便说一下，我会指出一些可以改进你的解决方案的地方。首先，让我分享一下我的工作解决方案：

package main

import (
	"fmt"
	"log"
	"strconv"
	"time"

	"github.com/gocolly/colly/v2"
)

func finish() {
	fmt.Println("Finish")
}

func main() {
	c := colly.NewCollector(
		colly.AllowedDomains("stackoverflow.com"),
		colly.MaxDepth(1),
		colly.Async(true),
		// colly.Debugger(&debug.LogDebugger{}),
	)

	c.Limit(&colly.LimitRule{DomainGlob: "*stackoverflow.*", Parallelism: 8, Delay: 1 * time.Second})
	c.OnRequest(func(r *colly.Request) {
		r.Headers.Set("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println("Something went wrong:", err)
	})

	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", r.Request.URL)
	})
	c.OnHTML("#questions", func(e *colly.HTMLElement) {
		e.ForEach(".s-post-summary.js-post-summary", func(i int, el *colly.HTMLElement) {
			link := el.ChildAttr("a[href]", "href")
			e.Request.Visit("https://stackoverflow.com" + link)
		})
	})

	for i := 0; i <= 29; i++ {
		link := "https://stackoverflow.com/questions?tab=votes&page=" + strconv.Itoa(i)
		c.Visit(link)
	}

	c.Wait()
	finish()
}

所做的更改如下：

将潜在并发线程数从10减少到8。
使用了我的User-Agent值。
将c.Wait调用放在了for循环之外。

最后一项更改是最重要的，因为你误解了它的用法。基本上，它等待之前创建的所有线程（例如，根据你的机器，你可能有8个并发线程在处理你的爬取请求）。如果你将这个语句放在循环内部，每次你只等待刚刚实例化的线程，导致同步操作。

你可以通过几次尝试轻松地注意到这一点。如果你将c.Wait放在for循环内部，你会注意到页面按顺序访问。如果你将这个语句放在for循环之外，页面将以无序的方式访问。

如果你对这些更改后你的解决方案也能正常工作，请告诉我，谢谢！

英文:

Unfortunately, I was not able to repro your issue on my machine. By the way, I'll point out some things that will improve your solution. First, let me share my working solution:

package main

import (
	&quot;fmt&quot;
	&quot;log&quot;
	&quot;strconv&quot;
	&quot;time&quot;

	&quot;github.com/gocolly/colly/v2&quot;
)

func finish() {
	fmt.Println(&quot;Finish&quot;)
}

func main() {
	c := colly.NewCollector(
		colly.AllowedDomains(&quot;stackoverflow.com&quot;),
		colly.MaxDepth(1),
		colly.Async(true),
		// colly.Debugger(&amp;debug.LogDebugger{}),
	)

	c.Limit(&amp;colly.LimitRule{DomainGlob: &quot;*stackoverflow.*&quot;, Parallelism: 8, Delay: 1 * time.Second})
	c.OnRequest(func(r *colly.Request) {
		r.Headers.Set(&quot;User-Agent&quot;, &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36&quot;)
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println(&quot;Something went wrong:&quot;, err)
	})

	c.OnResponse(func(r *colly.Response) {
		fmt.Println(&quot;Visited&quot;, r.Request.URL)
	})
	c.OnHTML(&quot;#questions&quot;, func(e *colly.HTMLElement) {
		e.ForEach(&quot;.s-post-summary.js-post-summary&quot;, func(i int, el *colly.HTMLElement) {
			link := el.ChildAttr(&quot;a[href]&quot;, &quot;href&quot;)
			e.Request.Visit(&quot;https://stackoverflow.com&quot; + link)
		})
	})

	for i := 0; i &lt;= 29; i++ {
		link := &quot;https://stackoverflow.com/questions?tab=votes&amp;page=&quot; + strconv.Itoa(i)
		c.Visit(link)
	}

	c.Wait()
	finish()
}

The changes done are:

Decreased from 10 to 8 the potential concurrent threads.
Used my User-Agent value.
Put the c.Wait call outside the for loop.

The last change is the most important as you misunderstood its usage. Basically, it waits for all the threads that were created before (e.g. based on your machine you might have 8 concurrent threads working on your crawl requests). If you put this statement within the loop, every time you're waiting only for the just-instantiated thread resulting in synchronous operations.
> You can easily notice with a couple of attempts. If you leave the c.Wait within the for loop you notice that the pages are visited in an ordered way. If you put this statement out of the for loop, the pages get visited in an unsorted way.

Let me know if with these changes also your solution works, thanks!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

StackOverflow的最大速率限制是多少？

问题

答案1

为什么字段部分没有嵌入？

When using the mongodb $in query, the "(BadValue) $in needs an array" error occurs if [] uint8 is used for the query. Why?

Google Cloud Platform，Golang灵活环境仅支持自定义域名的HTTPS。

zero value of a pointer with %v and %p in Golang

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论