英文:
Max Rate limit of StackOverflow
问题
我一直在尝试以每秒30个请求的频率访问StackOverflow,但它不起作用。几秒钟后就被阻止了。尽管StackOverflow的文档中说StackExchange的最大速率限制是每秒30个请求。
我使用的库是gocolly。以下是我的代码:
package main
import (
"fmt"
"log"
"strconv"
"time"
"github.com/gocolly/colly"
"github.com/gocolly/colly/debug"
)
func finish() {
fmt.Println("Finish")
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains("stackoverflow.com"),
colly.MaxDepth(1),
colly.Async(true),
colly.Debugger(&debug.LogDebugger{}),
)
c.Limit(&colly.LimitRule{DomainGlob: "*stackoverflow.*", Parallelism: 10, Delay: 1 * time.Second})
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
})
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.OnHTML("#questions", func(e *colly.HTMLElement) {
e.ForEach(".s-post-summary.js-post-summary", func(i int, el *colly.HTMLElement) {
link := el.ChildAttr("a[href]", "href")
e.Request.Visit("https://stackoverflow.com" + link)
})
})
for i := 0; i <= 1000; i++ {
var link = "https://stackoverflow.com/questions?tab=votes&page=" + strconv.Itoa(i)
c.Visit(link)
c.Wait()
}
finish()
}
希望有人能帮助我。
英文:
I have been trying to access StackOverflow with the amount of 30 requests / second but it not working. It has been blocked after a few seconds. Although the document of StackOverflow said the max rate limit of StackExchange is 30 req /s.
The libraries i used to access is gocolly
Here is my code:
package main
import (
"fmt"
"log"
"strconv"
"time"
"github.com/gocolly/colly"
"github.com/gocolly/colly/debug"
)
func finish() {
fmt.Println("Finish")
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains("stackoverflow.com"),
colly.MaxDepth(1),
colly.Async(true),
colly.Debugger(&debug.LogDebugger{}),
)
c.Limit(&colly.LimitRule{DomainGlob: "*stackoverflow.*", Parallelism: 10, Delay: 1 * time.Second})
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
})
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.OnHTML("#questions", func(e *colly.HTMLElement) {
e.ForEach(".s-post-summary.js-post-summary", func(i int, el *colly.HTMLElement) {
link := el.ChildAttr("a[href]", "href")
e.Request.Visit("https://stackoverflow.com" + link)
})
})
for i := 0; i <= 1000; i++ {
var link = "https://stackoverflow.com/questions?tab=votes&page=" + strconv.Itoa(i)
c.Visit(link)
c.Wait()
}
finish()
}
I hope someone can help me.
答案1
得分: 1
很抱歉,我无法在我的机器上重现你的问题。顺便说一下,我会指出一些可以改进你的解决方案的地方。首先,让我分享一下我的工作解决方案:
package main
import (
"fmt"
"log"
"strconv"
"time"
"github.com/gocolly/colly/v2"
)
func finish() {
fmt.Println("Finish")
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains("stackoverflow.com"),
colly.MaxDepth(1),
colly.Async(true),
// colly.Debugger(&debug.LogDebugger{}),
)
c.Limit(&colly.LimitRule{DomainGlob: "*stackoverflow.*", Parallelism: 8, Delay: 1 * time.Second})
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")
})
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.OnHTML("#questions", func(e *colly.HTMLElement) {
e.ForEach(".s-post-summary.js-post-summary", func(i int, el *colly.HTMLElement) {
link := el.ChildAttr("a[href]", "href")
e.Request.Visit("https://stackoverflow.com" + link)
})
})
for i := 0; i <= 29; i++ {
link := "https://stackoverflow.com/questions?tab=votes&page=" + strconv.Itoa(i)
c.Visit(link)
}
c.Wait()
finish()
}
所做的更改如下:
- 将潜在并发线程数从
10
减少到8
。 - 使用了我的
User-Agent
值。 - 将
c.Wait
调用放在了for
循环之外。
最后一项更改是最重要的,因为你误解了它的用法。基本上,它等待之前创建的所有线程(例如,根据你的机器,你可能有8
个并发线程在处理你的爬取请求)。如果你将这个语句放在循环内部,每次你只等待刚刚实例化的线程,导致同步操作。
你可以通过几次尝试轻松地注意到这一点。如果你将
c.Wait
放在for
循环内部,你会注意到页面按顺序访问。如果你将这个语句放在for
循环之外,页面将以无序的方式访问。
如果你对这些更改后你的解决方案也能正常工作,请告诉我,谢谢!
英文:
Unfortunately, I was not able to repro your issue on my machine. By the way, I'll point out some things that will improve your solution. First, let me share my working solution:
package main
import (
"fmt"
"log"
"strconv"
"time"
"github.com/gocolly/colly/v2"
)
func finish() {
fmt.Println("Finish")
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains("stackoverflow.com"),
colly.MaxDepth(1),
colly.Async(true),
// colly.Debugger(&debug.LogDebugger{}),
)
c.Limit(&colly.LimitRule{DomainGlob: "*stackoverflow.*", Parallelism: 8, Delay: 1 * time.Second})
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")
})
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.OnHTML("#questions", func(e *colly.HTMLElement) {
e.ForEach(".s-post-summary.js-post-summary", func(i int, el *colly.HTMLElement) {
link := el.ChildAttr("a[href]", "href")
e.Request.Visit("https://stackoverflow.com" + link)
})
})
for i := 0; i <= 29; i++ {
link := "https://stackoverflow.com/questions?tab=votes&page=" + strconv.Itoa(i)
c.Visit(link)
}
c.Wait()
finish()
}
The changes done are:
- Decreased from
10
to8
the potential concurrent threads. - Used my
User-Agent
value. - Put the
c.Wait
call outside thefor
loop.
The last change is the most important as you misunderstood its usage. Basically, it waits for all the threads that were created before (e.g. based on your machine you might have 8
concurrent threads working on your crawl requests). If you put this statement within the loop, every time you're waiting only for the just-instantiated thread resulting in synchronous operations.
> You can easily notice with a couple of attempts. If you leave the c.Wait
within the for loop you notice that the pages are visited in an ordered way. If you put this statement out of the for
loop, the pages get visited in an unsorted way.
Let me know if with these changes also your solution works, thanks!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论