英文:
Golang Colly Scraping - Website Captcha Catches My Scrape
问题
我为亚马逊产品标题做了爬取,但是亚马逊的验证码抓住了我的爬虫。我尝试了10次运行main.go(8次被抓住,2次成功爬取了产品标题)。
我进行了研究,但没有找到针对golang的解决方案(只有Python的解决方案),是否有适用于我的解决方案?
package main
import (
"fmt"
"strings"
"github.com/gocolly/colly"
)
func main() {
// 创建一个专门用于 Shopify 的 Collector
c := colly.NewCollector(
colly.AllowedDomains("www.amazon.com", "amazon.com"),
)
c.OnHTML("div", func(h *colly.HTMLElement) {
capctha := h.Text
title := h.ChildText("span#productTitle")
fmt.Println(strings.TrimSpace(title))
fmt.Println(strings.TrimSpace(capctha))
})
// 启动 Collector
c.Visit("https://www.amazon.com/Bluetooth-Over-Ear-Headphones-Foldable-Prolonged/dp/B07K5214NZ")
}
输出:
输入下面看到的字符。抱歉,我们只是需要确保您不是机器人。为了获得最佳结果,请确保您的浏览器接受 cookie。
英文:
I did make Scraping for Amazon Product Titles but Amazon captcha catches my scraper. I tried 10 times- go run main.go(8 times catches me - 2 times I scraped the product title)
I researched this but I did not find any solution for golang(there is just python) is there any solution for me?
package main
import (
"fmt"
"strings"0
"github.com/gocolly/colly"
)
func main() {
// Create a Collector specifically for Shopify
c := colly.NewCollector(
colly.AllowedDomains("www.amazon.com", "amazon.com"),
)
c.OnHTML("div", func(h *colly.HTMLElement) {
capctha := h.Text
title := h.ChildText("span#productTitle")
fmt.Println(strings.TrimSpace(title))
fmt.Println(strings.TrimSpace(capctha))
})
// Start the collector
c.Visit("https://www.amazon.com/Bluetooth-Over-Ear-Headphones-Foldable-Prolonged/dp/B07K5214NZ")
}
> Output:
>
> Enter the characters you see below Sorry, we just need to make sure
> you're not a robot. For best results, please make sure your browser is
> accepting cookies.
答案1
得分: 2
如果您不介意使用不同的软件包,我写了一个用于搜索HTML的软件包(实际上是github.com/tdewolff/parse
的薄包装):
package main
import (
"github.com/89z/parse/html"
"net/http"
"os"
)
func main() {
req, err := http.NewRequest(
"GET", "https://www.amazon.com/dp/B07K5214NZ", nil,
)
req.Header = http.Header{
"User-Agent": {"Mozilla"},
}
res, err := new(http.Transport).RoundTrip(req)
if err != nil {
panic(err)
}
defer res.Body.Close()
lex := html.NewLexer(res.Body)
lex.NextAttr("id", "productTitle")
os.Stdout.Write(lex.Bytes())
}
结果:
Bluetooth Headphones Over-Ear, Zihnic Foldable Wireless and Wired Stereo
Headset Micro SD/TF, FM for Cell Phone,PC,Soft Earmuffs &Light Weight for
Prolonged Waring(Rose Gold)
英文:
If you don't mind a different package, I wrote a package to search HTML
(essentially thin wrapper around github.com/tdewolff/parse
):
package main
import (
"github.com/89z/parse/html"
"net/http"
"os"
)
func main() {
req, err := http.NewRequest(
"GET", "https://www.amazon.com/dp/B07K5214NZ", nil,
)
req.Header = http.Header{
"User-Agent": {"Mozilla"},
}
res, err := new(http.Transport).RoundTrip(req)
if err != nil {
panic(err)
}
defer res.Body.Close()
lex := html.NewLexer(res.Body)
lex.NextAttr("id", "productTitle")
os.Stdout.Write(lex.Bytes())
}
Result:
Bluetooth Headphones Over-Ear, Zihnic Foldable Wireless and Wired Stereo
Headset Micro SD/TF, FM for Cell Phone,PC,Soft Earmuffs &Light Weight for
Prolonged Waring(Rose Gold)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论