问题

我正在尝试爬取我国家的一个休闲体育团队网站，但该网站一直阻止我的Scrapy尝试。我尝试设置了一个用户代理，但没有成功...一旦我运行Scrapy，就会收到429未知状态的错误，没有一个200成功。我可以在浏览器中访问该网站，所以我知道我的IP没有被封锁。任何帮助将不胜感激。

这是我正在使用的代码：

import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class QuoteSpider(CrawlSpider):
    name = "Quote"
    allowed_domains = ["avaldsnes.spoortz.no"]
    start_urls = ["https://avaldsnes.spoortz.no/portal/arego/club/7"]

    rules = (Rule(LinkExtractor(allow="")),)
    custom_settings = {"USER_AGENT": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}

    def parse(self, response):
        print(response.request.headers)

我尝试爬取该网站的链接，但没有一个尝试成功。目前用户代理设置为Google Bot，但我也尝试过常规的用户代理。

英文:

I am trying to scrape a casual sports-team website in my country that keeps blocking my Scrapy attempts. I have tried setting a User Agent, but without any success.. as soon as i run Scrapy, I get the 429 Unknown Status. Not one 200 success. I am able to visit the website in my browser so I know my IP is not blocked. Any help would be appreciated.

Here is the code I am using:

import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class QuoteSpider(CrawlSpider):
    name = &quot;Quote&quot;
    allowed_domains = [&quot;avaldsnes.spoortz.no&quot;]
    start_urls = [&quot;https://avaldsnes.spoortz.no/portal/arego/club/7&quot;]

    rules = (Rule(LinkExtractor(allow=&quot;&quot;)),)
    custom_settings = {&quot;USER_AGENT&quot;: &quot;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&quot;}

    def parse(self, response):
        print(response.request.headers)

I tried Crawling the website for it´s links, But not one attempt succeeded. Right now the user agent is set as Google Bot, but I have tried regular ones as well.

答案1

得分: 1

在这种情况下，您需要设置头信息（而不仅仅是用户代理）。

from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class QuoteSpider(CrawlSpider):
    name = "Quote"
    allowed_domains = ["avaldsnes.spoortz.no"]
    start_urls = ["https://avaldsnes.spoortz.no/portal/arego/club/7"]

    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-US,en;q=0.5",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "DNT": "1",
        "Host": "avaldsnes.spoortz.no",
        "Pragma": "no-cache",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
    }

    rules = (Rule(LinkExtractor(allow="")),)

    custom_settings = {
        'DEFAULT_REQUEST_HEADERS': headers
    }

    def parse(self, response):
        print(response.request.headers)

Output:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://avaldsnes.spoortz.no/portal/arego/club/7> (referer: None)
...
...
...

英文:

In this case you need to set the headers (and not just the user agent).

from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor


class QuoteSpider(CrawlSpider):
    name = &quot;Quote&quot;
    allowed_domains = [&quot;avaldsnes.spoortz.no&quot;]
    start_urls = [&quot;https://avaldsnes.spoortz.no/portal/arego/club/7&quot;]

    headers = {
        &quot;Accept&quot;: &quot;text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8&quot;,
        &quot;Accept-Encoding&quot;: &quot;gzip, deflate, br&quot;,
        &quot;Accept-Language&quot;: &quot;en-US,en;q=0.5&quot;,
        &quot;Cache-Control&quot;: &quot;no-cache&quot;,
        &quot;Connection&quot;: &quot;keep-alive&quot;,
        &quot;DNT&quot;: &quot;1&quot;,
        &quot;Host&quot;: &quot;avaldsnes.spoortz.no&quot;,
        &quot;Pragma&quot;: &quot;no-cache&quot;,
        &quot;Sec-Fetch-Dest&quot;: &quot;document&quot;,
        &quot;Sec-Fetch-Mode&quot;: &quot;navigate&quot;,
        &quot;Sec-Fetch-Site&quot;: &quot;none&quot;,
        &quot;Sec-Fetch-User&quot;: &quot;?1&quot;,
        &quot;Upgrade-Insecure-Requests&quot;: &quot;1&quot;,
        &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36&quot;
    }

    rules = (Rule(LinkExtractor(allow=&quot;&quot;)),)

    custom_settings = {
        &#39;DEFAULT_REQUEST_HEADERS&#39;: headers
    }

    def parse(self, response):
        print(response.request.headers)

Output:

[scrapy.core.engine] DEBUG: Crawled (200) &lt;GET https://avaldsnes.spoortz.no/portal/arego/club/7&gt; (referer: None)
...
...
...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

遇到使用Scrapy时被阻止（使用用户代理）

问题

答案1

从网页中抓取具有不一致嵌套HTML标记的数据表格。

Scrapy – 递归函数作为分页的回调

Python – 从XML中抓取数据

我们可以在没有SSL证书的情况下建立HTTPS连接吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论