2023年4月6日 22:51:10go评论144阅读模式

英文:

Spider Run from Script will not Crawl pages from start_urls seed

问题

I have an issue crawling websites when I try to run the spider from a Google Colab script:

code for class:


    class Campaign_Spider(scrapy.Spider):
        #name of the spider
        name = "crowdfunder"
    
        # First Start Url
        start_urls= ["https://www.crowdfunder.co.uk/search/projects?category=Business&amp;map=off"]
        
        npages = 83 # For full list of listed campaignswe could set this to 83
    
        # This mimics getting the pages using the next button. 
        for i in range(2, npages + 2 ):
            start_urls.append("https://www.crowdfunder.co.uk/search/projects?page="+str(i)+"&amp;category=Business&amp;map=off") 
    
        def parse(self, response):
            #print('This is the response' + response.url)
    
            for href in response.xpath("//a[contains(@class, 'cf-pod__link')]/@href"):
                url = href.extract()
    
                yield scrapy.Request(url, callback=self.parse_page)
      
        def parse_page(self, response):
            pass
            #href = response.xpath("//a[contains(@class, 'cf-pod__link')]/@href)
            # Extract the information
            # ...
            #yield {
                #'url': response.request.meta['referer'],
                # ...
            #}

Code for wrapper to crawl:



    # the wrapper to make it run more times
    def run_spider(spider):
        def f(q):
            try:
                runner = CrawlerProcess(settings={'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
            'FEEDS': {'crowdfunder.csv': {'format': 'csv', 'overwrite': True}})
                
                deferred = runner.crawl(spider)
                deferred.addBoth(lambda _: reactor.stop())
                reactor.run()
                q.put(None)
            except Exception as e:
                q.put(e)
    
        q = Queue()
        p = Process(target=f, args=(q,))
        p.start()
        result = q.get()
        p.join()
    
        if result is not None:
            raise result

I have tried printing the response in the code and would like for each of the urls in the start_urls list to refer the spider to further webpages to explore. I would appreciate any advice as I would like to implement this spider from a Google Colab script.

The output from the config_log:

config_log() Heading -> https://i.stack.imgur.com/7PjEc.png
Response/no crawling -> https://i.stack.imgur.com/yW5vN.png
Spider stats output log -> https://i.stack.imgur.com/MATrA.png

英文:

I have an issue crawling websites when I try to run the spider from a Google Colab script:

code for class:


    class Campaign_Spider(scrapy.Spider):
        #name of the spider
        name = &quot;crowdfunder&quot;
    
        # First Start Url
        start_urls= [&quot;https://www.crowdfunder.co.uk/search/projects?category=Business&amp;map=off&quot;]
        
        npages = 83 # For full list of listed campaignswe could set this to 83
    
        # This mimics getting the pages using the next button. 
        for i in range(2, npages + 2 ):
            start_urls.append(&quot;https://www.crowdfunder.co.uk/search/projects?page=&quot;+str(i)+&quot;&amp;category=Business&amp;map=off&quot;) 
    
        def parse(self, response):
            #print(&#39;This is the response&#39; + response.url)
    
            for href in response.xpath(&quot;//a[contains(@class, &#39;cf-pod__link&#39;)]//@href&quot;):
                url = href.extract()
    
                yield scrapy.Request(url, callback=self.parse_page)
      
        def parse_page(self, response):
            pass
            #href = response.xpath(&quot;//a[contains(@class, &#39;cf-pod__link&#39;)]//@href&quot;)
            # Extract the information
            # ...
            #yield {
                #&#39;url&#39;: response.request.meta[&#39;referer&#39;],
                # ...
            #}

Code for wrapper to crawl:


    # the wrapper to make it run more times
    def run_spider(spider):
        def f(q):
            try:
                runner = CrawlerProcess(settings={&#39;REQUEST_FINGERPRINTER_IMPLEMENTATION&#39;: &#39;2.7&#39;,
            &#39;FEEDS&#39;: {&#39;crowdfunder.csv&#39;: {&#39;format&#39;: &#39;csv&#39;, &#39;overwrite&#39;: True}}})
                
                deferred = runner.crawl(spider)
                deferred.addBoth(lambda _: reactor.stop())
                reactor.run()
                q.put(None)
            except Exception as e:
                q.put(e)
    
        q = Queue()
        p = Process(target=f, args=(q,))
        p.start()
        result = q.get()
        p.join()
    
        if result is not None:
            raise result

The output from the config_log:

config_log() Heading -&gt; https://i.stack.imgur.com/7PjEc.png
Response/no crawling -&gt; https://i.stack.imgur.com/yW5vN.png
Spider stats output log -&gt; https://i.stack.imgur.com/MATrA.png

答案1

得分: 2

以下是我跟随的步骤来解决问题的内容：

在不启用 JavaScript 的浏览器中打开网站（JavaScript turned off）。
搜索网址并检查它们是否未加载。
（然后重新启用 JavaScript）。
我在浏览器中打开了开发者工具，点击网络选项卡并打开了网页。
我搜索了一个 JSON 文件，并复制了请求的标头、正文和网址。
现在我们只需重新创建请求。
请注意，我保留了请求标头中的 Content-Length，而 API 请求为 POST 请求。
Scrapy 需要将有效载荷设为字符串，因此我们使用 json.dumps() 函数对有效载荷进行处理。
我们获得的响应也是 JSON 格式的，所以为了解析它，我使用 response.json()。

接下来是 Python 代码部分。

英文:

Here are the steps I followed to get to the solution.

Open the website in a browser with JavaScript turned off.
 
Search for the urls and see that they didn't load.
 
(Turned JavaScript on again).
I opened the Developer Tools in my browser, clicked the network tab and opened the webpage.
I searched for a JSON file, and I copied the request's headers, body, and url.

Now we just need to recreate the request.
 
Notice that I left the Content-Length from the headers, and that the request for the API is a POST request.
 
Scrapy needs the payload to be a string so we use json.dumps() function on the payload.
 
The response that we get is also a JSON so in order to parse it I use response.json().

import scrapy
import json


class Campaign_Spider(scrapy.Spider):
    name = &quot;crowdfunder&quot;
    npages = 83     # For full list of listed campaignswe could set this to 83

    # https://docs.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned
    custom_settings = {
        &#39;DOWNLOAD_DELAY&#39;: 0.6
    }

    def start_requests(self):
        api_url = &#39;https://7izdzrqwm2-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia for JavaScript (3.35.1); Browser (lite)&amp;x-algolia-application-id=7IZDZRQWM2&amp;x-algolia-api-key=9767ce6d672cff99e513892e0b798ae2&#39;
        # headers for the API request
        headers = {
            &quot;Accept&quot;: &quot;application/json&quot;,
            &quot;Accept-Encoding&quot;: &quot;gzip, deflate, br&quot;,
            &quot;Accept-Language&quot;: &quot;en-US,en;q=0.5&quot;,
            &quot;Cache-Control&quot;: &quot;no-cache&quot;,
            &quot;Connection&quot;: &quot;keep-alive&quot;,
            &quot;content-type&quot;: &quot;application/x-www-form-urlencoded&quot;,
            &quot;DNT&quot;: &quot;1&quot;,
            &quot;Host&quot;: &quot;7izdzrqwm2-dsn.algolia.net&quot;,
            &quot;Origin&quot;: &quot;https://www.crowdfunder.co.uk&quot;,
            &quot;Pragma&quot;: &quot;no-cache&quot;,
            &quot;Referer&quot;: &quot;https://www.crowdfunder.co.uk/&quot;,
            &quot;Sec-Fetch-Dest&quot;: &quot;empty&quot;,
            &quot;Sec-Fetch-Mode&quot;: &quot;cors&quot;,
            &quot;Sec-Fetch-Site&quot;: &quot;cross-site&quot;,
            &quot;User-Agent&quot;: &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36&quot;
        }

        for i in range(1, self.npages + 2):
            payload = {
                &quot;requests&quot;: [{
                    &quot;indexName&quot;: &quot;frontendsearch&quot;,
                    &quot;params&quot;: f&quot;facetFilters=%5B%22category%3ABusiness%22%5D&amp;hitsPerPage=12&amp;page={str(i)}&amp;aroundPrecision=1000&amp;distinct=true&amp;query=&amp;insideBoundingBox=&amp;facets=%5B%5D&amp;tagFilters=&quot;
                }]
            }
            yield scrapy.Request(url=api_url, body=json.dumps(payload), method=&#39;POST&#39;, headers=headers)

    def parse(self, response):
        json_data = response.json()
        base_url = &#39;https://www.crowdfunder.co.uk&#39;

        for hit in json_data.get(&#39;results&#39;)[0].get(&#39;hits&#39;):
            url = f&quot;{base_url}{hit.get(&#39;uri&#39;)}&quot;
            # don&#39;t forget to add whatever headers you need
            yield scrapy.Request(url, callback=self.parse_page)

    def parse_page(self, response):
        # parse whatever you want here from each webpage
        pass

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spider Run from Script will not Crawl pages from start_urls seed

问题

答案1

在Google Colab或Jupyter Notebook中使用Sherlock。

文件因另一个进程正在使用而无法访问，同时重命名文件路径。

如何使用Selenium从网站下拉菜单中获取值列表？

How to find the x path of an element when the python selenium script clicks a button which in turn opens in a new window (element is in new window)

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论