Spider Run from Script will not Crawl pages from start_urls seed

huangapple go评论61阅读模式
英文:

Spider Run from Script will not Crawl pages from start_urls seed

问题

I have an issue crawling websites when I try to run the spider from a Google Colab script:

code for class:


    class Campaign_Spider(scrapy.Spider):
        #name of the spider
        name = "crowdfunder"
    
        # First Start Url
        start_urls= ["https://www.crowdfunder.co.uk/search/projects?category=Business&map=off"]
        
        npages = 83 # For full list of listed campaignswe could set this to 83
    
        # This mimics getting the pages using the next button. 
        for i in range(2, npages + 2 ):
            start_urls.append("https://www.crowdfunder.co.uk/search/projects?page="+str(i)+"&category=Business&map=off") 
    
        def parse(self, response):
            #print('This is the response' + response.url)
    
            for href in response.xpath("//a[contains(@class, 'cf-pod__link')]/@href"):
                url = href.extract()
    
                yield scrapy.Request(url, callback=self.parse_page)
      
        def parse_page(self, response):
            pass
            #href = response.xpath("//a[contains(@class, 'cf-pod__link')]/@href)
            # Extract the information
            # ...
            #yield {
                #'url': response.request.meta['referer'],
                # ...
            #}

Code for wrapper to crawl:



    # the wrapper to make it run more times
    def run_spider(spider):
        def f(q):
            try:
                runner = CrawlerProcess(settings={'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
            'FEEDS': {'crowdfunder.csv': {'format': 'csv', 'overwrite': True}})
                
                deferred = runner.crawl(spider)
                deferred.addBoth(lambda _: reactor.stop())
                reactor.run()
                q.put(None)
            except Exception as e:
                q.put(e)
    
        q = Queue()
        p = Process(target=f, args=(q,))
        p.start()
        result = q.get()
        p.join()
    
        if result is not None:
            raise result

I have tried printing the response in the code and would like for each of the urls in the start_urls list to refer the spider to further webpages to explore. I would appreciate any advice as I would like to implement this spider from a Google Colab script.

The output from the config_log:

config_log() Heading -> https://i.stack.imgur.com/7PjEc.png
Response/no crawling -> https://i.stack.imgur.com/yW5vN.png
Spider stats output log -> https://i.stack.imgur.com/MATrA.png

英文:

I have an issue crawling websites when I try to run the spider from a Google Colab script:

code for class:


    class Campaign_Spider(scrapy.Spider):
        #name of the spider
        name = "crowdfunder"
    
        # First Start Url
        start_urls= ["https://www.crowdfunder.co.uk/search/projects?category=Business&map=off"]
        
        npages = 83 # For full list of listed campaignswe could set this to 83
    
        # This mimics getting the pages using the next button. 
        for i in range(2, npages + 2 ):
            start_urls.append("https://www.crowdfunder.co.uk/search/projects?page="+str(i)+"&category=Business&map=off") 
    
        def parse(self, response):
            #print('This is the response' + response.url)
    
            for href in response.xpath("//a[contains(@class, 'cf-pod__link')]//@href"):
                url = href.extract()
    
                yield scrapy.Request(url, callback=self.parse_page)
      
        def parse_page(self, response):
            pass
            #href = response.xpath("//a[contains(@class, 'cf-pod__link')]//@href")
            # Extract the information
            # ...
            #yield {
                #'url': response.request.meta['referer'],
                # ...
            #}

Code for wrapper to crawl:


    # the wrapper to make it run more times
    def run_spider(spider):
        def f(q):
            try:
                runner = CrawlerProcess(settings={'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
            'FEEDS': {'crowdfunder.csv': {'format': 'csv', 'overwrite': True}}})
                
                deferred = runner.crawl(spider)
                deferred.addBoth(lambda _: reactor.stop())
                reactor.run()
                q.put(None)
            except Exception as e:
                q.put(e)
    
        q = Queue()
        p = Process(target=f, args=(q,))
        p.start()
        result = q.get()
        p.join()
    
        if result is not None:
            raise result

I have tried printing the response in the code and would like for each of the urls in the start_urls list to refer the spider to further webpages to explore. I would appreciate any advice as I would like to implement this spider from a Google Colab script.

The output from the config_log:

config_log() Heading -> https://i.stack.imgur.com/7PjEc.png
Response/no crawling -> https://i.stack.imgur.com/yW5vN.png
Spider stats output log -> https://i.stack.imgur.com/MATrA.png

答案1

得分: 2

以下是我跟随的步骤来解决问题的内容:

  1. 在不启用 JavaScript 的浏览器中打开网站(JavaScript turned off)。

  2. 搜索网址并检查它们是否未加载。

  3. (然后重新启用 JavaScript)。

  4. 我在浏览器中打开了开发者工具,点击网络选项卡并打开了网页。

  5. 我搜索了一个 JSON 文件,并复制了请求的标头、正文和网址。

  6. 现在我们只需重新创建请求。

  7. 请注意,我保留了请求标头中的 Content-Length,而 API 请求为 POST 请求。

  8. Scrapy 需要将有效载荷设为字符串,因此我们使用 json.dumps() 函数对有效载荷进行处理。

  9. 我们获得的响应也是 JSON 格式的,所以为了解析它,我使用 response.json()

接下来是 Python 代码部分。

英文:

Here are the steps I followed to get to the solution.

  1. Open the website in a browser with JavaScript turned off.
    <br>
    Search for the urls and see that they didn't load.
    <br>
    (Turned JavaScript on again).
  2. I opened the Developer Tools in my browser, clicked the network tab and opened the webpage.
  3. I searched for a JSON file, and I copied the request's headers, body, and url.

Spider Run from Script will not Crawl pages from start_urls seed

  1. Now we just need to recreate the request.
    <br>
    Notice that I left the Content-Length from the headers, and that the request for the API is a POST request.
    <br>
    Scrapy needs the payload to be a string so we use json.dumps() function on the payload.
    <br>
    The response that we get is also a JSON so in order to parse it I use response.json().
import scrapy
import json


class Campaign_Spider(scrapy.Spider):
    name = &quot;crowdfunder&quot;
    npages = 83     # For full list of listed campaignswe could set this to 83

    # https://docs.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned
    custom_settings = {
        &#39;DOWNLOAD_DELAY&#39;: 0.6
    }

    def start_requests(self):
        api_url = &#39;https://7izdzrqwm2-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia for JavaScript (3.35.1); Browser (lite)&amp;x-algolia-application-id=7IZDZRQWM2&amp;x-algolia-api-key=9767ce6d672cff99e513892e0b798ae2&#39;
        # headers for the API request
        headers = {
            &quot;Accept&quot;: &quot;application/json&quot;,
            &quot;Accept-Encoding&quot;: &quot;gzip, deflate, br&quot;,
            &quot;Accept-Language&quot;: &quot;en-US,en;q=0.5&quot;,
            &quot;Cache-Control&quot;: &quot;no-cache&quot;,
            &quot;Connection&quot;: &quot;keep-alive&quot;,
            &quot;content-type&quot;: &quot;application/x-www-form-urlencoded&quot;,
            &quot;DNT&quot;: &quot;1&quot;,
            &quot;Host&quot;: &quot;7izdzrqwm2-dsn.algolia.net&quot;,
            &quot;Origin&quot;: &quot;https://www.crowdfunder.co.uk&quot;,
            &quot;Pragma&quot;: &quot;no-cache&quot;,
            &quot;Referer&quot;: &quot;https://www.crowdfunder.co.uk/&quot;,
            &quot;Sec-Fetch-Dest&quot;: &quot;empty&quot;,
            &quot;Sec-Fetch-Mode&quot;: &quot;cors&quot;,
            &quot;Sec-Fetch-Site&quot;: &quot;cross-site&quot;,
            &quot;User-Agent&quot;: &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36&quot;
        }

        for i in range(1, self.npages + 2):
            payload = {
                &quot;requests&quot;: [{
                    &quot;indexName&quot;: &quot;frontendsearch&quot;,
                    &quot;params&quot;: f&quot;facetFilters=%5B%22category%3ABusiness%22%5D&amp;hitsPerPage=12&amp;page={str(i)}&amp;aroundPrecision=1000&amp;distinct=true&amp;query=&amp;insideBoundingBox=&amp;facets=%5B%5D&amp;tagFilters=&quot;
                }]
            }
            yield scrapy.Request(url=api_url, body=json.dumps(payload), method=&#39;POST&#39;, headers=headers)

    def parse(self, response):
        json_data = response.json()
        base_url = &#39;https://www.crowdfunder.co.uk&#39;

        for hit in json_data.get(&#39;results&#39;)[0].get(&#39;hits&#39;):
            url = f&quot;{base_url}{hit.get(&#39;uri&#39;)}&quot;
            # don&#39;t forget to add whatever headers you need
            yield scrapy.Request(url, callback=self.parse_page)

    def parse_page(self, response):
        # parse whatever you want here from each webpage
        pass

huangapple
  • 本文由 发表于 2023年4月6日 22:51:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75950893.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定