Spider Run from Script will not Crawl pages from start_urls seed

huangapple go评论104阅读模式
英文:

Spider Run from Script will not Crawl pages from start_urls seed

问题

I have an issue crawling websites when I try to run the spider from a Google Colab script:

code for class:

  1. class Campaign_Spider(scrapy.Spider):
  2. #name of the spider
  3. name = "crowdfunder"
  4. # First Start Url
  5. start_urls= ["https://www.crowdfunder.co.uk/search/projects?category=Business&map=off"]
  6. npages = 83 # For full list of listed campaignswe could set this to 83
  7. # This mimics getting the pages using the next button.
  8. for i in range(2, npages + 2 ):
  9. start_urls.append("https://www.crowdfunder.co.uk/search/projects?page="+str(i)+"&category=Business&map=off")
  10. def parse(self, response):
  11. #print('This is the response' + response.url)
  12. for href in response.xpath("//a[contains(@class, 'cf-pod__link')]/@href"):
  13. url = href.extract()
  14. yield scrapy.Request(url, callback=self.parse_page)
  15. def parse_page(self, response):
  16. pass
  17. #href = response.xpath("//a[contains(@class, 'cf-pod__link')]/@href)
  18. # Extract the information
  19. # ...
  20. #yield {
  21. #'url': response.request.meta['referer'],
  22. # ...
  23. #}

Code for wrapper to crawl:

  1. # the wrapper to make it run more times
  2. def run_spider(spider):
  3. def f(q):
  4. try:
  5. runner = CrawlerProcess(settings={'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
  6. 'FEEDS': {'crowdfunder.csv': {'format': 'csv', 'overwrite': True}})
  7. deferred = runner.crawl(spider)
  8. deferred.addBoth(lambda _: reactor.stop())
  9. reactor.run()
  10. q.put(None)
  11. except Exception as e:
  12. q.put(e)
  13. q = Queue()
  14. p = Process(target=f, args=(q,))
  15. p.start()
  16. result = q.get()
  17. p.join()
  18. if result is not None:
  19. raise result

I have tried printing the response in the code and would like for each of the urls in the start_urls list to refer the spider to further webpages to explore. I would appreciate any advice as I would like to implement this spider from a Google Colab script.

The output from the config_log:

config_log() Heading -> https://i.stack.imgur.com/7PjEc.png
Response/no crawling -> https://i.stack.imgur.com/yW5vN.png
Spider stats output log -> https://i.stack.imgur.com/MATrA.png

英文:

I have an issue crawling websites when I try to run the spider from a Google Colab script:

code for class:

  1. class Campaign_Spider(scrapy.Spider):
  2. #name of the spider
  3. name = "crowdfunder"
  4. # First Start Url
  5. start_urls= ["https://www.crowdfunder.co.uk/search/projects?category=Business&map=off"]
  6. npages = 83 # For full list of listed campaignswe could set this to 83
  7. # This mimics getting the pages using the next button.
  8. for i in range(2, npages + 2 ):
  9. start_urls.append("https://www.crowdfunder.co.uk/search/projects?page="+str(i)+"&category=Business&map=off")
  10. def parse(self, response):
  11. #print('This is the response' + response.url)
  12. for href in response.xpath("//a[contains(@class, 'cf-pod__link')]//@href"):
  13. url = href.extract()
  14. yield scrapy.Request(url, callback=self.parse_page)
  15. def parse_page(self, response):
  16. pass
  17. #href = response.xpath("//a[contains(@class, 'cf-pod__link')]//@href")
  18. # Extract the information
  19. # ...
  20. #yield {
  21. #'url': response.request.meta['referer'],
  22. # ...
  23. #}

Code for wrapper to crawl:

  1. # the wrapper to make it run more times
  2. def run_spider(spider):
  3. def f(q):
  4. try:
  5. runner = CrawlerProcess(settings={'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
  6. 'FEEDS': {'crowdfunder.csv': {'format': 'csv', 'overwrite': True}}})
  7. deferred = runner.crawl(spider)
  8. deferred.addBoth(lambda _: reactor.stop())
  9. reactor.run()
  10. q.put(None)
  11. except Exception as e:
  12. q.put(e)
  13. q = Queue()
  14. p = Process(target=f, args=(q,))
  15. p.start()
  16. result = q.get()
  17. p.join()
  18. if result is not None:
  19. raise result

I have tried printing the response in the code and would like for each of the urls in the start_urls list to refer the spider to further webpages to explore. I would appreciate any advice as I would like to implement this spider from a Google Colab script.

The output from the config_log:

  1. config_log() Heading -> https://i.stack.imgur.com/7PjEc.png
  2. Response/no crawling -> https://i.stack.imgur.com/yW5vN.png
  3. Spider stats output log -> https://i.stack.imgur.com/MATrA.png

答案1

得分: 2

以下是我跟随的步骤来解决问题的内容:

  1. 在不启用 JavaScript 的浏览器中打开网站(JavaScript turned off)。

  2. 搜索网址并检查它们是否未加载。

  3. (然后重新启用 JavaScript)。

  4. 我在浏览器中打开了开发者工具,点击网络选项卡并打开了网页。

  5. 我搜索了一个 JSON 文件,并复制了请求的标头、正文和网址。

  6. 现在我们只需重新创建请求。

  7. 请注意,我保留了请求标头中的 Content-Length,而 API 请求为 POST 请求。

  8. Scrapy 需要将有效载荷设为字符串,因此我们使用 json.dumps() 函数对有效载荷进行处理。

  9. 我们获得的响应也是 JSON 格式的,所以为了解析它,我使用 response.json()

接下来是 Python 代码部分。

英文:

Here are the steps I followed to get to the solution.

  1. Open the website in a browser with JavaScript turned off.
    <br>
    Search for the urls and see that they didn't load.
    <br>
    (Turned JavaScript on again).
  2. I opened the Developer Tools in my browser, clicked the network tab and opened the webpage.
  3. I searched for a JSON file, and I copied the request's headers, body, and url.

Spider Run from Script will not Crawl pages from start_urls seed

  1. Now we just need to recreate the request.
    <br>
    Notice that I left the Content-Length from the headers, and that the request for the API is a POST request.
    <br>
    Scrapy needs the payload to be a string so we use json.dumps() function on the payload.
    <br>
    The response that we get is also a JSON so in order to parse it I use response.json().
  1. import scrapy
  2. import json
  3. class Campaign_Spider(scrapy.Spider):
  4. name = &quot;crowdfunder&quot;
  5. npages = 83 # For full list of listed campaignswe could set this to 83
  6. # https://docs.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned
  7. custom_settings = {
  8. &#39;DOWNLOAD_DELAY&#39;: 0.6
  9. }
  10. def start_requests(self):
  11. api_url = &#39;https://7izdzrqwm2-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia for JavaScript (3.35.1); Browser (lite)&amp;x-algolia-application-id=7IZDZRQWM2&amp;x-algolia-api-key=9767ce6d672cff99e513892e0b798ae2&#39;
  12. # headers for the API request
  13. headers = {
  14. &quot;Accept&quot;: &quot;application/json&quot;,
  15. &quot;Accept-Encoding&quot;: &quot;gzip, deflate, br&quot;,
  16. &quot;Accept-Language&quot;: &quot;en-US,en;q=0.5&quot;,
  17. &quot;Cache-Control&quot;: &quot;no-cache&quot;,
  18. &quot;Connection&quot;: &quot;keep-alive&quot;,
  19. &quot;content-type&quot;: &quot;application/x-www-form-urlencoded&quot;,
  20. &quot;DNT&quot;: &quot;1&quot;,
  21. &quot;Host&quot;: &quot;7izdzrqwm2-dsn.algolia.net&quot;,
  22. &quot;Origin&quot;: &quot;https://www.crowdfunder.co.uk&quot;,
  23. &quot;Pragma&quot;: &quot;no-cache&quot;,
  24. &quot;Referer&quot;: &quot;https://www.crowdfunder.co.uk/&quot;,
  25. &quot;Sec-Fetch-Dest&quot;: &quot;empty&quot;,
  26. &quot;Sec-Fetch-Mode&quot;: &quot;cors&quot;,
  27. &quot;Sec-Fetch-Site&quot;: &quot;cross-site&quot;,
  28. &quot;User-Agent&quot;: &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36&quot;
  29. }
  30. for i in range(1, self.npages + 2):
  31. payload = {
  32. &quot;requests&quot;: [{
  33. &quot;indexName&quot;: &quot;frontendsearch&quot;,
  34. &quot;params&quot;: f&quot;facetFilters=%5B%22category%3ABusiness%22%5D&amp;hitsPerPage=12&amp;page={str(i)}&amp;aroundPrecision=1000&amp;distinct=true&amp;query=&amp;insideBoundingBox=&amp;facets=%5B%5D&amp;tagFilters=&quot;
  35. }]
  36. }
  37. yield scrapy.Request(url=api_url, body=json.dumps(payload), method=&#39;POST&#39;, headers=headers)
  38. def parse(self, response):
  39. json_data = response.json()
  40. base_url = &#39;https://www.crowdfunder.co.uk&#39;
  41. for hit in json_data.get(&#39;results&#39;)[0].get(&#39;hits&#39;):
  42. url = f&quot;{base_url}{hit.get(&#39;uri&#39;)}&quot;
  43. # don&#39;t forget to add whatever headers you need
  44. yield scrapy.Request(url, callback=self.parse_page)
  45. def parse_page(self, response):
  46. # parse whatever you want here from each webpage
  47. pass

huangapple
  • 本文由 发表于 2023年4月6日 22:51:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75950893.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定