英文:
Spider Run from Script will not Crawl pages from start_urls seed
问题
I have an issue crawling websites when I try to run the spider from a Google Colab script:
code for class:
class Campaign_Spider(scrapy.Spider):
#name of the spider
name = "crowdfunder"
# First Start Url
start_urls= ["https://www.crowdfunder.co.uk/search/projects?category=Business&map=off"]
npages = 83 # For full list of listed campaignswe could set this to 83
# This mimics getting the pages using the next button.
for i in range(2, npages + 2 ):
start_urls.append("https://www.crowdfunder.co.uk/search/projects?page="+str(i)+"&category=Business&map=off")
def parse(self, response):
#print('This is the response' + response.url)
for href in response.xpath("//a[contains(@class, 'cf-pod__link')]/@href"):
url = href.extract()
yield scrapy.Request(url, callback=self.parse_page)
def parse_page(self, response):
pass
#href = response.xpath("//a[contains(@class, 'cf-pod__link')]/@href)
# Extract the information
# ...
#yield {
#'url': response.request.meta['referer'],
# ...
#}
Code for wrapper to crawl:
# the wrapper to make it run more times
def run_spider(spider):
def f(q):
try:
runner = CrawlerProcess(settings={'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'FEEDS': {'crowdfunder.csv': {'format': 'csv', 'overwrite': True}})
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
I have tried printing the response in the code and would like for each of the urls in the start_urls list to refer the spider to further webpages to explore. I would appreciate any advice as I would like to implement this spider from a Google Colab script.
The output from the config_log:
config_log() Heading -> https://i.stack.imgur.com/7PjEc.png
Response/no crawling -> https://i.stack.imgur.com/yW5vN.png
Spider stats output log -> https://i.stack.imgur.com/MATrA.png
英文:
I have an issue crawling websites when I try to run the spider from a Google Colab script:
code for class:
class Campaign_Spider(scrapy.Spider):
#name of the spider
name = "crowdfunder"
# First Start Url
start_urls= ["https://www.crowdfunder.co.uk/search/projects?category=Business&map=off"]
npages = 83 # For full list of listed campaignswe could set this to 83
# This mimics getting the pages using the next button.
for i in range(2, npages + 2 ):
start_urls.append("https://www.crowdfunder.co.uk/search/projects?page="+str(i)+"&category=Business&map=off")
def parse(self, response):
#print('This is the response' + response.url)
for href in response.xpath("//a[contains(@class, 'cf-pod__link')]//@href"):
url = href.extract()
yield scrapy.Request(url, callback=self.parse_page)
def parse_page(self, response):
pass
#href = response.xpath("//a[contains(@class, 'cf-pod__link')]//@href")
# Extract the information
# ...
#yield {
#'url': response.request.meta['referer'],
# ...
#}
Code for wrapper to crawl:
# the wrapper to make it run more times
def run_spider(spider):
def f(q):
try:
runner = CrawlerProcess(settings={'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'FEEDS': {'crowdfunder.csv': {'format': 'csv', 'overwrite': True}}})
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
I have tried printing the response in the code and would like for each of the urls in the start_urls list to refer the spider to further webpages to explore. I would appreciate any advice as I would like to implement this spider from a Google Colab script.
The output from the config_log:
config_log() Heading -> https://i.stack.imgur.com/7PjEc.png
Response/no crawling -> https://i.stack.imgur.com/yW5vN.png
Spider stats output log -> https://i.stack.imgur.com/MATrA.png
答案1
得分: 2
以下是我跟随的步骤来解决问题的内容:
-
在不启用 JavaScript 的浏览器中打开网站(JavaScript turned off)。
-
搜索网址并检查它们是否未加载。
-
(然后重新启用 JavaScript)。
-
我在浏览器中打开了开发者工具,点击网络选项卡并打开了网页。
-
我搜索了一个
JSON
文件,并复制了请求的标头、正文和网址。 -
现在我们只需重新创建请求。
-
请注意,我保留了请求标头中的
Content-Length
,而 API 请求为POST
请求。 -
Scrapy 需要将有效载荷设为字符串,因此我们使用
json.dumps()
函数对有效载荷进行处理。 -
我们获得的响应也是 JSON 格式的,所以为了解析它,我使用
response.json()
。
接下来是 Python 代码部分。
英文:
Here are the steps I followed to get to the solution.
- Open the website in a browser with JavaScript turned off.
<br>
Search for the urls and see that they didn't load.
<br>
(Turned JavaScript on again). - I opened the Developer Tools in my browser, clicked the network tab and opened the webpage.
- I searched for a
JSON
file, and I copied the request's headers, body, and url.
- Now we just need to recreate the request.
<br>
Notice that I left theContent-Length
from the headers, and that the request for the API is aPOST
request.
<br>
Scrapy needs the payload to be a string so we usejson.dumps()
function on the payload.
<br>
The response that we get is also a JSON so in order to parse it I useresponse.json()
.
import scrapy
import json
class Campaign_Spider(scrapy.Spider):
name = "crowdfunder"
npages = 83 # For full list of listed campaignswe could set this to 83
# https://docs.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned
custom_settings = {
'DOWNLOAD_DELAY': 0.6
}
def start_requests(self):
api_url = 'https://7izdzrqwm2-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia for JavaScript (3.35.1); Browser (lite)&x-algolia-application-id=7IZDZRQWM2&x-algolia-api-key=9767ce6d672cff99e513892e0b798ae2'
# headers for the API request
headers = {
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"content-type": "application/x-www-form-urlencoded",
"DNT": "1",
"Host": "7izdzrqwm2-dsn.algolia.net",
"Origin": "https://www.crowdfunder.co.uk",
"Pragma": "no-cache",
"Referer": "https://www.crowdfunder.co.uk/",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
}
for i in range(1, self.npages + 2):
payload = {
"requests": [{
"indexName": "frontendsearch",
"params": f"facetFilters=%5B%22category%3ABusiness%22%5D&hitsPerPage=12&page={str(i)}&aroundPrecision=1000&distinct=true&query=&insideBoundingBox=&facets=%5B%5D&tagFilters="
}]
}
yield scrapy.Request(url=api_url, body=json.dumps(payload), method='POST', headers=headers)
def parse(self, response):
json_data = response.json()
base_url = 'https://www.crowdfunder.co.uk'
for hit in json_data.get('results')[0].get('hits'):
url = f"{base_url}{hit.get('uri')}"
# don't forget to add whatever headers you need
yield scrapy.Request(url, callback=self.parse_page)
def parse_page(self, response):
# parse whatever you want here from each webpage
pass
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论