英文:
scrape hidden pages if search yields more results than displayed
问题
以下是翻译好的部分:
有些在https://www.comparis.ch/carfinder/default下输入的搜索查询会返回超过1,000个结果(动态显示在搜索页面上)。然而,结果页面只显示最多100页,每页10个结果,所以我试图爬取在返回超过1,000个结果的查询下的剩余数据。
用于爬取前100页的ID的代码如下(运行所有100页大约需要2分钟):
from bs4 import BeautifulSoup
import requests
# 由于最大页数限制为100页
number_of_pages = 100
# 初始化一个空字典
car_dict = {}
# 解析每个搜索结果页面并提取每个汽车的ID
for page in range(0, number_of_pages + 1, 1):
newest_secondhand_cars = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
newest_secondhand_cars = requests.get(newest_secondhand_cars + str('?page=') + str(page))
newest_secondhand_cars = newest_secondhand_cars.content
soup = BeautifulSoup(newest_secondhand_cars, "lxml")
for car in list(soup.find('div', {'id': 'cf-result-list'}).find_all('h2')):
car_id = int(car.decode().split('href="')[1].split('">')[0].split('/')[-1])
car_dict[car_id] = {}
所以我尝试只传递一个大于100的str(page)
,但并没有返回额外的结果。如果有可能,我应该如何访问剩余的结果呢?
英文:
Some of the search queries entered under https://www.comparis.ch/carfinder/default would yield more than 1'000 results (shown dynamically on the search page). The results however only show a max of 100 pages with 10 results each so I'm trying to scrape the remaining data given a query that yields more than 1'000 results.
The code to scrape the IDs of the first 100 pages is (takes approx. 2 minutes to run through all 100 pages):
from bs4 import BeautifulSoup
import requests
# as the max number of pages is limited to 100
number_of_pages = 100
# initiate empty dict
car_dict = {}
# parse every search results page and extract every car ID
for page in range(0, number_of_pages + 1, 1):
newest_secondhand_cars = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
newest_secondhand_cars = requests.get(newest_secondhand_cars + str('?page=') + str(page))
newest_secondhand_cars = newest_secondhand_cars.content
soup = BeautifulSoup(newest_secondhand_cars, "lxml")
for car in list(soup.find('div', {'id': 'cf-result-list'}).find_all('h2')):
car_id = int(car.decode().split('href="')[1].split('">')[0].split('/')[-1])
car_dict[car_id] = {}
So I obviously tried just passing a str(page)
greater than 100 which does not yield additional results.
How could I access the remaining results, if at all?
答案1
得分: 1
似乎您的网站在客户浏览时加载数据。可能有多种修复方法。一种选择可能是利用Scrapy Splash。
假设您正在使用Scrapy,您可以执行以下操作:
- 使用Docker启动Splash服务器 - 记下<ip-address>。
- 在
settings.py
中添加SPLASH_URL = <splash-server-ip-address>
。 - 在
settings.py
中添加到中间件的代码:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
- 在您的spider.py中导入
from scrapy_splash import SplashRequest
。 - 在您的spider.py中将
start_url
设置为遍历页面的起始点,例如:
base_url = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
start_urls = [
base_url + str('?page=') + str(page) % page for page in range(0,100)
]
- 通过修改
def start_requests(self):
将URL重定向到Splash服务器,例如:
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)
- 像您现在一样解析响应。
请告诉我这对您有何帮助。
英文:
It seems that your website loads data when the client is browsing. There are probably a number of ways to fix this. One option could be to utilize Scrapy Splash.
Assuming you use scrapy, you can do the following:
- Start a Splash server using docker - make a note of the <ip-address>
- In
settings.py
addSPLASH_URL = <splash-server-ip-address>
- In
settings.py
add to middlewares
this code:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
- Import
from scrapy_splash import SplashRequest
in your spider.py - Set
start_url
in your spider.py to iterate over the pages
E.g. like this
base_url = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
start_urls = [
base_url + str('?page=') + str(page) % page for page in range(0,100)
]
- Redirect the url to the splash server by modifing
def start_requests(self):
E.g. like this
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)
- Parse the response like you do now.
Let me know how that works out for you.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论