2020年1月3日 16:35:37go评论116阅读模式

英文:

scrape hidden pages if search yields more results than displayed

问题

以下是翻译好的部分：

有些在https://www.comparis.ch/carfinder/default下输入的搜索查询会返回超过1,000个结果（动态显示在搜索页面上）。然而，结果页面只显示最多100页，每页10个结果，所以我试图爬取在返回超过1,000个结果的查询下的剩余数据。
用于爬取前100页的ID的代码如下（运行所有100页大约需要2分钟）：

from bs4 import BeautifulSoup
import requests
# 由于最大页数限制为100页
number_of_pages = 100
# 初始化一个空字典
car_dict = {}
# 解析每个搜索结果页面并提取每个汽车的ID
for page in range(0, number_of_pages + 1, 1):
    newest_secondhand_cars = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
    newest_secondhand_cars = requests.get(newest_secondhand_cars + str('?page=') + str(page))
    newest_secondhand_cars = newest_secondhand_cars.content
    soup = BeautifulSoup(newest_secondhand_cars, "lxml")
    for car in list(soup.find('div', {'id': 'cf-result-list'}).find_all('h2')):
        car_id = int(car.decode().split('href="')[1].split('">')[0].split('/')[-1])
        car_dict[car_id] = {}

所以我尝试只传递一个大于100的str(page)，但并没有返回额外的结果。如果有可能，我应该如何访问剩余的结果呢？

英文:

Some of the search queries entered under https://www.comparis.ch/carfinder/default would yield more than 1'000 results (shown dynamically on the search page). The results however only show a max of 100 pages with 10 results each so I'm trying to scrape the remaining data given a query that yields more than 1'000 results.
The code to scrape the IDs of the first 100 pages is (takes approx. 2 minutes to run through all 100 pages):

from bs4 import BeautifulSoup
import requests
# as the max number of pages is limited to 100
number_of_pages = 100
# initiate empty dict
car_dict = {}
# parse every search results page and extract every car ID
for page in range(0, number_of_pages + 1, 1):
    newest_secondhand_cars = &#39;https://www.comparis.ch/carfinder/marktplatz/occasion&#39;
    newest_secondhand_cars = requests.get(newest_secondhand_cars + str(&#39;?page=&#39;) + str(page))
    newest_secondhand_cars = newest_secondhand_cars.content
    soup = BeautifulSoup(newest_secondhand_cars, &quot;lxml&quot;)
    for car in list(soup.find(&#39;div&#39;, {&#39;id&#39;: &#39;cf-result-list&#39;}).find_all(&#39;h2&#39;)):
        car_id = int(car.decode().split(&#39;href=&quot;&#39;)[1].split(&#39;&quot;&gt;&#39;)[0].split(&#39;/&#39;)[-1])
        car_dict[car_id] = {}

So I obviously tried just passing a str(page) greater than 100 which does not yield additional results.
How could I access the remaining results, if at all?

答案1

得分: 1

似乎您的网站在客户浏览时加载数据。可能有多种修复方法。一种选择可能是利用Scrapy Splash。

假设您正在使用Scrapy，您可以执行以下操作：

使用Docker启动Splash服务器 - 记下<ip-address>。
在settings.py中添加SPLASH_URL = <splash-server-ip-address>。
在settings.py中添加到中间件的代码：

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

在您的spider.py中导入from scrapy_splash import SplashRequest。
在您的spider.py中将start_url设置为遍历页面的起始点，例如：

base_url = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
start_urls = [
     base_url + str('?page=') + str(page) % page for page in range(0,100)      
    ]

通过修改def start_requests(self):将URL重定向到Splash服务器，例如：

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint='render.html',
            args={'wait': 0.5},
        )

像您现在一样解析响应。

请告诉我这对您有何帮助。

英文:

It seems that your website loads data when the client is browsing. There are probably a number of ways to fix this. One option could be to utilize Scrapy Splash.

Assuming you use scrapy, you can do the following:

Start a Splash server using docker - make a note of the <ip-address>
In settings.py add SPLASH_URL = <splash-server-ip-address>
In settings.py add to middlewares

this code:

DOWNLOADER_MIDDLEWARES = {
    &#39;scrapy_splash.SplashCookiesMiddleware&#39;: 723,
    &#39;scrapy_splash.SplashMiddleware&#39;: 725,
    &#39;scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&#39;: 810,
}

Import from scrapy_splash import SplashRequest in your spider.py
Set start_url in your spider.py to iterate over the pages

E.g. like this

base_url = &#39;https://www.comparis.ch/carfinder/marktplatz/occasion&#39;
start_urls = [
     base_url + str(&#39;?page=&#39;) + str(page) % page for page in range(0,100)      
    ]

Redirect the url to the splash server by modifing def start_requests(self):

E.g. like this

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint=&#39;render.html&#39;,
            args={&#39;wait&#39;: 0.5},
        )

Parse the response like you do now.

Let me know how that works out for you.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

抓取隐藏页面，如果搜索结果多于显示的结果。

问题

答案1

Tkinter获取鼠标

在Python中，即使程序关闭，也可以将数据存储在文件中。

在for循环内返回多个值。

send_filer 返回 None 或在没有返回语句的情况下结束

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。