2023年5月8日 01:36:01go评论132阅读模式

英文:

A Python script built upon the requests module throws a KeyError when it goes for the next page after grabbing content from the first page

问题

我尝试使用请求模块从此网页抓取不同餐厅和酒吧的名称。脚本从第一页无误地解析内容。然而，在尝试从第二页获取内容时，它抛出了一个 KeyError 错误，而实际上还有几页要抓取。

以下是我一直尝试的代码：

import requests
from pprint import pprint
link = 'https://2gis.ae/dubai/search/Bars/rubricId/159'
url = 'https://catalog.api.2gis.ru/3.0/items'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
params = {
    'page': 1,
    'page_size': 12,
    'rubric_id': 159,
    'fields': 'items.locale,items.flags,search_attributes,items.adm_div,items.city_alias,items.region_id,items.segment_id,items.reviews,items.point,request_type,context_rubrics,query_context,items.links,items.name_ex,items.name_back,items.org,items.group,items.external_content,items.comment,items.ads.options,items.email_for_sending.allowed,items.stat,items.description,items.geometry.centroid,items.geometry.selection,items.geometry.style,items.timezone_offset,items.context,items.address,items.is_paid,items.access,items.access_comment,items.for_trucks,items.is_incentive,items.paving_type,items.capacity,items.schedule,items.floors,dym,ad,items.rubrics,items.routes,items.reply_rate,items.purpose,items.attribute_groups,items.route_logo,items.has_goods,items.has_apartments_info,items.has_pinned_goods,items.has_realty,items.has_payments,items.is_promoted,items.delivery,items.order_with_cart,search_type,items.has_discount,items.metarubrics,broadcast,items.detailed_subtype,items.temporary_unavailable_atm_services,items.poi_category',
    'key': 'rurbbn3446',
    'locale': 'en_AE',
    'search_device_type': 'desktop',
    'search_user_hash': '7233966692562515761',
    'viewpoint1': '55.09734166196474,25.248071810295556',
    'viewpoint2': '55.421438338035266,25.16107818970444',
    'stat[sid]': '3de514cd-c705-4753-91b6-e997c2227aa7',
    'stat[user]': '1723a6dd-6008-4197-a074-c1e34e27f785',
    'shv': '2023-05-02-14',
    'r': '2843121634'
}
with requests.Session() as s:
    s.headers.update(headers)
    while True:
        print(f"processing page ==============> {params['page']}")
        res = s.get(url, params=params)
        try:
            res.json()['result']['items']
        except KeyError:
            break
        for item in res.json()['result']['items']:
            print(item['name'])
        params['page'] += 1

以下是完整的 traceback：

Traceback (most recent call last):
  File "C:\Users\C.L\Desktop\Python basic\python scripts\demo.py", line 35, in <module>
    res.json()['result']['items']
KeyError: 'result'

英文:

I'm trying to scrape the names of different restaurants and bars from this webpage using the requests module. The script parses the content from the first page errorlessly. However, it throws a KeyError when it attempts to grab content from page two, whereas there are several pages to go.

Here is what I've been trying with:

import requests
from pprint import pprint
link = &#39;https://2gis.ae/dubai/search/Bars/rubricId/159&#39;
url = &#39;https://catalog.api.2gis.ru/3.0/items&#39;
headers = {
    &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36&#39;,
}
params = {
    &#39;page&#39;: 1,
    &#39;page_size&#39;: 12,
    &#39;rubric_id&#39;: 159,
    &#39;fields&#39;: &#39;items.locale,items.flags,search_attributes,items.adm_div,items.city_alias,items.region_id,items.segment_id,items.reviews,items.point,request_type,context_rubrics,query_context,items.links,items.name_ex,items.name_back,items.org,items.group,items.external_content,items.comment,items.ads.options,items.email_for_sending.allowed,items.stat,items.description,items.geometry.centroid,items.geometry.selection,items.geometry.style,items.timezone_offset,items.context,items.address,items.is_paid,items.access,items.access_comment,items.for_trucks,items.is_incentive,items.paving_type,items.capacity,items.schedule,items.floors,dym,ad,items.rubrics,items.routes,items.reply_rate,items.purpose,items.attribute_groups,items.route_logo,items.has_goods,items.has_apartments_info,items.has_pinned_goods,items.has_realty,items.has_payments,items.is_promoted,items.delivery,items.order_with_cart,search_type,items.has_discount,items.metarubrics,broadcast,items.detailed_subtype,items.temporary_unavailable_atm_services,items.poi_category&#39;,
    &#39;key&#39;: &#39;rurbbn3446&#39;,
    &#39;locale&#39;: &#39;en_AE&#39;,
    &#39;search_device_type&#39;: &#39;desktop&#39;,
    &#39;search_user_hash&#39;: &#39;7233966692562515761&#39;,
    &#39;viewpoint1&#39;: &#39;55.09734166196474,25.248071810295556&#39;,
    &#39;viewpoint2&#39;: &#39;55.421438338035266,25.16107818970444&#39;,
    &#39;stat[sid]&#39;: &#39;3de514cd-c705-4753-91b6-e997c2227aa7&#39;,
    &#39;stat[user]&#39;: &#39;1723a6dd-6008-4197-a074-c1e34e27f785&#39;,
    &#39;shv&#39;: &#39;2023-05-02-14&#39;,
    &#39;r&#39;: &#39;2843121634&#39;
}
with requests.Session() as s:
    s.headers.update(headers)
    while True:
        print(f&quot;processing page ==============&gt; {params[&#39;page&#39;]}&quot;)
        res = s.get(url,params=params)
        try:
            res.json()[&#39;result&#39;][&#39;items&#39;]
        except KeyError:
            break
        for item in res.json()[&#39;result&#39;][&#39;items&#39;]:
            print(item[&#39;name&#39;])
        params[&#39;page&#39;]+=1

Here is the full traceback:

Traceback (most recent call last):
  File &quot;C:\Users\C.L\Desktop\Python basic\python scripts\demo.py&quot;, line 35, in &lt;module&gt;
    res.json()[&#39;result&#39;][&#39;items&#39;]
KeyError: &#39;result&#39;

答案1

得分: 2

这是一个不容易抓取的页面。

第二个请求出现了这个错误：

{
  "meta": {
    "code": 403,
    "error": {
      "message": "Authorization error (key is blocked, please contact api@2gis.ru)",
      "type": "apiKeyIsBlocked"
    },
    "api_version": "2.0.1086066",
    "issue_date": "20190201"
  }
}

search_user_hash 的值是唯一的，可能与 r 的值有关。

请仔细查看开发工具的 Network 选项卡中的 XHR 过滤器。

r 的值会随着后续请求而改变：

r: 3249794585

这可能是请求的关键，因为如果将其删除，您将收到相同的错误。此外，保留上一个请求的值也不会改变任何内容，仍然会产生相同的 403 错误。

因此，我的建议是要么反向工程确定如何生成 r 的值，

要么

联系 api@2gis.ru 并请求您自己的API密钥。

第三个选项是使用 selenium。

编辑：

关于API还有一些信息：

地图数据由 https://dev.urbi.ae/ 提供。
这是从API获取的数据（基本上就是您试图抓取的内容）- https://dev.2gis.com/data/
他们为公共和开放项目提供免费版本的API - https://api.2gis.ru/doc/maps/en/quickstart/
以及商业和教育项目 - https://dev.2gis.com/price/

英文:

This is not a trivial page to scrape.

The second request gives this error:

{
  &quot;meta&quot;: {
    &quot;code&quot;: 403,
    &quot;error&quot;: {
      &quot;message&quot;: &quot;Authorization error (key is blocked, please contact api@2gis.ru)&quot;,
      &quot;type&quot;: &quot;apiKeyIsBlocked&quot;
    },
    &quot;api_version&quot;: &quot;2.0.1086066&quot;,
    &quot;issue_date&quot;: &quot;20190201&quot;
  }
}

The value for search_user_hash is unique and might have to do something with the value of r.

Take a good look at the XHR filter in the Network tab of your Dev Tools.

The value of r changes with subsequent requests:

r: 3249794585

which might be the crux of the request because if you remove it, you'll get the same error. Also, keeping the value from the previous request doesn't change anything and produces the same 403 error.

So, IMHO, you either reverse engineer how the value of r is populated

you contact api@2gis.ru and ask for your own API key.

Third option, use selenium.

EDIT:

A few more words on the API:

The map data is powered by https://dev.urbi.ae/
This is the data you get from the API (basically, that's what you're trying to scrape) - https://dev.2gis.com/data/
They have a free version of the API for public and open projects - https://api.2gis.ru/doc/maps/en/quickstart/
As well as commerical and educational projects - https://dev.2gis.com/price/

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

A Python script built upon the requests module throws a KeyError when it goes for the next page after grabbing content from the first page

问题

答案1

你为什么能够透过PyOpenGL中的物体看到它们？

在Python中基于多个字符串复制特定文件。

Sample 2D grid in Xarray

如何将全卷积网络（FCN）应用于二元分类？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。