A Python script built upon the requests module throws a KeyError when it goes for the next page after grabbing content from the first page

huangapple go评论107阅读模式
英文:

A Python script built upon the requests module throws a KeyError when it goes for the next page after grabbing content from the first page

问题

我尝试使用请求模块从此网页抓取不同餐厅和酒吧的名称。脚本从第一页无误地解析内容。然而,在尝试从第二页获取内容时,它抛出了一个 KeyError 错误,而实际上还有几页要抓取。

以下是我一直尝试的代码:

import requests
from pprint import pprint

link = 'https://2gis.ae/dubai/search/Bars/rubricId/159'
url = 'https://catalog.api.2gis.ru/3.0/items'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
params = {
    'page': 1,
    'page_size': 12,
    'rubric_id': 159,
    'fields': 'items.locale,items.flags,search_attributes,items.adm_div,items.city_alias,items.region_id,items.segment_id,items.reviews,items.point,request_type,context_rubrics,query_context,items.links,items.name_ex,items.name_back,items.org,items.group,items.external_content,items.comment,items.ads.options,items.email_for_sending.allowed,items.stat,items.description,items.geometry.centroid,items.geometry.selection,items.geometry.style,items.timezone_offset,items.context,items.address,items.is_paid,items.access,items.access_comment,items.for_trucks,items.is_incentive,items.paving_type,items.capacity,items.schedule,items.floors,dym,ad,items.rubrics,items.routes,items.reply_rate,items.purpose,items.attribute_groups,items.route_logo,items.has_goods,items.has_apartments_info,items.has_pinned_goods,items.has_realty,items.has_payments,items.is_promoted,items.delivery,items.order_with_cart,search_type,items.has_discount,items.metarubrics,broadcast,items.detailed_subtype,items.temporary_unavailable_atm_services,items.poi_category',
    'key': 'rurbbn3446',
    'locale': 'en_AE',
    'search_device_type': 'desktop',
    'search_user_hash': '7233966692562515761',
    'viewpoint1': '55.09734166196474,25.248071810295556',
    'viewpoint2': '55.421438338035266,25.16107818970444',
    'stat[sid]': '3de514cd-c705-4753-91b6-e997c2227aa7',
    'stat[user]': '1723a6dd-6008-4197-a074-c1e34e27f785',
    'shv': '2023-05-02-14',
    'r': '2843121634'
}

with requests.Session() as s:
    s.headers.update(headers)

    while True:
        print(f"processing page ==============> {params['page']}")

        res = s.get(url, params=params)
        try:
            res.json()['result']['items']
        except KeyError:
            break

        for item in res.json()['result']['items']:
            print(item['name'])

        params['page'] += 1

以下是完整的 traceback:

Traceback (most recent call last):
  File "C:\Users\C.L\Desktop\Python basic\python scripts\demo.py", line 35, in <module>
    res.json()['result']['items']
KeyError: 'result'
英文:

I'm trying to scrape the names of different restaurants and bars from this webpage using the requests module. The script parses the content from the first page errorlessly. However, it throws a KeyError when it attempts to grab content from page two, whereas there are several pages to go.

Here is what I've been trying with:

import requests
from pprint import pprint

link = &#39;https://2gis.ae/dubai/search/Bars/rubricId/159&#39;
url = &#39;https://catalog.api.2gis.ru/3.0/items&#39;

headers = {
    &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36&#39;,
}
params = {
    &#39;page&#39;: 1,
    &#39;page_size&#39;: 12,
    &#39;rubric_id&#39;: 159,
    &#39;fields&#39;: &#39;items.locale,items.flags,search_attributes,items.adm_div,items.city_alias,items.region_id,items.segment_id,items.reviews,items.point,request_type,context_rubrics,query_context,items.links,items.name_ex,items.name_back,items.org,items.group,items.external_content,items.comment,items.ads.options,items.email_for_sending.allowed,items.stat,items.description,items.geometry.centroid,items.geometry.selection,items.geometry.style,items.timezone_offset,items.context,items.address,items.is_paid,items.access,items.access_comment,items.for_trucks,items.is_incentive,items.paving_type,items.capacity,items.schedule,items.floors,dym,ad,items.rubrics,items.routes,items.reply_rate,items.purpose,items.attribute_groups,items.route_logo,items.has_goods,items.has_apartments_info,items.has_pinned_goods,items.has_realty,items.has_payments,items.is_promoted,items.delivery,items.order_with_cart,search_type,items.has_discount,items.metarubrics,broadcast,items.detailed_subtype,items.temporary_unavailable_atm_services,items.poi_category&#39;,
    &#39;key&#39;: &#39;rurbbn3446&#39;,
    &#39;locale&#39;: &#39;en_AE&#39;,
    &#39;search_device_type&#39;: &#39;desktop&#39;,
    &#39;search_user_hash&#39;: &#39;7233966692562515761&#39;,
    &#39;viewpoint1&#39;: &#39;55.09734166196474,25.248071810295556&#39;,
    &#39;viewpoint2&#39;: &#39;55.421438338035266,25.16107818970444&#39;,
    &#39;stat[sid]&#39;: &#39;3de514cd-c705-4753-91b6-e997c2227aa7&#39;,
    &#39;stat[user]&#39;: &#39;1723a6dd-6008-4197-a074-c1e34e27f785&#39;,
    &#39;shv&#39;: &#39;2023-05-02-14&#39;,
    &#39;r&#39;: &#39;2843121634&#39;
}

with requests.Session() as s:
    s.headers.update(headers)

    while True:
        print(f&quot;processing page ==============&gt; {params[&#39;page&#39;]}&quot;)

        res = s.get(url,params=params)
        try:
            res.json()[&#39;result&#39;][&#39;items&#39;]
        except KeyError:
            break

        for item in res.json()[&#39;result&#39;][&#39;items&#39;]:
            print(item[&#39;name&#39;])

        params[&#39;page&#39;]+=1

Here is the full traceback:

Traceback (most recent call last):
  File &quot;C:\Users\C.L\Desktop\Python basic\python scripts\demo.py&quot;, line 35, in &lt;module&gt;
    res.json()[&#39;result&#39;][&#39;items&#39;]
KeyError: &#39;result&#39;

答案1

得分: 2

这是一个不容易抓取的页面。

第二个请求出现了这个错误:

{
  "meta": {
    "code": 403,
    "error": {
      "message": "Authorization error (key is blocked, please contact api@2gis.ru)",
      "type": "apiKeyIsBlocked"
    },
    "api_version": "2.0.1086066",
    "issue_date": "20190201"
  }
}

search_user_hash 的值是唯一的,可能与 r 的值有关。

请仔细查看开发工具的 Network 选项卡中的 XHR 过滤器。

r 的值会随着后续请求而改变:

r: 3249794585

这可能是请求的关键,因为如果将其删除,您将收到相同的错误。此外,保留上一个请求的值也不会改变任何内容,仍然会产生相同的 403 错误。

因此,我的建议是要么反向工程确定如何生成 r 的值,

要么

联系 api@2gis.ru 并请求您自己的API密钥。

第三个选项是使用 selenium

编辑:

关于API还有一些信息:

英文:

This is not a trivial page to scrape.

The second request gives this error:

{
  &quot;meta&quot;: {
    &quot;code&quot;: 403,
    &quot;error&quot;: {
      &quot;message&quot;: &quot;Authorization error (key is blocked, please contact api@2gis.ru)&quot;,
      &quot;type&quot;: &quot;apiKeyIsBlocked&quot;
    },
    &quot;api_version&quot;: &quot;2.0.1086066&quot;,
    &quot;issue_date&quot;: &quot;20190201&quot;
  }
}

The value for search_user_hash is unique and might have to do something with the value of r.

Take a good look at the XHR filter in the Network tab of your Dev Tools.

The value of r changes with subsequent requests:

r: 3249794585

which might be the crux of the request because if you remove it, you'll get the same error. Also, keeping the value from the previous request doesn't change anything and produces the same 403 error.

So, IMHO, you either reverse engineer how the value of r is populated

or

you contact api@2gis.ru and ask for your own API key.

Third option, use selenium.

EDIT:

A few more words on the API:

huangapple
  • 本文由 发表于 2023年5月8日 01:36:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76195378.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定