英文:
A Python script built upon the requests module throws a KeyError when it goes for the next page after grabbing content from the first page
问题
我尝试使用请求模块从此网页抓取不同餐厅和酒吧的名称。脚本从第一页无误地解析内容。然而,在尝试从第二页获取内容时,它抛出了一个 KeyError 错误,而实际上还有几页要抓取。
以下是我一直尝试的代码:
import requests
from pprint import pprint
link = 'https://2gis.ae/dubai/search/Bars/rubricId/159'
url = 'https://catalog.api.2gis.ru/3.0/items'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
params = {
'page': 1,
'page_size': 12,
'rubric_id': 159,
'fields': 'items.locale,items.flags,search_attributes,items.adm_div,items.city_alias,items.region_id,items.segment_id,items.reviews,items.point,request_type,context_rubrics,query_context,items.links,items.name_ex,items.name_back,items.org,items.group,items.external_content,items.comment,items.ads.options,items.email_for_sending.allowed,items.stat,items.description,items.geometry.centroid,items.geometry.selection,items.geometry.style,items.timezone_offset,items.context,items.address,items.is_paid,items.access,items.access_comment,items.for_trucks,items.is_incentive,items.paving_type,items.capacity,items.schedule,items.floors,dym,ad,items.rubrics,items.routes,items.reply_rate,items.purpose,items.attribute_groups,items.route_logo,items.has_goods,items.has_apartments_info,items.has_pinned_goods,items.has_realty,items.has_payments,items.is_promoted,items.delivery,items.order_with_cart,search_type,items.has_discount,items.metarubrics,broadcast,items.detailed_subtype,items.temporary_unavailable_atm_services,items.poi_category',
'key': 'rurbbn3446',
'locale': 'en_AE',
'search_device_type': 'desktop',
'search_user_hash': '7233966692562515761',
'viewpoint1': '55.09734166196474,25.248071810295556',
'viewpoint2': '55.421438338035266,25.16107818970444',
'stat[sid]': '3de514cd-c705-4753-91b6-e997c2227aa7',
'stat[user]': '1723a6dd-6008-4197-a074-c1e34e27f785',
'shv': '2023-05-02-14',
'r': '2843121634'
}
with requests.Session() as s:
s.headers.update(headers)
while True:
print(f"processing page ==============> {params['page']}")
res = s.get(url, params=params)
try:
res.json()['result']['items']
except KeyError:
break
for item in res.json()['result']['items']:
print(item['name'])
params['page'] += 1
以下是完整的 traceback:
Traceback (most recent call last):
File "C:\Users\C.L\Desktop\Python basic\python scripts\demo.py", line 35, in <module>
res.json()['result']['items']
KeyError: 'result'
英文:
I'm trying to scrape the names of different restaurants and bars from this webpage using the requests module. The script parses the content from the first page errorlessly. However, it throws a KeyError when it attempts to grab content from page two, whereas there are several pages to go.
Here is what I've been trying with:
import requests
from pprint import pprint
link = 'https://2gis.ae/dubai/search/Bars/rubricId/159'
url = 'https://catalog.api.2gis.ru/3.0/items'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
params = {
'page': 1,
'page_size': 12,
'rubric_id': 159,
'fields': 'items.locale,items.flags,search_attributes,items.adm_div,items.city_alias,items.region_id,items.segment_id,items.reviews,items.point,request_type,context_rubrics,query_context,items.links,items.name_ex,items.name_back,items.org,items.group,items.external_content,items.comment,items.ads.options,items.email_for_sending.allowed,items.stat,items.description,items.geometry.centroid,items.geometry.selection,items.geometry.style,items.timezone_offset,items.context,items.address,items.is_paid,items.access,items.access_comment,items.for_trucks,items.is_incentive,items.paving_type,items.capacity,items.schedule,items.floors,dym,ad,items.rubrics,items.routes,items.reply_rate,items.purpose,items.attribute_groups,items.route_logo,items.has_goods,items.has_apartments_info,items.has_pinned_goods,items.has_realty,items.has_payments,items.is_promoted,items.delivery,items.order_with_cart,search_type,items.has_discount,items.metarubrics,broadcast,items.detailed_subtype,items.temporary_unavailable_atm_services,items.poi_category',
'key': 'rurbbn3446',
'locale': 'en_AE',
'search_device_type': 'desktop',
'search_user_hash': '7233966692562515761',
'viewpoint1': '55.09734166196474,25.248071810295556',
'viewpoint2': '55.421438338035266,25.16107818970444',
'stat[sid]': '3de514cd-c705-4753-91b6-e997c2227aa7',
'stat[user]': '1723a6dd-6008-4197-a074-c1e34e27f785',
'shv': '2023-05-02-14',
'r': '2843121634'
}
with requests.Session() as s:
s.headers.update(headers)
while True:
print(f"processing page ==============> {params['page']}")
res = s.get(url,params=params)
try:
res.json()['result']['items']
except KeyError:
break
for item in res.json()['result']['items']:
print(item['name'])
params['page']+=1
Here is the full traceback:
Traceback (most recent call last):
File "C:\Users\C.L\Desktop\Python basic\python scripts\demo.py", line 35, in <module>
res.json()['result']['items']
KeyError: 'result'
答案1
得分: 2
这是一个不容易抓取的页面。
第二个请求出现了这个错误:
{
"meta": {
"code": 403,
"error": {
"message": "Authorization error (key is blocked, please contact api@2gis.ru)",
"type": "apiKeyIsBlocked"
},
"api_version": "2.0.1086066",
"issue_date": "20190201"
}
}
search_user_hash
的值是唯一的,可能与 r
的值有关。
请仔细查看开发工具的 Network
选项卡中的 XHR
过滤器。
r
的值会随着后续请求而改变:
r: 3249794585
这可能是请求的关键,因为如果将其删除,您将收到相同的错误。此外,保留上一个请求的值也不会改变任何内容,仍然会产生相同的 403
错误。
因此,我的建议是要么反向工程确定如何生成 r
的值,
要么
联系 api@2gis.ru
并请求您自己的API密钥。
第三个选项是使用 selenium。
编辑:
关于API还有一些信息:
- 地图数据由 https://dev.urbi.ae/ 提供。
- 这是从API获取的数据(基本上就是您试图抓取的内容)- https://dev.2gis.com/data/
- 他们为公共和开放项目提供免费版本的API - https://api.2gis.ru/doc/maps/en/quickstart/
- 以及商业和教育项目 - https://dev.2gis.com/price/
英文:
This is not a trivial page to scrape.
The second request gives this error:
{
"meta": {
"code": 403,
"error": {
"message": "Authorization error (key is blocked, please contact api@2gis.ru)",
"type": "apiKeyIsBlocked"
},
"api_version": "2.0.1086066",
"issue_date": "20190201"
}
}
The value for search_user_hash
is unique and might have to do something with the value of r
.
Take a good look at the XHR
filter in the Network
tab of your Dev Tools.
The value of r
changes with subsequent requests:
r: 3249794585
which might be the crux of the request because if you remove it, you'll get the same error. Also, keeping the value from the previous request doesn't change anything and produces the same 403
error.
So, IMHO, you either reverse engineer how the value of r
is populated
or
you contact api@2gis.ru
and ask for your own API key.
Third option, use selenium.
EDIT:
A few more words on the API:
- The map data is powered by https://dev.urbi.ae/
- This is the data you get from the API (basically, that's what you're trying to scrape) - https://dev.2gis.com/data/
- They have a free version of the API for public and open projects - https://api.2gis.ru/doc/maps/en/quickstart/
- As well as commerical and educational projects - https://dev.2gis.com/price/
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论