英文:
Issues with scraping a website with paging in python
问题
我在抓取一个带有分页的网站时遇到问题,该网站显示了成千上万条记录,我需要读取这些记录。
当我抓取主要的URL时,我只能获取到前50个结果。但是,当我将分页信息添加到URL中时,页面的内容返回了,但没有相关的记录,就好像将分页信息添加到URL中会破坏某些东西。
如果我将确切的URL放入浏览器中,我可以看到结果,如果查看页面源代码,我也可以看到结果。
主要URL:http://www.camp.bicnirrh.res.in/seqDb.php
带有分页的URL:http://www.camp.bicnirrh.res.in/seqDb.php?page=15
我的代码如下。
我漏掉了什么?
我的代码:
lstCAMP_IDs = []
for i in range(1, 500):
url = f'http://www.camp.bicnirrh.res.in/seqDb.php?page={i}'
r = requests.get(url)
try:
content = r.content.decode("utf-8")
except Exception as e:
print(e)
print(i)
ampAccnPrec = re.findall("(CAMPSQ[0-9]+)", content)
if ampAccnPrec == None:
print(i, "error ampAccnPrec")
ampAccn = ""
continue
else:
lstCAMP_IDs.append(ampAccnPrec)
英文:
I am experiencing issues with scraping a website that has paging in it, showing thousands of records that I need to read.
When I scrape the main URL, I get the first 50 results. However, when I add the paging to the URL, the content of the page is returned without the relevant records at all, as if adding the paging to the URL is breaking something.
If I take the exact URL and put it in the browser, I see the results, and if I view the page source, I see the results as well.
Main URL: http://www.camp.bicnirrh.res.in/seqDb.php
URL with paging: http://www.camp.bicnirrh.res.in/seqDb.php?page=15
My code below.
What am I missing?
My code:
lstCAMP_IDs = []
for i in range(1, 500):
url = f'http://www.camp.bicnirrh.res.in/seqDb.php?page={i}'
r = requests.get(url)
try:
content = r.content.decode("utf-8")
except Exception as e:
print(e)
print(i)
ampAccnPrec = re.findall("(CAMPSQ[0-9]+)", content)
if ampAccnPrec == None:
print(i, "error ampAccnPrec")
ampAccn = ""
continue
else:
lstCAMP_IDs.append(ampAccnPrec)
答案1
得分: 0
需要使用cookies来访问页面,因此最好的方法是创建一个会话(session)来进行爬取。并且,你需要从第0页开始,而不是第1页。
以下是更新后的代码:
import re
import requests
session = requests.session()
lstCAMP_IDs = []
for i in range(0, 500):
url = f'http://www.camp.bicnirrh.res.in/seqDb.php?page={i}'
r = session.get(url)
try:
content = r.content.decode('utf-8')
except Exception as e:
print(e)
print(i)
ampAccnPrec = re.findall("CAMPSQ[0-9]+", content)
if ampAccnPrec == None:
print(i, "error ampAccnPrec")
ampAccn = ""
continue
else:
lstCAMP_IDs.append(ampAccnPrec)
print(lstCAMP_IDs)
从第一页(即第0页)获取了这些序列:
[['CAMPSQ12088', 'CAMPSQ12088', 'CAMPSQ12089', 'CAMPSQ12089', 'CAMPSQ12090', 'CAMPSQ12090', 'CAMPSQ12091', 'CAMPSQ12091', 'CAMPSQ12092', 'CAMPSQ12092', 'CAMPSQ12093', 'CAMPSQ12093', 'CAMPSQ12094', 'CAMPSQ12094', 'CAMPSQ12095', 'CAMPSQ12095', 'CAMPSQ12096', 'CAMPSQ12096', 'CAMPSQ12097', 'CAMPSQ12097', 'CAMPSQ12098', 'CAMPSQ12098', 'CAMPSQ12099', 'CAMPSQ12099', 'CAMPSQ12100', 'CAMPSQ12100', 'CAMPSQ12101', 'CAMPSQ12101', 'CAMPSQ12102', 'CAMPSQ12102', 'CAMPSQ12103', 'CAMPSQ12103', 'CAMPSQ12104', 'CAMPSQ12104', 'CAMPSQ12105', 'CAMPSQ12105', 'CAMPSQ12106', 'CAMPSQ12106', 'CAMPSQ12107', 'CAMPSQ12107', 'CAMPSQ12108', 'CAMPSQ12108', 'CAMPSQ12345', 'CAMPSQ12345', 'CAMPSQ12346', 'CAMPSQ12346', 'CAMPSQ12347', 'CAMPSQ12347', 'CAMPSQ12348', 'CAMPSQ12348']]
英文:
You need cookies to traverse pages, so the best way to scrape it is to create a session. And, you need to start at page 0, not 1.
Here's the updated code
import re
import requests
session = requests.session()
lstCAMP_IDs = []
for i in range(0, 500):
url = f'http://www.camp.bicnirrh.res.in/seqDb.php?page={i}'
r = session.get(url)
try:
content = r.content.decode('utf-8')
except Exception as e:
print(e)
print(i)
ampAccnPrec = re.findall("CAMPSQ[0-9]+", content)
if ampAccnPrec == None:
print(i, "error ampAccnPrec")
ampAccn = ""
continue
else:
lstCAMP_IDs.append(ampAccnPrec)
print(lstCAMP_IDs)
Got these sequences from the first page, which is page 0.
[['CAMPSQ12088', 'CAMPSQ12088', 'CAMPSQ12089', 'CAMPSQ12089', 'CAMPSQ12090', 'CAMPSQ12090', 'CAMPSQ12091', 'CAMPSQ12091', 'CAMPSQ12092', 'CAMPSQ12092', 'CAMPSQ12093', 'CAMPSQ12093', 'CAMPSQ12094', 'CAMPSQ12094', 'CAMPSQ12095', 'CAMPSQ12095', 'CAMPSQ12096', 'CAMPSQ12096', 'CAMPSQ12097', 'CAMPSQ12097', 'CAMPSQ12098', 'CAMPSQ12098', 'CAMPSQ12099', 'CAMPSQ12099', 'CAMPSQ12100', 'CAMPSQ12100', 'CAMPSQ12101', 'CAMPSQ12101', 'CAMPSQ12102', 'CAMPSQ12102', 'CAMPSQ12103', 'CAMPSQ12103', 'CAMPSQ12104', 'CAMPSQ12104', 'CAMPSQ12105', 'CAMPSQ12105', 'CAMPSQ12106', 'CAMPSQ12106', 'CAMPSQ12107', 'CAMPSQ12107', 'CAMPSQ12108', 'CAMPSQ12108', 'CAMPSQ12345', 'CAMPSQ12345', 'CAMPSQ12346', 'CAMPSQ12346', 'CAMPSQ12347', 'CAMPSQ12347', 'CAMPSQ12348', 'CAMPSQ12348']]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论