Issues with scraping a website with paging in python.

huangapple go评论58阅读模式
英文:

Issues with scraping a website with paging in python

问题

我在抓取一个带有分页的网站时遇到问题,该网站显示了成千上万条记录,我需要读取这些记录。
当我抓取主要的URL时,我只能获取到前50个结果。但是,当我将分页信息添加到URL中时,页面的内容返回了,但没有相关的记录,就好像将分页信息添加到URL中会破坏某些东西。

如果我将确切的URL放入浏览器中,我可以看到结果,如果查看页面源代码,我也可以看到结果。

主要URL:http://www.camp.bicnirrh.res.in/seqDb.php

带有分页的URL:http://www.camp.bicnirrh.res.in/seqDb.php?page=15

我的代码如下。
我漏掉了什么?

我的代码:

lstCAMP_IDs = []
for i in range(1, 500):
    url = f'http://www.camp.bicnirrh.res.in/seqDb.php?page={i}'
    r = requests.get(url)
    try:
        content = r.content.decode("utf-8")
    except Exception as e:
        print(e)
        print(i)
    ampAccnPrec = re.findall("(CAMPSQ[0-9]+)", content)
    if ampAccnPrec == None:
        print(i, "error ampAccnPrec")
        ampAccn = ""
        continue
    else:
        lstCAMP_IDs.append(ampAccnPrec)
英文:

I am experiencing issues with scraping a website that has paging in it, showing thousands of records that I need to read.
When I scrape the main URL, I get the first 50 results. However, when I add the paging to the URL, the content of the page is returned without the relevant records at all, as if adding the paging to the URL is breaking something.

If I take the exact URL and put it in the browser, I see the results, and if I view the page source, I see the results as well.

Main URL: http://www.camp.bicnirrh.res.in/seqDb.php

URL with paging: http://www.camp.bicnirrh.res.in/seqDb.php?page=15

My code below.
What am I missing?

My code:

lstCAMP_IDs = []
for i in range(1, 500):
    url = f'http://www.camp.bicnirrh.res.in/seqDb.php?page={i}'
    r = requests.get(url)
    try:
        content = r.content.decode("utf-8")
    except Exception as e:
        print(e)
        print(i)
    ampAccnPrec = re.findall("(CAMPSQ[0-9]+)", content)
    if ampAccnPrec == None:
        print(i, "error ampAccnPrec")
        ampAccn = ""
        continue
    else:
        lstCAMP_IDs.append(ampAccnPrec)

答案1

得分: 0

需要使用cookies来访问页面,因此最好的方法是创建一个会话(session)来进行爬取。并且,你需要从第0页开始,而不是第1页。

以下是更新后的代码:

import re
import requests

session = requests.session()
lstCAMP_IDs = []
for i in range(0, 500):
    url = f'http://www.camp.bicnirrh.res.in/seqDb.php?page={i}'
    r = session.get(url)

    try:
        content = r.content.decode('utf-8')
    except Exception as e:
        print(e)
        print(i)

    ampAccnPrec = re.findall("CAMPSQ[0-9]+", content)
    if ampAccnPrec == None:
        print(i, "error ampAccnPrec")
        ampAccn = ""
        continue
    else:
        lstCAMP_IDs.append(ampAccnPrec)
        print(lstCAMP_IDs)

从第一页(即第0页)获取了这些序列:

[['CAMPSQ12088', 'CAMPSQ12088', 'CAMPSQ12089', 'CAMPSQ12089', 'CAMPSQ12090', 'CAMPSQ12090', 'CAMPSQ12091', 'CAMPSQ12091', 'CAMPSQ12092', 'CAMPSQ12092', 'CAMPSQ12093', 'CAMPSQ12093', 'CAMPSQ12094', 'CAMPSQ12094', 'CAMPSQ12095', 'CAMPSQ12095', 'CAMPSQ12096', 'CAMPSQ12096', 'CAMPSQ12097', 'CAMPSQ12097', 'CAMPSQ12098', 'CAMPSQ12098', 'CAMPSQ12099', 'CAMPSQ12099', 'CAMPSQ12100', 'CAMPSQ12100', 'CAMPSQ12101', 'CAMPSQ12101', 'CAMPSQ12102', 'CAMPSQ12102', 'CAMPSQ12103', 'CAMPSQ12103', 'CAMPSQ12104', 'CAMPSQ12104', 'CAMPSQ12105', 'CAMPSQ12105', 'CAMPSQ12106', 'CAMPSQ12106', 'CAMPSQ12107', 'CAMPSQ12107', 'CAMPSQ12108', 'CAMPSQ12108', 'CAMPSQ12345', 'CAMPSQ12345', 'CAMPSQ12346', 'CAMPSQ12346', 'CAMPSQ12347', 'CAMPSQ12347', 'CAMPSQ12348', 'CAMPSQ12348']]
英文:

You need cookies to traverse pages, so the best way to scrape it is to create a session. And, you need to start at page 0, not 1.

Here's the updated code

import re
import requests

session = requests.session()
lstCAMP_IDs = []
for i in range(0, 500):
	url = f'http://www.camp.bicnirrh.res.in/seqDb.php?page={i}'
	r = session.get(url)
	
	try:
		content = r.content.decode('utf-8')
	except Exception as e:
		print(e)
		print(i)

	ampAccnPrec = re.findall("CAMPSQ[0-9]+", content)
	if ampAccnPrec == None:
		print(i, "error ampAccnPrec")
		ampAccn = ""
		continue
	else:
		lstCAMP_IDs.append(ampAccnPrec)
		print(lstCAMP_IDs)

Got these sequences from the first page, which is page 0.

[['CAMPSQ12088', 'CAMPSQ12088', 'CAMPSQ12089', 'CAMPSQ12089', 'CAMPSQ12090', 'CAMPSQ12090', 'CAMPSQ12091', 'CAMPSQ12091', 'CAMPSQ12092', 'CAMPSQ12092', 'CAMPSQ12093', 'CAMPSQ12093', 'CAMPSQ12094', 'CAMPSQ12094', 'CAMPSQ12095', 'CAMPSQ12095', 'CAMPSQ12096', 'CAMPSQ12096', 'CAMPSQ12097', 'CAMPSQ12097', 'CAMPSQ12098', 'CAMPSQ12098', 'CAMPSQ12099', 'CAMPSQ12099', 'CAMPSQ12100', 'CAMPSQ12100', 'CAMPSQ12101', 'CAMPSQ12101', 'CAMPSQ12102', 'CAMPSQ12102', 'CAMPSQ12103', 'CAMPSQ12103', 'CAMPSQ12104', 'CAMPSQ12104', 'CAMPSQ12105', 'CAMPSQ12105', 'CAMPSQ12106', 'CAMPSQ12106', 'CAMPSQ12107', 'CAMPSQ12107', 'CAMPSQ12108', 'CAMPSQ12108', 'CAMPSQ12345', 'CAMPSQ12345', 'CAMPSQ12346', 'CAMPSQ12346', 'CAMPSQ12347', 'CAMPSQ12347', 'CAMPSQ12348', 'CAMPSQ12348']]

huangapple
  • 本文由 发表于 2023年6月27日 19:34:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76564438.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定