2023年6月27日 19:34:16go评论147阅读模式

英文:

Issues with scraping a website with paging in python

问题

我在抓取一个带有分页的网站时遇到问题，该网站显示了成千上万条记录，我需要读取这些记录。
当我抓取主要的URL时，我只能获取到前50个结果。但是，当我将分页信息添加到URL中时，页面的内容返回了，但没有相关的记录，就好像将分页信息添加到URL中会破坏某些东西。

如果我将确切的URL放入浏览器中，我可以看到结果，如果查看页面源代码，我也可以看到结果。

主要URL：http://www.camp.bicnirrh.res.in/seqDb.php

带有分页的URL：http://www.camp.bicnirrh.res.in/seqDb.php?page=15

我的代码如下。
我漏掉了什么？

我的代码：

lstCAMP_IDs = []
for i in range(1, 500):
    url = f'http://www.camp.bicnirrh.res.in/seqDb.php?page={i}'
    r = requests.get(url)
    try:
        content = r.content.decode("utf-8")
    except Exception as e:
        print(e)
        print(i)
    ampAccnPrec = re.findall("(CAMPSQ[0-9]+)", content)
    if ampAccnPrec == None:
        print(i, "error ampAccnPrec")
        ampAccn = ""
        continue
    else:
        lstCAMP_IDs.append(ampAccnPrec)

英文:

I am experiencing issues with scraping a website that has paging in it, showing thousands of records that I need to read.
When I scrape the main URL, I get the first 50 results. However, when I add the paging to the URL, the content of the page is returned without the relevant records at all, as if adding the paging to the URL is breaking something.

If I take the exact URL and put it in the browser, I see the results, and if I view the page source, I see the results as well.

Main URL: http://www.camp.bicnirrh.res.in/seqDb.php

URL with paging: http://www.camp.bicnirrh.res.in/seqDb.php?page=15

My code below.
What am I missing?

My code:

lstCAMP_IDs = []
for i in range(1, 500):
    url = f&#39;http://www.camp.bicnirrh.res.in/seqDb.php?page={i}&#39;
    r = requests.get(url)
    try:
        content = r.content.decode(&quot;utf-8&quot;)
    except Exception as e:
        print(e)
        print(i)
    ampAccnPrec = re.findall(&quot;(CAMPSQ[0-9]+)&quot;, content)
    if ampAccnPrec == None:
        print(i, &quot;error ampAccnPrec&quot;)
        ampAccn = &quot;&quot;
        continue
    else:
        lstCAMP_IDs.append(ampAccnPrec)

答案1

得分: 0

需要使用cookies来访问页面，因此最好的方法是创建一个会话（session）来进行爬取。并且，你需要从第0页开始，而不是第1页。

以下是更新后的代码：

import re
import requests

session = requests.session()
lstCAMP_IDs = []
for i in range(0, 500):
    url = f'http://www.camp.bicnirrh.res.in/seqDb.php?page={i}'
    r = session.get(url)

    try:
        content = r.content.decode('utf-8')
    except Exception as e:
        print(e)
        print(i)

    ampAccnPrec = re.findall("CAMPSQ[0-9]+", content)
    if ampAccnPrec == None:
        print(i, "error ampAccnPrec")
        ampAccn = ""
        continue
    else:
        lstCAMP_IDs.append(ampAccnPrec)
        print(lstCAMP_IDs)

从第一页（即第0页）获取了这些序列：

[['CAMPSQ12088', 'CAMPSQ12088', 'CAMPSQ12089', 'CAMPSQ12089', 'CAMPSQ12090', 'CAMPSQ12090', 'CAMPSQ12091', 'CAMPSQ12091', 'CAMPSQ12092', 'CAMPSQ12092', 'CAMPSQ12093', 'CAMPSQ12093', 'CAMPSQ12094', 'CAMPSQ12094', 'CAMPSQ12095', 'CAMPSQ12095', 'CAMPSQ12096', 'CAMPSQ12096', 'CAMPSQ12097', 'CAMPSQ12097', 'CAMPSQ12098', 'CAMPSQ12098', 'CAMPSQ12099', 'CAMPSQ12099', 'CAMPSQ12100', 'CAMPSQ12100', 'CAMPSQ12101', 'CAMPSQ12101', 'CAMPSQ12102', 'CAMPSQ12102', 'CAMPSQ12103', 'CAMPSQ12103', 'CAMPSQ12104', 'CAMPSQ12104', 'CAMPSQ12105', 'CAMPSQ12105', 'CAMPSQ12106', 'CAMPSQ12106', 'CAMPSQ12107', 'CAMPSQ12107', 'CAMPSQ12108', 'CAMPSQ12108', 'CAMPSQ12345', 'CAMPSQ12345', 'CAMPSQ12346', 'CAMPSQ12346', 'CAMPSQ12347', 'CAMPSQ12347', 'CAMPSQ12348', 'CAMPSQ12348']]

英文:

You need cookies to traverse pages, so the best way to scrape it is to create a session. And, you need to start at page 0, not 1.

Here's the updated code

import re
import requests

session = requests.session()
lstCAMP_IDs = []
for i in range(0, 500):
	url = f&#39;http://www.camp.bicnirrh.res.in/seqDb.php?page={i}&#39;
	r = session.get(url)
	
	try:
		content = r.content.decode(&#39;utf-8&#39;)
	except Exception as e:
		print(e)
		print(i)

	ampAccnPrec = re.findall(&quot;CAMPSQ[0-9]+&quot;, content)
	if ampAccnPrec == None:
		print(i, &quot;error ampAccnPrec&quot;)
		ampAccn = &quot;&quot;
		continue
	else:
		lstCAMP_IDs.append(ampAccnPrec)
		print(lstCAMP_IDs)

Got these sequences from the first page, which is page 0.

[[&#39;CAMPSQ12088&#39;, &#39;CAMPSQ12088&#39;, &#39;CAMPSQ12089&#39;, &#39;CAMPSQ12089&#39;, &#39;CAMPSQ12090&#39;, &#39;CAMPSQ12090&#39;, &#39;CAMPSQ12091&#39;, &#39;CAMPSQ12091&#39;, &#39;CAMPSQ12092&#39;, &#39;CAMPSQ12092&#39;, &#39;CAMPSQ12093&#39;, &#39;CAMPSQ12093&#39;, &#39;CAMPSQ12094&#39;, &#39;CAMPSQ12094&#39;, &#39;CAMPSQ12095&#39;, &#39;CAMPSQ12095&#39;, &#39;CAMPSQ12096&#39;, &#39;CAMPSQ12096&#39;, &#39;CAMPSQ12097&#39;, &#39;CAMPSQ12097&#39;, &#39;CAMPSQ12098&#39;, &#39;CAMPSQ12098&#39;, &#39;CAMPSQ12099&#39;, &#39;CAMPSQ12099&#39;, &#39;CAMPSQ12100&#39;, &#39;CAMPSQ12100&#39;, &#39;CAMPSQ12101&#39;, &#39;CAMPSQ12101&#39;, &#39;CAMPSQ12102&#39;, &#39;CAMPSQ12102&#39;, &#39;CAMPSQ12103&#39;, &#39;CAMPSQ12103&#39;, &#39;CAMPSQ12104&#39;, &#39;CAMPSQ12104&#39;, &#39;CAMPSQ12105&#39;, &#39;CAMPSQ12105&#39;, &#39;CAMPSQ12106&#39;, &#39;CAMPSQ12106&#39;, &#39;CAMPSQ12107&#39;, &#39;CAMPSQ12107&#39;, &#39;CAMPSQ12108&#39;, &#39;CAMPSQ12108&#39;, &#39;CAMPSQ12345&#39;, &#39;CAMPSQ12345&#39;, &#39;CAMPSQ12346&#39;, &#39;CAMPSQ12346&#39;, &#39;CAMPSQ12347&#39;, &#39;CAMPSQ12347&#39;, &#39;CAMPSQ12348&#39;, &#39;CAMPSQ12348&#39;]]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Issues with scraping a website with paging in python.

问题

答案1

spacy python package no longer runs

Regex to parse badly formatted polynomials

CLI命令，使用Python的Click包。

在Metpy 1.5中绘制包裹虚拟温度剖面

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论