英文:
Web scraper not collecting URLs
问题
我正在抓取[此网站](https://www.judiciary.uk/prevention-of-future-death-reports/)上的数据以创建数据库。
我[有成功执行此操作的Python代码](https://github.com/georgiarichards/georgiarichards.github.io/blob/master/data/Web_scraper_PFDs.ipynb),但HTML结构已完全改变。
我尝试更新代码以更新数据库,但我的代码无法定位URL。
这是我的新代码:
```python
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
base_url = 'https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/'
page_count = 442
with requests.Session() as session:
record_urls = [ ]
for page in tqdm(range(1, page_count + 1)):
url = base_url.format(page)
try:
response = session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
record_urls += [h5.a['href'] for h5 in soup.find_all('h5', {'class': 'entry-title'})]
except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
print(f"无法处理第 {page} 页:{e}")
print(f"已收集 {len(record_urls)} 个URL")
有人能指导一下包含URL的HTML元素是哪个,或者如何找到它,以便我的代码能够收集大约4420个URL吗?
谢谢!
我使用了'选择器工具'和'检查'来定位新的结构。我尝试将'entry-title'替换为'card__link',然后尝试'li',但都无法确定URL现在位于何处。
<details>
<summary>英文:</summary>
I am scraping [this website](https://www.judiciary.uk/prevention-of-future-death-reports/) to create a database.
I [have python code that successfully did this](https://github.com/georgiarichards/georgiarichards.github.io/blob/master/data/Web_scraper_PFDs.ipynb), but the HTML structure has completely changed.
I am trying to update my code to update the database, but my code cannot locate the URLs.
This is my new code:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
base_url = 'https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/'
page_count = 442
with requests.Session() as session:
record_urls = [ ]
for page in tqdm(range(1, page_count + 1)):
url = base_url.format(page)
try:
response = session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
record_urls += [h5.a['href'] for h5 in soup.find_all('h5', {'class': 'entry-title'})]
except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
print(f"Failed to process page {page}: {e}")
print(f"Collected {len(record_urls)} URLs")
Could anyone advise what HTML element contains the URLs or how to find this so my code can collect the ~4420 URLs?
Thank you!
I used 'selector gadget' and 'inspect' to locate the new strucutre. I tried swapping 'entry-title' with 'card__link' and then 'li' but none helped identify where the URLs now sit.
</details>
# 答案1
**得分**: 1
将您的代码中的以下部分更改为:
```python
record_urls += [a['href'] for a in soup.find_all('a', {'class': 'card__link'})]
提取所有页面的URL。
完整代码:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
base_url = 'https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/'
page_count = 442
with requests.Session() as session:
record_urls = []
for page in tqdm(range(1, page_count + 1)):
url = base_url.format(page)
try:
response = session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
record_urls += [a['href'] for a in soup.find_all('a', {'class': 'card__link'})]
except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
print(f"Failed to process page {page}: {e}")
print(f"Collected {len(record_urls)} URLs")
请注意,我已经按照您的要求进行了代码部分的翻译,没有包含其他内容。
英文:
Change Below in your code from
record_urls += [h5.a['href'] for h5 in soup.find_all('h5', {'class': 'entry-title'})]
To
record_urls += [a['href'] for a in soup.find_all('a', {'class': 'card__link'})]
Should extract url on all pages
Full code
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
base_url = 'https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/'
page_count = 442
with requests.Session() as session:
record_urls = []
for page in tqdm(range(1, page_count + 1)):
url = base_url.format(page)
try:
response = session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
record_urls += [a['href'] for a in soup.find_all('a', {'class': 'card__link'})]
except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
print(f"Failed to process page {page}: {e}")
print(f"Collected {len(record_urls)} URLs")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论