问题

我正在抓取[此网站](https://www.judiciary.uk/prevention-of-future-death-reports/)上的数据以创建数据库。
我[有成功执行此操作的Python代码](https://github.com/georgiarichards/georgiarichards.github.io/blob/master/data/Web_scraper_PFDs.ipynb)，但HTML结构已完全改变。
我尝试更新代码以更新数据库，但我的代码无法定位URL。
这是我的新代码：

```python
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

base_url = 'https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/'
page_count = 442

with requests.Session() as session:
    record_urls = [ ]
    for page in tqdm(range(1, page_count + 1)):
        url = base_url.format(page)
        try:
            response = session.get(url)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')
            record_urls += [h5.a['href'] for h5 in soup.find_all('h5', {'class': 'entry-title'})]
        except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
            print(f"无法处理第 {page} 页：{e}")
    print(f"已收集 {len(record_urls)} 个URL")

有人能指导一下包含URL的HTML元素是哪个，或者如何找到它，以便我的代码能够收集大约4420个URL吗？
谢谢！

我使用了'选择器工具'和'检查'来定位新的结构。我尝试将'entry-title'替换为'card__link'，然后尝试'li'，但都无法确定URL现在位于何处。


<details>
<summary>英文:</summary>

I am scraping [this website](https://www.judiciary.uk/prevention-of-future-death-reports/) to create a database. 
I [have python code that successfully did this](https://github.com/georgiarichards/georgiarichards.github.io/blob/master/data/Web_scraper_PFDs.ipynb), but the HTML structure has completely changed. 
I am trying to update my code to update the database, but my code cannot locate the URLs.
This is my new code:

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

base_url = 'https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/'
page_count = 442

with requests.Session() as session:
record_urls = [ ]
for page in tqdm(range(1, page_count + 1)):
url = base_url.format(page)
try:
response = session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
record_urls += [h5.a['href'] for h5 in soup.find_all('h5', {'class': 'entry-title'})]
except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
print(f"Failed to process page {page}: {e}")
print(f"Collected {len(record_urls)} URLs")

Could anyone advise what HTML element contains the URLs or how to find this so my code can collect the ~4420 URLs?
Thank you! 

I used &#39;selector gadget&#39; and &#39;inspect&#39; to locate the new strucutre. I tried swapping &#39;entry-title&#39; with &#39;card__link&#39; and then &#39;li&#39; but none helped identify where the URLs now sit. 

</details>


# 答案1
**得分**: 1

将您的代码中的以下部分更改为：

```python
record_urls += [a['href'] for a in soup.find_all('a', {'class': 'card__link'})]

提取所有页面的URL。

完整代码：

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

base_url = 'https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/'
page_count = 442

with requests.Session() as session:
    record_urls = []
    for page in tqdm(range(1, page_count + 1)):
        url = base_url.format(page)
        try:
            response = session.get(url)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')
            record_urls += [a['href'] for a in soup.find_all('a', {'class': 'card__link'})]

        except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
            print(f"Failed to process page {page}: {e}")

    print(f"Collected {len(record_urls)} URLs")

请注意，我已经按照您的要求进行了代码部分的翻译，没有包含其他内容。

英文:

Change Below in your code from

record_urls += [h5.a[&#39;href&#39;] for h5 in soup.find_all(&#39;h5&#39;, {&#39;class&#39;: &#39;entry-title&#39;})]

record_urls += [a[&#39;href&#39;] for a in soup.find_all(&#39;a&#39;, {&#39;class&#39;: &#39;card__link&#39;})]

Should extract url on all pages

Full code

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

base_url = &#39;https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/&#39;
page_count = 442

with requests.Session() as session:
    record_urls = []
    for page in tqdm(range(1, page_count + 1)):
        url = base_url.format(page)
        try:
            response = session.get(url)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
            record_urls += [a[&#39;href&#39;] for a in soup.find_all(&#39;a&#39;, {&#39;class&#39;: &#39;card__link&#39;})]

        except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
            print(f&quot;Failed to process page {page}: {e}&quot;)
   
    print(f&quot;Collected {len(record_urls)} URLs&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Web scraper未收集URL。

问题

OpenCV，直接编辑图像列表会导致不准确。

找到浮点数的除法余数Python

如何删除缩进错误：在Google Colab中，期望有一个缩进块：除了ValueError：

如何在两个div标签之间设置SVG形状，而不留下任何空白空间？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论