Web scraper未收集URL。

huangapple go评论64阅读模式
英文:

Web scraper not collecting URLs

问题

我正在抓取[此网站](https://www.judiciary.uk/prevention-of-future-death-reports/)上的数据以创建数据库。
我[有成功执行此操作的Python代码](https://github.com/georgiarichards/georgiarichards.github.io/blob/master/data/Web_scraper_PFDs.ipynb),但HTML结构已完全改变。
我尝试更新代码以更新数据库,但我的代码无法定位URL。
这是我的新代码:

```python
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

base_url = 'https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/'
page_count = 442

with requests.Session() as session:
    record_urls = [ ]
    for page in tqdm(range(1, page_count + 1)):
        url = base_url.format(page)
        try:
            response = session.get(url)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')
            record_urls += [h5.a['href'] for h5 in soup.find_all('h5', {'class': 'entry-title'})]
        except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
            print(f"无法处理第 {page} 页:{e}")
    print(f"已收集 {len(record_urls)} 个URL")

有人能指导一下包含URL的HTML元素是哪个,或者如何找到它,以便我的代码能够收集大约4420个URL吗?
谢谢!

我使用了'选择器工具'和'检查'来定位新的结构。我尝试将'entry-title'替换为'card__link',然后尝试'li',但都无法确定URL现在位于何处。


<details>
<summary>英文:</summary>

I am scraping [this website](https://www.judiciary.uk/prevention-of-future-death-reports/) to create a database. 
I [have python code that successfully did this](https://github.com/georgiarichards/georgiarichards.github.io/blob/master/data/Web_scraper_PFDs.ipynb), but the HTML structure has completely changed. 
I am trying to update my code to update the database, but my code cannot locate the URLs.
This is my new code: 

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

base_url = 'https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/'
page_count = 442

with requests.Session() as session:
record_urls = [ ]
for page in tqdm(range(1, page_count + 1)):
url = base_url.format(page)
try:
response = session.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
record_urls += [h5.a['href'] for h5 in soup.find_all('h5', {'class': 'entry-title'})]
except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
print(f"Failed to process page {page}: {e}")
print(f"Collected {len(record_urls)} URLs")

Could anyone advise what HTML element contains the URLs or how to find this so my code can collect the ~4420 URLs?
Thank you! 

I used &#39;selector gadget&#39; and &#39;inspect&#39; to locate the new strucutre. I tried swapping &#39;entry-title&#39; with &#39;card__link&#39; and then &#39;li&#39; but none helped identify where the URLs now sit. 

</details>


# 答案1
**得分**: 1

将您的代码中的以下部分更改为:

```python
record_urls += [a['href'] for a in soup.find_all('a', {'class': 'card__link'})]

提取所有页面的URL。

完整代码:

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

base_url = 'https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/'
page_count = 442

with requests.Session() as session:
    record_urls = []
    for page in tqdm(range(1, page_count + 1)):
        url = base_url.format(page)
        try:
            response = session.get(url)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')
            record_urls += [a['href'] for a in soup.find_all('a', {'class': 'card__link'})]

        except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
            print(f"Failed to process page {page}: {e}")

    print(f"Collected {len(record_urls)} URLs")

请注意,我已经按照您的要求进行了代码部分的翻译,没有包含其他内容。

英文:

Change Below in your code from

record_urls += [h5.a[&#39;href&#39;] for h5 in soup.find_all(&#39;h5&#39;, {&#39;class&#39;: &#39;entry-title&#39;})]

To

record_urls += [a[&#39;href&#39;] for a in soup.find_all(&#39;a&#39;, {&#39;class&#39;: &#39;card__link&#39;})]

Should extract url on all pages

Full code

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

base_url = &#39;https://www.judiciary.uk/prevention-of-future-death-reports/page/{}/&#39;
page_count = 442

with requests.Session() as session:
    record_urls = []
    for page in tqdm(range(1, page_count + 1)):
        url = base_url.format(page)
        try:
            response = session.get(url)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, &#39;html.parser&#39;)
            record_urls += [a[&#39;href&#39;] for a in soup.find_all(&#39;a&#39;, {&#39;class&#39;: &#39;card__link&#39;})]

        except (requests.exceptions.RequestException, ValueError, AttributeError) as e:
            print(f&quot;Failed to process page {page}: {e}&quot;)
   
    print(f&quot;Collected {len(record_urls)} URLs&quot;)

huangapple
  • 本文由 发表于 2023年5月14日 01:36:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76244103.html
  • beautifulsoup
  • html
  • python
  • url
  • web-scraping

如何在HTML的 :?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定