从网页中抓取具有不一致嵌套HTML标记的数据表格。

huangapple go评论64阅读模式
英文:

Scraping table of data from webpage with inconsistently nested html tags

问题

抱歉,我只会翻译文本内容,不会执行代码。以下是您提供的文本的翻译:

我正在尝试从https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/ 中的表格中抓取一些数据。具体来说,我想抓取“Metropolitan tram”表格。然而,HTML元素的结构不太清晰,我不确定如何按名称识别表格并抓取内容。

这是我尝试过的内容:

import requests
from bs4 import BeautifulSoup

URL = "https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

tables = soup.find_all("div", class_="mceTmpl table__wrapper")
for table in tables:
    print("下一个-------------------------------------------")
    print(table, end="\n"*2)
英文:

I am trying to scrape some data off of the tables in https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/
Specifically, I want to scrape the 'Metropolitan tram' table. However, the html elements aren't structured well and I am unsure how to identify the table by name and scrape the content.

This is what I have tried:

import requests
from bs4 import BeautifulSoup

URL = "https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")


tables = soup.find_all("div", class_="mceTmpl table__wrapper")
for table in tables:
    print("NEXT-------------------------------------------")
    print(table, end="\n"*2)

答案1

得分: 2

可以使用pandas.read_html()来解析表格,这是一种最佳实践,它在底层使用BeautifulSoup,然后通过索引选择你需要的表格。

另一种方法是使用css选择器

soup.select('h3:has(a[name="metrotram"]) + div > div:first-of-type tr')

示例

import pandas as pd
import requests
from bs4 import BeautifulSoup
pd.read_html(
    requests.get(
        'https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/', 
        headers={'user-agent':'some agent'}
    ).text,
    header=0
)[1]

输出

Unnamed: 0 % 列车时刻表达到 % 服务按时到达定点
0 星期日, 2023年2月5日 99.4% 83.3%
1 星期六, 2023年2月4日 99.4% 81.8%
2 星期五, 2023年2月3日 98.4% 79.7%
3 星期四, 2023年2月2日 97.9% 72.8%
4 星期三, 2023年2月1日 98.9% 79.1%
5 星期二, 2023年1月31日 99.0% 81.4%
6 星期一, 2023年1月30日 99.3% 90.2%
英文:

May use pandas.read_html() in case of scraping tables, what is best practice and uses BeautifulSoup under the hood and select your table from list by index.

Alternative use css selectors :

soup.select('h3:has(a[name="metrotram"]) + div > div:first-of-type tr')

Example

import pandas as pd
import requests
from bs4 import BeautifulSoup
pd.read_html(
    requests.get(
        'https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/', 
        headers={'user-agent':'some agent'}
    ).text,
    header=0
)[1]

Output

Unnamed: 0 % timetable delivered % services on-time at timing points
0 Sunday, 5 February 2023 99.4% 83.3%
1 Saturday, 4 February 2023 99.4% 81.8%
2 Friday, 3 February 2023 98.4% 79.7%
3 Thursday, 2 February 2023 97.9% 72.8%
4 Wednesday, 1 February 2023 98.9% 79.1%
5 Tuesday, 31 January 2023 99.0% 81.4%
6 Monday, 30 January 2023 99.3% 90.2%

huangapple
  • 本文由 发表于 2023年2月6日 09:22:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75356619.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定