英文:
Scraping table of data from webpage with inconsistently nested html tags
问题
抱歉,我只会翻译文本内容,不会执行代码。以下是您提供的文本的翻译:
我正在尝试从https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/ 中的表格中抓取一些数据。具体来说,我想抓取“Metropolitan tram”表格。然而,HTML元素的结构不太清晰,我不确定如何按名称识别表格并抓取内容。
这是我尝试过的内容:
import requests
from bs4 import BeautifulSoup
URL = "https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.find_all("div", class_="mceTmpl table__wrapper")
for table in tables:
print("下一个-------------------------------------------")
print(table, end="\n"*2)
英文:
I am trying to scrape some data off of the tables in https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/
Specifically, I want to scrape the 'Metropolitan tram' table. However, the html elements aren't structured well and I am unsure how to identify the table by name and scrape the content.
This is what I have tried:
import requests
from bs4 import BeautifulSoup
URL = "https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.find_all("div", class_="mceTmpl table__wrapper")
for table in tables:
print("NEXT-------------------------------------------")
print(table, end="\n"*2)
答案1
得分: 2
可以使用pandas.read_html()
来解析表格,这是一种最佳实践,它在底层使用BeautifulSoup
,然后通过索引选择你需要的表格。
另一种方法是使用css选择器
:
soup.select('h3:has(a[name="metrotram"]) + div > div:first-of-type tr')
示例
import pandas as pd
import requests
from bs4 import BeautifulSoup
pd.read_html(
requests.get(
'https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/',
headers={'user-agent':'some agent'}
).text,
header=0
)[1]
输出
Unnamed: 0 | % 列车时刻表达到 | % 服务按时到达定点 | |
---|---|---|---|
0 | 星期日, 2023年2月5日 | 99.4% | 83.3% |
1 | 星期六, 2023年2月4日 | 99.4% | 81.8% |
2 | 星期五, 2023年2月3日 | 98.4% | 79.7% |
3 | 星期四, 2023年2月2日 | 97.9% | 72.8% |
4 | 星期三, 2023年2月1日 | 98.9% | 79.1% |
5 | 星期二, 2023年1月31日 | 99.0% | 81.4% |
6 | 星期一, 2023年1月30日 | 99.3% | 90.2% |
英文:
May use pandas.read_html()
in case of scraping tables, what is best practice and uses BeautifulSoup
under the hood and select your table from list by index.
Alternative use css selectors
:
soup.select('h3:has(a[name="metrotram"]) + div > div:first-of-type tr')
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
pd.read_html(
requests.get(
'https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/',
headers={'user-agent':'some agent'}
).text,
header=0
)[1]
Output
Unnamed: 0 | % timetable delivered | % services on-time at timing points | |
---|---|---|---|
0 | Sunday, 5 February 2023 | 99.4% | 83.3% |
1 | Saturday, 4 February 2023 | 99.4% | 81.8% |
2 | Friday, 3 February 2023 | 98.4% | 79.7% |
3 | Thursday, 2 February 2023 | 97.9% | 72.8% |
4 | Wednesday, 1 February 2023 | 98.9% | 79.1% |
5 | Tuesday, 31 January 2023 | 99.0% | 81.4% |
6 | Monday, 30 January 2023 | 99.3% | 90.2% |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论