2023年2月6日 09:22:53go评论104阅读模式

英文:

Scraping table of data from webpage with inconsistently nested html tags

问题

抱歉，我只会翻译文本内容，不会执行代码。以下是您提供的文本的翻译：

我正在尝试从https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/ 中的表格中抓取一些数据。具体来说，我想抓取“Metropolitan tram”表格。然而，HTML元素的结构不太清晰，我不确定如何按名称识别表格并抓取内容。

这是我尝试过的内容：

import requests
from bs4 import BeautifulSoup
URL = "https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.find_all("div", class_="mceTmpl table__wrapper")
for table in tables:
    print("下一个-------------------------------------------")
    print(table, end="\n"*2)

英文:

I am trying to scrape some data off of the tables in https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/
Specifically, I want to scrape the 'Metropolitan tram' table. However, the html elements aren't structured well and I am unsure how to identify the table by name and scrape the content.

This is what I have tried:

import requests
from bs4 import BeautifulSoup
URL = &quot;https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/&quot;
page = requests.get(URL)
soup = BeautifulSoup(page.content, &quot;html.parser&quot;)
tables = soup.find_all(&quot;div&quot;, class_=&quot;mceTmpl table__wrapper&quot;)
for table in tables:
    print(&quot;NEXT-------------------------------------------&quot;)
    print(table, end=&quot;\n&quot;*2)

答案1

得分: 2

可以使用pandas.read_html()来解析表格，这是一种最佳实践，它在底层使用BeautifulSoup，然后通过索引选择你需要的表格。

另一种方法是使用css选择器：

soup.select('h3:has(a[name="metrotram"]) + div > div:first-of-type tr')

示例

import pandas as pd
import requests
from bs4 import BeautifulSoup
pd.read_html(
    requests.get(
        'https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/', 
        headers={'user-agent':'some agent'}
    ).text,
    header=0
)[1]

输出

	Unnamed: 0	% 列车时刻表达到	% 服务按时到达定点
0	星期日, 2023年2月5日	99.4%	83.3%
1	星期六, 2023年2月4日	99.4%	81.8%
2	星期五, 2023年2月3日	98.4%	79.7%
3	星期四, 2023年2月2日	97.9%	72.8%
4	星期三, 2023年2月1日	98.9%	79.1%
5	星期二, 2023年1月31日	99.0%	81.4%
6	星期一, 2023年1月30日	99.3%	90.2%

英文:

May use pandas.read_html() in case of scraping tables, what is best practice and uses BeautifulSoup under the hood and select your table from list by index.

Alternative use css selectors :

soup.select(&#39;h3:has(a[name=&quot;metrotram&quot;]) + div &gt; div:first-of-type tr&#39;)

Example

import pandas as pd
import requests
from bs4 import BeautifulSoup
pd.read_html(
    requests.get(
        &#39;https://www.ptv.vic.gov.au/footer/data-and-reporting/network-performance/daily-performance/&#39;, 
        headers={&#39;user-agent&#39;:&#39;some agent&#39;}
    ).text,
    header=0
)[1]

Output

	Unnamed: 0	% timetable delivered	% services on-time at timing points
0	Sunday, 5 February 2023	99.4%	83.3%
1	Saturday, 4 February 2023	99.4%	81.8%
2	Friday, 3 February 2023	98.4%	79.7%
3	Thursday, 2 February 2023	97.9%	72.8%
4	Wednesday, 1 February 2023	98.9%	79.1%
5	Tuesday, 31 January 2023	99.0%	81.4%
6	Monday, 30 January 2023	99.3%	90.2%

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从网页中抓取具有不一致嵌套HTML标记的数据表格。

问题

答案1

示例

输出

Example

Output

将列表写入Python数据库

过去四个季度的数据帧筛选。

可以将Python中的每个进程打印成一行吗？

有没有办法优化这个NumPy索引重新分配？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。