2023年5月14日 00:42:00go评论99阅读模式

英文:

Web scaping: how to access the tbody of this table with Python?

问题

在我的第一次尝试使用Python进行网页抓取时，我试图从这个网页中提取内容：https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats。具体来说，我试图提取表格中的一组数字。

我已经尝试了使用Pandas和BeautifulSoup的最常见解决方案，但我无法访问tbody中的tr标签。有人可以告诉我我做错了什么吗？

这是我尝试过的代码：
在此处输入图像描述

英文:

In my first try with web scraping with Python I'm trying to extract the content from this webpage:
https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats.
In particular, I'm trying to take the set of numbers of the table.

I've tried to apply the most common solutions with Pandas, BeautifulSoup but I can't access the tr tag in the tbody.
Can someone indicate me what I'm doing wrong?

Here's the code I tried:
enter image description here

答案1

得分: 2

以下是您要翻译的内容：

"The table data is not contained at that URL, it is fetched from elsewhere by your web browser.

You can use your web browser's Network Tools search through requests, here we've used the first player in the table.

Which returns: https://g9u7p6f6.ssl.hwcdn.net/api/custom/statsPlayer/fixture/21IW8573?lang=en_GB as the source of the table data - which is in JSON format.

The URL is assembled from information contained in the html:

3556 var STATS_API_VARS = {&quot;base_url&quot;:&quot;https:\/\/g9u7p6f6.ssl.hwcdn.net\/api\/&quot;};&#172;

2085         data-article-fixguid=&quot;21IW8573&quot;&#172;

You can extract the URL using string slicing:

start = r.text.find(&#39;var STATS_API_VARS&#39;)
end = r.text.find(&#39;}&#39;, start) + 1
start = r.text.find(&#39;{&#39;, start) 

r.text[start:end]
# &#39;{&quot;base_url&quot;:&quot;https:\\/\\/g9u7p6f6.ssl.hwcdn.net\\/api\\/&quot;}&#39;

Which can be loaded using the json module:

&gt;&gt;&gt; json.loads(r.text[start:end])[&#39;base_url&#39;]
&#39;https://g9u7p6f6.ssl.hwcdn.net/api/&#39;

The fixguid can be extracted with beautifulsoup:

&gt;&gt;&gt; soup.find(attrs={&#39;data-article-fixguid&#39;: True})[&#39;data-article-fixguid&#39;]
&#39;21IW8573&#39;

Combining these steps you can:

...

game_url = &#39;https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats&#39;

r = requests.get(game_url)

...

api_url = json.loads(r.text[start:end])[&#39;base_url&#39;]
stats = &#39;custom/statsPlayer/fixture/&#39;
fixguid = soup.find(attrs={&#39;data-article-fixguid&#39;: True})[&#39;data-article-fixguid&#39;]

player_stats_table = requests.get(api_url + stats + fixguid).json()

An example of loading a subset of the data:

pd.json_normalize(
   player_stats_table[&#39;data&#39;][&#39;playerStats&#39;][&#39;playerStatistics&#39;][&#39;teamA&#39;]
).filter(regex=&#39;player.*Name&#39;)

英文:

The table data is not contained at that URL, it is fetched from elsewhere by your web browser.

You can use your web browsers Network Tools search through requests, here we've used the first player in the table.

Which returns: https://g9u7p6f6.ssl.hwcdn.net/api/custom/statsPlayer/fixture/21IW8573?lang=en_GB as the source of the table data - which is in JSON format.

The URL is assembled from information contained in the html:

3556 var STATS_API_VARS = {&quot;base_url&quot;:&quot;https:\/\/g9u7p6f6.ssl.hwcdn.net\/api\/&quot;};&#172;

2085         data-article-fixguid=&quot;21IW8573&quot;&#172;

You can extract the url using string slicing:

start = r.text.find(&#39;var STATS_API_VARS&#39;)
end = r.text.find(&#39;}&#39;, start) + 1
start = r.text.find(&#39;{&#39;, start) 

r.text[start:end]
# &#39;{&quot;base_url&quot;:&quot;https:\\/\\/g9u7p6f6.ssl.hwcdn.net\\/api\\/&quot;}&#39;

Which can be loaded using the json module:

&gt;&gt;&gt; json.loads(r.text[start:end])[&#39;base_url&#39;]
&#39;https://g9u7p6f6.ssl.hwcdn.net/api/&#39;

The fixguid can be extracted with beautifulsoup:

&gt;&gt;&gt; soup.find(attrs={&#39;data-article-fixguid&#39;: True})[&#39;data-article-fixguid&#39;]
&#39;21IW8573&#39;

Combining these steps you can:

...

game_url = &#39;https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats&#39;

r = requests.get(game_url)

...

api_url = json.loads(r.text[start:end])[&#39;base_url&#39;]
stats = &#39;custom/statsPlayer/fixture/&#39;
fixguid = soup.find(attrs={&#39;data-article-fixguid&#39;: True})[&#39;data-article-fixguid&#39;]

player_stats_table = requests.get(api_url + stats + fixguid).json()

An example of loading a subset of the data:

pd.json_normalize(
   player_stats_table[&#39;data&#39;][&#39;playerStats&#39;][&#39;playerStatistics&#39;][&#39;teamA&#39;]
).filter(regex=&#39;player.*Name&#39;)

   player.firstName player.lastName
0              Hugo          Keenan
1            Andrew          Conway
2             Garry        Ringrose
3            Bundee             Aki
4              Mack          Hansen
5            Johnny          Sexton
6           Jamison     Gibson-Park
7            Andrew          Porter
8             Ronan        Kelleher
9             Tadhg         Furlong
10            Tadhg          Beirne
11            James            Ryan
12           Caelan           Doris
13             Josh   van der Flier
14             Jack           Conan

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Web scraping: 如何使用Python访问此表格的tbody部分？

问题

答案1

创建一个新列，其值取决于其他列。

如何在SQLAlchemy中查询由JSON字符串组成的属性？

在Python DataFrame中基于多个条件选择数值。

在Python中，是否有一个好的解决方案用于异步写入NetCDF文件？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论