Web scraping: 如何使用Python访问此表格的tbody部分?

huangapple go评论99阅读模式
英文:

Web scaping: how to access the tbody of this table with Python?

问题

在我的第一次尝试使用Python进行网页抓取时,我试图从这个网页中提取内容:https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats。具体来说,我试图提取表格中的一组数字。

我已经尝试了使用Pandas和BeautifulSoup的最常见解决方案,但我无法访问tbody中的tr标签。有人可以告诉我我做错了什么吗?

这是我尝试过的代码:
在此处输入图像描述

英文:

In my first try with web scraping with Python I'm trying to extract the content from this webpage:
https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats.
In particular, I'm trying to take the set of numbers of the table.

I've tried to apply the most common solutions with Pandas, BeautifulSoup but I can't access the tr tag in the tbody.
Can someone indicate me what I'm doing wrong?

Here's the code I tried:
enter image description here

答案1

得分: 2

以下是您要翻译的内容:

"The table data is not contained at that URL, it is fetched from elsewhere by your web browser.

You can use your web browser's Network Tools search through requests, here we've used the first player in the table.

Web scraping: 如何使用Python访问此表格的tbody部分?

Which returns: https://g9u7p6f6.ssl.hwcdn.net/api/custom/statsPlayer/fixture/21IW8573?lang=en_GB as the source of the table data - which is in JSON format.

The URL is assembled from information contained in the html:

3556 var STATS_API_VARS = {"base_url":"https:\/\/g9u7p6f6.ssl.hwcdn.net\/api\/"};¬
2085         data-article-fixguid="21IW8573"¬                                    

You can extract the URL using string slicing:

start = r.text.find('var STATS_API_VARS')
end = r.text.find('}', start) + 1
start = r.text.find('{', start) 

r.text[start:end]
# '{"base_url":"https:\\/\\/g9u7p6f6.ssl.hwcdn.net\\/api\\/"}'

Which can be loaded using the json module:

>>> json.loads(r.text[start:end])['base_url']
'https://g9u7p6f6.ssl.hwcdn.net/api/'

The fixguid can be extracted with beautifulsoup:

>>> soup.find(attrs={'data-article-fixguid': True})['data-article-fixguid']
'21IW8573'

Combining these steps you can:

...

game_url = 'https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats'

r = requests.get(game_url)

...

api_url = json.loads(r.text[start:end])['base_url']
stats = 'custom/statsPlayer/fixture/'
fixguid = soup.find(attrs={'data-article-fixguid': True})['data-article-fixguid']

player_stats_table = requests.get(api_url + stats + fixguid).json()

An example of loading a subset of the data:

pd.json_normalize(
   player_stats_table['data']['playerStats']['playerStatistics']['teamA']
).filter(regex='player.*Name')
英文:

The table data is not contained at that URL, it is fetched from elsewhere by your web browser.

You can use your web browsers Network Tools search through requests, here we've used the first player in the table.

Web scraping: 如何使用Python访问此表格的tbody部分?

Which returns: https://g9u7p6f6.ssl.hwcdn.net/api/custom/statsPlayer/fixture/21IW8573?lang=en_GB as the source of the table data - which is in JSON format.

The URL is assembled from information contained in the html:

3556 var STATS_API_VARS = {"base_url":"https:\/\/g9u7p6f6.ssl.hwcdn.net\/api\/"};¬
2085         data-article-fixguid="21IW8573"¬                                    

You can extract the url using string slicing:

start = r.text.find('var STATS_API_VARS')
end = r.text.find('}', start) + 1
start = r.text.find('{', start) 

r.text[start:end]
# '{"base_url":"https:\\/\\/g9u7p6f6.ssl.hwcdn.net\\/api\\/"}'

Which can be loaded using the json module:

>>> json.loads(r.text[start:end])['base_url']
'https://g9u7p6f6.ssl.hwcdn.net/api/'

The fixguid can be extracted with beautifulsoup:

>>> soup.find(attrs={'data-article-fixguid': True})['data-article-fixguid']
'21IW8573'

Combining these steps you can:

...

game_url = 'https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats'

r = requests.get(game_url)

...

api_url = json.loads(r.text[start:end])['base_url']
stats = 'custom/statsPlayer/fixture/'
fixguid = soup.find(attrs={'data-article-fixguid': True})['data-article-fixguid']

player_stats_table = requests.get(api_url + stats + fixguid).json()

An example of loading a subset of the data:

pd.json_normalize(
   player_stats_table['data']['playerStats']['playerStatistics']['teamA']
).filter(regex='player.*Name')
   player.firstName player.lastName
0              Hugo          Keenan
1            Andrew          Conway
2             Garry        Ringrose
3            Bundee             Aki
4              Mack          Hansen
5            Johnny          Sexton
6           Jamison     Gibson-Park
7            Andrew          Porter
8             Ronan        Kelleher
9             Tadhg         Furlong
10            Tadhg          Beirne
11            James            Ryan
12           Caelan           Doris
13             Josh   van der Flier
14             Jack           Conan

huangapple
  • 本文由 发表于 2023年5月14日 00:42:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76243862.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定