英文:
Web scaping: how to access the tbody of this table with Python?
问题
在我的第一次尝试使用Python进行网页抓取时,我试图从这个网页中提取内容:https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats。具体来说,我试图提取表格中的一组数字。
我已经尝试了使用Pandas和BeautifulSoup的最常见解决方案,但我无法访问tbody中的tr标签。有人可以告诉我我做错了什么吗?
这是我尝试过的代码:
在此处输入图像描述
英文:
In my first try with web scraping with Python I'm trying to extract the content from this webpage:
https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats.
In particular, I'm trying to take the set of numbers of the table.
I've tried to apply the most common solutions with Pandas, BeautifulSoup but I can't access the tr tag in the tbody.
Can someone indicate me what I'm doing wrong?
Here's the code I tried:
enter image description here
答案1
得分: 2
以下是您要翻译的内容:
"The table data is not contained at that URL, it is fetched from elsewhere by your web browser.
You can use your web browser's Network Tools search through requests, here we've used the first player in the table.
Which returns: https://g9u7p6f6.ssl.hwcdn.net/api/custom/statsPlayer/fixture/21IW8573?lang=en_GB as the source of the table data - which is in JSON format.
The URL is assembled from information contained in the html:
3556 var STATS_API_VARS = {"base_url":"https:\/\/g9u7p6f6.ssl.hwcdn.net\/api\/"};¬
2085 data-article-fixguid="21IW8573"¬
You can extract the URL using string slicing:
start = r.text.find('var STATS_API_VARS')
end = r.text.find('}', start) + 1
start = r.text.find('{', start)
r.text[start:end]
# '{"base_url":"https:\\/\\/g9u7p6f6.ssl.hwcdn.net\\/api\\/"}'
Which can be loaded using the json
module:
>>> json.loads(r.text[start:end])['base_url']
'https://g9u7p6f6.ssl.hwcdn.net/api/'
The fixguid
can be extracted with beautifulsoup:
>>> soup.find(attrs={'data-article-fixguid': True})['data-article-fixguid']
'21IW8573'
Combining these steps you can:
...
game_url = 'https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats'
r = requests.get(game_url)
...
api_url = json.loads(r.text[start:end])['base_url']
stats = 'custom/statsPlayer/fixture/'
fixguid = soup.find(attrs={'data-article-fixguid': True})['data-article-fixguid']
player_stats_table = requests.get(api_url + stats + fixguid).json()
An example of loading a subset of the data:
pd.json_normalize(
player_stats_table['data']['playerStats']['playerStatistics']['teamA']
).filter(regex='player.*Name')
英文:
The table data is not contained at that URL, it is fetched from elsewhere by your web browser.
You can use your web browsers Network Tools search through requests, here we've used the first player in the table.
Which returns: https://g9u7p6f6.ssl.hwcdn.net/api/custom/statsPlayer/fixture/21IW8573?lang=en_GB as the source of the table data - which is in JSON format.
The URL is assembled from information contained in the html:
3556 var STATS_API_VARS = {"base_url":"https:\/\/g9u7p6f6.ssl.hwcdn.net\/api\/"};¬
2085 data-article-fixguid="21IW8573"¬
You can extract the url using string slicing:
start = r.text.find('var STATS_API_VARS')
end = r.text.find('}', start) + 1
start = r.text.find('{', start)
r.text[start:end]
# '{"base_url":"https:\\/\\/g9u7p6f6.ssl.hwcdn.net\\/api\\/"}'
Which can be loaded using the json
module:
>>> json.loads(r.text[start:end])['base_url']
'https://g9u7p6f6.ssl.hwcdn.net/api/'
The fixguid
can be extracted with beautifulsoup:
>>> soup.find(attrs={'data-article-fixguid': True})['data-article-fixguid']
'21IW8573'
Combining these steps you can:
...
game_url = 'https://www.sixnationsrugby.com/report/conway-at-the-double-as-ireland-defeat-wales-in-dublin#match-stats'
r = requests.get(game_url)
...
api_url = json.loads(r.text[start:end])['base_url']
stats = 'custom/statsPlayer/fixture/'
fixguid = soup.find(attrs={'data-article-fixguid': True})['data-article-fixguid']
player_stats_table = requests.get(api_url + stats + fixguid).json()
An example of loading a subset of the data:
pd.json_normalize(
player_stats_table['data']['playerStats']['playerStatistics']['teamA']
).filter(regex='player.*Name')
player.firstName player.lastName
0 Hugo Keenan
1 Andrew Conway
2 Garry Ringrose
3 Bundee Aki
4 Mack Hansen
5 Johnny Sexton
6 Jamison Gibson-Park
7 Andrew Porter
8 Ronan Kelleher
9 Tadhg Furlong
10 Tadhg Beirne
11 James Ryan
12 Caelan Doris
13 Josh van der Flier
14 Jack Conan
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论