英文:
Web scraping with Python. Get href from "a" elements
问题
以下是代码部分的中文翻译:
# 通过以下代码,我可以从给定URL的指定页数中获取所有数据:
import pandas as pd
F, L = 1, 2 # 第一页和最后一页
dico = {}
for page in range(F, L+1):
url = f'https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page={page}&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17'
sub_df = pd.read_html(url, parse_dates=True)[0]
#sub_df.insert(0, "page_number", page)
sub_df.insert(1, "Year", "AT")
sub_df.insert(2, "Ind_Out", "I")
sub_df.insert(3, "Gender", "M")
sub_df.insert(4, "Event", "MILLA")
sub_df.insert(5, "L_N", "L")
dico[page] = sub_df
out = pd.concat(dico, ignore_index=True)
out.to_csv('WA_AT_I_M_MILLA_L.csv')
但我需要获取运动员的代码(字段“Competitor”)。
如何在每位竞争者的链接中插入一个字段?
英文:
With the following code I can get all data from the noted number of pages at the given URL:
import pandas as pd
F, L = 1, 2 # first and last pages
dico = {}
for page in range(F, L+1):
url = f'https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page={page}&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17'
sub_df = pd.read_html(url, parse_dates=True)[0]
#sub_df.insert(0, "page_number", page)
sub_df.insert(1, "Year", "AT")
sub_df.insert(2, "Ind_Out", "I")
sub_df.insert(3, "Gender", "M")
sub_df.insert(4, "Event", "MILLA")
sub_df.insert(5, "L_N", "L")
dico[page] = sub_df
out = pd.concat(dico, ignore_index=True)
out.to_csv('WA_AT_I_M_MILLA_L.csv')
But I need to get the athletes' code (field "Competitor").
How could I insert a field with the href of each competitor?
答案1
得分: 1
我不太确定你在你的代码中为什么要做所有这些事情,但要在该页面上获得一个额外的列用于来自链接的竞争者代码的表格,我会这样做(在此示例中,仅针对第一页,但你可以显然扩展它):
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
url = "https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page=1&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17"
req = requests.get(url)
# 这将为你获取整个表格,如下所示:
sub_df = pd.read_html(req.text)[0]
# 我们需要这个来提取代码:
soup = bs(req.text, "html.parser")
codes = [comp['href'].split('=')[1] for comp in soup.select('table.records-table td[data-th="Competitor"] a')]
# 然后,我们将代码作为新列插入到数据框中
sub_df.insert(3, 'Code', codes)
现在你应该在“Competitor”列后面有一个新的列。你可以删除任何你不想要的列,添加其他列等等。
英文:
I'm not really sure why you're doing everything you're doing in your code, but to get the table on that page with an additional column for the competitor code from the link, I would do this (in this example, just for the first page, but you can obviously extend it):
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
url = "https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page=1&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17"
req = requests.get(url)
#this gets you the whole table, as is:
sub_df = pd.read_html(req.text)[0]
#we need this to extract the codes:
soup = bs(req.text,"html.parser")
codes = [comp['href'].split('=')[1] for comp in soup.select('table.records-table td[data-th="Competitor"] a')]
#we then insert the codes as a new column in the df
sub_df.insert(3, 'Code', codes)
You should now have a new column right after Competitor
. You can drop whatever column you don't want, add other columns and so on.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论