使用Python进行网络抓取。从”a”元素中获取href链接。

huangapple go评论59阅读模式
英文:

Web scraping with Python. Get href from "a" elements

问题

以下是代码部分的中文翻译:

# 通过以下代码,我可以从给定URL的指定页数中获取所有数据:

import pandas as pd

F, L = 1, 2 # 第一页和最后一页

dico = {}
for page in range(F, L+1):
    url = f'https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page={page}&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17'
    sub_df = pd.read_html(url, parse_dates=True)[0]
    #sub_df.insert(0, "page_number", page)
    sub_df.insert(1, "Year", "AT")
    sub_df.insert(2, "Ind_Out", "I")
    sub_df.insert(3, "Gender", "M")
    sub_df.insert(4, "Event", "MILLA")
    sub_df.insert(5, "L_N", "L")
    dico[page] = sub_df

out = pd.concat(dico, ignore_index=True)
out.to_csv('WA_AT_I_M_MILLA_L.csv')

但我需要获取运动员的代码(字段“Competitor”)。

如何在每位竞争者的链接中插入一个字段?

英文:

With the following code I can get all data from the noted number of pages at the given URL:

import pandas as pd

F, L = 1, 2 # first and last pages

dico = {}
for page in range(F, L+1):
    url = f'https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page={page}&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17'
    sub_df = pd.read_html(url, parse_dates=True)[0]
    #sub_df.insert(0, "page_number", page)
    sub_df.insert(1, "Year", "AT")
    sub_df.insert(2, "Ind_Out", "I")
    sub_df.insert(3, "Gender", "M")
    sub_df.insert(4, "Event", "MILLA")
    sub_df.insert(5, "L_N", "L")
    dico[page] = sub_df
    
out = pd.concat(dico, ignore_index=True)
out.to_csv('WA_AT_I_M_MILLA_L.csv')

But I need to get the athletes' code (field "Competitor").

How could I insert a field with the href of each competitor?

答案1

得分: 1

我不太确定你在你的代码中为什么要做所有这些事情,但要在该页面上获得一个额外的列用于来自链接的竞争者代码的表格,我会这样做(在此示例中,仅针对第一页,但你可以显然扩展它):

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = "https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page=1&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17"
req = requests.get(url)

# 这将为你获取整个表格,如下所示:
sub_df = pd.read_html(req.text)[0]

# 我们需要这个来提取代码:
soup = bs(req.text, "html.parser")
codes = [comp['href'].split('=')[1] for comp in soup.select('table.records-table td[data-th="Competitor"] a')]

# 然后,我们将代码作为新列插入到数据框中
sub_df.insert(3, 'Code', codes)

现在你应该在“Competitor”列后面有一个新的列。你可以删除任何你不想要的列,添加其他列等等。

英文:

I'm not really sure why you're doing everything you're doing in your code, but to get the table on that page with an additional column for the competitor code from the link, I would do this (in this example, just for the first page, but you can obviously extend it):

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

url = "https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page=1&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17"
req =  requests.get(url)

#this gets you the whole table, as is:
sub_df = pd.read_html(req.text)[0]
#we need this to extract the codes:
soup = bs(req.text,"html.parser")
codes = [comp['href'].split('=')[1] for comp in soup.select('table.records-table td[data-th="Competitor"] a')]

#we then insert the codes as a new column in the df
sub_df.insert(3, 'Code', codes)

You should now have a new column right after Competitor. You can drop whatever column you don't want, add other columns and so on.

huangapple
  • 本文由 发表于 2023年2月24日 03:06:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75549289.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定