2023年2月24日 03:06:41go评论113阅读模式

英文:

Web scraping with Python. Get href from "a" elements

问题

以下是代码部分的中文翻译：

# 通过以下代码，我可以从给定URL的指定页数中获取所有数据：
import pandas as pd
F, L = 1, 2 # 第一页和最后一页
dico = {}
for page in range(F, L+1):
    url = f'https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page={page}&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17'
    sub_df = pd.read_html(url, parse_dates=True)[0]
    #sub_df.insert(0, "page_number", page)
    sub_df.insert(1, "Year", "AT")
    sub_df.insert(2, "Ind_Out", "I")
    sub_df.insert(3, "Gender", "M")
    sub_df.insert(4, "Event", "MILLA")
    sub_df.insert(5, "L_N", "L")
    dico[page] = sub_df
out = pd.concat(dico, ignore_index=True)
out.to_csv('WA_AT_I_M_MILLA_L.csv')

但我需要获取运动员的代码（字段“Competitor”）。

如何在每位竞争者的链接中插入一个字段？

英文:

With the following code I can get all data from the noted number of pages at the given URL:

import pandas as pd
F, L = 1, 2 # first and last pages
dico = {}
for page in range(F, L+1):
    url = f&#39;https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&amp;page={page}&amp;bestResultsOnly=false&amp;oversizedTrack=regular&amp;firstDay=1899-12-31&amp;lastDay=2023-02-17&#39;
    sub_df = pd.read_html(url, parse_dates=True)[0]
    #sub_df.insert(0, &quot;page_number&quot;, page)
    sub_df.insert(1, &quot;Year&quot;, &quot;AT&quot;)
    sub_df.insert(2, &quot;Ind_Out&quot;, &quot;I&quot;)
    sub_df.insert(3, &quot;Gender&quot;, &quot;M&quot;)
    sub_df.insert(4, &quot;Event&quot;, &quot;MILLA&quot;)
    sub_df.insert(5, &quot;L_N&quot;, &quot;L&quot;)
    dico[page] = sub_df
    
out = pd.concat(dico, ignore_index=True)
out.to_csv(&#39;WA_AT_I_M_MILLA_L.csv&#39;)

But I need to get the athletes' code (field "Competitor").

How could I insert a field with the href of each competitor?

答案1

得分: 1

我不太确定你在你的代码中为什么要做所有这些事情，但要在该页面上获得一个额外的列用于来自链接的竞争者代码的表格，我会这样做（在此示例中，仅针对第一页，但你可以显然扩展它）：

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
url = "https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&page=1&bestResultsOnly=false&oversizedTrack=regular&firstDay=1899-12-31&lastDay=2023-02-17"
req = requests.get(url)
# 这将为你获取整个表格，如下所示：
sub_df = pd.read_html(req.text)[0]
# 我们需要这个来提取代码：
soup = bs(req.text, "html.parser")
codes = [comp['href'].split('=')[1] for comp in soup.select('table.records-table td[data-th="Competitor"] a')]
# 然后，我们将代码作为新列插入到数据框中
sub_df.insert(3, 'Code', codes)

现在你应该在“Competitor”列后面有一个新的列。你可以删除任何你不想要的列，添加其他列等等。

英文:

I'm not really sure why you're doing everything you're doing in your code, but to get the table on that page with an additional column for the competitor code from the link, I would do this (in this example, just for the first page, but you can obviously extend it):

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
url = &quot;https://www.worldathletics.org/records/all-time-toplists/middle-long/one-mile/indoor/men/senior?regionType=world&amp;page=1&amp;bestResultsOnly=false&amp;oversizedTrack=regular&amp;firstDay=1899-12-31&amp;lastDay=2023-02-17&quot;
req =  requests.get(url)
#this gets you the whole table, as is:
sub_df = pd.read_html(req.text)[0]
#we need this to extract the codes:
soup = bs(req.text,&quot;html.parser&quot;)
codes = [comp[&#39;href&#39;].split(&#39;=&#39;)[1] for comp in soup.select(&#39;table.records-table td[data-th=&quot;Competitor&quot;] a&#39;)]
#we then insert the codes as a new column in the df
sub_df.insert(3, &#39;Code&#39;, codes)

You should now have a new column right after Competitor. You can drop whatever column you don't want, add other columns and so on.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Python进行网络抓取。从”a”元素中获取href链接。

问题

答案1

如何使用argparse库来解析给定的字符串而不是app_args？

通过代理的HTTPS完全加密，包括SSL CONNECT。

无法弄清如何使回车键不提交 input()。

Pyparsing：如何从各个组收集所有命名结果？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。