2023年7月27日 15:25:46go评论125阅读模式

英文:

Webscraping FBRef for Individual Player Stats in R

问题

我正在尝试从FBRef上爬取个人球员的统计数据，但我遇到了一个无法解决的问题。

假设我有一个只包含2名球员的列表，Marcus Rashford和Erling Haaland。通常情况下，我会将每个球员的名字转换为相应的URL，例如https://fbref.com/en/players/matchlogs/2022-2023/Marcus-Rashford-Match-Logs，然后进行爬取。

然而，URL中包含一些我不知道如何生成的随机内容，例如：
https://fbref.com/en/players/a1d5bd30/matchlogs/2022-2023/Marcus-Rashford-Match-Logs
https://fbref.com/en/players/1f44ac21/matchlogs/2022-2023/Erling-Haaland-Match-Logs

问题是我不知道如何自动确定a1d5bd30和1f44ac21这些部分，因为它们看起来是随机的。

我已经从同一网站上爬取了篮球统计数据，但URL的这一部分非常简单，只是球员姓氏的首字母，例如https://www.basketball-reference.com/players/b/brownke03/gamelog/2023/

请问有人知道如何解决我的问题，或者有人之前成功地爬取过这些数据吗？

非常感谢！

英文:

I am attempting to webscrape FBRef for individual players stats but I have encountered a problem I am unable to solve.

Let's say I have a list of just 2 players, Marcus Rashford and Erling Haaland. So ordinarily I would take each players name and convert it to the relevant URL, something like https://fbref.com/en/players/matchlogs/2022-2023/Marcus-Rashford-Match-Logs and then scrape it.

However the URL contains some random stuff which I dont know how to generate i.e.
https://fbref.com/en/players/a1d5bd30/matchlogs/2022-2023/Marcus-Rashford-Match-Logs
https://fbref.com/en/players/1f44ac21/matchlogs/2022-2023/Erling-Haaland-Match-Logs

The issue is I dont know how to automatically determine the a1d5bd30 and 1f44ac21 parts as they appear random.

I have webscraped basketball stats from the same website however that part of the URL is very simple as it is just the first letter of the players last name eg https://www.basketball-reference.com/players/b/brownke03/gamelog/2023/

Does anyone know how to solve my problem please or has anyone scraped this data successfully before?

Many thanks

答案1

得分: 1

你可以使用FBREF搜索。

单个匹配将返回一个带有重定向的响应，下面的示例从响应头中收集这些位置，而不实际跟随重定向（req_options(followlocation = FALSE)），如果要修改URL，这将节省一些请求。如果搜索返回0个或多个球员，则返回的列表中缺少Location头将为NULL。

req_throttle(1)将请求速率设置为1/s。

library(httr2)
library(purrr)
# "Marcus" - 包含多个匹配的页面
# "Super Mario" - 没有匹配
players <- c("Marcus Rashford", "Erling Haaland", "Marcus", "Super Mario")
set_names(players) |>
  map(\(player) request("https://fbref.com/en/search/search.fcgi") |>
        req_url_query(search = player) |>
        req_options(followlocation = FALSE) |>
        req_throttle(1) |>
        req_perform() |>
        resp_header("Location")
      , .progress = TRUE)
#> ■■■■■■■■■■■■■■■■■■■■■■■ 75% | ETA: 1s
#> $`Marcus Rashford`
#> [1] "/en/players/a1d5bd30/Marcus-Rashford"
#> 
#> $`Erling Haaland`
#> [1] "/en/players/1f44ac21/Erling-Braut-Haland"
#> 
#> $Marcus
#> NULL
#> 
#> $`Super Mario`
#> NULL

<sup>创建于2023-07-27，使用reprex v2.0.2</sup>

英文:

You can use FBREF search.

Single matches will return a response with redirection, the example bellow collects those locations from response headers without actually following redirects (req_options(followlocation = FALSE)) , it will save you a few requests if you want to modify the URL anyway. If the search returns 0 or multiple players, missing Loaction header will result as NULL in returned list.

req_throttle(1) sets request rate to 1/s.

library(httr2)
library(purrr)
# &quot;Marcus&quot; - page with multiple matches
# &quot;Super Mario&quot; - no match
players &lt;- c(&quot;Marcus Rashford&quot;, &quot;Erling Haaland&quot;, &quot;Marcus&quot;, &quot;Super Mario&quot;)
set_names(players) |&gt;
  map(\(player) request(&quot;https://fbref.com/en/search/search.fcgi&quot;) |&gt;
        req_url_query(search = player) |&gt; 
        req_options(followlocation = FALSE) |&gt;
        req_throttle(1) |&gt;
        req_perform() |&gt; 
        resp_header(&quot;Location&quot;)
      , .progress = TRUE)
#&gt; ■■■■■■■■■■■■■■■■■■■■■■■ 75% | ETA: 1s
#&gt; $`Marcus Rashford`
#&gt; [1] &quot;/en/players/a1d5bd30/Marcus-Rashford&quot;
#&gt; 
#&gt; $`Erling Haaland`
#&gt; [1] &quot;/en/players/1f44ac21/Erling-Braut-Haland&quot;
#&gt; 
#&gt; $Marcus
#&gt; NULL
#&gt; 
#&gt; $`Super Mario`
#&gt; NULL

<sup>Created on 2023-07-27 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中使用Webscraping FBRef获取个人球员统计数据

问题

答案1

如何从R中的列表中删除行数为0的数据表？

使用BeautifulSoup获取强调标签后的文本。

我正在尝试从网站上爬取图像，使用了Selenium，但在代码中出现了错误。

将包含百分比和小数的列中的百分比转换为小数。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。