在R中使用Webscraping FBRef获取个人球员统计数据

huangapple go评论79阅读模式
英文:

Webscraping FBRef for Individual Player Stats in R

问题

我正在尝试从FBRef上爬取个人球员的统计数据,但我遇到了一个无法解决的问题。

假设我有一个只包含2名球员的列表,Marcus Rashford和Erling Haaland。通常情况下,我会将每个球员的名字转换为相应的URL,例如https://fbref.com/en/players/matchlogs/2022-2023/Marcus-Rashford-Match-Logs,然后进行爬取。

然而,URL中包含一些我不知道如何生成的随机内容,例如:
https://fbref.com/en/players/a1d5bd30/matchlogs/2022-2023/Marcus-Rashford-Match-Logs
https://fbref.com/en/players/1f44ac21/matchlogs/2022-2023/Erling-Haaland-Match-Logs

问题是我不知道如何自动确定a1d5bd30和1f44ac21这些部分,因为它们看起来是随机的。

我已经从同一网站上爬取了篮球统计数据,但URL的这一部分非常简单,只是球员姓氏的首字母,例如https://www.basketball-reference.com/players/b/brownke03/gamelog/2023/

请问有人知道如何解决我的问题,或者有人之前成功地爬取过这些数据吗?

非常感谢!

英文:

I am attempting to webscrape FBRef for individual players stats but I have encountered a problem I am unable to solve.

Let's say I have a list of just 2 players, Marcus Rashford and Erling Haaland. So ordinarily I would take each players name and convert it to the relevant URL, something like https://fbref.com/en/players/matchlogs/2022-2023/Marcus-Rashford-Match-Logs and then scrape it.

However the URL contains some random stuff which I dont know how to generate i.e.
https://fbref.com/en/players/a1d5bd30/matchlogs/2022-2023/Marcus-Rashford-Match-Logs
https://fbref.com/en/players/1f44ac21/matchlogs/2022-2023/Erling-Haaland-Match-Logs

The issue is I dont know how to automatically determine the a1d5bd30 and 1f44ac21 parts as they appear random.

I have webscraped basketball stats from the same website however that part of the URL is very simple as it is just the first letter of the players last name eg https://www.basketball-reference.com/players/b/brownke03/gamelog/2023/

Does anyone know how to solve my problem please or has anyone scraped this data successfully before?

Many thanks

答案1

得分: 1

你可以使用FBREF搜索。

单个匹配将返回一个带有重定向的响应,下面的示例从响应头中收集这些位置,而不实际跟随重定向(req_options(followlocation = FALSE)),如果要修改URL,这将节省一些请求。如果搜索返回0个或多个球员,则返回的列表中缺少Location头将为NULL

req_throttle(1)将请求速率设置为1/s。

library(httr2)
library(purrr)

# "Marcus" - 包含多个匹配的页面
# "Super Mario" - 没有匹配
players <- c("Marcus Rashford", "Erling Haaland", "Marcus", "Super Mario")

set_names(players) |>
  map(\(player) request("https://fbref.com/en/search/search.fcgi") |>
        req_url_query(search = player) |>
        req_options(followlocation = FALSE) |>
        req_throttle(1) |>
        req_perform() |>
        resp_header("Location")
      , .progress = TRUE)
#> ■■■■■■■■■■■■■■■■■■■■■■■ 75% | ETA: 1s
#> $`Marcus Rashford`
#> [1] "/en/players/a1d5bd30/Marcus-Rashford"
#> 
#> $`Erling Haaland`
#> [1] "/en/players/1f44ac21/Erling-Braut-Haland"
#> 
#> $Marcus
#> NULL
#> 
#> $`Super Mario`
#> NULL

<sup>创建于2023-07-27,使用reprex v2.0.2</sup>

英文:

You can use FBREF search.

Single matches will return a response with redirection, the example bellow collects those locations from response headers without actually following redirects (req_options(followlocation = FALSE)) , it will save you a few requests if you want to modify the URL anyway. If the search returns 0 or multiple players, missing Loaction header will result as NULL in returned list.

req_throttle(1) sets request rate to 1/s.

library(httr2)
library(purrr)

# &quot;Marcus&quot; - page with multiple matches
# &quot;Super Mario&quot; - no match
players &lt;- c(&quot;Marcus Rashford&quot;, &quot;Erling Haaland&quot;, &quot;Marcus&quot;, &quot;Super Mario&quot;)

set_names(players) |&gt;
  map(\(player) request(&quot;https://fbref.com/en/search/search.fcgi&quot;) |&gt;
        req_url_query(search = player) |&gt; 
        req_options(followlocation = FALSE) |&gt;
        req_throttle(1) |&gt;
        req_perform() |&gt; 
        resp_header(&quot;Location&quot;)
      , .progress = TRUE)
#&gt; ■■■■■■■■■■■■■■■■■■■■■■■ 75% | ETA: 1s
#&gt; $`Marcus Rashford`
#&gt; [1] &quot;/en/players/a1d5bd30/Marcus-Rashford&quot;
#&gt; 
#&gt; $`Erling Haaland`
#&gt; [1] &quot;/en/players/1f44ac21/Erling-Braut-Haland&quot;
#&gt; 
#&gt; $Marcus
#&gt; NULL
#&gt; 
#&gt; $`Super Mario`
#&gt; NULL

<sup>Created on 2023-07-27 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年7月27日 15:25:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76777385.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定