英文:
Webscraping FBRef for Individual Player Stats in R
问题
我正在尝试从FBRef上爬取个人球员的统计数据,但我遇到了一个无法解决的问题。
假设我有一个只包含2名球员的列表,Marcus Rashford和Erling Haaland。通常情况下,我会将每个球员的名字转换为相应的URL,例如https://fbref.com/en/players/matchlogs/2022-2023/Marcus-Rashford-Match-Logs,然后进行爬取。
然而,URL中包含一些我不知道如何生成的随机内容,例如:
https://fbref.com/en/players/a1d5bd30/matchlogs/2022-2023/Marcus-Rashford-Match-Logs
https://fbref.com/en/players/1f44ac21/matchlogs/2022-2023/Erling-Haaland-Match-Logs
问题是我不知道如何自动确定a1d5bd30和1f44ac21这些部分,因为它们看起来是随机的。
我已经从同一网站上爬取了篮球统计数据,但URL的这一部分非常简单,只是球员姓氏的首字母,例如https://www.basketball-reference.com/players/b/brownke03/gamelog/2023/
请问有人知道如何解决我的问题,或者有人之前成功地爬取过这些数据吗?
非常感谢!
英文:
I am attempting to webscrape FBRef for individual players stats but I have encountered a problem I am unable to solve.
Let's say I have a list of just 2 players, Marcus Rashford and Erling Haaland. So ordinarily I would take each players name and convert it to the relevant URL, something like https://fbref.com/en/players/matchlogs/2022-2023/Marcus-Rashford-Match-Logs and then scrape it.
However the URL contains some random stuff which I dont know how to generate i.e.
https://fbref.com/en/players/a1d5bd30/matchlogs/2022-2023/Marcus-Rashford-Match-Logs
https://fbref.com/en/players/1f44ac21/matchlogs/2022-2023/Erling-Haaland-Match-Logs
The issue is I dont know how to automatically determine the a1d5bd30 and 1f44ac21 parts as they appear random.
I have webscraped basketball stats from the same website however that part of the URL is very simple as it is just the first letter of the players last name eg https://www.basketball-reference.com/players/b/brownke03/gamelog/2023/
Does anyone know how to solve my problem please or has anyone scraped this data successfully before?
Many thanks
答案1
得分: 1
你可以使用FBREF搜索。
单个匹配将返回一个带有重定向的响应,下面的示例从响应头中收集这些位置,而不实际跟随重定向(req_options(followlocation = FALSE)
),如果要修改URL,这将节省一些请求。如果搜索返回0个或多个球员,则返回的列表中缺少Location
头将为NULL
。
req_throttle(1)
将请求速率设置为1/s。
library(httr2)
library(purrr)
# "Marcus" - 包含多个匹配的页面
# "Super Mario" - 没有匹配
players <- c("Marcus Rashford", "Erling Haaland", "Marcus", "Super Mario")
set_names(players) |>
map(\(player) request("https://fbref.com/en/search/search.fcgi") |>
req_url_query(search = player) |>
req_options(followlocation = FALSE) |>
req_throttle(1) |>
req_perform() |>
resp_header("Location")
, .progress = TRUE)
#> ■■■■■■■■■■■■■■■■■■■■■■■ 75% | ETA: 1s
#> $`Marcus Rashford`
#> [1] "/en/players/a1d5bd30/Marcus-Rashford"
#>
#> $`Erling Haaland`
#> [1] "/en/players/1f44ac21/Erling-Braut-Haland"
#>
#> $Marcus
#> NULL
#>
#> $`Super Mario`
#> NULL
<sup>创建于2023-07-27,使用reprex v2.0.2</sup>
英文:
You can use FBREF search.
Single matches will return a response with redirection, the example bellow collects those locations from response headers without actually following redirects (req_options(followlocation = FALSE)
) , it will save you a few requests if you want to modify the URL anyway. If the search returns 0 or multiple players, missing Loaction
header will result as NULL
in returned list.
req_throttle(1)
sets request rate to 1/s.
library(httr2)
library(purrr)
# "Marcus" - page with multiple matches
# "Super Mario" - no match
players <- c("Marcus Rashford", "Erling Haaland", "Marcus", "Super Mario")
set_names(players) |>
map(\(player) request("https://fbref.com/en/search/search.fcgi") |>
req_url_query(search = player) |>
req_options(followlocation = FALSE) |>
req_throttle(1) |>
req_perform() |>
resp_header("Location")
, .progress = TRUE)
#> ■■■■■■■■■■■■■■■■■■■■■■■ 75% | ETA: 1s
#> $`Marcus Rashford`
#> [1] "/en/players/a1d5bd30/Marcus-Rashford"
#>
#> $`Erling Haaland`
#> [1] "/en/players/1f44ac21/Erling-Braut-Haland"
#>
#> $Marcus
#> NULL
#>
#> $`Super Mario`
#> NULL
<sup>Created on 2023-07-27 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论