英文:
r RVEST Scraping of URL Related Data no Longer working
问题
在R中,我正在使用rvest包来从以下网址中抓取球员数据
"https://www.covers.com/sport/basketball/nba/teams/main/boston-celtics/2022-2023/roster"
在这个页面上,有许多URL,并且我想专注于获取所有球员特定的URL(然后将它们存储起来)。例如:
"https://www.covers.com/sport/basketball/nba/players/238239/jd-davison"
在2022年12月,我使用以下代码来生成列表(covers_page是我上面指定的URL)
library(xml2)
library(rvest)
library(tidyverse)
library(lubridate)
library(janitor)
tmp <- read_html(covers_page)
href <- as_tibble(html_attr(html_nodes(tmp, "a"), "href")) %>%
filter(grepl("/players/", value))
上面的输出是null,因为html_attr/html_nodes组合的列表没有生成与屏幕上的个别球员相关的任何URL。它显示屏幕上的每个其他URL节点,而不仅仅是这些。
这之前是有效的,因为我有一个详细说明我要找的内容的输出文件。
是否在RVEST世界中有关于如何使用html_attr/html_nodes的更改,因为我不明白为什么它没有“抓取”这些URL而却抓取了其他URL。
英文:
In R, I am using the rvest package to scrape player data off the below url
"https://www.covers.com/sport/basketball/nba/teams/main/boston-celtics/2022-2023/roster"
On this page, there are many urls and I want to focus on getting all the player specific urls (and then storing them). Example is:
"https://www.covers.com/sport/basketball/nba/players/238239/jd-davison"
In Dec 2022, I used the following code to generate the list (covers_page is the url I specified above)
library(xml2)
library(rvest)
library(tidyverse)
library(lubridate)
library(janitor)
tmp <- read_html(covers_page)
href <- as_tibble(html_attr(html_nodes(tmp, "a"), "href")) %>%
filter(grepl("/players/",value))
The output of the above is null since the list from the html_attr/html_nodes combination is not generating any of the URLs associated with the individual players on the screen. It shows every other url node on the screen, not just these.
This worked before as I have an output file which details what I am looking for.
Has something changed in the RVEST world on how to use html_attr/html_nodes since I don't get how it is not "grabbing" these urls while grabbing the others.
答案1
得分: 2
以下是翻译好的部分:
这里遇到的是动态加载的数据。当浏览器连接到此页面时,它会启动后台请求以获取球员名单,然后使用JavaScript更新页面以显示这些新数据。
如果您打开浏览器的开发工具(通常是F12键),并查看网络选项卡(xhr部分):
您可以看到此请求返回球员的HTML数据:
要抓取这些数据,您需要在R中复制此POST请求。不幸的是,rvest不支持POST请求,因此您需要使用替代的HTTP客户端,如httr
:
library("httr")
# 定义端点URL
url <- "https://www.covers.com/sport/basketball/nba/teams/main/Boston%20Celtics/tab/roster"
# 定义要发布的JSON数据
data <- list(teamId = "98", seasonId = "3996", seasonName="2022-2023", leagueName="NBA")
# 进行POST请求
response <- POST(url, body = data, encode="form", add_headers("X-Requested-With" = "XMLHttpRequest"))
content(response)
# 然后,您可以将HTML加载到rvest中,并按预期的HTML进行解析
希望这有所帮助。
英文:
What you're encountering here is dynamicly loaded data. When the browser connects to this page it starts a background request to get the player roster and then uses javascript to update the page with this new data.
If you fire up your browser's devtools (usually F12 key) and take a look at the Network tab (xhr section):
You can see this request returns HTML data of the players:
To scrape this you need to replicate this POST request in R. Unfortunately, rvest doesn't support Post requests so you need to use alternative http client like httr
:
library("httr")
# Define the endpoint URL
url <- "https://www.covers.com/sport/basketball/nba/teams/main/Boston%20Celtics/tab/roster"
# Define the JSON data to be posted
data <- list(teamId = "98", seasonId = "3996", seasonName="2022-2023", leagueName="NBA")
# Make the POST request
response <- POST(url, body = data, encode="form", add_headers("X-Requested-With" = "XMLHttpRequest"))
content(response)
# then you can load the html to rvest and parse it as expected HTML
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论