r RVEST爬取URL相关数据不再工作

huangapple go评论59阅读模式
英文:

r RVEST Scraping of URL Related Data no Longer working

问题

在R中,我正在使用rvest包来从以下网址中抓取球员数据
"https://www.covers.com/sport/basketball/nba/teams/main/boston-celtics/2022-2023/roster"

在这个页面上,有许多URL,并且我想专注于获取所有球员特定的URL(然后将它们存储起来)。例如:
"https://www.covers.com/sport/basketball/nba/players/238239/jd-davison"

在2022年12月,我使用以下代码来生成列表(covers_page是我上面指定的URL)

library(xml2)
library(rvest)
library(tidyverse)
library(lubridate)
library(janitor)

tmp <- read_html(covers_page)

href <- as_tibble(html_attr(html_nodes(tmp, "a"), "href")) %>% 
  filter(grepl("/players/", value))

上面的输出是null,因为html_attr/html_nodes组合的列表没有生成与屏幕上的个别球员相关的任何URL。它显示屏幕上的每个其他URL节点,而不仅仅是这些。

这之前是有效的,因为我有一个详细说明我要找的内容的输出文件。

是否在RVEST世界中有关于如何使用html_attr/html_nodes的更改,因为我不明白为什么它没有“抓取”这些URL而却抓取了其他URL。

英文:

In R, I am using the rvest package to scrape player data off the below url
"https://www.covers.com/sport/basketball/nba/teams/main/boston-celtics/2022-2023/roster"

On this page, there are many urls and I want to focus on getting all the player specific urls (and then storing them). Example is:
"https://www.covers.com/sport/basketball/nba/players/238239/jd-davison"

In Dec 2022, I used the following code to generate the list (covers_page is the url I specified above)

library(xml2)
library(rvest)
library(tidyverse)
library(lubridate)
library(janitor)

tmp &lt;- read_html(covers_page)

href &lt;- as_tibble(html_attr(html_nodes(tmp, &quot;a&quot;), &quot;href&quot;)) %&gt;% 
  filter(grepl(&quot;/players/&quot;,value))

The output of the above is null since the list from the html_attr/html_nodes combination is not generating any of the URLs associated with the individual players on the screen. It shows every other url node on the screen, not just these.

This worked before as I have an output file which details what I am looking for.

Has something changed in the RVEST world on how to use html_attr/html_nodes since I don't get how it is not "grabbing" these urls while grabbing the others.

答案1

得分: 2

以下是翻译好的部分:

这里遇到的是动态加载的数据。当浏览器连接到此页面时,它会启动后台请求以获取球员名单,然后使用JavaScript更新页面以显示这些新数据。

如果您打开浏览器的开发工具(通常是F12键),并查看网络选项卡(xhr部分):

您可以看到此请求返回球员的HTML数据:

要抓取这些数据,您需要在R中复制此POST请求。不幸的是,rvest不支持POST请求,因此您需要使用替代的HTTP客户端,如httr

library("httr")
# 定义端点URL
url <- "https://www.covers.com/sport/basketball/nba/teams/main/Boston%20Celtics/tab/roster"

# 定义要发布的JSON数据
data <- list(teamId = "98", seasonId = "3996", seasonName="2022-2023", leagueName="NBA")

# 进行POST请求
response <- POST(url, body = data, encode="form", add_headers("X-Requested-With" = "XMLHttpRequest"))
content(response)
# 然后,您可以将HTML加载到rvest中,并按预期的HTML进行解析

希望这有所帮助。

英文:

What you're encountering here is dynamicly loaded data. When the browser connects to this page it starts a background request to get the player roster and then uses javascript to update the page with this new data.

If you fire up your browser's devtools (usually F12 key) and take a look at the Network tab (xhr section):

r RVEST爬取URL相关数据不再工作

You can see this request returns HTML data of the players:

r RVEST爬取URL相关数据不再工作

To scrape this you need to replicate this POST request in R. Unfortunately, rvest doesn't support Post requests so you need to use alternative http client like httr:

library(&quot;httr&quot;)
# Define the endpoint URL
url &lt;- &quot;https://www.covers.com/sport/basketball/nba/teams/main/Boston%20Celtics/tab/roster&quot;

# Define the JSON data to be posted
data &lt;- list(teamId = &quot;98&quot;, seasonId = &quot;3996&quot;, seasonName=&quot;2022-2023&quot;, leagueName=&quot;NBA&quot;)

# Make the POST request
response &lt;- POST(url, body = data, encode=&quot;form&quot;, add_headers(&quot;X-Requested-With&quot; = &quot;XMLHttpRequest&quot;))
content(response)
# then you can load the html to rvest and parse it as expected HTML

huangapple
  • 本文由 发表于 2023年2月10日 06:16:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75405030.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定