2023年2月10日 06:16:53go评论104阅读模式

英文:

r RVEST Scraping of URL Related Data no Longer working

问题

在R中，我正在使用rvest包来从以下网址中抓取球员数据
"https://www.covers.com/sport/basketball/nba/teams/main/boston-celtics/2022-2023/roster"

在这个页面上，有许多URL，并且我想专注于获取所有球员特定的URL（然后将它们存储起来）。例如:
"https://www.covers.com/sport/basketball/nba/players/238239/jd-davison"

在2022年12月，我使用以下代码来生成列表（covers_page是我上面指定的URL）

library(xml2)
library(rvest)
library(tidyverse)
library(lubridate)
library(janitor)
tmp <- read_html(covers_page)
href <- as_tibble(html_attr(html_nodes(tmp, "a"), "href")) %>% 
  filter(grepl("/players/", value))

上面的输出是null，因为html_attr/html_nodes组合的列表没有生成与屏幕上的个别球员相关的任何URL。它显示屏幕上的每个其他URL节点，而不仅仅是这些。

这之前是有效的，因为我有一个详细说明我要找的内容的输出文件。

是否在RVEST世界中有关于如何使用html_attr/html_nodes的更改，因为我不明白为什么它没有“抓取”这些URL而却抓取了其他URL。

英文:

In R, I am using the rvest package to scrape player data off the below url
"https://www.covers.com/sport/basketball/nba/teams/main/boston-celtics/2022-2023/roster"

On this page, there are many urls and I want to focus on getting all the player specific urls (and then storing them). Example is:
"https://www.covers.com/sport/basketball/nba/players/238239/jd-davison"

In Dec 2022, I used the following code to generate the list (covers_page is the url I specified above)

library(xml2)
library(rvest)
library(tidyverse)
library(lubridate)
library(janitor)
tmp &lt;- read_html(covers_page)
href &lt;- as_tibble(html_attr(html_nodes(tmp, &quot;a&quot;), &quot;href&quot;)) %&gt;% 
  filter(grepl(&quot;/players/&quot;,value))

The output of the above is null since the list from the html_attr/html_nodes combination is not generating any of the URLs associated with the individual players on the screen. It shows every other url node on the screen, not just these.

This worked before as I have an output file which details what I am looking for.

Has something changed in the RVEST world on how to use html_attr/html_nodes since I don't get how it is not "grabbing" these urls while grabbing the others.

答案1

得分: 2

以下是翻译好的部分：

这里遇到的是动态加载的数据。当浏览器连接到此页面时，它会启动后台请求以获取球员名单，然后使用JavaScript更新页面以显示这些新数据。

如果您打开浏览器的开发工具（通常是F12键），并查看网络选项卡（xhr部分）：

您可以看到此请求返回球员的HTML数据：

要抓取这些数据，您需要在R中复制此POST请求。不幸的是，rvest不支持POST请求，因此您需要使用替代的HTTP客户端，如httr：

library("httr")
# 定义端点URL
url <- "https://www.covers.com/sport/basketball/nba/teams/main/Boston%20Celtics/tab/roster"
# 定义要发布的JSON数据
data <- list(teamId = "98", seasonId = "3996", seasonName="2022-2023", leagueName="NBA")
# 进行POST请求
response <- POST(url, body = data, encode="form", add_headers("X-Requested-With" = "XMLHttpRequest"))
content(response)
# 然后，您可以将HTML加载到rvest中，并按预期的HTML进行解析

希望这有所帮助。

英文:

What you're encountering here is dynamicly loaded data. When the browser connects to this page it starts a background request to get the player roster and then uses javascript to update the page with this new data.

If you fire up your browser's devtools (usually F12 key) and take a look at the Network tab (xhr section):

You can see this request returns HTML data of the players:

To scrape this you need to replicate this POST request in R. Unfortunately, rvest doesn't support Post requests so you need to use alternative http client like httr:

library(&quot;httr&quot;)
# Define the endpoint URL
url &lt;- &quot;https://www.covers.com/sport/basketball/nba/teams/main/Boston%20Celtics/tab/roster&quot;
# Define the JSON data to be posted
data &lt;- list(teamId = &quot;98&quot;, seasonId = &quot;3996&quot;, seasonName=&quot;2022-2023&quot;, leagueName=&quot;NBA&quot;)
# Make the POST request
response &lt;- POST(url, body = data, encode=&quot;form&quot;, add_headers(&quot;X-Requested-With&quot; = &quot;XMLHttpRequest&quot;))
content(response)
# then you can load the html to rvest and parse it as expected HTML

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

r RVEST爬取URL相关数据不再工作

问题

答案1

如何在R中连续连接整数向量。

如何在组合柱状图和折线图中修复第二个y轴

如何在RMarkdown中为文中引用和图表超链接使用不同的颜色？

直到事件发生时翻转硬币

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。