2023年2月16日 05:10:09go评论101阅读模式

英文:

Scraping Dynamic JSON Data in R

问题

在pgatour.com/stats上，我正在尝试抓取多个统计数据，跨越多个比赛和多年。不幸的是，我很难抓取过去年份或比赛ID的数据。以前，PGA的网站看起来是这样的：

https://www.pgatour.com/stats/stat.STAT_ID.y.YEAR_ID.eoff.TOURNAMENT_ID.html

STAT_ID、YEAR_ID和TOURNAMENT_ID都会随着您更新特定的统计数据、年份和比赛ID而更改，以对应它们的唯一ID。因此，我能够使用一个函数来筛选所有stat_id、year_id和tournament_id的组合，以抓取网站的数据。现在，网站的URL只有在搜索特定stat_id时才会更改。如果我通过下拉菜单更改比赛或年份，统计数据会加载，但URL保持不变。这阻止了定位不同的比赛或年份。

https://www.pgatour.com/stats/detail/02675 - 02675是一个示例stat_id

@Dave2e对我非常有帮助，他向我展示了PGA如何使用Java以及如何访问一些JSON数据。我结合了他的教导和我以前的代码来抓取最近比赛的所有统计数据。然而，我不知道如何获取过去年份或比赛的统计数据。在JSON字符串中，我看到有$tournamentId和$year的ID，但我不确定如何使用这些信息来搜索过去的比赛和年份。

我如何访问比赛和年份的ID以抓取pgatour.com上的过去数据？我应该尝试使用rselenium来访问这些数据，而不是使用rvest这样的程序吗？

代码

library(tidyverse)
library(rvest)
library(dplyr)
df23 <- expand.grid(
  stat_id = c("02568","02675", "101")
) %>% 
  mutate(
    links = paste0(
      "https://www.pgatour.com/stats/detail/",
      stat_id
    )
  ) %>% 
  as_tibble()
get_info <- function(link, stat_id) {
  data <- link %>%
    read_html() %>% 
    html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>% 
    html_text() %>%
    jsonlite::fromJSON()
  
  answer <- data$props$pageProps$statDetails$rows %>%
#NA's in player name stops data from being collected
        drop_na(playerName)
# get lists of dataframes into single dataframe, then merge back with original dataframe
    answer2 <- answer$stats
  
  answer2 <- bind_rows(answer2, .id = "column_label") %>%
    select(-color) %>%
    pivot_wider(
      values_from = statValue, 
      names_from = statName) 
  
  #All stats combined and unnested
  stats2 <- dplyr::bind_cols(answer, answer2) 
}
test_stats <- df23 %>%
  mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_stats <- test_stats %>% 
  unnest(everything())

简化的代码由@Dave2e提供

#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>% html_text()
#convert from JSON 
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table 
answer <-output$props$pageProps$statDetails$rows

英文:

On pgatour.com/stats I am trying to scrape multiple stats over multiple tournaments over multiple years. Unfortunately, I am struggling to scrape data for past years or tournament ID’s. In the past, PGA’s website looked like:

https://www.pgatour.com/stats/stat.STAT_ID.y.YEAR_ID.eoff.TOURNAMENT_ID.html

STAT_ID, YEAR_ID, and TOURNAMENT_ID would all change as you updated the particular stat, year, and tournament id to correspond with their unique id’s. Because of this, I was able to use a function that sifted through all combinations of stat_id, year_id, and tournament_id to scrape the website.
Now the website URL’s don’t change except for the particular stat_id being searched. If I change the tournament or year through dropdowns, the stats will load, but the url remains unchanged. This prevents targeting different tournaments or years.

https://www.pgatour.com/stats/detail/02675 - 02675 being an example stat_id

@Dave2e has been very helpful in showing me that pga uses java and how to access some of the JSON data. I combined his teachings along with my past code to scrape all stats for the most recent tournament. However, I can’t figure out how to get the stats for past years or tournaments. In the JSON str I see that there are id’s for $tournamentId and $year, but I’m uncertain of how to use this info to search for past tournaments and years.

How can I access the tournament and year id's to scrape past data on pgatour.com. Should I be trying to access this data with rselenium opposed to a program like rvest?

Code

library(tidyverse)
library(rvest)
library(dplyr)
df23 &lt;- expand.grid(
  stat_id = c(&quot;02568&quot;,&quot;02675&quot;, &quot;101&quot;)
) %&gt;% 
  mutate(
    links = paste0(
      &quot;https://www.pgatour.com/stats/detail/&quot;,
      stat_id
    )
  ) %&gt;% 
  as_tibble()
get_info &lt;- function(link, stat_id) {
  data &lt;- link %&gt;%
    read_html() %&gt;% 
    html_elements(xpath = &quot;.//script[@id=&#39;__NEXT_DATA__&#39;]&quot;) %&gt;% 
    html_text() %&gt;%
    jsonlite::fromJSON()
  
  answer &lt;- data$props$pageProps$statDetails$rows %&gt;%
#NA&#39;s in player name stops data from being collected
        drop_na(playerName)
# get lists of dataframes into single dataframe, then merge back with original dataframe
    answer2 &lt;- answer$stats
  
  answer2 &lt;- bind_rows(answer2, .id = &quot;column_label&quot;) %&gt;%
    select(-color) %&gt;%
    pivot_wider(
      values_from = statValue, 
      names_from = statName) 
  
  #All stats combined and unnested
  stats2 &lt;- dplyr::bind_cols(answer, answer2) 
}
test_stats &lt;- df23 %&gt;%
  mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_stats &lt;- test_stats %&gt;% 
  unnest(everything())

Simplified code courtesy of @Dave2e

#read page
library(rvest)
page &lt;- read_html(&quot;https://www.pgatour.com/stats/detail/02675&quot;)
#find the script with the correct id tage, strip the html code
datascript &lt;- page %&gt;% html_elements(xpath = &quot;.//script[@id=&#39;__NEXT_DATA__&#39;]&quot;) %&gt;% html_text()
#convert from JSON 
output &lt;- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table 
answer &lt;-output$props$pageProps$statDetails$rows

答案1

得分: 1

以下是您要翻译的内容：

"If you take a look at the developer tools (F12 key in your browser) and observe the Network tab when you click on a different year you can see a background request is being made to retrieve that year's data:

It returns a JSON dataset similar to the one in your original post:

To scrape this you need to replicate this GraphQL POST request in your R program. Note that it sends a JSON document with query details which includes tournament codes and the year.

Finally to ensure that your graphql succeeds make sure that you match headers you see in this inspector in your R program. In particular the headers Origin, Referer and the X- prefixed ones:

(you can probably hardcode these)

英文:

If you take a look at the developer tools (F12 key in your browser) and observe the Network tab when you click on a different year you can see a background request is being made to retrieve that year's data:

It returns a JSON dataset similar to the one in your original post:

To scrape this you need to replicate this GraphQL POST request in your R program. Note that it sends a JSON document with query details which includes tournament codes and the year.

Finally to ensure that your graphql succeeds make sure that you match headers you see in this inspector in your R program. In particular the headers Origin, Referer and the X- prefixed ones:

(you can probably hardcode these)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在 R 中抓取动态 JSON 数据

问题

答案1

如何在使用版本13而不是版本12创建next-app

Can't go past Cloudflare's verify you are human check even after clicking the check box multiple times when using Selenium

找不到“mixstock”包中的“calc.RL.0”函数。

将具有多列的数据重塑为长格式。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。