在 R 中抓取动态 JSON 数据

huangapple go评论79阅读模式
英文:

Scraping Dynamic JSON Data in R

问题

在pgatour.com/stats上,我正在尝试抓取多个统计数据,跨越多个比赛和多年。不幸的是,我很难抓取过去年份或比赛ID的数据。以前,PGA的网站看起来是这样的:

https://www.pgatour.com/stats/stat.STAT_ID.y.YEAR_ID.eoff.TOURNAMENT_ID.html

STAT_ID、YEAR_ID和TOURNAMENT_ID都会随着您更新特定的统计数据、年份和比赛ID而更改,以对应它们的唯一ID。因此,我能够使用一个函数来筛选所有stat_id、year_id和tournament_id的组合,以抓取网站的数据。现在,网站的URL只有在搜索特定stat_id时才会更改。如果我通过下拉菜单更改比赛或年份,统计数据会加载,但URL保持不变。这阻止了定位不同的比赛或年份。

https://www.pgatour.com/stats/detail/02675 - 02675是一个示例stat_id

@Dave2e对我非常有帮助,他向我展示了PGA如何使用Java以及如何访问一些JSON数据。我结合了他的教导和我以前的代码来抓取最近比赛的所有统计数据。然而,我不知道如何获取过去年份或比赛的统计数据。在JSON字符串中,我看到有$tournamentId和$year的ID,但我不确定如何使用这些信息来搜索过去的比赛和年份。

我如何访问比赛和年份的ID以抓取pgatour.com上的过去数据?我应该尝试使用rselenium来访问这些数据,而不是使用rvest这样的程序吗?

在 R 中抓取动态 JSON 数据

代码

library(tidyverse)
library(rvest)
library(dplyr)

df23 <- expand.grid(
  stat_id = c("02568","02675", "101")
) %>% 
  mutate(
    links = paste0(
      "https://www.pgatour.com/stats/detail/",
      stat_id
    )
  ) %>% 
  as_tibble()

get_info <- function(link, stat_id) {
  data <- link %>%
    read_html() %>% 
    html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>% 
    html_text() %>%
    jsonlite::fromJSON()
  
  answer <- data$props$pageProps$statDetails$rows %>%
#NA's in player name stops data from being collected
        drop_na(playerName)

# get lists of dataframes into single dataframe, then merge back with original dataframe
    answer2 <- answer$stats
  
  answer2 <- bind_rows(answer2, .id = "column_label") %>%
    select(-color) %>%
    pivot_wider(
      values_from = statValue, 
      names_from = statName) 
  
  #All stats combined and unnested
  stats2 <- dplyr::bind_cols(answer, answer2) 
}

test_stats <- df23 %>%
  mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))

test_stats <- test_stats %>% 
  unnest(everything())

简化的代码由@Dave2e提供

#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")

#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>% html_text()

#convert from JSON 
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)

#get the main table 
answer <-output$props$pageProps$statDetails$rows
英文:

On pgatour.com/stats I am trying to scrape multiple stats over multiple tournaments over multiple years. Unfortunately, I am struggling to scrape data for past years or tournament ID’s. In the past, PGA’s website looked like:

https://www.pgatour.com/stats/stat.STAT_ID.y.YEAR_ID.eoff.TOURNAMENT_ID.html

STAT_ID, YEAR_ID, and TOURNAMENT_ID would all change as you updated the particular stat, year, and tournament id to correspond with their unique id’s. Because of this, I was able to use a function that sifted through all combinations of stat_id, year_id, and tournament_id to scrape the website.
Now the website URL’s don’t change except for the particular stat_id being searched. If I change the tournament or year through dropdowns, the stats will load, but the url remains unchanged. This prevents targeting different tournaments or years.

https://www.pgatour.com/stats/detail/02675 - 02675 being an example stat_id

@Dave2e has been very helpful in showing me that pga uses java and how to access some of the JSON data. I combined his teachings along with my past code to scrape all stats for the most recent tournament. However, I can’t figure out how to get the stats for past years or tournaments. In the JSON str I see that there are id’s for $tournamentId and $year, but I’m uncertain of how to use this info to search for past tournaments and years.

How can I access the tournament and year id's to scrape past data on pgatour.com. Should I be trying to access this data with rselenium opposed to a program like rvest?

在 R 中抓取动态 JSON 数据

Code

library(tidyverse)
library(rvest)
library(dplyr)

df23 &lt;- expand.grid(
  stat_id = c(&quot;02568&quot;,&quot;02675&quot;, &quot;101&quot;)
) %&gt;% 
  mutate(
    links = paste0(
      &quot;https://www.pgatour.com/stats/detail/&quot;,
      stat_id
    )
  ) %&gt;% 
  as_tibble()

get_info &lt;- function(link, stat_id) {
  data &lt;- link %&gt;%
    read_html() %&gt;% 
    html_elements(xpath = &quot;.//script[@id=&#39;__NEXT_DATA__&#39;]&quot;) %&gt;% 
    html_text() %&gt;%
    jsonlite::fromJSON()
  
  answer &lt;- data$props$pageProps$statDetails$rows %&gt;%
#NA&#39;s in player name stops data from being collected
        drop_na(playerName)

# get lists of dataframes into single dataframe, then merge back with original dataframe
    answer2 &lt;- answer$stats
  
  answer2 &lt;- bind_rows(answer2, .id = &quot;column_label&quot;) %&gt;%
    select(-color) %&gt;%
    pivot_wider(
      values_from = statValue, 
      names_from = statName) 
  
  #All stats combined and unnested
  stats2 &lt;- dplyr::bind_cols(answer, answer2) 
}

test_stats &lt;- df23 %&gt;%
  mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))

test_stats &lt;- test_stats %&gt;% 
  unnest(everything())

Simplified code courtesy of @Dave2e

#read page
library(rvest)
page &lt;- read_html(&quot;https://www.pgatour.com/stats/detail/02675&quot;)

#find the script with the correct id tage, strip the html code
datascript &lt;- page %&gt;% html_elements(xpath = &quot;.//script[@id=&#39;__NEXT_DATA__&#39;]&quot;) %&gt;% html_text()

#convert from JSON 
output &lt;- jsonlite::fromJSON(datascript)
#explore the output
str(output)

#get the main table 
answer &lt;-output$props$pageProps$statDetails$rows

答案1

得分: 1

以下是您要翻译的内容:

"If you take a look at the developer tools (F12 key in your browser) and observe the Network tab when you click on a different year you can see a background request is being made to retrieve that year's data:

在 R 中抓取动态 JSON 数据

It returns a JSON dataset similar to the one in your original post:

在 R 中抓取动态 JSON 数据

To scrape this you need to replicate this GraphQL POST request in your R program. Note that it sends a JSON document with query details which includes tournament codes and the year.

Finally to ensure that your graphql succeeds make sure that you match headers you see in this inspector in your R program. In particular the headers Origin, Referer and the X- prefixed ones:

在 R 中抓取动态 JSON 数据

(you can probably hardcode these)

英文:

If you take a look at the developer tools (F12 key in your browser) and observe the Network tab when you click on a different year you can see a background request is being made to retrieve that year's data:

在 R 中抓取动态 JSON 数据

It returns a JSON dataset similar to the one in your original post:

在 R 中抓取动态 JSON 数据

To scrape this you need to replicate this GraphQL POST request in your R program. Note that it sends a JSON document with query details which includes tournament codes and the year.

Finally to ensure that your graphql succeeds make sure that you match headers you see in this inspector in your R program. In particular the headers Origin, Referer and the X- prefixed ones:

在 R 中抓取动态 JSON 数据

(you can probably hardcode these)

huangapple
  • 本文由 发表于 2023年2月16日 05:10:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75465462.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定