Scraping with {rvest} yields "character (empty)"?

huangapple go评论83阅读模式
英文:

Scraping with {rvest} yields "character (empty)"?

问题

我已经在爬取一个文件,但现在有一个新的网址 - 我尝试更改了网址和 CSS 选择器 - 但我的 link 对象不是一个搜索路径,而是 "character (empty)" - 问题似乎是什么?

网站:https://arbetsformedlingen.se/statistik/statistik-om-varsel

我想获取位于 Antal varsel och berörda personer 'box' 中的文件 "Tillfällig statistik per län och bransch, januari-april 2023"。

R 代码:

library(tidyverse)
library(stringr)
library(rio) #import-function
library(rvest) #read_html()-function

# 链接到目标网站
url <- "https://arbetsformedlingen.se/statistik/statistik-om-varsel"

## 解析 HTML 内容
doc <- read_html(url)

## 查找要抓取的数据
### 选择 CSS 定位器
link <- html_elements(doc, css = '#cardContainer > app-downloads:nth-child(3) > div > div:nth-child(3) > div > digi-link-internal > digi-link > a') %>%
  html_attr("href")

# 创建文件下载的 URL
url2 <- "https://arbetsformedlingen.se"
full_link <- sprintf("%s%s", url2, link)

# 获取并本地保存文件
td = tempdir()              # 创建临时文件夹
varsel_fil <- tempfile(tmpdir=td, fileext = ".xlsx")
download.file(full_link, destfile = varsel_fil, mode = "wb")   

# 读取文件到一个数据框中
df_imported <- import(varsel_fil, which=1) #which - 选择 'sheet'-nr

之前 html_elements 函数中的 css 参数是 #svid12_142311c317a09e842af1a94 > div.sv-text-portlet-content > p:nth-child(20) > strong > a

-> 所以开头有些不同 - 不过我不明白它意味着什么...

感谢任何帮助!

英文:

I've been scraping a file, but now there's a new URL - I just tried to chg. the URL and CSS-selector - but my link-object don't result in a searchpath but "character (empty)" - what's seems to be the problem?

Site: https://arbetsformedlingen.se/statistik/statistik-om-varsel

I want to grab the file "Tillfällig statistik per län och bransch, januari-april 2023" in the 'box' Antal varsel och berörda personer.

R-code:

library(tidyverse)
library(stringr)
library(rio) #import-function
library(rvest) #read_html()-function


# Link to target site
url &lt;- &quot;https://arbetsformedlingen.se/statistik/statistik-om-varsel&quot;

## Parsa HTML-inneh&#229;llet
doc &lt;- read_html(url)

## Hitta data som du vill skrapa
### Select CSS locator
link &lt;- html_elements(doc, css = &#39;#cardContainer &gt; app-downloads:nth-child(3) &gt; div &gt; div:nth-child(3) &gt; div &gt; digi-link-internal &gt; digi-link &gt; a&#39;) %&gt;%
  html_attr(&quot;href&quot;)

# Create URL for file download
url2 &lt;- &quot;https://arbetsformedlingen.se&quot;
full_link &lt;- sprintf(&quot;%s%s&quot;, url2, link)

# Get and save file locally
td = tempdir()              # skapa tempor&#228;r mapp
varsel_fil &lt;- tempfile(tmpdir=td, fileext = &quot;.xlsx&quot;)
download.file(full_link, destfile = varsel_fil, mode = &quot;wb&quot;)   

# Read file into a df
df_imported &lt;- import(varsel_fil, which=1) #which - v&#228;lj &#39;sheet&#39;-nr

Previously the css-argument in the html_elements-function was #svid12_142311c317a09e842af1a94 &gt; div.sv-text-portlet-content &gt; p:nth-child(20) &gt; strong &gt; a

-> So the beginning is quite different - I don't understand what it implies though..

Thanks for any assistance!

答案1

得分: 2

以下是您要翻译的内容:

"That page is now mostly rendered by JavaScript and most of that content is not included in the page source, you can check by disabling JS for the site in your browser. List of files in the described box is pulled from https://arbetsformedlingen.se/rest/analysportalen/va/sitevision.

A quick way to find this API endpoint would be through the network tab of browser's developer tools -- after launching dev tools, refresh the page to capture all requests and search for some phrase that can't be found from the source of the main page, i.e. 'januari-april', looks something like this. Once the API endpoint with file list is identified, we can extract the file URL and proceed with the download:

library(dplyr)
url_ <- "https://arbetsformedlingen.se/rest/analysportalen/va/sitevision"
xlsx_response <- jsonlite::fromJSON(url_, simplifyVector = FALSE) %>%
  # there are 3 files listed, naively picking the last one; 
  # may or may not work in a long run
  dplyr::last() %>%
  # we could also use purrr::keep() **before** last() to keep only 
  # the record with matching name, like 
  # purrr::keep(~ .x$name == "Länktext till varsel tillfällig statistik")
  purrr::pluck("properties", "link") %>%
  # combine first part of url_ and extracted link to get full URL for the file
  # last parameter, ".", is where the output of previous pipe ends up 
  # next expression evaluates as:
  # str_replace("https://arbetsformedlingen.se/rest/analysportalen/va/sitevision", "(?<=\\w)/.*", "/download/18.793fa1821869801540c14b3/1683719045927/web-varsel-bransch-lan-2023.xlsx")
  stringr::str_replace(url_, "(?<=\\w)/.*", .) %>%
  httr2::request() %>%
  # we can store the file by setting the path
  httr2::req_perform(path = file.path(tempdir(), basename(.$url)))

# httr2 response:
xlsx_response
#> <httr2_response>
#> GET
#> https://arbetsformedlingen.se/download/18.793fa1821869801540c14b3/1683719045927/web-varsel-bransch-lan-2023.xlsx
#> Status: 200 OK
#> Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
#> Body: On disk 'body'

# open file location:
# browseURL(xlsx_response$body[1])

Downloaded file:

readxl::read_xlsx(xlsx_response$body[1], 1, "A4:Y26") %>% glimpse()
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> • `` -> `...25`
#> Rows: 22
#> Columns: 25
#> $ ...1  <chr> "SNI-kod", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"…
#> $ ...2  <chr> "Näringsgren", "Jordbruk, skogsbruk och fiske", "Utvinning av mi…
#> $ AB    <chr> "Stockholms län", "5", NA, "56", NA, NA, "256", "384", "141", "3…
#> $ C     <chr> "Uppsala län", NA, NA, NA, NA, NA, "18", NA, NA, NA, NA, NA, NA,…
#> $ D     <chr> "Södermanlands län", NA, NA, NA, NA, "8", NA, NA, NA, NA, NA, NA…
#> $ E     <chr> "Östergötlands län", NA, NA, "20", NA, NA, "61", NA, "17", NA, N…
#> $ F     <chr> "Jönköpings län", NA, "5", "246", NA, NA, NA, "21", NA, NA, NA, …
#> $ G     <chr> "Kronobergs län", NA, NA, "201", NA, NA, "19", NA, NA, NA, NA, N…
#> $ H     <chr> "Kalmar län", NA, NA, NA, NA, NA, NA, NA, "15", NA, NA, NA, NA, …
#> $ I     <chr> "Gotlands län", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ K     <chr> "Blekinge län", NA, NA, "9", NA, NA, NA, "18", NA, NA, NA, NA, N…
#> $ M     <chr> "Skåne län", NA, NA, "30", NA, NA, "57", "63", NA, NA, "46", NA,…
#> $ N     <chr> "Hallands län", NA, NA, "24", NA, NA, "15", "6", NA, NA, NA, NA,…
#> $ O     <chr> "Västra Götalands län", NA, NA, "62", NA, NA, "120", "27", "96",…
#> $ S     <chr> "Värmlands län", NA, NA, "20", NA, NA, NA, NA, NA, "10", NA, NA,…
#> $ T     <chr> "Örebro län", NA, NA, "47", NA, NA, NA, "8", NA, "18", NA, NA, "…
#> $ U     <chr> "Västmanlands län", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ W     <chr> "Dalarnas län", NA, NA, NA, NA, NA, "15", NA, NA, NA, NA, NA, NA…
#> $ X     <chr> "Gävleborgs län", NA, NA, "32", NA, NA, NA, NA, NA, "50", NA, NA…
#> $ Y     <chr> "Västernorrlands län", NA, NA, "10", NA, NA, NA, NA, NA, NA, NA,…
#> $ Z     <chr> "Jämtlands län", NA, NA, "14", NA, NA, NA, NA, NA

<details>
<summary>英文:</summary>

That page is now mostly rendered by JavaScript and most of that content is not included in the page source, you can check by disabling JS for the site in your browser. List of files in the described box is pulled from `https://arbetsformedlingen.se/rest/analysportalen/va/sitevision`. 

A quick way to find this API endpoint would be through the network tab of browser&#39;s developer tools -- after launching dev tools, refresh the page to capture all requests and search for some phrase that can&#39;t be found from the source of the main page, i.e. &quot;januari-april&quot;, looks something [like this][1]. Once the API endpoint with file list is identified, we can extract the file URL and proceed with the download:

``` r
library(dplyr)
url_ &lt;- &quot;https://arbetsformedlingen.se/rest/analysportalen/va/sitevision&quot;
xlsx_response &lt;- jsonlite::fromJSON(url_, simplifyVector = FALSE) %&gt;% 
  # there are 3 files listed, naively picking the last one; 
  # may or may not work in a long run
  dplyr::last() %&gt;% 
  # we could also use purrr::keep() **before** last() to keep only 
  # the record with matching name, like 
  # purrr::keep(~ .x$name == &quot;L&#228;nktext till varsel tillf&#228;llig statistik&quot;)
  purrr::pluck(&quot;properties&quot;, &quot;link&quot;) %&gt;% 
  # combine first part of url_ and extracted link to get full URL for the file
  # last parameter, &quot;.&quot;, is where the output of previous pipe ends up 
  # next expression evaluates as:
  # str_replace(&quot;https://arbetsformedlingen.se/rest/analysportalen/va/sitevision&quot;, &quot;(?&lt;=\\w)/.*&quot;, &quot;/download/18.793fa1821869801540c14b3/1683719045927/web-varsel-bransch-lan-2023.xlsx&quot;)
  stringr::str_replace(url_, &quot;(?&lt;=\\w)/.*&quot;, .) %&gt;%
  httr2::request() %&gt;% 
  # we can store the file by setting the path
  httr2::req_perform(path = file.path(tempdir(), basename(.$url)))

# httr2 response:
xlsx_response
#&gt; &lt;httr2_response&gt;
#&gt; GET
#&gt; https://arbetsformedlingen.se/download/18.793fa1821869801540c14b3/1683719045927/web-varsel-bransch-lan-2023.xlsx
#&gt; Status: 200 OK
#&gt; Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
#&gt; Body: On disk &#39;body&#39;

# open file location:
# browseURL(xlsx_response$body[1])

Downloaded file:

readxl::read_xlsx(xlsx_response$body[1], 1, &quot;A4:Y26&quot;) %&gt;% glimpse()
#&gt; New names:
#&gt; • `` -&gt; `...1`
#&gt; • `` -&gt; `...2`
#&gt; • `` -&gt; `...25`
#&gt; Rows: 22
#&gt; Columns: 25
#&gt; $ ...1  &lt;chr&gt; &quot;SNI-kod&quot;, &quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;, &quot;F&quot;, &quot;G&quot;, &quot;H&quot;, &quot;I&quot;, &quot;J&quot;, &quot;K&quot;…
#&gt; $ ...2  &lt;chr&gt; &quot;N&#228;ringsgren&quot;, &quot;Jordbruk, skogsbruk och fiske&quot;, &quot;Utvinning av mi…
#&gt; $ AB    &lt;chr&gt; &quot;Stockholms l&#228;n&quot;, &quot;5&quot;, NA, &quot;56&quot;, NA, NA, &quot;256&quot;, &quot;384&quot;, &quot;141&quot;, &quot;3…
#&gt; $ C     &lt;chr&gt; &quot;Uppsala l&#228;n&quot;, NA, NA, NA, NA, NA, &quot;18&quot;, NA, NA, NA, NA, NA, NA,…
#&gt; $ D     &lt;chr&gt; &quot;S&#246;dermanlands l&#228;n&quot;, NA, NA, NA, NA, &quot;8&quot;, NA, NA, NA, NA, NA, NA…
#&gt; $ E     &lt;chr&gt; &quot;&#214;sterg&#246;tlands l&#228;n&quot;, NA, NA, &quot;20&quot;, NA, NA, &quot;61&quot;, NA, &quot;17&quot;, NA, N…
#&gt; $ F     &lt;chr&gt; &quot;J&#246;nk&#246;pings l&#228;n&quot;, NA, &quot;5&quot;, &quot;246&quot;, NA, NA, NA, &quot;21&quot;, NA, NA, NA, …
#&gt; $ G     &lt;chr&gt; &quot;Kronobergs l&#228;n&quot;, NA, NA, &quot;201&quot;, NA, NA, &quot;19&quot;, NA, NA, NA, NA, N…
#&gt; $ H     &lt;chr&gt; &quot;Kalmar l&#228;n&quot;, NA, NA, NA, NA, NA, NA, NA, &quot;15&quot;, NA, NA, NA, NA, …
#&gt; $ I     &lt;chr&gt; &quot;Gotlands l&#228;n&quot;, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#&gt; $ K     &lt;chr&gt; &quot;Blekinge l&#228;n&quot;, NA, NA, &quot;9&quot;, NA, NA, NA, &quot;18&quot;, NA, NA, NA, NA, N…
#&gt; $ M     &lt;chr&gt; &quot;Sk&#229;ne l&#228;n&quot;, NA, NA, &quot;30&quot;, NA, NA, &quot;57&quot;, &quot;63&quot;, NA, NA, &quot;46&quot;, NA,…
#&gt; $ N     &lt;chr&gt; &quot;Hallands l&#228;n&quot;, NA, NA, &quot;24&quot;, NA, NA, &quot;15&quot;, &quot;6&quot;, NA, NA, NA, NA,…
#&gt; $ O     &lt;chr&gt; &quot;V&#228;stra G&#246;talands l&#228;n&quot;, NA, NA, &quot;62&quot;, NA, NA, &quot;120&quot;, &quot;27&quot;, &quot;96&quot;,…
#&gt; $ S     &lt;chr&gt; &quot;V&#228;rmlands l&#228;n&quot;, NA, NA, &quot;20&quot;, NA, NA, NA, NA, NA, &quot;10&quot;, NA, NA,…
#&gt; $ T     &lt;chr&gt; &quot;&#214;rebro l&#228;n&quot;, NA, NA, &quot;47&quot;, NA, NA, NA, &quot;8&quot;, NA, &quot;18&quot;, NA, NA, &quot;…
#&gt; $ U     &lt;chr&gt; &quot;V&#228;stmanlands l&#228;n&quot;, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#&gt; $ W     &lt;chr&gt; &quot;Dalarnas l&#228;n&quot;, NA, NA, NA, NA, NA, &quot;15&quot;, NA, NA, NA, NA, NA, NA…
#&gt; $ X     &lt;chr&gt; &quot;G&#228;vleborgs l&#228;n&quot;, NA, NA, &quot;32&quot;, NA, NA, NA, NA, NA, &quot;50&quot;, NA, NA…
#&gt; $ Y     &lt;chr&gt; &quot;V&#228;sternorrlands l&#228;n&quot;, NA, NA, &quot;10&quot;, NA, NA, NA, NA, NA, NA, NA,…
#&gt; $ Z     &lt;chr&gt; &quot;J&#228;mtlands l&#228;n&quot;, NA, NA, &quot;14&quot;, NA, NA, NA, NA, NA, NA, &quot;6&quot;, NA, …
#&gt; $ AC    &lt;chr&gt; &quot;V&#228;sterbottens l&#228;n&quot;, NA, NA, &quot;50&quot;, NA, NA, &quot;9&quot;, NA, &quot;16&quot;, NA, NA…
#&gt; $ BD    &lt;chr&gt; &quot;Norrbottens l&#228;n&quot;, NA, NA, &quot;6&quot;, NA, NA, &quot;15&quot;, &quot;6&quot;, NA, NA, NA, N…
#&gt; $ `-`   &lt;chr&gt; &quot;Uppgift saknas&quot;, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#&gt; $ ...25 &lt;chr&gt; &quot;Riket&quot;, &quot;5&quot;, &quot;5&quot;, &quot;827&quot;, NA, &quot;8&quot;, &quot;585&quot;, &quot;533&quot;, &quot;285&quot;, &quot;125&quot;, &quot;…

<sup>Created on 2023-05-20 with reprex v2.0.2</sup>

More base-like approach would perhaps be:

download.file(paste0(&quot;https://arbetsformedlingen.se&quot;,
                     jsonlite::read_json(&quot;https://arbetsformedlingen.se/rest/analysportalen/va/sitevision&quot;)[[3]]$properties$link),
              file.path(tempdir(),&quot;out.xlsx&quot;), mode = &quot;wb&quot;)

答案2

得分: 0

感谢 @margusl,以下是翻译好的代码部分:

# 创建临时目录,将文件下载到本地
td = tempdir() ## 创建临时目录
varsel_fil <- tempfile(tmpdir=td, fileext = ".xlsx")

## 下载文件
download.file(paste0("https://arbetsformedlingen.se",
                     jsonlite::read_json("https://arbetsformedlingen.se/rest/analysportalen/va/sitevision")[[3]]$properties$link),
              destfile = varsel_fil, mode = "wb")


# 将文件导入为数据框
df <- import(varsel_fil, which=1) # which - sheet nbr

这段代码选择了 JSON 响应中的第三个列表,首先创建一个对象并检查 JSON 响应是一个很好的主意。第一个建议(带有更多代码)稍微更加严格,根据条件选择列表。

英文:

So thanks to @margusl this code would suffice to do what my original code did:

# Create temp dir, download file locally
td = tempdir() ## created temp map
varsel_fil &lt;- tempfile(tmpdir=td, fileext = &quot;.xlsx&quot;)
## Download file
download.file(paste0(&quot;https://arbetsformedlingen.se&quot;,
jsonlite::read_json(&quot;https://arbetsformedlingen.se/rest/analysportalen/va/sitevision&quot;)[[3]]$properties$link),
destfile = varsel_fil, mode = &quot;wb&quot;)
#Importera file as df
df &lt;- import(varsel_fil, which=1) #which - sheet nbr

This code select third list from json-response, it's a good idea to first create an object and inspect the json-response. The first suggestion (with more code) is a bit more rigid and select list based on criteria.

huangapple
  • 本文由 发表于 2023年5月21日 03:16:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76296966.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定