2023年5月21日 03:16:31go评论105阅读模式

英文:

Scraping with {rvest} yields "character (empty)"?

问题

我已经在爬取一个文件，但现在有一个新的网址 - 我尝试更改了网址和 CSS 选择器 - 但我的 link 对象不是一个搜索路径，而是 "character (empty)" - 问题似乎是什么？

网站：https://arbetsformedlingen.se/statistik/statistik-om-varsel

我想获取位于 Antal varsel och berörda personer 'box' 中的文件 "Tillfällig statistik per län och bransch, januari-april 2023"。

R 代码：

library(tidyverse)
library(stringr)
library(rio) #import-function
library(rvest) #read_html()-function
# 链接到目标网站
url <- "https://arbetsformedlingen.se/statistik/statistik-om-varsel"
## 解析 HTML 内容
doc <- read_html(url)
## 查找要抓取的数据
### 选择 CSS 定位器
link <- html_elements(doc, css = '#cardContainer > app-downloads:nth-child(3) > div > div:nth-child(3) > div > digi-link-internal > digi-link > a') %>%
  html_attr("href")
# 创建文件下载的 URL
url2 <- "https://arbetsformedlingen.se"
full_link <- sprintf("%s%s", url2, link)
# 获取并本地保存文件
td = tempdir()              # 创建临时文件夹
varsel_fil <- tempfile(tmpdir=td, fileext = ".xlsx")
download.file(full_link, destfile = varsel_fil, mode = "wb")   
# 读取文件到一个数据框中
df_imported <- import(varsel_fil, which=1) #which - 选择 'sheet'-nr

之前 html_elements 函数中的 css 参数是 #svid12_142311c317a09e842af1a94 > div.sv-text-portlet-content > p:nth-child(20) > strong > a

-> 所以开头有些不同 - 不过我不明白它意味着什么...

感谢任何帮助！

英文:

I've been scraping a file, but now there's a new URL - I just tried to chg. the URL and CSS-selector - but my link-object don't result in a searchpath but "character (empty)" - what's seems to be the problem?

Site: https://arbetsformedlingen.se/statistik/statistik-om-varsel

I want to grab the file "Tillfällig statistik per län och bransch, januari-april 2023" in the 'box' Antal varsel och berörda personer.

R-code:

library(tidyverse)
library(stringr)
library(rio) #import-function
library(rvest) #read_html()-function
# Link to target site
url &lt;- &quot;https://arbetsformedlingen.se/statistik/statistik-om-varsel&quot;
## Parsa HTML-inneh&#229;llet
doc &lt;- read_html(url)
## Hitta data som du vill skrapa
### Select CSS locator
link &lt;- html_elements(doc, css = &#39;#cardContainer &gt; app-downloads:nth-child(3) &gt; div &gt; div:nth-child(3) &gt; div &gt; digi-link-internal &gt; digi-link &gt; a&#39;) %&gt;%
  html_attr(&quot;href&quot;)
# Create URL for file download
url2 &lt;- &quot;https://arbetsformedlingen.se&quot;
full_link &lt;- sprintf(&quot;%s%s&quot;, url2, link)
# Get and save file locally
td = tempdir()              # skapa tempor&#228;r mapp
varsel_fil &lt;- tempfile(tmpdir=td, fileext = &quot;.xlsx&quot;)
download.file(full_link, destfile = varsel_fil, mode = &quot;wb&quot;)   
# Read file into a df
df_imported &lt;- import(varsel_fil, which=1) #which - v&#228;lj &#39;sheet&#39;-nr

Previously the css-argument in the html_elements-function was #svid12_142311c317a09e842af1a94 > div.sv-text-portlet-content > p:nth-child(20) > strong > a

-> So the beginning is quite different - I don't understand what it implies though..

Thanks for any assistance!

答案1

得分: 2

以下是您要翻译的内容：

"That page is now mostly rendered by JavaScript and most of that content is not included in the page source, you can check by disabling JS for the site in your browser. List of files in the described box is pulled from https://arbetsformedlingen.se/rest/analysportalen/va/sitevision.

A quick way to find this API endpoint would be through the network tab of browser's developer tools -- after launching dev tools, refresh the page to capture all requests and search for some phrase that can't be found from the source of the main page, i.e. 'januari-april', looks something like this. Once the API endpoint with file list is identified, we can extract the file URL and proceed with the download:

library(dplyr)
url_ <- "https://arbetsformedlingen.se/rest/analysportalen/va/sitevision"
xlsx_response <- jsonlite::fromJSON(url_, simplifyVector = FALSE) %>%
  # there are 3 files listed, naively picking the last one; 
  # may or may not work in a long run
  dplyr::last() %>%
  # we could also use purrr::keep() **before** last() to keep only 
  # the record with matching name, like 
  # purrr::keep(~ .x$name == "Länktext till varsel tillfällig statistik")
  purrr::pluck("properties", "link") %>%
  # combine first part of url_ and extracted link to get full URL for the file
  # last parameter, ".", is where the output of previous pipe ends up 
  # next expression evaluates as:
  # str_replace("https://arbetsformedlingen.se/rest/analysportalen/va/sitevision", "(?<=\\w)/.*", "/download/18.793fa1821869801540c14b3/1683719045927/web-varsel-bransch-lan-2023.xlsx")
  stringr::str_replace(url_, "(?<=\\w)/.*", .) %>%
  httr2::request() %>%
  # we can store the file by setting the path
  httr2::req_perform(path = file.path(tempdir(), basename(.$url)))
# httr2 response:
xlsx_response
#> <httr2_response>
#> GET
#> https://arbetsformedlingen.se/download/18.793fa1821869801540c14b3/1683719045927/web-varsel-bransch-lan-2023.xlsx
#> Status: 200 OK
#> Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
#> Body: On disk 'body'
# open file location:
# browseURL(xlsx_response$body[1])

Downloaded file:

readxl::read_xlsx(xlsx_response$body[1], 1, "A4:Y26") %>% glimpse()
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> • `` -> `...25`
#> Rows: 22
#> Columns: 25
#> $ ...1  <chr> "SNI-kod", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"…
#> $ ...2  <chr> "Näringsgren", "Jordbruk, skogsbruk och fiske", "Utvinning av mi…
#> $ AB    <chr> "Stockholms län", "5", NA, "56", NA, NA, "256", "384", "141", "3…
#> $ C     <chr> "Uppsala län", NA, NA, NA, NA, NA, "18", NA, NA, NA, NA, NA, NA,…
#> $ D     <chr> "Södermanlands län", NA, NA, NA, NA, "8", NA, NA, NA, NA, NA, NA…
#> $ E     <chr> "Östergötlands län", NA, NA, "20", NA, NA, "61", NA, "17", NA, N…
#> $ F     <chr> "Jönköpings län", NA, "5", "246", NA, NA, NA, "21", NA, NA, NA, …
#> $ G     <chr> "Kronobergs län", NA, NA, "201", NA, NA, "19", NA, NA, NA, NA, N…
#> $ H     <chr> "Kalmar län", NA, NA, NA, NA, NA, NA, NA, "15", NA, NA, NA, NA, …
#> $ I     <chr> "Gotlands län", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ K     <chr> "Blekinge län", NA, NA, "9", NA, NA, NA, "18", NA, NA, NA, NA, N…
#> $ M     <chr> "Skåne län", NA, NA, "30", NA, NA, "57", "63", NA, NA, "46", NA,…
#> $ N     <chr> "Hallands län", NA, NA, "24", NA, NA, "15", "6", NA, NA, NA, NA,…
#> $ O     <chr> "Västra Götalands län", NA, NA, "62", NA, NA, "120", "27", "96",…
#> $ S     <chr> "Värmlands län", NA, NA, "20", NA, NA, NA, NA, NA, "10", NA, NA,…
#> $ T     <chr> "Örebro län", NA, NA, "47", NA, NA, NA, "8", NA, "18", NA, NA, "…
#> $ U     <chr> "Västmanlands län", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ W     <chr> "Dalarnas län", NA, NA, NA, NA, NA, "15", NA, NA, NA, NA, NA, NA…
#> $ X     <chr> "Gävleborgs län", NA, NA, "32", NA, NA, NA, NA, NA, "50", NA, NA…
#> $ Y     <chr> "Västernorrlands län", NA, NA, "10", NA, NA, NA, NA, NA, NA, NA,…
#> $ Z     <chr> "Jämtlands län", NA, NA, "14", NA, NA, NA, NA, NA
<details>
<summary>英文:</summary>
That page is now mostly rendered by JavaScript and most of that content is not included in the page source, you can check by disabling JS for the site in your browser. List of files in the described box is pulled from `https://arbetsformedlingen.se/rest/analysportalen/va/sitevision`. 
A quick way to find this API endpoint would be through the network tab of browser&#39;s developer tools -- after launching dev tools, refresh the page to capture all requests and search for some phrase that can&#39;t be found from the source of the main page, i.e. &quot;januari-april&quot;, looks something [like this][1]. Once the API endpoint with file list is identified, we can extract the file URL and proceed with the download:
``` r
library(dplyr)
url_ &lt;- &quot;https://arbetsformedlingen.se/rest/analysportalen/va/sitevision&quot;
xlsx_response &lt;- jsonlite::fromJSON(url_, simplifyVector = FALSE) %&gt;% 
  # there are 3 files listed, naively picking the last one; 
  # may or may not work in a long run
  dplyr::last() %&gt;% 
  # we could also use purrr::keep() **before** last() to keep only 
  # the record with matching name, like 
  # purrr::keep(~ .x$name == &quot;L&#228;nktext till varsel tillf&#228;llig statistik&quot;)
  purrr::pluck(&quot;properties&quot;, &quot;link&quot;) %&gt;% 
  # combine first part of url_ and extracted link to get full URL for the file
  # last parameter, &quot;.&quot;, is where the output of previous pipe ends up 
  # next expression evaluates as:
  # str_replace(&quot;https://arbetsformedlingen.se/rest/analysportalen/va/sitevision&quot;, &quot;(?&lt;=\\w)/.*&quot;, &quot;/download/18.793fa1821869801540c14b3/1683719045927/web-varsel-bransch-lan-2023.xlsx&quot;)
  stringr::str_replace(url_, &quot;(?&lt;=\\w)/.*&quot;, .) %&gt;%
  httr2::request() %&gt;% 
  # we can store the file by setting the path
  httr2::req_perform(path = file.path(tempdir(), basename(.$url)))
# httr2 response:
xlsx_response
#&gt; &lt;httr2_response&gt;
#&gt; GET
#&gt; https://arbetsformedlingen.se/download/18.793fa1821869801540c14b3/1683719045927/web-varsel-bransch-lan-2023.xlsx
#&gt; Status: 200 OK
#&gt; Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
#&gt; Body: On disk &#39;body&#39;
# open file location:
# browseURL(xlsx_response$body[1])

Downloaded file:

readxl::read_xlsx(xlsx_response$body[1], 1, &quot;A4:Y26&quot;) %&gt;% glimpse()
#&gt; New names:
#&gt; • `` -&gt; `...1`
#&gt; • `` -&gt; `...2`
#&gt; • `` -&gt; `...25`
#&gt; Rows: 22
#&gt; Columns: 25
#&gt; $ ...1  &lt;chr&gt; &quot;SNI-kod&quot;, &quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;, &quot;F&quot;, &quot;G&quot;, &quot;H&quot;, &quot;I&quot;, &quot;J&quot;, &quot;K&quot;…
#&gt; $ ...2  &lt;chr&gt; &quot;N&#228;ringsgren&quot;, &quot;Jordbruk, skogsbruk och fiske&quot;, &quot;Utvinning av mi…
#&gt; $ AB    &lt;chr&gt; &quot;Stockholms l&#228;n&quot;, &quot;5&quot;, NA, &quot;56&quot;, NA, NA, &quot;256&quot;, &quot;384&quot;, &quot;141&quot;, &quot;3…
#&gt; $ C     &lt;chr&gt; &quot;Uppsala l&#228;n&quot;, NA, NA, NA, NA, NA, &quot;18&quot;, NA, NA, NA, NA, NA, NA,…
#&gt; $ D     &lt;chr&gt; &quot;S&#246;dermanlands l&#228;n&quot;, NA, NA, NA, NA, &quot;8&quot;, NA, NA, NA, NA, NA, NA…
#&gt; $ E     &lt;chr&gt; &quot;&#214;sterg&#246;tlands l&#228;n&quot;, NA, NA, &quot;20&quot;, NA, NA, &quot;61&quot;, NA, &quot;17&quot;, NA, N…
#&gt; $ F     &lt;chr&gt; &quot;J&#246;nk&#246;pings l&#228;n&quot;, NA, &quot;5&quot;, &quot;246&quot;, NA, NA, NA, &quot;21&quot;, NA, NA, NA, …
#&gt; $ G     &lt;chr&gt; &quot;Kronobergs l&#228;n&quot;, NA, NA, &quot;201&quot;, NA, NA, &quot;19&quot;, NA, NA, NA, NA, N…
#&gt; $ H     &lt;chr&gt; &quot;Kalmar l&#228;n&quot;, NA, NA, NA, NA, NA, NA, NA, &quot;15&quot;, NA, NA, NA, NA, …
#&gt; $ I     &lt;chr&gt; &quot;Gotlands l&#228;n&quot;, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#&gt; $ K     &lt;chr&gt; &quot;Blekinge l&#228;n&quot;, NA, NA, &quot;9&quot;, NA, NA, NA, &quot;18&quot;, NA, NA, NA, NA, N…
#&gt; $ M     &lt;chr&gt; &quot;Sk&#229;ne l&#228;n&quot;, NA, NA, &quot;30&quot;, NA, NA, &quot;57&quot;, &quot;63&quot;, NA, NA, &quot;46&quot;, NA,…
#&gt; $ N     &lt;chr&gt; &quot;Hallands l&#228;n&quot;, NA, NA, &quot;24&quot;, NA, NA, &quot;15&quot;, &quot;6&quot;, NA, NA, NA, NA,…
#&gt; $ O     &lt;chr&gt; &quot;V&#228;stra G&#246;talands l&#228;n&quot;, NA, NA, &quot;62&quot;, NA, NA, &quot;120&quot;, &quot;27&quot;, &quot;96&quot;,…
#&gt; $ S     &lt;chr&gt; &quot;V&#228;rmlands l&#228;n&quot;, NA, NA, &quot;20&quot;, NA, NA, NA, NA, NA, &quot;10&quot;, NA, NA,…
#&gt; $ T     &lt;chr&gt; &quot;&#214;rebro l&#228;n&quot;, NA, NA, &quot;47&quot;, NA, NA, NA, &quot;8&quot;, NA, &quot;18&quot;, NA, NA, &quot;…
#&gt; $ U     &lt;chr&gt; &quot;V&#228;stmanlands l&#228;n&quot;, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#&gt; $ W     &lt;chr&gt; &quot;Dalarnas l&#228;n&quot;, NA, NA, NA, NA, NA, &quot;15&quot;, NA, NA, NA, NA, NA, NA…
#&gt; $ X     &lt;chr&gt; &quot;G&#228;vleborgs l&#228;n&quot;, NA, NA, &quot;32&quot;, NA, NA, NA, NA, NA, &quot;50&quot;, NA, NA…
#&gt; $ Y     &lt;chr&gt; &quot;V&#228;sternorrlands l&#228;n&quot;, NA, NA, &quot;10&quot;, NA, NA, NA, NA, NA, NA, NA,…
#&gt; $ Z     &lt;chr&gt; &quot;J&#228;mtlands l&#228;n&quot;, NA, NA, &quot;14&quot;, NA, NA, NA, NA, NA, NA, &quot;6&quot;, NA, …
#&gt; $ AC    &lt;chr&gt; &quot;V&#228;sterbottens l&#228;n&quot;, NA, NA, &quot;50&quot;, NA, NA, &quot;9&quot;, NA, &quot;16&quot;, NA, NA…
#&gt; $ BD    &lt;chr&gt; &quot;Norrbottens l&#228;n&quot;, NA, NA, &quot;6&quot;, NA, NA, &quot;15&quot;, &quot;6&quot;, NA, NA, NA, N…
#&gt; $ `-`   &lt;chr&gt; &quot;Uppgift saknas&quot;, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#&gt; $ ...25 &lt;chr&gt; &quot;Riket&quot;, &quot;5&quot;, &quot;5&quot;, &quot;827&quot;, NA, &quot;8&quot;, &quot;585&quot;, &quot;533&quot;, &quot;285&quot;, &quot;125&quot;, &quot;…

<sup>Created on 2023-05-20 with reprex v2.0.2</sup>

More base-like approach would perhaps be:

download.file(paste0(&quot;https://arbetsformedlingen.se&quot;,
                     jsonlite::read_json(&quot;https://arbetsformedlingen.se/rest/analysportalen/va/sitevision&quot;)[[3]]$properties$link),
              file.path(tempdir(),&quot;out.xlsx&quot;), mode = &quot;wb&quot;)

答案2

得分: 0

感谢 @margusl，以下是翻译好的代码部分：

# 创建临时目录，将文件下载到本地
td = tempdir() ## 创建临时目录
varsel_fil <- tempfile(tmpdir=td, fileext = ".xlsx")
## 下载文件
download.file(paste0("https://arbetsformedlingen.se",
                     jsonlite::read_json("https://arbetsformedlingen.se/rest/analysportalen/va/sitevision")[[3]]$properties$link),
              destfile = varsel_fil, mode = "wb")
# 将文件导入为数据框
df <- import(varsel_fil, which=1) # which - sheet nbr

这段代码选择了 JSON 响应中的第三个列表，首先创建一个对象并检查 JSON 响应是一个很好的主意。第一个建议（带有更多代码）稍微更加严格，根据条件选择列表。

英文:

So thanks to @margusl this code would suffice to do what my original code did:

# Create temp dir, download file locally
td = tempdir() ## created temp map
varsel_fil &lt;- tempfile(tmpdir=td, fileext = &quot;.xlsx&quot;)
## Download file
download.file(paste0(&quot;https://arbetsformedlingen.se&quot;,
jsonlite::read_json(&quot;https://arbetsformedlingen.se/rest/analysportalen/va/sitevision&quot;)[[3]]$properties$link),
destfile = varsel_fil, mode = &quot;wb&quot;)
#Importera file as df
df &lt;- import(varsel_fil, which=1) #which - sheet nbr

This code select third list from json-response, it's a good idea to first create an object and inspect the json-response. The first suggestion (with more code) is a bit more rigid and select list based on criteria.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scraping with {rvest} yields "character (empty)"?

问题

答案1

答案2

显示4套蒙版图像的-webkit-mask-box-image。

Material UI 线性进度条步骤

尝试为 tibble 创建一个日期列。希望从价格 xts 对象的索引中获取值。

使用模式进行 gsub 的方法

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。