如何使用rvest包在R中从网络抓取的数据中提取和显示图像?

huangapple go评论64阅读模式
英文:

How can I extract and display images from web scraped data in R using rvest package?

问题

我目前正在尝试从维基百科页面中抓取美国所有法学院的名称和徽标。我已经成功创建了一个包含大学名称和图像链接的表格。但是,我想创建一个包含实际图像的大学名称表格。我正在使用R编程。

library(tidyverse)
library(rvest)

# 读取数据
url = 'https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States'
school_url = 'https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States' %>%
  read_html() %>%
  html_elements('.wikitable a') %>%
  html_attr('href')
school_url = paste('https://en.wikipedia.org/', school_url, sep='')

top3_school_urls = head(school_url, 3)

# 网页抓取
results = list()
for (school_url in top3_school_urls) {
  message('Scraping URL: ', school_url)
  school_html = read_html(school_url) 
  School = school_html %>%
    html_element('h1 .mw-page-title-main') %>%
    html_text2()
  Logo = school_html %>%
    html_element('.infobox-image a') %>%
    html_attr('href')
  Logo = paste('https://en.wikipedia.org/', Logo, sep='')

  school_tibble = tibble(School, Logo)
  results[[school_url]] = school_tibble
}

d = bind_rows(results, .id = 'url')
d

我已经将数据读入R并使用rvest包解析了数据。我成功提取了图像的链接,但我想更进一步,将实际图像放入表格中。

英文:

I am currently trying to scrape the names and logos of all the Law Schools in the U.S. from a wiki page. I was able to create a table of university names and image links. However, I would like to create a table of university names with the actual images. I am using R programming.

library(tidyverse)
library(rvest)
\###Reading in the Data
url = 
'https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States'
school_url = 
'https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States' 
   %\>%
read_html() %\>%
html_elements('.wikitable a') %\>%
html_attr('href')
school_url = paste('https://en.wikipedia.org/', school_url, sep='')

top3_school_urls = head(school_url, 3)

###Web Scraper
results = list()
for (school_url in top3_school_urls) {
  message('Scraping URL: ', school_url)
  school_html = read_html(school_url) 
  School = school_html %>% 
    html_element('h1 .mw-page-title-main') %>% html_text2()
  Logo = school_html %>% html_element('.infobox-image a') %>% 
          html_attr('href')
  Logo = paste('https://en.wikipedia.org/', Logo, sep='')

  school_tibble = tibble(School, Logo)
  results[[school_url]] = school_tibble
}

d = bind_rows(results, .id = 'url')
d`

I read the data into R and parsed the data using the rvest package. I was able to extract the links to the image but I would like to take it a step further and have the actual images in the table.

答案1

得分: 1

一个快速简单的选择是 kableExtra

---
title: "法学院"
output: html_document
---
英文:

One quick and easy option is kableExtra:

---
title: "law schools"
output: html_document
---
```{r echo=FALSE, warning=FALSE}
suppressPackageStartupMessages({
  library(rvest)
  library(stringr)
  library(purrr)
  library(dplyr)
  library(kableExtra)
})

mw_api <- list(page  = "https://en.wikipedia.org/api/rest_v1/page/html/",
               media = "https://en.wikipedia.org/api/rest_v1/page/media-list/")

top_3 <- read_html(paste0(mw_api$page, "List_of_law_schools_in_the_United_States")) %>% 
  # extract link elements only from 2nd column of the table
  html_elements("table.wikitable tbody tr > td:nth-child(2) > a") %>% 
  # keep only top 3
  head(3) %>% 
  # get link / wiki title and link text from single elements
  map(~ tibble::tibble_row( wikititle = html_attr(.x, "href") %>% str_remove("^./"),
                            title = html_text(.x) %>% str_trim())
  ) %>% list_rbind() %>% 
  # request media list for titles
  mutate(logo = map_chr(wikititle, ~ paste0(mw_api$media,.x) %>% 
                          jsonlite::read_json() %>% 
                          pluck("items", 1, "srcset", 1, "src")),
         logo = paste0("https:", logo))

top_3 %>% 
  mutate(logo = "") %>% 
  kbl(booktabs = T) %>%
  kable_paper(full_width = FALSE) %>%
  column_spec(3, image = top_3$logo)

top_3
#> # A tibble: 3 × 3
#>   wikititle                title                    logo                        
#>   <chr>                    <chr>                    <chr>                       
#> 1 Birmingham_School_of_Law Birmingham School of Law https://upload.wikimedia.or…
#> 2 Cumberland_School_of_Law Cumberland School of Law https://upload.wikimedia.or…
#> 3 Samford_University       Samford University       https://upload.wikimedia.or…
```

Renders as:
如何使用rvest包在R中从网络抓取的数据中提取和显示图像?

When rendering to pdf, files must be downloaded first:

dir.create("tmp_img/")
# Download all files listed in top_3$logo and updated 
# logo to include local files paths instead of urls
top_3 <- top_3 %>% 
  mutate(logo = map_chr(logo, ~ {
    destfile = file.path("tmp_img", basename(.x))
    download.file(.x, destfile = destfile, mode = "wb")
    destfile
    }))

top_3 %>% 
  mutate(logo = "") %>% 
  kbl(booktabs = T) %>%
  # we can use "scale_down" for slightly better fit, but most likely  
  # it needs some further tweaking
  kable_styling(latex_options = c("scale_down")) %>%
  column_spec(3, image = top_3$logo)

huangapple
  • 本文由 发表于 2023年5月21日 07:34:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76297738.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定