英文:
How can I extract and display images from web scraped data in R using rvest package?
问题
我目前正在尝试从维基百科页面中抓取美国所有法学院的名称和徽标。我已经成功创建了一个包含大学名称和图像链接的表格。但是,我想创建一个包含实际图像的大学名称表格。我正在使用R编程。
library(tidyverse)
library(rvest)
# 读取数据
url = 'https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States'
school_url = 'https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States' %>%
read_html() %>%
html_elements('.wikitable a') %>%
html_attr('href')
school_url = paste('https://en.wikipedia.org/', school_url, sep='')
top3_school_urls = head(school_url, 3)
# 网页抓取
results = list()
for (school_url in top3_school_urls) {
message('Scraping URL: ', school_url)
school_html = read_html(school_url)
School = school_html %>%
html_element('h1 .mw-page-title-main') %>%
html_text2()
Logo = school_html %>%
html_element('.infobox-image a') %>%
html_attr('href')
Logo = paste('https://en.wikipedia.org/', Logo, sep='')
school_tibble = tibble(School, Logo)
results[[school_url]] = school_tibble
}
d = bind_rows(results, .id = 'url')
d
我已经将数据读入R并使用rvest
包解析了数据。我成功提取了图像的链接,但我想更进一步,将实际图像放入表格中。
英文:
I am currently trying to scrape the names and logos of all the Law Schools in the U.S. from a wiki page. I was able to create a table of university names and image links. However, I would like to create a table of university names with the actual images. I am using R programming.
library(tidyverse)
library(rvest)
\###Reading in the Data
url =
'https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States'
school_url =
'https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States'
%\>%
read_html() %\>%
html_elements('.wikitable a') %\>%
html_attr('href')
school_url = paste('https://en.wikipedia.org/', school_url, sep='')
top3_school_urls = head(school_url, 3)
###Web Scraper
results = list()
for (school_url in top3_school_urls) {
message('Scraping URL: ', school_url)
school_html = read_html(school_url)
School = school_html %>%
html_element('h1 .mw-page-title-main') %>% html_text2()
Logo = school_html %>% html_element('.infobox-image a') %>%
html_attr('href')
Logo = paste('https://en.wikipedia.org/', Logo, sep='')
school_tibble = tibble(School, Logo)
results[[school_url]] = school_tibble
}
d = bind_rows(results, .id = 'url')
d`
I read the data into R and parsed the data using the rvest
package. I was able to extract the links to the image but I would like to take it a step further and have the actual images in the table.
答案1
得分: 1
一个快速简单的选择是 kableExtra
:
---
title: "法学院"
output: html_document
---
英文:
One quick and easy option is kableExtra
:
---
title: "law schools"
output: html_document
---
```{r echo=FALSE, warning=FALSE}
suppressPackageStartupMessages({
library(rvest)
library(stringr)
library(purrr)
library(dplyr)
library(kableExtra)
})
mw_api <- list(page = "https://en.wikipedia.org/api/rest_v1/page/html/",
media = "https://en.wikipedia.org/api/rest_v1/page/media-list/")
top_3 <- read_html(paste0(mw_api$page, "List_of_law_schools_in_the_United_States")) %>%
# extract link elements only from 2nd column of the table
html_elements("table.wikitable tbody tr > td:nth-child(2) > a") %>%
# keep only top 3
head(3) %>%
# get link / wiki title and link text from single elements
map(~ tibble::tibble_row( wikititle = html_attr(.x, "href") %>% str_remove("^./"),
title = html_text(.x) %>% str_trim())
) %>% list_rbind() %>%
# request media list for titles
mutate(logo = map_chr(wikititle, ~ paste0(mw_api$media,.x) %>%
jsonlite::read_json() %>%
pluck("items", 1, "srcset", 1, "src")),
logo = paste0("https:", logo))
top_3 %>%
mutate(logo = "") %>%
kbl(booktabs = T) %>%
kable_paper(full_width = FALSE) %>%
column_spec(3, image = top_3$logo)
top_3
#> # A tibble: 3 × 3
#> wikititle title logo
#> <chr> <chr> <chr>
#> 1 Birmingham_School_of_Law Birmingham School of Law https://upload.wikimedia.or…
#> 2 Cumberland_School_of_Law Cumberland School of Law https://upload.wikimedia.or…
#> 3 Samford_University Samford University https://upload.wikimedia.or…
```
When rendering to pdf, files must be downloaded first:
dir.create("tmp_img/")
# Download all files listed in top_3$logo and updated
# logo to include local files paths instead of urls
top_3 <- top_3 %>%
mutate(logo = map_chr(logo, ~ {
destfile = file.path("tmp_img", basename(.x))
download.file(.x, destfile = destfile, mode = "wb")
destfile
}))
top_3 %>%
mutate(logo = "") %>%
kbl(booktabs = T) %>%
# we can use "scale_down" for slightly better fit, but most likely
# it needs some further tweaking
kable_styling(latex_options = c("scale_down")) %>%
column_spec(3, image = top_3$logo)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论