2023年5月21日 07:34:41go评论92阅读模式

英文:

How can I extract and display images from web scraped data in R using rvest package?

问题

我目前正在尝试从维基百科页面中抓取美国所有法学院的名称和徽标。我已经成功创建了一个包含大学名称和图像链接的表格。但是，我想创建一个包含实际图像的大学名称表格。我正在使用R编程。

library(tidyverse)
library(rvest)
# 读取数据
url = 'https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States'
school_url = 'https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States' %>%
  read_html() %>%
  html_elements('.wikitable a') %>%
  html_attr('href')
school_url = paste('https://en.wikipedia.org/', school_url, sep='')
top3_school_urls = head(school_url, 3)
# 网页抓取
results = list()
for (school_url in top3_school_urls) {
  message('Scraping URL: ', school_url)
  school_html = read_html(school_url) 
  School = school_html %>%
    html_element('h1 .mw-page-title-main') %>%
    html_text2()
  Logo = school_html %>%
    html_element('.infobox-image a') %>%
    html_attr('href')
  Logo = paste('https://en.wikipedia.org/', Logo, sep='')
  school_tibble = tibble(School, Logo)
  results[[school_url]] = school_tibble
}
d = bind_rows(results, .id = 'url')
d

我已经将数据读入R并使用rvest包解析了数据。我成功提取了图像的链接，但我想更进一步，将实际图像放入表格中。

英文:

I am currently trying to scrape the names and logos of all the Law Schools in the U.S. from a wiki page. I was able to create a table of university names and image links. However, I would like to create a table of university names with the actual images. I am using R programming.

library(tidyverse)
library(rvest)
\###Reading in the Data
url = 
&#39;https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States&#39;
school_url = 
&#39;https://en.wikipedia.org/wiki/List_of_law_schools_in_the_United_States&#39; 
   %\&gt;%
read_html() %\&gt;%
html_elements(&#39;.wikitable a&#39;) %\&gt;%
html_attr(&#39;href&#39;)
school_url = paste(&#39;https://en.wikipedia.org/&#39;, school_url, sep=&#39;&#39;)
top3_school_urls = head(school_url, 3)
###Web Scraper
results = list()
for (school_url in top3_school_urls) {
  message(&#39;Scraping URL: &#39;, school_url)
  school_html = read_html(school_url) 
  School = school_html %&gt;% 
    html_element(&#39;h1 .mw-page-title-main&#39;) %&gt;% html_text2()
  Logo = school_html %&gt;% html_element(&#39;.infobox-image a&#39;) %&gt;% 
          html_attr(&#39;href&#39;)
  Logo = paste(&#39;https://en.wikipedia.org/&#39;, Logo, sep=&#39;&#39;)
  school_tibble = tibble(School, Logo)
  results[[school_url]] = school_tibble
}
d = bind_rows(results, .id = &#39;url&#39;)
d`

I read the data into R and parsed the data using the rvest package. I was able to extract the links to the image but I would like to take it a step further and have the actual images in the table.

答案1

得分: 1

一个快速简单的选择是 kableExtra：

---
title: "法学院"
output: html_document
---

英文:

One quick and easy option is kableExtra:

---
title: &quot;law schools&quot;
output: html_document
---
```{r echo=FALSE, warning=FALSE}
suppressPackageStartupMessages({
  library(rvest)
  library(stringr)
  library(purrr)
  library(dplyr)
  library(kableExtra)
})
mw_api &lt;- list(page  = &quot;https://en.wikipedia.org/api/rest_v1/page/html/&quot;,
               media = &quot;https://en.wikipedia.org/api/rest_v1/page/media-list/&quot;)
top_3 &lt;- read_html(paste0(mw_api$page, &quot;List_of_law_schools_in_the_United_States&quot;)) %&gt;% 
  # extract link elements only from 2nd column of the table
  html_elements(&quot;table.wikitable tbody tr &gt; td:nth-child(2) &gt; a&quot;) %&gt;% 
  # keep only top 3
  head(3) %&gt;% 
  # get link / wiki title and link text from single elements
  map(~ tibble::tibble_row( wikititle = html_attr(.x, &quot;href&quot;) %&gt;% str_remove(&quot;^./&quot;),
                            title = html_text(.x) %&gt;% str_trim())
  ) %&gt;% list_rbind() %&gt;% 
  # request media list for titles
  mutate(logo = map_chr(wikititle, ~ paste0(mw_api$media,.x) %&gt;% 
                          jsonlite::read_json() %&gt;% 
                          pluck(&quot;items&quot;, 1, &quot;srcset&quot;, 1, &quot;src&quot;)),
         logo = paste0(&quot;https:&quot;, logo))
top_3 %&gt;% 
  mutate(logo = &quot;&quot;) %&gt;% 
  kbl(booktabs = T) %&gt;%
  kable_paper(full_width = FALSE) %&gt;%
  column_spec(3, image = top_3$logo)
top_3
#&gt; # A tibble: 3 &#215; 3
#&gt;   wikititle                title                    logo                        
#&gt;   &lt;chr&gt;                    &lt;chr&gt;                    &lt;chr&gt;                       
#&gt; 1 Birmingham_School_of_Law Birmingham School of Law https://upload.wikimedia.or…
#&gt; 2 Cumberland_School_of_Law Cumberland School of Law https://upload.wikimedia.or…
#&gt; 3 Samford_University       Samford University       https://upload.wikimedia.or…
```

Renders as:

When rendering to pdf, files must be downloaded first:

dir.create(&quot;tmp_img/&quot;)
# Download all files listed in top_3$logo and updated 
# logo to include local files paths instead of urls
top_3 &lt;- top_3 %&gt;% 
  mutate(logo = map_chr(logo, ~ {
    destfile = file.path(&quot;tmp_img&quot;, basename(.x))
    download.file(.x, destfile = destfile, mode = &quot;wb&quot;)
    destfile
    }))
top_3 %&gt;% 
  mutate(logo = &quot;&quot;) %&gt;% 
  kbl(booktabs = T) %&gt;%
  # we can use &quot;scale_down&quot; for slightly better fit, but most likely  
  # it needs some further tweaking
  kable_styling(latex_options = c(&quot;scale_down&quot;)) %&gt;%
  column_spec(3, image = top_3$logo)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用rvest包在R中从网络抓取的数据中提取和显示图像？

问题

答案1

在R中创建Sankey或Alluvial图，并在”next_node”和”next_x”值为”NA”时停止流动。

Google搜索结果与抓取Google结果不同，如何使它们相同？

基于空间链接距离（也称为邻居距离）计算距离矩阵。

Cookie在前端没有设置，即使在网络选项卡中存在。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。