Web Scraping with R and rvest

huangapple go评论109阅读模式
英文:

webscraping with R and rvest

问题

Here is the translated content you requested:

我有一个项目,我应该从新闻网站上爬取一系列文章。我对新闻的标题和正文感兴趣。在大多数情况下,该网站维护一个基本URL,例如:

https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html
https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html

由于有很多文章(超过1000篇)需要下载,我考虑创建一个自动下载所有数据的函数。一个向量将提供所有网址(每行一个):

  1. article
  2. [1] "https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html"
  3. [2] "https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html"
  4. [3] "https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html"
  5. > str(article)
  6. chr [1:3] "https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html" ...
  7. > summary(article)
  8. Length Class Mode
  9. 3 character character

因此,脚本将使用该向量作为地址的来源,并创建一个包含每篇文章标题和正文的数据框。但是会出现一些错误。以下是我根据一系列Stack Overflow帖子编写的代码:

  1. library(rvest)
  2. library(purrr)
  3. library(xml2)
  4. library(dplyr)
  5. library(readr)

导入CSV并导出为向量

  1. base <- read_csv(file.choose(), col_names = FALSE)
  2. article <- pull(base, X1)

第一次尝试

  1. articles_final <- map_df(article, function(i){
  2. pages <- read_html(article)
  3. title <-
  4. article %>% map_chr(. %>% html_node("h1") %>% html_text())
  5. content <-
  6. article %>% map_chr(. %>% html_nodes('.article_body span') %>% html_text() %>% paste(., collapse = ""))
  7. article_table <- data.frame("Title" = title, "Content" = content)
  8. return(article_table)
  9. })

第二次尝试

  1. map_df(1:3, function(i){
  2. page <- read_html(sprintf(article, i))
  3. data.frame(Title = html_text(html_nodes(page,'.h1')),
  4. Content = html_text(html_nodes(page,'.article_body span')),
  5. Site = "American Thinker"
  6. )
  7. }) -> articles_final

在这两种情况下,运行这些函数时我都收到以下错误:

  1. Error in doc_parse_file (con, encoding = encoding, as_html = as_html, options = options):
  2. Expecting a single string value:
  3. [type = character; extent = 3].

我需要下载和分析这些文章。

非常感谢您的帮助。

编辑

我尝试了下面建议的代码:

  1. I tried and it dod not work, some problem with my coding:
  2. > map_dfc(.x = article,
  3. + .f = function(x){
  4. + foo <- tibble(Title = read_html(x) %>%
  5. + html_nodes("h1") %>%
  6. + html_text() %>%
  7. + .[nchar(.) > 0],
  8. + Content = read_html(x) %>%
  9. + html_nodes("p") %>%
  10. + html_text(),
  11. + Site = "AmericanThinker")%>%
  12. + filter(nchar(Content) > 0)
  13. + }
  14. + ) -> out
  15. Error: Argument 3 must be length 28, not 46

但正如您所见,出现了一个新错误。

英文:

I have a project where I should scrape a series of articles from news sites. I am interested in the headline and text of the news. In most cases, the site maintains a base URL, for example:

> https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html
> https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html

As there are a so many articles (more than 1000) to download, I thought of creating a function to download all data automatically. A vector would provide all web addresses (one per line):

  1. article
  2. [1] &quot;https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html&quot;
  3. [2] &quot;https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html&quot;
  4. [3] &quot;https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html&quot;
  5. &gt; str(article)
  6. chr [1:3] &quot;https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html&quot; ...
  7. &gt; summary(article)
  8. Length Class Mode
  9. 3 character character

As a result, the script would use the vector as a source for the addresses and create a dataframe with the title and text of each article. But some errors pop up. Here are the codes I wrote, based on a series of Stack Overflow posts:

Packages

  1. library(rvest)
  2. library(purrr)
  3. library(xml2)
  4. library(dplyr)
  5. library(readr)

Importing CSV and exporting as a vector

  1. base &lt;- read_csv(file.choose(), col_names = FALSE)
  2. article &lt;- pull(base,X1)

First try

  1. articles_final &lt;- map_df(article, function(i){
  2. pages&lt;-read_html(article)
  3. title &lt;-
  4. article %&gt;% map_chr(. %&gt;% html_node(&quot;h1&quot;) %&gt;% html_text())
  5. content &lt;-
  6. article %&gt;% map_chr(. %&gt;% html_nodes(&#39;.article_body span&#39;) %&gt;% html_text() %&gt;% paste(., collapse = &quot;&quot;))
  7. article_table &lt;- data.frame(&quot;Title&quot; = title, &quot;Content&quot; = content)
  8. return(article_table)
  9. })

Second try

  1. map_df(1:3, function(i){
  2. page &lt;- read_html(sprintf(article,i))
  3. data.frame(Title = html_text(html_nodes(page,&#39;.h1&#39;)),
  4. Content= html_text(html_nodes(page,&#39;.article_body span&#39;)),
  5. Site = &quot;American Thinker&quot;
  6. )
  7. }) -&gt; articles_final

In both cases, I am getting the following error while running these functions:

  1. Error in doc_parse_file (con, encoding = encoding, as_html = as_html, options = options):
  2. Expecting a single string value:
  3. [type = character; extent = 3].

I need this to download and analyse articles

Thank you very much for your help.

edit

I tried the code suggested bellow:

  1. I tried and it dod not work, some problem with my coding:
  2. &gt; map_dfc(.x = article,
  3. + .f = function(x){
  4. + foo &lt;- tibble(Title = read_html(x) %&gt;%
  5. + html_nodes(&quot;h1&quot;) %&gt;%
  6. + html_text() %&gt;%
  7. + .[nchar(.) &gt; 0],
  8. + Content = read_html(x) %&gt;%
  9. + html_nodes(&quot;p&quot;) %&gt;%
  10. + html_text(),
  11. + Site = &quot;AmericanThinker&quot;)%&gt;%
  12. + filter(nchar(Content) &gt; 0)
  13. + }
  14. + ) -&gt; out
  15. Error: Argument 3 must be length 28, not 46

But as you see a new error pops up

答案1

得分: 2

以下是我为您尝试的内容。我正在使用Selector Gadget并检查页面源代码。经过一些检查,我认为您需要使用<title><div class="article_body">map()部分正在循环遍历article中的三篇文章,并创建一个数据框。每行代表一篇文章。我认为您仍然需要进行一些字符串操作以获得干净的文本。但这将帮助您抓取所需的内容。

  1. library(tidyverse)
  2. library(rvest)
  3. article <- c(
  4. "https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html",
  5. "https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html",
  6. "https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html"
  7. )
  8. map_dfr(.x = article,
  9. .f = function(x){
  10. tibble(
  11. Title = read_html(x) %>%
  12. html_nodes("title") %>%
  13. html_text(),
  14. Content = read_html(x) %>%
  15. html_nodes(xpath = "//div[@class='article_body']") %>%
  16. html_text(),
  17. Site = "AmericanThinker"
  18. )
  19. }) -> result

标题:Why Rich People Love Poor Immigrants...
内容:"Soon after the Immigration Act of 1965 was passed, rea...
站点:"AmericanThinker"

标题:California begins giving driver's...
内容:"The largest state in the union began handing out driv...
站点:"AmericanThinker"

标题:Immigrants Will Not Fund Our Reti...
内容:"Ask Democrats why they support open borders, and they ...
站点:"AmericanThinker"

  1. <details>
  2. <summary>英文:</summary>
  3. Here is what I tried for you. I was playing with Selector Gadget and checking page source. After some inspection, I think you need to use `&lt;title&gt;` and `&lt;div class=&quot;article_body&quot;&gt;`. The `map()` part is looping through the three articles in `article` and creating a data frame. Each row represents each article. I think you still need to do some string manipulation to have clean text. But this will help you to scrape the contents you need.
  4. library(tidyverse)
  5. library(rvest)
  6. article &lt;- c(&quot;https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html&quot;,
  7. &quot;https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html&quot;,
  8. &quot;https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html&quot;)
  9. map_dfr(.x = article,
  10. .f = function(x){
  11. tibble(Title = read_html(x) %&gt;%
  12. html_nodes(&quot;title&quot;) %&gt;%
  13. html_text(),
  14. Content = read_html(x) %&gt;%
  15. html_nodes(xpath = &quot;//div[@class=&#39;article_body&#39;]&quot;) %&gt;%
  16. html_text(),
  17. Site = &quot;AmericanThinker&quot;)}) -&gt; result
  18. # Title Content Site
  19. # &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  20. #1 Why Rich People Love Poor Immigra… &quot;Soon after the Immigration Act of 1965 was passed, rea… AmericanT…
  21. #2 California begins giving driver&#39;s… &quot;The largest state in the union began handing out driv… AmericanT…
  22. #3 Immigrants Will Not Fund Our Reti… &quot;Ask Democrats why they support open borders, and they … AmericanT…
  23. </details>

huangapple
  • 本文由 发表于 2020年1月3日 22:16:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/59580096.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定