Web Scraping with R and rvest

huangapple go评论79阅读模式
英文:

webscraping with R and rvest

问题

Here is the translated content you requested:

我有一个项目,我应该从新闻网站上爬取一系列文章。我对新闻的标题和正文感兴趣。在大多数情况下,该网站维护一个基本URL,例如:

https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html
https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html

由于有很多文章(超过1000篇)需要下载,我考虑创建一个自动下载所有数据的函数。一个向量将提供所有网址(每行一个):

article 
[1] "https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html"                   
[2] "https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html"
[3] "https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html"                
> str(article)
 chr [1:3] "https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html" ...
> summary(article)
   Length     Class      Mode 
        3 character character 

因此,脚本将使用该向量作为地址的来源,并创建一个包含每篇文章标题和正文的数据框。但是会出现一些错误。以下是我根据一系列Stack Overflow帖子编写的代码:

library(rvest)
library(purrr)
library(xml2) 
library(dplyr)
library(readr)

导入CSV并导出为向量

base <- read_csv(file.choose(), col_names = FALSE)
article <- pull(base, X1)

第一次尝试

articles_final <- map_df(article, function(i){
  pages <- read_html(article)
  title <-
    article %>%  map_chr(. %>% html_node("h1") %>% html_text())
  content <-
    article %>% map_chr(. %>% html_nodes('.article_body span') %>% html_text() %>% paste(., collapse = ""))
  article_table <- data.frame("Title" = title, "Content" = content)
  return(article_table)
})  

第二次尝试

map_df(1:3, function(i){
  page <- read_html(sprintf(article, i))
  data.frame(Title = html_text(html_nodes(page,'.h1')),
             Content = html_text(html_nodes(page,'.article_body span')),
             Site = "American Thinker"
             )
}) -> articles_final

在这两种情况下,运行这些函数时我都收到以下错误:

Error in doc_parse_file (con, encoding = encoding, as_html = as_html, options = options):
Expecting a single string value: 
[type = character; extent = 3].

我需要下载和分析这些文章。

非常感谢您的帮助。

编辑

我尝试了下面建议的代码:

I tried and it dod not work, some problem with my coding:
> map_dfc(.x = article,
+         .f = function(x){
+           foo <- tibble(Title = read_html(x) %>%
+                           html_nodes("h1") %>%
+                           html_text() %>%
+                           .[nchar(.) > 0],
+                         Content = read_html(x) %>%
+                           html_nodes("p") %>%
+                           html_text(),
+                         Site = "AmericanThinker")%>%
+             filter(nchar(Content) > 0)
+           }
+         ) -> out
Error: Argument 3 must be length 28, not 46

但正如您所见,出现了一个新错误。

英文:

I have a project where I should scrape a series of articles from news sites. I am interested in the headline and text of the news. In most cases, the site maintains a base URL, for example:

> https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html
> https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html

As there are a so many articles (more than 1000) to download, I thought of creating a function to download all data automatically. A vector would provide all web addresses (one per line):

article 
[1] &quot;https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html&quot;                   
[2] &quot;https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html&quot;
[3] &quot;https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html&quot;                
&gt; str(article)
 chr [1:3] &quot;https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html&quot; ...
&gt; summary(article)
   Length     Class      Mode 
        3 character character 

As a result, the script would use the vector as a source for the addresses and create a dataframe with the title and text of each article. But some errors pop up. Here are the codes I wrote, based on a series of Stack Overflow posts:

Packages

library(rvest)
library(purrr)
library(xml2) 
library(dplyr)
library(readr)

Importing CSV and exporting as a vector

base &lt;- read_csv(file.choose(), col_names = FALSE)
article &lt;- pull(base,X1)

First try

articles_final &lt;- map_df(article, function(i){
  pages&lt;-read_html(article)
  title &lt;-
    article %&gt;%  map_chr(. %&gt;% html_node(&quot;h1&quot;) %&gt;% html_text())
  content &lt;-
    article %&gt;% map_chr(. %&gt;% html_nodes(&#39;.article_body span&#39;) %&gt;% html_text() %&gt;% paste(., collapse = &quot;&quot;))
  article_table &lt;- data.frame(&quot;Title&quot; = title, &quot;Content&quot; = content)
  return(article_table)
})  

Second try

map_df(1:3, function(i){
  page &lt;- read_html(sprintf(article,i))
  data.frame(Title = html_text(html_nodes(page,&#39;.h1&#39;)),
             Content= html_text(html_nodes(page,&#39;.article_body span&#39;)),
             Site = &quot;American Thinker&quot;
             )
}) -&gt; articles_final

In both cases, I am getting the following error while running these functions:

Error in doc_parse_file (con, encoding = encoding, as_html = as_html, options = options):
Expecting a single string value: 
[type = character; extent = 3].

I need this to download and analyse articles

Thank you very much for your help.

edit

I tried the code suggested bellow:

I tried and it dod not work, some problem with my coding:
&gt; map_dfc(.x = article,
+         .f = function(x){
+           foo &lt;- tibble(Title = read_html(x) %&gt;%
+                           html_nodes(&quot;h1&quot;) %&gt;% 
+                           html_text() %&gt;%
+                           .[nchar(.) &gt; 0],
+                         Content = read_html(x) %&gt;% 
+                           html_nodes(&quot;p&quot;) %&gt;% 
+                           html_text(),
+                         Site = &quot;AmericanThinker&quot;)%&gt;%
+             filter(nchar(Content) &gt; 0)
+           }
+         ) -&gt; out
Error: Argument 3 must be length 28, not 46

But as you see a new error pops up

答案1

得分: 2

以下是我为您尝试的内容。我正在使用Selector Gadget并检查页面源代码。经过一些检查,我认为您需要使用<title><div class="article_body">map()部分正在循环遍历article中的三篇文章,并创建一个数据框。每行代表一篇文章。我认为您仍然需要进行一些字符串操作以获得干净的文本。但这将帮助您抓取所需的内容。

library(tidyverse)
library(rvest)

article <- c(
  "https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html",
  "https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html",
  "https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html"
)

map_dfr(.x = article,
        .f = function(x){
          
          tibble(
            Title = read_html(x) %>%
              html_nodes("title") %>%
              html_text(),
            Content = read_html(x) %>%
              html_nodes(xpath = "//div[@class='article_body']") %>%
              html_text(),
            Site = "AmericanThinker"
          )
        }) -> result

标题:Why Rich People Love Poor Immigrants...
内容:"Soon after the Immigration Act of 1965 was passed, rea...
站点:"AmericanThinker"

标题:California begins giving driver's...
内容:"The largest state in the union began handing out driv...
站点:"AmericanThinker"

标题:Immigrants Will Not Fund Our Reti...
内容:"Ask Democrats why they support open borders, and they ...
站点:"AmericanThinker"


<details>
<summary>英文:</summary>

Here is what I tried for you. I was playing with Selector Gadget and checking page source. After some inspection, I think you need to use `&lt;title&gt;` and `&lt;div class=&quot;article_body&quot;&gt;`. The `map()` part is looping through the three articles in `article` and creating a data frame. Each row represents each article. I think you still need to do some string manipulation to have clean text. But this will help you to scrape the contents you need.

    library(tidyverse)
    library(rvest)

    article &lt;- c(&quot;https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html&quot;,
             &quot;https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html&quot;,
             &quot;https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html&quot;)

     map_dfr(.x = article,
             .f = function(x){
          
                     tibble(Title = read_html(x) %&gt;%
                                    html_nodes(&quot;title&quot;) %&gt;%
                                    html_text(),
                            Content = read_html(x) %&gt;%
                                      html_nodes(xpath = &quot;//div[@class=&#39;article_body&#39;]&quot;) %&gt;%
                                      html_text(),
                            Site = &quot;AmericanThinker&quot;)}) -&gt; result


    #  Title                              Content                                                  Site      
    #  &lt;chr&gt;                              &lt;chr&gt;                                                    &lt;chr&gt;     
    #1 Why Rich People Love Poor Immigra… &quot;Soon after the Immigration Act of 1965 was passed, rea… AmericanT…
    #2 California begins giving driver&#39;s… &quot;The largest state in  the union began handing out driv… AmericanT…
    #3 Immigrants Will Not Fund Our Reti… &quot;Ask Democrats why they support open borders, and they … AmericanT…

</details>



huangapple
  • 本文由 发表于 2020年1月3日 22:16:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/59580096.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定