2020年1月3日 22:16:05go评论109阅读模式

英文:

webscraping with R and rvest

问题

Here is the translated content you requested:

我有一个项目，我应该从新闻网站上爬取一系列文章。我对新闻的标题和正文感兴趣。在大多数情况下，该网站维护一个基本URL，例如：

https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html
https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html

由于有很多文章（超过1000篇）需要下载，我考虑创建一个自动下载所有数据的函数。一个向量将提供所有网址（每行一个）：

article 
[1] "https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html"                   
[2] "https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html"
[3] "https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html"                
> str(article)
 chr [1:3] "https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html" ...
> summary(article)
   Length     Class      Mode 
        3 character character

因此，脚本将使用该向量作为地址的来源，并创建一个包含每篇文章标题和正文的数据框。但是会出现一些错误。以下是我根据一系列Stack Overflow帖子编写的代码：

包

library(rvest)
library(purrr)
library(xml2) 
library(dplyr)
library(readr)

导入CSV并导出为向量

base <- read_csv(file.choose(), col_names = FALSE)
article <- pull(base, X1)

第一次尝试

articles_final <- map_df(article, function(i){
  pages <- read_html(article)
  title <-
    article %>%  map_chr(. %>% html_node("h1") %>% html_text())
  content <-
    article %>% map_chr(. %>% html_nodes('.article_body span') %>% html_text() %>% paste(., collapse = ""))
  article_table <- data.frame("Title" = title, "Content" = content)
  return(article_table)
})

第二次尝试

map_df(1:3, function(i){
  page <- read_html(sprintf(article, i))
  data.frame(Title = html_text(html_nodes(page,'.h1')),
             Content = html_text(html_nodes(page,'.article_body span')),
             Site = "American Thinker"
             )
}) -> articles_final

在这两种情况下，运行这些函数时我都收到以下错误：

Error in doc_parse_file (con, encoding = encoding, as_html = as_html, options = options):
Expecting a single string value: 
[type = character; extent = 3].

我需要下载和分析这些文章。

非常感谢您的帮助。

编辑

我尝试了下面建议的代码：

I tried and it dod not work, some problem with my coding:
> map_dfc(.x = article,
+         .f = function(x){
+           foo <- tibble(Title = read_html(x) %>%
+                           html_nodes("h1") %>%
+                           html_text() %>%
+                           .[nchar(.) > 0],
+                         Content = read_html(x) %>%
+                           html_nodes("p") %>%
+                           html_text(),
+                         Site = "AmericanThinker")%>%
+             filter(nchar(Content) > 0)
+           }
+         ) -> out
Error: Argument 3 must be length 28, not 46

但正如您所见，出现了一个新错误。

英文:

I have a project where I should scrape a series of articles from news sites. I am interested in the headline and text of the news. In most cases, the site maintains a base URL, for example:

> https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html
> https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html

As there are a so many articles (more than 1000) to download, I thought of creating a function to download all data automatically. A vector would provide all web addresses (one per line):

article 
[1] &quot;https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html&quot;                   
[2] &quot;https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html&quot;
[3] &quot;https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html&quot;                
&gt; str(article)
 chr [1:3] &quot;https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html&quot; ...
&gt; summary(article)
   Length     Class      Mode 
        3 character character

As a result, the script would use the vector as a source for the addresses and create a dataframe with the title and text of each article. But some errors pop up. Here are the codes I wrote, based on a series of Stack Overflow posts:

Packages

library(rvest)
library(purrr)
library(xml2) 
library(dplyr)
library(readr)

Importing CSV and exporting as a vector

base &lt;- read_csv(file.choose(), col_names = FALSE)
article &lt;- pull(base,X1)

First try

articles_final &lt;- map_df(article, function(i){
  pages&lt;-read_html(article)
  title &lt;-
    article %&gt;%  map_chr(. %&gt;% html_node(&quot;h1&quot;) %&gt;% html_text())
  content &lt;-
    article %&gt;% map_chr(. %&gt;% html_nodes(&#39;.article_body span&#39;) %&gt;% html_text() %&gt;% paste(., collapse = &quot;&quot;))
  article_table &lt;- data.frame(&quot;Title&quot; = title, &quot;Content&quot; = content)
  return(article_table)
})

Second try

map_df(1:3, function(i){
  page &lt;- read_html(sprintf(article,i))
  data.frame(Title = html_text(html_nodes(page,&#39;.h1&#39;)),
             Content= html_text(html_nodes(page,&#39;.article_body span&#39;)),
             Site = &quot;American Thinker&quot;
             )
}) -&gt; articles_final

In both cases, I am getting the following error while running these functions:

Error in doc_parse_file (con, encoding = encoding, as_html = as_html, options = options):
Expecting a single string value: 
[type = character; extent = 3].

I need this to download and analyse articles

Thank you very much for your help.

edit

I tried the code suggested bellow:

I tried and it dod not work, some problem with my coding:
&gt; map_dfc(.x = article,
+         .f = function(x){
+           foo &lt;- tibble(Title = read_html(x) %&gt;%
+                           html_nodes(&quot;h1&quot;) %&gt;% 
+                           html_text() %&gt;%
+                           .[nchar(.) &gt; 0],
+                         Content = read_html(x) %&gt;% 
+                           html_nodes(&quot;p&quot;) %&gt;% 
+                           html_text(),
+                         Site = &quot;AmericanThinker&quot;)%&gt;%
+             filter(nchar(Content) &gt; 0)
+           }
+         ) -&gt; out
Error: Argument 3 must be length 28, not 46

But as you see a new error pops up

答案1

得分: 2

以下是我为您尝试的内容。我正在使用Selector Gadget并检查页面源代码。经过一些检查，我认为您需要使用<title>和<div class="article_body">。map()部分正在循环遍历article中的三篇文章，并创建一个数据框。每行代表一篇文章。我认为您仍然需要进行一些字符串操作以获得干净的文本。但这将帮助您抓取所需的内容。

library(tidyverse)
library(rvest)
article <- c(
  "https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html",
  "https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html",
  "https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html"
)
map_dfr(.x = article,
        .f = function(x){
          
          tibble(
            Title = read_html(x) %>%
              html_nodes("title") %>%
              html_text(),
            Content = read_html(x) %>%
              html_nodes(xpath = "//div[@class='article_body']") %>%
              html_text(),
            Site = "AmericanThinker"
          )
        }) -> result

标题：Why Rich People Love Poor Immigrants...
内容："Soon after the Immigration Act of 1965 was passed, rea...
站点："AmericanThinker"

标题：California begins giving driver's...
内容："The largest state in the union began handing out driv...
站点："AmericanThinker"

标题：Immigrants Will Not Fund Our Reti...
内容："Ask Democrats why they support open borders, and they ...
站点："AmericanThinker"


<details>
<summary>英文:</summary>
Here is what I tried for you. I was playing with Selector Gadget and checking page source. After some inspection, I think you need to use `&lt;title&gt;` and `&lt;div class=&quot;article_body&quot;&gt;`. The `map()` part is looping through the three articles in `article` and creating a data frame. Each row represents each article. I think you still need to do some string manipulation to have clean text. But this will help you to scrape the contents you need.
    library(tidyverse)
    library(rvest)
    article &lt;- c(&quot;https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html&quot;,
             &quot;https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html&quot;,
             &quot;https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html&quot;)
     map_dfr(.x = article,
             .f = function(x){
          
                     tibble(Title = read_html(x) %&gt;%
                                    html_nodes(&quot;title&quot;) %&gt;%
                                    html_text(),
                            Content = read_html(x) %&gt;%
                                      html_nodes(xpath = &quot;//div[@class=&#39;article_body&#39;]&quot;) %&gt;%
                                      html_text(),
                            Site = &quot;AmericanThinker&quot;)}) -&gt; result
    #  Title                              Content                                                  Site      
    #  &lt;chr&gt;                              &lt;chr&gt;                                                    &lt;chr&gt;     
    #1 Why Rich People Love Poor Immigra… &quot;Soon after the Immigration Act of 1965 was passed, rea… AmericanT…
    #2 California begins giving driver&#39;s… &quot;The largest state in  the union began handing out driv… AmericanT…
    #3 Immigrants Will Not Fund Our Reti… &quot;Ask Democrats why they support open borders, and they … AmericanT…
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Web Scraping with R and rvest

问题

答案1

确定一个列中的值是否在R中的另一个列中的值之前。

提取给定列名的最后一个非NA值

How can I use the MatchIt package from R to match control and case patients on age and multiple diagnosis codes (ICD10)?

从长文本数据中提取特定的数值（R）

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。