2023年6月16日 14:12:29go评论104阅读模式

英文:

Using rvest to parse a dataframe column of class chr, containing html and non html input

问题

我是新手使用 rvest 并尝试进行网页抓取。是否有一种方法可以使用 rvest 解析包含 HTML 和非 HTML 输入的数据框列，特别是将包含 HTML 输入的行转换为解析后的文本，而不更改其他行？
```R
# 假设我的数据框叫做 'df'，字符列名为 'body'
# 创建一个新列叫做 'clean_body' 用于存储清理后的数据
df$clean_body <- NA
# 遍历每一行并清理 'body' 列
for (i in 1:nrow(df)) {
  # 检查 'body' 列的值是否不是 NA
  if (!is.na(df$body[i])) {
    # 检查值是否包含 HTML 标签
    if (grepl("<.*?>", df$body[i])) {
      # 如果值包含 HTML 标签，解析 HTML 并存储清理后的文本
      df$clean_body[i] <- html_text(read_html(df$body[i]))
    } else {
      # 如果值不包含 HTML 标签，保留原始值
      df$clean_body[i] <- df$body[i]
    }
  }
}

使用这个逻辑，出现错误 - "Error in UseMethod("xml_text") : no applicable method for 'xml_text' applied to an object of class 'xml_document'"


<details>
<summary>英文:</summary>
I&#39;m new to rvest and playing with web scraping. Is there a way to use rvest that can parse a dataframe column of class chr containing html and non html input, specifically transforming the rows with html input into parsed text with no change otherwise?

Assuming my dataframe is called 'df' and the chr column is named 'body'

Create a new column called 'clean_body' to store the cleaned data

df$clean_body <- NA

Iterate over each row and clean the 'body' column

for (i in 1:nrow(df)) {

Check if the 'body' column value is not NA

if (!is.na(df$body[i])) {
# Check if the value contains HTML tags
if (grepl("<.*?>", df$body[i])) {
# If the value contains HTML tags, parse the HTML and store the cleaned text
df$clean_body[i] <- html_text(read_html(df$body[i]))
} else {
# If the value doesn't contain HTML tags, keep the original value
df$clean_body[i] <- df$body[i]
}
}
}


Using this logic, it returns an error - &quot;Error in UseMethod(&quot;xml_text&quot;) :
no applicable method for &#39;xml_text&#39; applied to an object of class &quot;xml_document&quot;
</details>
# 答案1
**得分**: 1
根据您的实际数据（内容、数量、HTML与文本记录的比例），您可能只需要首先在整个列上应用 `rvest::minimal_html()`。
由于 `minimal_html()` 无法在字符串向量上工作，我们需要找到一种方法来在每个单独的项目上应用它，即使用 `apply` 或 `purrr` 的映射函数家族，进行逐行操作（`dplyr::rowwise()`）或通过循环遍历项目。在下面的示例中，通过 `sapply()` 来处理：
``` r
library(rvest)
df_ <- tibble::tibble(
  descr = c("text", "body", "html"),
  body = c(
    "test simple text",
    "<body>\n<p>test html body</p>\n<p></p>\n</body>",
    "<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n<meta charset=\"utf-8\">\n<title>title</title>\n</head>\n<body>\n<p>test html document</p>\n<p></p>\n</body>\n</html>"
  )
)
df_
#> # A tibble: 3 × 2
#>   descr body                                                                    
#>   <chr> <chr>                                                                   
#> 1 text  "test simple text"                                                      
#> 2 body  "<body>\n<p>test html body</p>\n<p></p>\n</body>"                       
#> 3 html  "<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" con…
df_$clean_body <- sapply(df_$body, \(text) minimal_html(text) |> html_text() |> trimws(),
                        simplify = TRUE) 
df_[, c("descr", "clean_body", "body")]
#> # A tibble: 3 × 3
#>   descr clean_body                  body                                        
#>   <chr> <chr>                       <chr>                                       
#> 1 text  "test simple text"          "test simple text"                          
#> 2 body  "test html body"            "<body>\n<p>test html body</p>\n<p></p>\n</…
#> 3 html  "title\ntest html document" "<!DOCTYPE html>\n<html>\n<head>\n<meta htt…
<sup>Created on 2023-06-16 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

英文:

Depending on your actual data (content, amount, ratio of HTML vs text records), all you might need is to apply rvest::minimal_html() on the whole column first.

As minimal_html() does not work on a vector of strings, we need a way to apply it on each individual item, i.e. use something from apply or purrr map family, do a row-wise operation (dplyr::rowwise()) or for-loop though items. In example bellow it's handled through sapply()

library(rvest)
df_ &lt;- tibble::tibble(
  descr = c(&quot;text&quot;,&quot;body&quot;,&quot;html&quot;),
  body = c(
    &quot;test simple text&quot;,
    &quot;&lt;body&gt;\n&lt;p&gt;test html body&lt;/p&gt;\n&lt;p&gt;&lt;/p&gt;\n&lt;/body&gt;&quot;,
    &quot;&lt;!DOCTYPE html&gt;\n&lt;html&gt;\n&lt;head&gt;\n&lt;meta http-equiv=\&quot;Content-Type\&quot; content=\&quot;text/html; charset=UTF-8\&quot;&gt;\n&lt;meta charset=\&quot;utf-8\&quot;&gt;\n&lt;title&gt;title&lt;/title&gt;\n&lt;/head&gt;\n&lt;body&gt;\n&lt;p&gt;test html document&lt;/p&gt;\n&lt;p&gt;&lt;/p&gt;\n&lt;/body&gt;\n&lt;/html&gt;\n&quot;)
  )
df_
#&gt; # A tibble: 3 &#215; 2
#&gt;   descr body                                                                    
#&gt;   &lt;chr&gt; &lt;chr&gt;                                                                   
#&gt; 1 text  &quot;test simple text&quot;                                                      
#&gt; 2 body  &quot;&lt;body&gt;\n&lt;p&gt;test html body&lt;/p&gt;\n&lt;p&gt;&lt;/p&gt;\n&lt;/body&gt;&quot;                       
#&gt; 3 html  &quot;&lt;!DOCTYPE html&gt;\n&lt;html&gt;\n&lt;head&gt;\n&lt;meta http-equiv=\&quot;Content-Type\&quot; con…
df_$clean_body &lt;- sapply(df_$body, \(text) minimal_html(text) |&gt; html_text() |&gt; trimws(),
                        simplify = TRUE) 
df_[, c(&quot;descr&quot;, &quot;clean_body&quot;, &quot;body&quot;)]
#&gt; # A tibble: 3 &#215; 3
#&gt;   descr clean_body                  body                                        
#&gt;   &lt;chr&gt; &lt;chr&gt;                       &lt;chr&gt;                                       
#&gt; 1 text  &quot;test simple text&quot;          &quot;test simple text&quot;                          
#&gt; 2 body  &quot;test html body&quot;            &quot;&lt;body&gt;\n&lt;p&gt;test html body&lt;/p&gt;\n&lt;p&gt;&lt;/p&gt;\n&lt;/…
#&gt; 3 html  &quot;title\ntest html document&quot; &quot;&lt;!DOCTYPE html&gt;\n&lt;html&gt;\n&lt;head&gt;\n&lt;meta htt…

<sup>Created on 2023-06-16 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用rvest解析一个包含HTML和非HTML输入的类chr的数据框列。

问题

Assuming my dataframe is called 'df' and the chr column is named 'body'

Create a new column called 'clean_body' to store the cleaned data

Iterate over each row and clean the 'body' column

Check if the 'body' column value is not NA

创建一个基于匹配字符串的新列。

如何使用Beautiful Soup检索具有多个属性的对象？

获取循环外的数据在 JQuery 中

Python cumsum of rows up until n-1

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。