使用rvest解析一个包含HTML和非HTML输入的类chr的数据框列。

huangapple go评论62阅读模式
英文:

Using rvest to parse a dataframe column of class chr, containing html and non html input

问题

我是新手使用 rvest 并尝试进行网页抓取。是否有一种方法可以使用 rvest 解析包含 HTML 和非 HTML 输入的数据框列,特别是将包含 HTML 输入的行转换为解析后的文本,而不更改其他行?

```R
# 假设我的数据框叫做 'df',字符列名为 'body'
# 创建一个新列叫做 'clean_body' 用于存储清理后的数据

df$clean_body <- NA

# 遍历每一行并清理 'body' 列
for (i in 1:nrow(df)) {
  # 检查 'body' 列的值是否不是 NA
  if (!is.na(df$body[i])) {
    # 检查值是否包含 HTML 标签
    if (grepl("<.*?>", df$body[i])) {
      # 如果值包含 HTML 标签,解析 HTML 并存储清理后的文本
      df$clean_body[i] <- html_text(read_html(df$body[i]))
    } else {
      # 如果值不包含 HTML 标签保留原始值
      df$clean_body[i] <- df$body[i]
    }
  }
}

使用这个逻辑,出现错误 - "Error in UseMethod("xml_text") : no applicable method for 'xml_text' applied to an object of class 'xml_document'"


<details>
<summary>英文:</summary>

I&#39;m new to rvest and playing with web scraping. Is there a way to use rvest that can parse a dataframe column of class chr containing html and non html input, specifically transforming the rows with html input into parsed text with no change otherwise?

Assuming my dataframe is called 'df' and the chr column is named 'body'

Create a new column called 'clean_body' to store the cleaned data

df$clean_body <- NA

Iterate over each row and clean the 'body' column

for (i in 1:nrow(df)) {

Check if the 'body' column value is not NA

if (!is.na(df$body[i])) {
# Check if the value contains HTML tags
if (grepl("<.*?>", df$body[i])) {
# If the value contains HTML tags, parse the HTML and store the cleaned text
df$clean_body[i] <- html_text(read_html(df$body[i]))
} else {
# If the value doesn't contain HTML tags, keep the original value
df$clean_body[i] <- df$body[i]
}
}
}


Using this logic, it returns an error - &quot;Error in UseMethod(&quot;xml_text&quot;) :
no applicable method for &#39;xml_text&#39; applied to an object of class &quot;xml_document&quot;


</details>


# 答案1
**得分**: 1

根据您的实际数据(内容、数量、HTML与文本记录的比例),您可能只需要首先在整个列上应用 `rvest::minimal_html()`。

由于 `minimal_html()` 无法在字符串向量上工作,我们需要找到一种方法来在每个单独的项目上应用它,即使用 `apply` 或 `purrr` 的映射函数家族,进行逐行操作(`dplyr::rowwise()`)或通过循环遍历项目。在下面的示例中,通过 `sapply()` 来处理:

``` r
library(rvest)
df_ <- tibble::tibble(
  descr = c("text", "body", "html"),
  body = c(
    "test simple text",
    "<body>\n<p>test html body</p>\n<p></p>\n</body>",
    "<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n<meta charset=\"utf-8\">\n<title>title</title>\n</head>\n<body>\n<p>test html document</p>\n<p></p>\n</body>\n</html>"
  )
)
df_
#> # A tibble: 3 × 2
#>   descr body                                                                    
#>   <chr> <chr>                                                                   
#> 1 text  "test simple text"                                                      
#> 2 body  "<body>\n<p>test html body</p>\n<p></p>\n</body>"                       
#> 3 html  "<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" con…

df_$clean_body <- sapply(df_$body, \(text) minimal_html(text) |> html_text() |> trimws(),
                        simplify = TRUE) 

df_[, c("descr", "clean_body", "body")]
#> # A tibble: 3 × 3
#>   descr clean_body                  body                                        
#>   <chr> <chr>                       <chr>                                       
#> 1 text  "test simple text"          "test simple text"                          
#> 2 body  "test html body"            "<body>\n<p>test html body</p>\n<p></p>\n</…
#> 3 html  "title\ntest html document" "<!DOCTYPE html>\n<html>\n<head>\n<meta htt…

<sup>Created on 2023-06-16 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
英文:

Depending on your actual data (content, amount, ratio of HTML vs text records), all you might need is to apply rvest::minimal_html() on the whole column first.

As minimal_html() does not work on a vector of strings, we need a way to apply it on each individual item, i.e. use something from apply or purrr map family, do a row-wise operation (dplyr::rowwise()) or for-loop though items. In example bellow it's handled through sapply()

library(rvest)
df_ &lt;- tibble::tibble(
  descr = c(&quot;text&quot;,&quot;body&quot;,&quot;html&quot;),
  body = c(
    &quot;test simple text&quot;,
    &quot;&lt;body&gt;\n&lt;p&gt;test html body&lt;/p&gt;\n&lt;p&gt;&lt;/p&gt;\n&lt;/body&gt;&quot;,
    &quot;&lt;!DOCTYPE html&gt;\n&lt;html&gt;\n&lt;head&gt;\n&lt;meta http-equiv=\&quot;Content-Type\&quot; content=\&quot;text/html; charset=UTF-8\&quot;&gt;\n&lt;meta charset=\&quot;utf-8\&quot;&gt;\n&lt;title&gt;title&lt;/title&gt;\n&lt;/head&gt;\n&lt;body&gt;\n&lt;p&gt;test html document&lt;/p&gt;\n&lt;p&gt;&lt;/p&gt;\n&lt;/body&gt;\n&lt;/html&gt;\n&quot;)
  )
df_
#&gt; # A tibble: 3 &#215; 2
#&gt;   descr body                                                                    
#&gt;   &lt;chr&gt; &lt;chr&gt;                                                                   
#&gt; 1 text  &quot;test simple text&quot;                                                      
#&gt; 2 body  &quot;&lt;body&gt;\n&lt;p&gt;test html body&lt;/p&gt;\n&lt;p&gt;&lt;/p&gt;\n&lt;/body&gt;&quot;                       
#&gt; 3 html  &quot;&lt;!DOCTYPE html&gt;\n&lt;html&gt;\n&lt;head&gt;\n&lt;meta http-equiv=\&quot;Content-Type\&quot; con…
df_$clean_body &lt;- sapply(df_$body, \(text) minimal_html(text) |&gt; html_text() |&gt; trimws(),
                        simplify = TRUE) 

df_[, c(&quot;descr&quot;, &quot;clean_body&quot;, &quot;body&quot;)]
#&gt; # A tibble: 3 &#215; 3
#&gt;   descr clean_body                  body                                        
#&gt;   &lt;chr&gt; &lt;chr&gt;                       &lt;chr&gt;                                       
#&gt; 1 text  &quot;test simple text&quot;          &quot;test simple text&quot;                          
#&gt; 2 body  &quot;test html body&quot;            &quot;&lt;body&gt;\n&lt;p&gt;test html body&lt;/p&gt;\n&lt;p&gt;&lt;/p&gt;\n&lt;/…
#&gt; 3 html  &quot;title\ntest html document&quot; &quot;&lt;!DOCTYPE html&gt;\n&lt;html&gt;\n&lt;head&gt;\n&lt;meta htt…

<sup>Created on 2023-06-16 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月16日 14:12:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76487370.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定