英文:
Using rvest to parse a dataframe column of class chr, containing html and non html input
问题
我是新手使用 rvest 并尝试进行网页抓取。是否有一种方法可以使用 rvest 解析包含 HTML 和非 HTML 输入的数据框列,特别是将包含 HTML 输入的行转换为解析后的文本,而不更改其他行?
```R
# 假设我的数据框叫做 'df',字符列名为 'body'
# 创建一个新列叫做 'clean_body' 用于存储清理后的数据
df$clean_body <- NA
# 遍历每一行并清理 'body' 列
for (i in 1:nrow(df)) {
# 检查 'body' 列的值是否不是 NA
if (!is.na(df$body[i])) {
# 检查值是否包含 HTML 标签
if (grepl("<.*?>", df$body[i])) {
# 如果值包含 HTML 标签,解析 HTML 并存储清理后的文本
df$clean_body[i] <- html_text(read_html(df$body[i]))
} else {
# 如果值不包含 HTML 标签,保留原始值
df$clean_body[i] <- df$body[i]
}
}
}
使用这个逻辑,出现错误 - "Error in UseMethod("xml_text") : no applicable method for 'xml_text' applied to an object of class 'xml_document'"
<details>
<summary>英文:</summary>
I'm new to rvest and playing with web scraping. Is there a way to use rvest that can parse a dataframe column of class chr containing html and non html input, specifically transforming the rows with html input into parsed text with no change otherwise?
Assuming my dataframe is called 'df' and the chr column is named 'body'
Create a new column called 'clean_body' to store the cleaned data
df$clean_body <- NA
Iterate over each row and clean the 'body' column
for (i in 1:nrow(df)) {
Check if the 'body' column value is not NA
if (!is.na(df$body[i])) {
# Check if the value contains HTML tags
if (grepl("<.*?>", df$body[i])) {
# If the value contains HTML tags, parse the HTML and store the cleaned text
df$clean_body[i] <- html_text(read_html(df$body[i]))
} else {
# If the value doesn't contain HTML tags, keep the original value
df$clean_body[i] <- df$body[i]
}
}
}
Using this logic, it returns an error - "Error in UseMethod("xml_text") :
no applicable method for 'xml_text' applied to an object of class "xml_document"
</details>
# 答案1
**得分**: 1
根据您的实际数据(内容、数量、HTML与文本记录的比例),您可能只需要首先在整个列上应用 `rvest::minimal_html()`。
由于 `minimal_html()` 无法在字符串向量上工作,我们需要找到一种方法来在每个单独的项目上应用它,即使用 `apply` 或 `purrr` 的映射函数家族,进行逐行操作(`dplyr::rowwise()`)或通过循环遍历项目。在下面的示例中,通过 `sapply()` 来处理:
``` r
library(rvest)
df_ <- tibble::tibble(
descr = c("text", "body", "html"),
body = c(
"test simple text",
"<body>\n<p>test html body</p>\n<p></p>\n</body>",
"<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n<meta charset=\"utf-8\">\n<title>title</title>\n</head>\n<body>\n<p>test html document</p>\n<p></p>\n</body>\n</html>"
)
)
df_
#> # A tibble: 3 × 2
#> descr body
#> <chr> <chr>
#> 1 text "test simple text"
#> 2 body "<body>\n<p>test html body</p>\n<p></p>\n</body>"
#> 3 html "<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" con…
df_$clean_body <- sapply(df_$body, \(text) minimal_html(text) |> html_text() |> trimws(),
simplify = TRUE)
df_[, c("descr", "clean_body", "body")]
#> # A tibble: 3 × 3
#> descr clean_body body
#> <chr> <chr> <chr>
#> 1 text "test simple text" "test simple text"
#> 2 body "test html body" "<body>\n<p>test html body</p>\n<p></p>\n</…
#> 3 html "title\ntest html document" "<!DOCTYPE html>\n<html>\n<head>\n<meta htt…
<sup>Created on 2023-06-16 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
英文:
Depending on your actual data (content, amount, ratio of HTML vs text records), all you might need is to apply rvest::minimal_html()
on the whole column first.
As minimal_html()
does not work on a vector of strings, we need a way to apply it on each individual item, i.e. use something from apply or purrr
map family, do a row-wise operation (dplyr::rowwise()
) or for-loop though items. In example bellow it's handled through sapply()
library(rvest)
df_ <- tibble::tibble(
descr = c("text","body","html"),
body = c(
"test simple text",
"<body>\n<p>test html body</p>\n<p></p>\n</body>",
"<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n<meta charset=\"utf-8\">\n<title>title</title>\n</head>\n<body>\n<p>test html document</p>\n<p></p>\n</body>\n</html>\n")
)
df_
#> # A tibble: 3 × 2
#> descr body
#> <chr> <chr>
#> 1 text "test simple text"
#> 2 body "<body>\n<p>test html body</p>\n<p></p>\n</body>"
#> 3 html "<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" con…
df_$clean_body <- sapply(df_$body, \(text) minimal_html(text) |> html_text() |> trimws(),
simplify = TRUE)
df_[, c("descr", "clean_body", "body")]
#> # A tibble: 3 × 3
#> descr clean_body body
#> <chr> <chr> <chr>
#> 1 text "test simple text" "test simple text"
#> 2 body "test html body" "<body>\n<p>test html body</p>\n<p></p>\n</…
#> 3 html "title\ntest html document" "<!DOCTYPE html>\n<html>\n<head>\n<meta htt…
<sup>Created on 2023-06-16 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论