问题

我尝试将一个文件夹中的 .html 文件映射到 RDS，但有时函数失败，如下所示

html_files <- list.files(file_directory, full.names = TRUE, recursive=TRUE)
rip_text <- function(court_file){
ripped_text <- read_html(court_file, options = "HUGE") %>|
html_text2() %>| # Pull out only the text
str_remove('^.*PubDate":"\\d{4}-\\d\\d-\\d\\d",\n') %>| 
str_remove('\\}"; var jsonData.*$') 
return(ripped_text)}'
ripped_files <- map(html_files, rip_text)

这里是错误消息：
错误 in map()：
i 在索引: 19531。
由于在 UseMethod() 中的错误引起：
没有适用于 'xml_find_first' 的方法，应用于类别为 "xml_document" 的对象。
运行 rlang::last_trace() 查看错误发生的位置。

英文:

I tried to map a folder of .html files into RDS, but sometimes the function fails as below

html_files &lt;- list.files(file_directory, full.names = TRUE, recursive=TRUE)
rip_text &lt;- function(court_file){
ripped_text &lt;- read_html(court_file, options = &quot;HUGE&quot;) |&gt;
html_text2() |&gt; # Pull out only the text
str_remove(&#39;^.*PubDate&quot;:&quot;\\d{4}-\\d\\d-\\d\\d&quot;,\n&#39;) |&gt; 
str_remove(&#39;\\}&quot;; var jsonData.*$&#39;) 
return(ripped_text)}&#39;
ripped_files &lt;- map(html_files, rip_text)

Here is the error:
Error in map():
i In index: 19531.
Caused by error in UseMethod():
! no applicable method for 'xml_find_first' applied to an object of class "xml_document"
Run rlang::last_trace() to see where the error occurred.

答案1

得分: 1

如果你不太在意排除掉那几个问题的错误，并且想要大部分的数据，你可以使用 purrr::safely()，如下所示。

html_files <- list.files(file_directory, full.names = TRUE, recursive=TRUE)
rip_text <- function(court_file){
  read_html(court_file, options = "HUGE") %>%
    html_text2() %>%
    str_remove('^.*PubDate":"\\d{4}-\\d\\d-\\d\\d",\n') %>%
    str_remove('\\}"; var jsonData.*$')
}
rip_text_safe <- safely(rip_text)
ripped_files <- map(html_files, rip_text_safe)

我无法测试这段代码（因为我理解你没有这些文件），但这应该适用于你。

英文:

If you don't care too much about excluding the couple of errors in question and want the bulk of the data, you could use purrr::safely(), as follows.

html_files &lt;- list.files(file_directory, full.names = TRUE, recursive=TRUE)
rip_text &lt;- function(court_file){
  read_html(court_file, options = &quot;HUGE&quot;) |&gt;
    html_text2() |&gt; # Pull out only the text
    str_remove(&#39;^.*PubDate&quot;:&quot;\\d{4}-\\d\\d-\\d\\d&quot;,\n&#39;) |&gt; 
    str_remove(&#39;\\}&quot;; var jsonData.*$&#39;)
  }
rip_text_safe &lt;- safely(rip_text)
ripped_files &lt;- map(html_files, rip_text_safe)

I can't test this (as I understandably don't have the files) but this should work for you.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在`map()`函数中出现问题，由于`UseMethod()`函数中的错误引起。

问题

答案1

Subtracting values of a shared variable between two data frames with unequal size if their categorical variables are the same

HTML验证器错误出现在所有头部元标签上，为什么？

使用六边形图来显示分类变量的比例（就像六角形三角形图中一样）。

Bootstrap 5 表格 – 移动设备上全宽，桌面设备上自动调整。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。