在`map()`函数中出现问题,由于`UseMethod()`函数中的错误引起。

huangapple go评论74阅读模式
英文:

Problem in function map(), Caused by error in `UseMethod()`

问题

我尝试将一个文件夹中的 .html 文件映射到 RDS,但有时函数失败,如下所示

html_files <- list.files(file_directory, full.names = TRUE, recursive=TRUE)

rip_text <- function(court_file){

ripped_text <- read_html(court_file, options = "HUGE") %>|
html_text2() %>| # Pull out only the text
str_remove('^.*PubDate":"\\d{4}-\\d\\d-\\d\\d",\n') %>| 
str_remove('\\}"; var jsonData.*$') 
return(ripped_text)}'

ripped_files <- map(html_files, rip_text)

这里是错误消息:
错误 in map()
i 在索引: 19531。
由于在 UseMethod() 中的错误引起:
没有适用于 'xml_find_first' 的方法,应用于类别为 "xml_document" 的对象。
运行 rlang::last_trace() 查看错误发生的位置。

英文:

I tried to map a folder of .html files into RDS, but sometimes the function fails as below

html_files &lt;- list.files(file_directory, full.names = TRUE, recursive=TRUE)

rip_text &lt;- function(court_file){

ripped_text &lt;- read_html(court_file, options = &quot;HUGE&quot;) |&gt;
html_text2() |&gt; # Pull out only the text
str_remove(&#39;^.*PubDate&quot;:&quot;\\d{4}-\\d\\d-\\d\\d&quot;,\n&#39;) |&gt; 
str_remove(&#39;\\}&quot;; var jsonData.*$&#39;) 
return(ripped_text)}&#39;

ripped_files &lt;- map(html_files, rip_text)

Here is the error:
Error in map():
i In index: 19531.
Caused by error in UseMethod():
! no applicable method for 'xml_find_first' applied to an object of class "xml_document"
Run rlang::last_trace() to see where the error occurred.

答案1

得分: 1

如果你不太在意排除掉那几个问题的错误,并且想要大部分的数据,你可以使用 purrr::safely(),如下所示。

html_files <- list.files(file_directory, full.names = TRUE, recursive=TRUE)

rip_text <- function(court_file){
  read_html(court_file, options = "HUGE") %>%
    html_text2() %>%
    str_remove('^.*PubDate":"\\d{4}-\\d\\d-\\d\\d",\n') %>%
    str_remove('\\}"; var jsonData.*$')
}

rip_text_safe <- safely(rip_text)

ripped_files <- map(html_files, rip_text_safe)

我无法测试这段代码(因为我理解你没有这些文件),但这应该适用于你。

英文:

If you don't care too much about excluding the couple of errors in question and want the bulk of the data, you could use purrr::safely(), as follows.

html_files &lt;- list.files(file_directory, full.names = TRUE, recursive=TRUE)

rip_text &lt;- function(court_file){
  read_html(court_file, options = &quot;HUGE&quot;) |&gt;
    html_text2() |&gt; # Pull out only the text
    str_remove(&#39;^.*PubDate&quot;:&quot;\\d{4}-\\d\\d-\\d\\d&quot;,\n&#39;) |&gt; 
    str_remove(&#39;\\}&quot;; var jsonData.*$&#39;)
  }

rip_text_safe &lt;- safely(rip_text)

ripped_files &lt;- map(html_files, rip_text_safe)

I can't test this (as I understandably don't have the files) but this should work for you.

huangapple
  • 本文由 发表于 2023年7月12日 21:06:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76670914.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定