rvest: “xml_find_all” 方法对于类别为 “list” 的对象不适用。

huangapple go评论152阅读模式
英文:

rvest: Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "list"

问题

以下是您请求的代码部分的中文翻译:

  1. # 载入必要的库
  2. library(rvest)
  3. library(xml2)
  4. library(dplyr)
  5. library(purrr)
  6. # 定义网页URL
  7. url <- "https://portal.edirepository.org/nis/simpleSearch?defType=edismax&q=*:*&fq=-scope:ecotrends&fq=-scope:lter-landsat*&fq=scope:(knb-lter-and)&fl=id,packageid,title,author,organization,pubdate,coordinates&debug=false&start=0&rows=150"
  8. webpage <- read_html(url)
  9. # 初始化用于存储数据的向量
  10. package_ids <- character()
  11. time_periods_begin <- character()
  12. time_periods_end <- character()
  13. # 提取Package Id
  14. package_ids <- webpage %>%
  15. html_table() %>%
  16. .[[4]] %>%
  17. select(`Package Id ▵▿`) %>%
  18. rename(PackageId = `Package Id ▵▿`)
  19. # 遍历每个PackageId行
  20. for (i in 1:length(package_ids$PackageId)) {
  21. # 构建“查看完整元数据”页面的URL
  22. package_id_link <- paste0("https://portal.edirepository.org/nis/metadataviewer?packageid=", package_ids$PackageId[i])
  23. # 转到“查看完整元数据”页面
  24. metadata_page <- map(package_id_link, read_html)
  25. # 提取开始和结束日期(这是错误发生的地方)
  26. time_period_begin <- html_nodes(metadata_page, "tr:contains('Begin') td:nth-child(2)") %>%
  27. html_text() %>%
  28. trimws()
  29. time_periods_begin <- c(time_periods_begin, time_period_begin)
  30. time_period_end <- html_nodes(metadata_page, "tr:contains('End') td:nth-child(2)") %>%
  31. html_text() %>%
  32. trimws()
  33. time_periods_end <- c(time_periods_end, time_period_end)
  34. }

希望这对您有所帮助。如果您需要进一步的帮助,请随时告诉我。

英文:

The Environmental Data Initiative (EDI) is a repository for datasets from several locations. I would like to scrape the beginning and end dates of each dataset from a single location (see example link here).

  • Each dataset for the one location contains a link to a metadata URL that lists the start and end date of the dataset (see example link here).

My code below is attempting to use a for-loop to extract the unique ID for each dataset (i.e., Package Id) which then gets used to create the metadata page URL for each Package Id.

However, my for-loop throws an error as it attempts to scrape the begin date from each of the metadata pages.

  • The error: Error in UseMethod(&quot;xml_find_all&quot;) : no applicable
    method for &#39;xml_find_all&#39; applied to an object of class &quot;list&quot;

How can I adapt my for-loop to extract the begin and end date of each Package Id?

  1. library(rvest)
  2. library(xml2)
  3. library(dplyr)
  4. library(purrr)
  5. url &lt;- &quot;https://portal.edirepository.org/nis/simpleSearch?defType=edismax&amp;q=*:*&amp;fq=-scope:ecotrends&amp;fq=-scope:lter-landsat*&amp;fq=scope:(knb-lter-and)&amp;fl=id,packageid,title,author,organization,pubdate,coordinates&amp;debug=false&amp;start=0&amp;rows=150&quot;
  6. webpage &lt;- read_html(url)
  7. # Initialize vectors to store the data
  8. package_ids &lt;- character()
  9. time_periods_begin &lt;- character()
  10. time_periods_end &lt;- character()
  11. # Extract the Package Id
  12. package_ids &lt;- webpage %&gt;%
  13. html_table() %&gt;%
  14. .[[4]] %&gt;%
  15. select(`Package Id ▵▿`) %&gt;%
  16. rename(PackageId = `Package Id ▵▿`)
  17. # Iterate over each PackageId row
  18. for (i in 1:length(package_ids$PackageId)) {
  19. # Construct the URL for the &quot;View Full Metadata&quot; page
  20. package_id_link &lt;- paste0(&quot;https://portal.edirepository.org/nis/metadataviewer?packageid=&quot;, package_ids$PackageId)
  21. # Navigate to the &quot;View Full Metadata&quot; page
  22. metadata_page &lt;- map(package_id_link, read_html)
  23. # Extract the Begin and End (this is where the error lives)
  24. time_period_begin &lt;- html_nodes(metadata_page, &quot;tr:contains(&#39;Begin&#39;) td:nth-child(2)&quot;) %&gt;%
  25. html_text() %&gt;%
  26. trimws()
  27. time_periods_begin &lt;- c(time_periods_begin, time_period_begin)
  28. time_period_end &lt;- html_nodes(metadata_page, &quot;tr:contains(&#39;End&#39;) td:nth-child(2)&quot;) %&gt;%
  29. html_text() %&gt;%
  30. trimws()
  31. time_periods_end &lt;- c(time_periods_end, time_period_end)
  32. }

The output should look like this

  1. # Create a data frame with Package Id, Begin, and End
  2. data_frame &lt;- data.frame(PackageId = package_id,
  3. Begin = time_periods_begin,
  4. End = time_periods_end)
  5. data_frame
  6. PackageId Begin End
  7. 1 knb-lter-and.2719.6 1971-06-01 2002-03-11
  8. 2 knb-lter-and.2720.8 1958-01-01 1979-01-01
  9. 3 knb-lter-and.2721.6 1975-01-01 1995-01-01

Update 1

I can get the PackageID, Begin, and End for a single dataset. In the code above, I can get each datasets metadata URL. Now just need to figure out how to extract the PackageID, Begin, and End for each of those 147 metadata URLs.

  1. url &lt;- &quot;https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-and.4525.10&quot;
  2. webpage &lt;- read_html(url)
  3. package_id &lt;- html_text(html_nodes(webpage, &quot;td.rowodd + td.roweven&quot;)[1])
  4. # Extract the Begin value
  5. time_periods_begin &lt;- html_text(html_nodes(webpage, &quot;td:contains(&#39;Begin:&#39;) + td&quot;)[1])
  6. # Extract the End value
  7. time_periods_end &lt;- html_text(html_nodes(webpage, &quot;td:contains(&#39;End:&#39;) + td&quot;)[1])
  8. data_frame &lt;- data.frame(PackageId = package_id,
  9. Begin = time_periods_begin,
  10. End = time_periods_end)
  11. data_frame

答案1

得分: 1

  1. library(tidyverse)
  2. library(rvest)
  3. library(janitor)
  4. page <-
  5. "http://portal.edirepository.org:80/nis/simpleSearch?defType=edismax&amp;q=*:*&amp;fq=-scope:ecotrends&amp;fq=-scope:lter-landsat*&amp;fq=scope:(knb-lter-and)&amp;fl=id,packageid,title,author,organization,pubdate,coordinates&amp;debug=false&amp;start=0&amp;rows=150" %>%
  6. read_html()
  7. scraper <- function(package_id) {
  8. cat("Scraping", package_id, "\n")
  9. data <- str_c("https://portal.edirepository.org/nis/metadataviewer?packageid=",
  10. package_id) %>%
  11. read_html() %>%
  12. html_elements(".subgroup.onehundred_percent") %>%
  13. pluck(1) %>%
  14. html_elements(".roweven") %>%
  15. html_text2()
  16. tibble(begin = pluck(data, 1),
  17. end = pluck(data, 2))
  18. }
  19. data <- page %>%
  20. html_table() %>%
  21. pluck(4) %>%
  22. clean_names() %>%
  23. mutate(across(title, ~ str_squish(str_remove_all(., "\\n")))) %>%
  24. mutate(date = map(package_id, scraper)) %>%
  25. unnest(date)
  26. title creators publication_date package_id begin end
  27. <chr> <chr> <int> <chr> <chr> <chr>
  28. 1 Invertebrates of the Andrews Experimental Forest: An annotated list of insects and other arthropods, 1971 Andrews 2014 knb-lter- 1971 2002
  29. 2 Vascular plant list on the Andrews Experimental Forest and nearby Research Natural Areas, 1958 to 1979 Andrews 2014 knb-lter- 1958 1979
  30. 3 Bird species list for the Andrews Experimental Forest and Upper McKenzie River Basin, 1975 to 1995 Andrews 2014 knb-lter- 1975 1995
  31. 4 Amphibian and reptile list of the Andrews Experimental Forest, 1975 to 1995 Andrews 2014 knb-lter- 1975 1995
  32. 5 Moss species list of the Andrews Experimental Forest, 1991 Andrews 2013 knb-lter- 1991 1991
  33. 6 Mammal species list of the Andrews Experimental Forest, 1971 to 1976 Anthony 2014 knb-lter- 1971 1976
  34. 7 Ecohydrology and Ecophysiology intensively measured plots in Watershed 1, Andrews Experimental Forest, 20 Andrews 2016 knb-lter- 2005 2011
  35. 8 A Study of Hyporheic Characteristics Along a Longitudinal Profile of Lookout Creek, Oregon, 2003 Andrews 2013 knb-lter- 2003 2003
  36. 9 Annual tree productivity in permanent plots within the H.J. Andrews Experimental Forest Andrews 2013 knb-lter- 2000 2004
  37. 10 Epiphytic macrolichens in relation to forest management and topography in a western Oregon watershed, 199 Andrews 2014 knb-lter- 1997 1999
英文:
  1. library(tidyverse)
  2. library(rvest)
  3. library(janitor)
  4. page &lt;-
  5. &quot;http://portal.edirepository.org:80/nis/simpleSearch?defType=edismax&amp;q=*:*&amp;fq=-scope:ecotrends&amp;fq=-scope:lter-landsat*&amp;fq=scope:(knb-lter-and)&amp;fl=id,packageid,title,author,organization,pubdate,coordinates&amp;debug=false&amp;start=0&amp;rows=150&quot; %&gt;%
  6. read_html()
  7. scraper &lt;- function(package_id) {
  8. cat(&quot;Scraping&quot;, package_id, &quot;\n&quot;)
  9. data &lt;- str_c(&quot;https://portal.edirepository.org/nis/metadataviewer?packageid=&quot;,
  10. package_id) %&gt;%
  11. read_html() %&gt;%
  12. html_elements(&quot;.subgroup.onehundred_percent&quot;) %&gt;%
  13. pluck(1) %&gt;%
  14. html_elements(&quot;.roweven&quot;) %&gt;%
  15. html_text2()
  16. tibble(begin = pluck(data, 1),
  17. end = pluck(data, 2))
  18. }
  19. data &lt;- page %&gt;%
  20. html_table() %&gt;%
  21. pluck(4) %&gt;%
  22. clean_names() %&gt;%
  23. mutate(across(title, ~ str_squish(str_remove_all(., &quot;\\n&quot;)))) %&gt;%
  24. mutate(date = map(package_id, scraper)) %&gt;%
  25. unnest(date)
  26. title creators publication_date package_id begin end
  27. &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  28. 1 Invertebrates of the Andrews Experimental Forest: An annotated list of insects and other arthropods, 1971 Andrews 2014 knb-lter-… 1971 2002
  29. 2 Vascular plant list on the Andrews Experimental Forest and nearby Research Natural Areas, 1958 to 1979 Andrews 2014 knb-lter-… 1958 1979
  30. 3 Bird species list for the Andrews Experimental Forest and Upper McKenzie River Basin, 1975 to 1995 Andrews 2014 knb-lter-… 1975 1995
  31. 4 Amphibian and reptile list of the Andrews Experimental Forest, 1975 to 1995 Andrews 2014 knb-lter-… 1975 1995
  32. 5 Moss species list of the Andrews Experimental Forest, 1991 Andrews 2013 knb-lter-… 1991 1991
  33. 6 Mammal species list of the Andrews Experimental Forest, 1971 to 1976 Anthony 2014 knb-lter-… 1971 1976
  34. 7 Ecohydrology and Ecophysiology intensively measured plots in Watershed 1, Andrews Experimental Forest, 20 Andrews 2016 knb-lter-… 2005 2011
  35. 8 A Study of Hyporheic Characteristics Along a Longitudinal Profile of Lookout Creek, Oregon, 2003 Andrews 2013 knb-lter-… 2003 2003
  36. 9 Annual tree productivity in permanent plots within the H.J. Andrews Experimental Forest Andrews 2013 knb-lter-… 2000 2004
  37. 10 Epiphytic macrolichens in relation to forest management and topography in a western Oregon watershed, 199 Andrews 2014 knb-lter-… 1997 1999

答案2

得分: 0

以下是如何从每个元数据文件中提取包ID、开始日期和结束日期的代码:

  1. library(rvest)
  2. library(dplyr)
  3. # EDI网页,用于Andrews LTER数据集
  4. url <- "http://portal.edirepository.org:80/nis/simpleSearch?defType=edismax&amp;q=*:*&amp;fq=-scope:ecotrends&amp;fq=-scope:lter-landsat*&amp;fq=scope:(knb-lter-and)&amp;fl=id,packageid,title,author,organization,pubdate,coordinates&amp;debug=false"
  5. webpage <- read_html(url)
  6. # 提取每个包ID
  7. package_ids <- webpage %>%
  8. html_table() %>%
  9. .[[4]] %>%
  10. select(`Package Id ▵▿`) %>%
  11. rename(PackageId = `Package Id ▵▿`)
  12. zz <- unique(package_ids$PackageId)
  13. # 遍历每个包ID的元数据页面
  14. for (i in 1:length(package_ids$PackageId)) {
  15. curDat = package_ids[package_ids$PackageId == zz[i],]
  16. # 构建“查看完整元数据”页面的URL
  17. package_id_link <- paste0("https://portal.edirepository.org/nis/metadataviewer?packageid=", curDat)
  18. # 读取“查看完整元数据”页面
  19. webpage <- read_html(package_id_link)
  20. # 提取包ID、开始日期和结束日期
  21. package_id <- html_text(html_nodes(webpage, "td.rowodd + td.roweven")[1])
  22. begin_value <- html_text(html_nodes(webpage, "td:contains('Begin:') + td")[1])
  23. end_value <- html_text(html_nodes(webpage, "td:contains('End:') + td")[1])
  24. if( i == 1){
  25. packageID = package_id
  26. time_periods_begin = begin_value
  27. time_periods_end = end_value
  28. } else{
  29. packageID = rbind(packageID, package_id)
  30. time_periods_begin = rbind(time_periods_begin, begin_value)
  31. time_periods_end = rbind(time_periods_end, end_value)
  32. }
  33. }
  34. data_frame <- data.frame(cbind(packageID,
  35. time_periods_begin,
  36. time_periods_end))
  37. colnames(data_frame)[1:3] <- c('PackageId','Begin','End')
  38. rownames(data_frame) <- seq(1,NROW(data_frame),1)
  39. data_frame
  40. PackageId Begin End
  41. 1 knb-lter-and.2719.6 1971-06-01 2002-03-11
  42. 2 knb-lter-and.2720.8 1958-01-01 1979-01-01
  43. 3 knb-lter-and.2721.6 1975-01-01 1995-01-01
  44. 4 knb-lter-and.2722.6 1975-01-01 1995-01-01
  45. 5 knb-lter-and.2725.6 1991-06-01 1991-08-01
  46. 6 knb-lter-and.2726.6 1971-01-01 1976-01-01
  47. 7 knb-lter-and.4528.10 2005-09-30 2011-05-05
  48. 8 knb-lter-and.4541.3 2003-06-14 2003-11-15
  49. 9 knb-lter-and.4544.4 2000-06-01 2004-09-30
  50. 10 knb-lter-and.4547.5 1997-09-23 1999-09-15
英文:

Here is how to scrape the Package ID, Begin Date, and End Date from each metadata file

  1. library(rvest)
  2. library(dplyr)
  3. # EDI webpage for Andrews LTER datasets
  4. url &lt;- &quot;http://portal.edirepository.org:80/nis/simpleSearch?defType=edismax&amp;q=*:*&amp;fq=-scope:ecotrends&amp;fq=-scope:lter-landsat*&amp;fq=scope:(knb-lter-and)&amp;fl=id,packageid,title,author,organization,pubdate,coordinates&amp;debug=false&quot;
  5. webpage &lt;- read_html(url)
  6. # Extract each of the Package Ids
  7. package_ids &lt;- webpage %&gt;%
  8. html_table() %&gt;%
  9. .[[4]] %&gt;%
  10. select(`Package Id ▵▿`) %&gt;%
  11. rename(PackageId = `Package Id ▵▿`)
  12. zz &lt;- unique(package_ids$PackageId)
  13. # Iterate between the metadata page of each Package Id
  14. for (i in 1:length(package_ids$PackageId)) {
  15. curDat = package_ids[package_ids$PackageId == zz[i],]
  16. # Construct the URL for the &quot;View Full Metadata&quot; page
  17. package_id_link &lt;- paste0(&quot;https://portal.edirepository.org/nis/metadataviewer?packageid=&quot;, curDat)
  18. # Read the &quot;View Full Metadata&quot; page
  19. webpage &lt;- read_html(package_id_link)
  20. # Extract Package ID, Begin Date, and End Date
  21. package_id &lt;- html_text(html_nodes(webpage, &quot;td.rowodd + td.roweven&quot;)[1])
  22. begin_value &lt;- html_text(html_nodes(webpage, &quot;td:contains(&#39;Begin:&#39;) + td&quot;)[1])
  23. end_value &lt;- html_text(html_nodes(webpage, &quot;td:contains(&#39;End:&#39;) + td&quot;)[1])
  24. if( i == 1){
  25. packageID = package_id
  26. time_periods_begin = begin_value
  27. time_periods_end = end_value
  28. } else{
  29. packageID = rbind(packageID, package_id)
  30. time_periods_begin = rbind(time_periods_begin, begin_value)
  31. time_periods_end = rbind(time_periods_end, end_value)
  32. }
  33. }
  34. data_frame &lt;- data.frame(cbind(packageID,
  35. time_periods_begin,
  36. time_periods_end))
  37. colnames(data_frame)[1:3] &lt;- c(&#39;PackageId&#39;,&#39;Begin&#39;,&#39;End&#39;)
  38. rownames(data_frame) &lt;- seq(1,NROW(data_frame),1)
  39. data_frame
  40. PackageId Begin End
  41. 1 knb-lter-and.2719.6 1971-06-01 2002-03-11
  42. 2 knb-lter-and.2720.8 1958-01-01 1979-01-01
  43. 3 knb-lter-and.2721.6 1975-01-01 1995-01-01
  44. 4 knb-lter-and.2722.6 1975-01-01 1995-01-01
  45. 5 knb-lter-and.2725.6 1991-06-01 1991-08-01
  46. 6 knb-lter-and.2726.6 1971-01-01 1976-01-01
  47. 7 knb-lter-and.4528.10 2005-09-30 2011-05-05
  48. 8 knb-lter-and.4541.3 2003-06-14 2003-11-15
  49. 9 knb-lter-and.4544.4 2000-06-01 2004-09-30
  50. 10 knb-lter-and.4547.5 1997-09-23 1999-09-15

huangapple
  • 本文由 发表于 2023年6月15日 10:16:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76478667.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定