使用rvest从网页上提取表格中的唯一ID。

huangapple go评论65阅读模式
英文:

Extract unique ID from table on webpage using rvest

问题

我试图提取Environmental Data Initiative(EDI)网站上Andrews LTER站点的147个数据包的唯一ID(```Package Id```)。然而,我无法确定哪个```rvest::html_nodes()```包含了Package Id。有任何想法吗?

我一直在尝试:

```R
# 加载所需库
library(rvest)
library(dplyr)

# 定义网站的URL
url <- "http://portal.edirepository.org:80/nis/simpleSearch?defType=edismax&amp;q=*:*&amp;fq=-scope:ecotrends&amp;fq=-scope:lter-landsat*&amp;fq=scope:(knb-lter-and)&amp;fl=id,packageid,title,author,organization,pubdate,coordinates&amp;debug=false"

# 从网站读取HTML内容
page <- read_html(url)

# 提取相关信息
packageIds <- page %>%
  html_nodes("td[class='Package Id']") %>%
  html_text() # 返回一个空的字符串

![在这里输入图像描述][1]
英文:

I am trying to extract the unique ID (Package Id) for each of the 147 data packages on the Environmental Data Initiative (EDI) website for site Andrews LTER. However, I can't
figure out which rvest::html_nodes() holds the Package Id. Any ideas?

What I've been trying:

# Load required libraries
library(rvest)
library(dplyr)

# Define the URL of the website
url &lt;- &quot;http://portal.edirepository.org:80/nis/simpleSearch?defType=edismax&amp;q=*:*&amp;fq=-scope:ecotrends&amp;fq=-scope:lter-landsat*&amp;fq=scope:(knb-lter-and)&amp;fl=id,packageid,title,author,organization,pubdate,coordinates&amp;debug=false&quot;

# Read the HTML content from the website
page &lt;- read_html(url)

# Extract the relevant information
packageIds &lt;- page %&gt;%
  html_nodes(&quot;td[class=&#39;Package Id&#39;]&quot;) %&gt;%
  html_text() # results in an empty character string

使用rvest从网页上提取表格中的唯一ID。

答案1

得分: 1

你可以尝试像这样做。这有点棘手,因为我需要将原始查询追加为 &amp;start=0&amp;rows=150,以加载完整的表格。

然后,您可以使用 html_table 返回内容,这在这种情况下是一个列表。然后选择实际的表格列表元素并选择“Package Id”列。

# 定义网站的URL
url <- "https://portal.edirepository.org/nis/simpleSearch?defType=edismax&amp;q=*:*&amp;fq=-scope:ecotrends&amp;fq=-scope:lter-landsat*&amp;fq=scope:(knb-lter-and)&amp;fl=id,packageid,title,author,organization,pubdate,coordinates&amp;debug=false&amp;start=0&amp;rows=150"

# 从网站读取HTML内容
page <- read_html(url)

# 提取相关信息
page %>%
  html_table() %>%
  .[[4]] %>%
  select(`Package Id  ▵▿`) %>%
  rename(package_id = `Package Id  ▵▿`)
# 一个数据框: 147 行 × 1 列
   package_id    
   <chr>               
 1 knb-lter-and.2719.6 
 2 knb-lter-and.2720.8 
 3 knb-lter-and.2721.6 
 4 knb-lter-and.2722.6 
 5 knb-lter-and.2725.6 
 6 knb-lter-and.2726.6 
 7 knb-lter-and.4528.10
 8 knb-lter-and.4541.3 
 9 knb-lter-and.4544.4 
10 knb-lter-and.4547.5 
# … 还有 137 行
英文:

You could try something like this. It was a bit tricky since I needed to append the original query with &amp;start=0&amp;rows=150 in order to load the full table.

Then you can use html_table to return contents which in this case was a list. Then select the actual table list element and select the Package Id col.

# Define the URL of the website
url &lt;- &quot;https://portal.edirepository.org/nis/simpleSearch?defType=edismax&amp;q=*:*&amp;fq=-scope:ecotrends&amp;fq=-scope:lter-landsat*&amp;fq=scope:(knb-lter-and)&amp;fl=id,packageid,title,author,organization,pubdate,coordinates&amp;debug=false&amp;start=0&amp;rows=150&quot;

# Read the HTML content from the website
page &lt;- read_html(url)

# Extract the relevant information
page %&gt;%
  html_table() %&gt;%
  .[[4]] %&gt;%
  select(`Package Id  ▵▿`) %&gt;%
  rename(package_id = `Package Id  ▵▿`)
# A tibble: 147 &#215; 1
   package_id    
   &lt;chr&gt;               
 1 knb-lter-and.2719.6 
 2 knb-lter-and.2720.8 
 3 knb-lter-and.2721.6 
 4 knb-lter-and.2722.6 
 5 knb-lter-and.2725.6 
 6 knb-lter-and.2726.6 
 7 knb-lter-and.4528.10
 8 knb-lter-and.4541.3 
 9 knb-lter-and.4544.4 
10 knb-lter-and.4547.5 
# … with 137 more rows

huangapple
  • 本文由 发表于 2023年6月15日 04:50:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76477446.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定