英文:
Extract unique ID from table on webpage using rvest
问题
我试图提取Environmental Data Initiative(EDI)网站上Andrews LTER站点的147个数据包的唯一ID(```Package Id```)。然而,我无法确定哪个```rvest::html_nodes()```包含了Package Id。有任何想法吗?
我一直在尝试:
```R
# 加载所需库
library(rvest)
library(dplyr)
# 定义网站的URL
url <- "http://portal.edirepository.org:80/nis/simpleSearch?defType=edismax&q=*:*&fq=-scope:ecotrends&fq=-scope:lter-landsat*&fq=scope:(knb-lter-and)&fl=id,packageid,title,author,organization,pubdate,coordinates&debug=false"
# 从网站读取HTML内容
page <- read_html(url)
# 提取相关信息
packageIds <- page %>%
html_nodes("td[class='Package Id']") %>%
html_text() # 返回一个空的字符串
![在这里输入图像描述][1]
英文:
I am trying to extract the unique ID (Package Id
) for each of the 147 data packages on the Environmental Data Initiative (EDI) website for site Andrews LTER. However, I can't
figure out which rvest::html_nodes()
holds the Package Id. Any ideas?
What I've been trying:
# Load required libraries
library(rvest)
library(dplyr)
# Define the URL of the website
url <- "http://portal.edirepository.org:80/nis/simpleSearch?defType=edismax&q=*:*&fq=-scope:ecotrends&fq=-scope:lter-landsat*&fq=scope:(knb-lter-and)&fl=id,packageid,title,author,organization,pubdate,coordinates&debug=false"
# Read the HTML content from the website
page <- read_html(url)
# Extract the relevant information
packageIds <- page %>%
html_nodes("td[class='Package Id']") %>%
html_text() # results in an empty character string
答案1
得分: 1
你可以尝试像这样做。这有点棘手,因为我需要将原始查询追加为 &start=0&rows=150
,以加载完整的表格。
然后,您可以使用 html_table
返回内容,这在这种情况下是一个列表。然后选择实际的表格列表元素并选择“Package Id”列。
# 定义网站的URL
url <- "https://portal.edirepository.org/nis/simpleSearch?defType=edismax&q=*:*&fq=-scope:ecotrends&fq=-scope:lter-landsat*&fq=scope:(knb-lter-and)&fl=id,packageid,title,author,organization,pubdate,coordinates&debug=false&start=0&rows=150"
# 从网站读取HTML内容
page <- read_html(url)
# 提取相关信息
page %>%
html_table() %>%
.[[4]] %>%
select(`Package Id ▵▿`) %>%
rename(package_id = `Package Id ▵▿`)
# 一个数据框: 147 行 × 1 列
package_id
<chr>
1 knb-lter-and.2719.6
2 knb-lter-and.2720.8
3 knb-lter-and.2721.6
4 knb-lter-and.2722.6
5 knb-lter-and.2725.6
6 knb-lter-and.2726.6
7 knb-lter-and.4528.10
8 knb-lter-and.4541.3
9 knb-lter-and.4544.4
10 knb-lter-and.4547.5
# … 还有 137 行
英文:
You could try something like this. It was a bit tricky since I needed to append the original query with &start=0&rows=150
in order to load the full table.
Then you can use html_table
to return contents which in this case was a list. Then select the actual table list element and select
the Package Id col.
# Define the URL of the website
url <- "https://portal.edirepository.org/nis/simpleSearch?defType=edismax&q=*:*&fq=-scope:ecotrends&fq=-scope:lter-landsat*&fq=scope:(knb-lter-and)&fl=id,packageid,title,author,organization,pubdate,coordinates&debug=false&start=0&rows=150"
# Read the HTML content from the website
page <- read_html(url)
# Extract the relevant information
page %>%
html_table() %>%
.[[4]] %>%
select(`Package Id ▵▿`) %>%
rename(package_id = `Package Id ▵▿`)
# A tibble: 147 × 1
package_id
<chr>
1 knb-lter-and.2719.6
2 knb-lter-and.2720.8
3 knb-lter-and.2721.6
4 knb-lter-and.2722.6
5 knb-lter-and.2725.6
6 knb-lter-and.2726.6
7 knb-lter-and.4528.10
8 knb-lter-and.4541.3
9 knb-lter-and.4544.4
10 knb-lter-and.4547.5
# … with 137 more rows
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论