英文:
Webscarping with rvest - Get table and span text
问题
I looking to get the table at this link (https://clinicaltrials.gov/ct2/history/NCT04658186 ) along with the hover text on some rows .
The result i want is to create a data frame , so that the hover text is a column on same row as its on webpage.
Tried the code below where i can get the table and span text separately, unable to figure out how to merge this togeather.
library(dplyr)
library(rvest)
# Set the URL of the webpage containing the table
url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"
# Read the HTML code from the webpage
page <- read_html(url)
# Use html_table() to extract the table data
table_data <- page %>%
html_table(fill = TRUE) %>%
.[[1]] # Select the first table on the page
# Use html_nodes() and html_text() to extract the text from span elements within the table
span_text <- page %>%
html_nodes("span") %>%
html_attr("title") %>% data.frame()
Thanks for any help in advance.
英文:
I looking to get the table at this link (https://clinicaltrials.gov/ct2/history/NCT04658186 ) along with the hover text on some rows
.
The result i want is to create a data frame , so that the hover text is a column on same row as its on webpage.
Tried the code below where i can get the table and span text separately, unable to figure out how to merge this togeather.
library(dplyr)
library(rvest)
# Set the URL of the webpage containing the table
url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"
# Read the HTML code from the webpage
page <- read_html(url)
# Use html_table() to extract the table data
table_data <- page %>%
html_table(fill = TRUE) %>%
.[[1]] # Select the first table on the page
# Use html_nodes() and html_text() to extract the text from span elements within the table
span_text <- page %>% html_nodes("span") %>%
html_attr("title") %>% data.frame()
Thanks for any help in advance.
答案1
得分: 0
The code you provided appears to be written in R, and it involves loading libraries and performing operations on a webpage. Here is the translated code:
library(tidyverse)
library(rvest)
page <- "https://clinicaltrials.gov/ct2/history/NCT04658186" %>%
read_html()
page %>%
html_table() %>%
pluck(1) %>%
mutate(status = page %>%
html_elements(".w3-bordered.releases") %>%
pluck(1) %>%
html_elements("tbody tr") %>%
map_chr(., ~ .x %>%
html_element(".recruitmentStatus") %>%
html_attr("title")))
# A tibble: 58 × 6
Version A B `Submitted Date` Changes status
<int> <lgl> <lgl> <chr> <chr> <chr>
1 1 NA NA December 1, 2020 None (earliest Version on record) NA
2 2 NA NA January 12, 2021 Recruitment Status, Study Status and Contacts/Locations Not yet recruiting --> Recruiting
3 3 NA NA January 29, 2021 Contacts/Locations and Study Status NA
4 4 NA NA February 4, 2021 Study Status and Contacts/Locations NA
5 5 NA NA March 4, 2021 Study Status and Contacts/Locations NA
6 6 NA NA March 18, 2021 Contacts/Locations and Study Status NA
7 7 NA NA April 15, 2021 Study Status and Contacts/Locations NA
8 8 NA NA May 14, 2021 Study Status and Contacts/Locations NA
9 9 NA NA May 27, 2021 Contacts/Locations and Study Status NA
10 10 NA NA June 10, 2021 Study Status and Contacts/Locations NA
# ℹ 48 more rows
# ℹ Use `print(n = ...)` to see more rows
This is a translation of the R code you provided without any additional content.
英文:
library(tidyverse)
library(rvest)
page <- "https://clinicaltrials.gov/ct2/history/NCT04658186" %>%
read_html()
page %>%
html_table() %>%
pluck(1) %>%
mutate(status = page %>%
html_elements(".w3-bordered.releases") %>%
pluck(1) %>%
html_elements("tbody tr") %>%
map_chr(.,
~ .x %>%
html_element(".recruitmentStatus") %>%
html_attr("title")))
# A tibble: 58 × 6
Version A B `Submitted Date` Changes status
<int> <lgl> <lgl> <chr> <chr> <chr>
1 1 NA NA December 1, 2020 None (earliest Version on record) NA
2 2 NA NA January 12, 2021 Recruitment Status, Study Status and Contacts/Locations Not yet recruiting --> Recruiting
3 3 NA NA January 29, 2021 Contacts/Locations and Study Status NA
4 4 NA NA February 4, 2021 Study Status and Contacts/Locations NA
5 5 NA NA March 4, 2021 Study Status and Contacts/Locations NA
6 6 NA NA March 18, 2021 Contacts/Locations and Study Status NA
7 7 NA NA April 15, 2021 Study Status and Contacts/Locations NA
8 8 NA NA May 14, 2021 Study Status and Contacts/Locations NA
9 9 NA NA May 27, 2021 Contacts/Locations and Study Status NA
10 10 NA NA June 10, 2021 Study Status and Contacts/Locations NA
# ℹ 48 more rows
# ℹ Use `print(n = ...)` to see more rows
答案2
得分: 0
在这种情况下,我们可以遍历元素列表(即表格行),并从每个项目中提取特定部分。使用这种方法,我们将得到一个正确对齐的列表或向量,可以绑定到先前提取的表格中:
library(dplyr)
library(rvest)
library(purrr)
# 设置包含表格的网页的URL
url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"
# 从网页读取HTML代码
page <- read_html(url)
table_data <- page %>%
# 首先选择目标表格以从html_table()中获取单个表格
html_element("table") %>%
html_table(fill = TRUE)
# 选择所有表格行,并使用map_chr()遍历这些行,
# map_chr返回与输入列表相同长度的字符向量(<tr>元素的数量)
recr_stat <- page %>% html_elements("tbody tr") %>%
map_chr (\(tr) html_element(tr, "span.recruitmentStatus") %>%
html_attr("title"))
# 绑定到表格中:
bind_cols(table_data, `Recruitment Status` = recr_stat) %>%
relocate(`Recruitment Status`, .before = Changes)
#> # A tibble: 58 × 6
#> Version A B `Submitted Date` `Recruitment Status` Changes
#> <int> <lgl> <lgl> <chr> <chr> <chr>
#> 1 1 NA NA December 1, 2020 <NA> None (…
#> 2 2 NA NA January 12, 2021 Not yet recruiting --> Recruiti… Recrui…
#> 3 3 NA NA January 29, 2021 <NA> Contac…
#> 4 4 NA NA February 4, 2021 <NA> Study …
#> 5 5 NA NA March 4, 2021 <NA> Study …
#> 6 6 NA NA March 18, 2021 <NA> Contac…
#> 7 7 NA NA April 15, 2021 <NA> Study …
#> 8 8 NA NA May 14, 2021 <NA> Study …
#> 9 9 NA NA May 27, 2021 <NA> Contac…
#> 10 10 NA NA June 10, 2021 <NA> Study …
#> # ℹ 48 more rows
对于更稳健的方法,我们可以跳过html_table()
,并从每个元素(这里是tr
)中提取所有所需的细节。这也适用于没有表格的设计,其中表格数据通过列表或divs来呈现,例如:
results <- page %>% html_elements("tbody tr") %>%
map(\(tr) list(
version = html_element(tr, "td[headers='VersionNumber']") %>%
html_text(),
date = html_element(tr, "td[headers='VersionDate']") %>%
html_text(),
recrstat = html_element(tr, "td[headers='Changes'] span.recruitmentStatus") %>%
html_attr("title"),
changes = html_element(tr, "td[headers='Changes']") %>%
html_text()
)) %>%
bind_rows()
results %>%
mutate(version = as.integer(version),
date = lubridate::mdy(date))
#> # A tibble: 58 × 4
#> version date recrstat changes
#> <int> <date> <chr> <chr>
#> 1 1 2020-12-01 <NA> None (earliest Version …
#> 2 2 2021-01-12 Not yet recruiting --> Recruiting Recruitment Status, Stu…
#> 3 3 2021-01-29 <NA> Contacts/Locations and …
#> 4 4 2021-02-04 <NA> Study Status and Contac…
#> 5 5 2021-03-04 <NA> Study Status and Contac…
#> 6 6 2021-03-18 <NA> Contacts/Locations and …
#> 7 7 2021-04-15 <NA> Study Status and Contac…
#> 8 8 2021-05-14 <NA> Study Status and Contac…
#> 9 9 2021-05-27 <NA> Contacts/Locations and …
#> 10 10 2021-10-20 <NA> Study Status and Contac…
#> # ℹ 48 more rows
<sup>创建于2023年6月15日,使用reprex v2.0.2</sup>
英文:
In such case, we can cycle through a list of elements (i.e. table rows) and extract certain bits from each item. With this approach, we'll end up with a correctly aligned list or vector that can be bound to previously extracted table:
library(dplyr)
library(rvest)
library(purrr)
# Set the URL of the webpage containing the table
url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"
# Read the HTML code from the webpage
page <- read_html(url)
table_data <- page %>%
# selecting the target table first to get a single table from html_table()
html_element("table") %>%
html_table(fill = TRUE)
# select all table rows, and cycle through those with map_chr(),
# map_chr returns character vecotor of the same length as
# input list (number of <tr> elements)
recr_stat <- page %>% html_elements("tbody tr") %>%
map_chr (\(tr) html_element(tr, "span.recruitmentStatus") %>% html_attr("title"))
# bind to table:
bind_cols(table_data, `Recruitment Status` = recr_stat) %>%
relocate(`Recruitment Status`, .before = Changes)
#> # A tibble: 58 × 6
#> Version A B `Submitted Date` `Recruitment Status` Changes
#> <int> <lgl> <lgl> <chr> <chr> <chr>
#> 1 1 NA NA December 1, 2020 <NA> None (…
#> 2 2 NA NA January 12, 2021 Not yet recruiting --> Recruiti… Recrui…
#> 3 3 NA NA January 29, 2021 <NA> Contac…
#> 4 4 NA NA February 4, 2021 <NA> Study …
#> 5 5 NA NA March 4, 2021 <NA> Study …
#> 6 6 NA NA March 18, 2021 <NA> Contac…
#> 7 7 NA NA April 15, 2021 <NA> Study …
#> 8 8 NA NA May 14, 2021 <NA> Study …
#> 9 9 NA NA May 27, 2021 <NA> Contac…
#> 10 10 NA NA June 10, 2021 <NA> Study …
#> # ℹ 48 more rows
For a more robust approach, we can skip html_table()
and extract all required details from every element (here: tr
) ourselves. This also works for tableless designs where tabular data is presented through lists or divs, for example.
results <- page %>% html_elements("tbody tr") %>%
map(\(tr) list(
version = html_element(tr, "td[headers='VersionNumber']") %>% html_text(),
date = html_element(tr, "td[headers='VersionDate']") %>% html_text(),
recrstat = html_element(tr, "td[headers='Changes'] span.recruitmentStatus") %>% html_attr("title"),
changes = html_element(tr, "td[headers='Changes']") %>% html_text()
)) %>%
bind_rows()
results %>%
mutate(version = as.integer(version),
date = lubridate::mdy(date))
#> # A tibble: 58 × 4
#> version date recrstat changes
#> <int> <date> <chr> <chr>
#> 1 1 2020-12-01 <NA> None (earliest Version …
#> 2 2 2021-01-12 Not yet recruiting --> Recruiting Recruitment Status, Stu…
#> 3 3 2021-01-29 <NA> Contacts/Locations and …
#> 4 4 2021-02-04 <NA> Study Status and Contac…
#> 5 5 2021-03-04 <NA> Study Status and Contac…
#> 6 6 2021-03-18 <NA> Contacts/Locations and …
#> 7 7 2021-04-15 <NA> Study Status and Contac…
#> 8 8 2021-05-14 <NA> Study Status and Contac…
#> 9 9 2021-05-27 <NA> Contacts/Locations and …
#> 10 10 2021-10-20 <NA> Study Status and Contac…
#> # ℹ 48 more rows
<sup>Created on 2023-06-15 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论