Webscarping with rvest – 获取表格和跨度文本

huangapple go评论68阅读模式
英文:

Webscarping with rvest - Get table and span text

问题

I looking to get the table at this link (https://clinicaltrials.gov/ct2/history/NCT04658186 ) along with the hover text on some rows Webscarping with rvest – 获取表格和跨度文本.

The result i want is to create a data frame , so that the hover text is a column on same row as its on webpage.
Tried the code below where i can get the table and span text separately, unable to figure out how to merge this togeather.

library(dplyr)
library(rvest)

 # Set the URL of the webpage containing the table
  url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"
  
  # Read the HTML code from the webpage
  page <- read_html(url)
  
  # Use html_table() to extract the table data
  table_data <- page %>%
    html_table(fill = TRUE) %>%
    .[[1]] # Select the first table on the page
  
  # Use html_nodes() and html_text() to extract the text from span elements within the table
  span_text <- page %>%
    html_nodes("span") %>%
    html_attr("title") %>% data.frame() 

Thanks for any help in advance.

英文:

I looking to get the table at this link (https://clinicaltrials.gov/ct2/history/NCT04658186 ) along with the hover text on some rows
Webscarping with rvest – 获取表格和跨度文本.

The result i want is to create a data frame , so that the hover text is a column on same row as its on webpage.
Tried the code below where i can get the table and span text separately, unable to figure out how to merge this togeather.

library(dplyr)
library(rvest)

 # Set the URL of the webpage containing the table
  url &lt;- &quot;https://clinicaltrials.gov/ct2/history/NCT04658186&quot;
  
  # Read the HTML code from the webpage
  page &lt;- read_html(url)
  
  # Use html_table() to extract the table data
  table_data &lt;- page %&gt;%
    html_table(fill = TRUE) %&gt;%
    .[[1]] # Select the first table on the page
  
  # Use html_nodes() and html_text() to extract the text from span elements within the table
  span_text &lt;- page %&gt;% html_nodes(&quot;span&quot;) %&gt;% 
    html_attr(&quot;title&quot;) %&gt;% data.frame() 

Thanks for any help in advance.

答案1

得分: 0

The code you provided appears to be written in R, and it involves loading libraries and performing operations on a webpage. Here is the translated code:

library(tidyverse)
library(rvest)

page <- "https://clinicaltrials.gov/ct2/history/NCT04658186" %>%
  read_html()

page %>%
  html_table() %>%
  pluck(1) %>%
  mutate(status = page %>%
           html_elements(".w3-bordered.releases") %>%
           pluck(1) %>%
           html_elements("tbody tr") %>%
           map_chr(., ~ .x %>%
                         html_element(".recruitmentStatus") %>%
                         html_attr("title")))

# A tibble: 58 × 6
   Version A     B     `Submitted Date` Changes                                                 status                           
     <int> <lgl> <lgl> <chr>            <chr>                                                   <chr>                             
 1       1 NA    NA    December 1, 2020 None (earliest Version on record)                       NA                                
 2       2 NA    NA    January 12, 2021 Recruitment Status, Study Status and Contacts/Locations Not yet recruiting --> Recruiting
 3       3 NA    NA    January 29, 2021 Contacts/Locations and Study Status                     NA                                
 4       4 NA    NA    February 4, 2021 Study Status and Contacts/Locations                     NA                                
 5       5 NA    NA    March 4, 2021    Study Status and Contacts/Locations                     NA                                
 6       6 NA    NA    March 18, 2021   Contacts/Locations and Study Status                     NA                                
 7       7 NA    NA    April 15, 2021   Study Status and Contacts/Locations                     NA                                
 8       8 NA    NA    May 14, 2021     Study Status and Contacts/Locations                     NA                                
 9       9 NA    NA    May 27, 2021     Contacts/Locations and Study Status                     NA                                
10      10 NA    NA    June 10, 2021    Study Status and Contacts/Locations                     NA                                
# ℹ 48 more rows
# ℹ Use `print(n = ...)` to see more rows

This is a translation of the R code you provided without any additional content.

英文:
library(tidyverse)
library(rvest)
page &lt;- &quot;https://clinicaltrials.gov/ct2/history/NCT04658186&quot; %&gt;%
read_html()
page %&gt;% 
html_table() %&gt;%
pluck(1) %&gt;% 
mutate(status = page %&gt;%
html_elements(&quot;.w3-bordered.releases&quot;) %&gt;%
pluck(1) %&gt;%
html_elements(&quot;tbody tr&quot;) %&gt;%
map_chr(.,
~ .x %&gt;%
html_element(&quot;.recruitmentStatus&quot;) %&gt;%
html_attr(&quot;title&quot;)))
# A tibble: 58 &#215; 6
Version A     B     `Submitted Date` Changes                                                 status                           
&lt;int&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;chr&gt;            &lt;chr&gt;                                                   &lt;chr&gt;                            
1       1 NA    NA    December 1, 2020 None (earliest Version on record)                       NA                               
2       2 NA    NA    January 12, 2021 Recruitment Status, Study Status and Contacts/Locations Not yet recruiting --&gt; Recruiting
3       3 NA    NA    January 29, 2021 Contacts/Locations and Study Status                     NA                               
4       4 NA    NA    February 4, 2021 Study Status and Contacts/Locations                     NA                               
5       5 NA    NA    March 4, 2021    Study Status and Contacts/Locations                     NA                               
6       6 NA    NA    March 18, 2021   Contacts/Locations and Study Status                     NA                               
7       7 NA    NA    April 15, 2021   Study Status and Contacts/Locations                     NA                               
8       8 NA    NA    May 14, 2021     Study Status and Contacts/Locations                     NA                               
9       9 NA    NA    May 27, 2021     Contacts/Locations and Study Status                     NA                               
10      10 NA    NA    June 10, 2021    Study Status and Contacts/Locations                     NA                               
# ℹ 48 more rows
# ℹ Use `print(n = ...)` to see more rows

答案2

得分: 0

在这种情况下,我们可以遍历元素列表(即表格行),并从每个项目中提取特定部分。使用这种方法,我们将得到一个正确对齐的列表或向量,可以绑定到先前提取的表格中:

library(dplyr)
library(rvest)
library(purrr)

# 设置包含表格的网页的URL
url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"

# 从网页读取HTML代码
page <- read_html(url)

table_data <- page %>%
  # 首先选择目标表格以从html_table()中获取单个表格
  html_element("table") %>%
  html_table(fill = TRUE)

# 选择所有表格行,并使用map_chr()遍历这些行,
# map_chr返回与输入列表相同长度的字符向量(<tr>元素的数量)
recr_stat <- page %>% html_elements("tbody tr") %>%
  map_chr (\(tr) html_element(tr, "span.recruitmentStatus") %>%
             html_attr("title"))

# 绑定到表格中:
bind_cols(table_data, `Recruitment Status` = recr_stat) %>%
  relocate(`Recruitment Status`, .before = Changes)
#> # A tibble: 58 × 6
#>    Version A     B     `Submitted Date` `Recruitment Status`             Changes
#>      <int> <lgl> <lgl> <chr>            <chr>                            <chr>  
#>  1       1 NA    NA    December 1, 2020 <NA>                             None (…
#>  2       2 NA    NA    January 12, 2021 Not yet recruiting --> Recruiti… Recrui…
#>  3       3 NA    NA    January 29, 2021 <NA>                             Contac…
#>  4       4 NA    NA    February 4, 2021 <NA>                             Study …
#>  5       5 NA    NA    March 4, 2021    <NA>                             Study …
#>  6       6 NA    NA    March 18, 2021   <NA>                             Contac…
#>  7       7 NA    NA    April 15, 2021   <NA>                             Study …
#>  8       8 NA    NA    May 14, 2021     <NA>                             Study …
#>  9       9 NA    NA    May 27, 2021     <NA>                             Contac…
#> 10      10 NA    NA    June 10, 2021    <NA>                             Study …
#> # ℹ 48 more rows

对于更稳健的方法,我们可以跳过html_table(),并从每个元素(这里是tr)中提取所有所需的细节。这也适用于没有表格的设计,其中表格数据通过列表或divs来呈现,例如:

results <- page %>% html_elements("tbody tr") %>%
  map(\(tr) list(
    version  = html_element(tr, "td[headers='VersionNumber']") %>%
               html_text(),
    date     = html_element(tr, "td[headers='VersionDate']") %>%
               html_text(),
    recrstat = html_element(tr, "td[headers='Changes'] span.recruitmentStatus") %>%
               html_attr("title"),
    changes  = html_element(tr, "td[headers='Changes']") %>%
               html_text()
    )) %>%
  bind_rows()

results %>%
  mutate(version = as.integer(version),
         date = lubridate::mdy(date))
#> # A tibble: 58 × 4
#>    version date       recrstat                          changes                 
#>      <int> <date>     <chr>                             <chr>                   
#>  1       1 2020-12-01 <NA>                              None (earliest Version …
#>  2       2 2021-01-12 Not yet recruiting --> Recruiting Recruitment Status, Stu…
#>  3       3 2021-01-29 <NA>                              Contacts/Locations and …
#>  4       4 2021-02-04 <NA>                              Study Status and Contac…
#>  5       5 2021-03-04 <NA>                              Study Status and Contac…
#>  6       6 2021-03-18 <NA>                              Contacts/Locations and …
#>  7       7 2021-04-15 <NA>                              Study Status and Contac…
#>  8       8 2021-05-14 <NA>                              Study Status and Contac…
#>  9       9 2021-05-27 <NA>                              Contacts/Locations and …
#> 10      10 2021-10-20 <NA>                              Study Status and Contac…
#> # ℹ 48 more rows

<sup>创建于2023年6月15日,使用reprex v2.0.2</sup>

英文:

In such case, we can cycle through a list of elements (i.e. table rows) and extract certain bits from each item. With this approach, we'll end up with a correctly aligned list or vector that can be bound to previously extracted table:

library(dplyr)
library(rvest)
library(purrr)

# Set the URL of the webpage containing the table
url &lt;- &quot;https://clinicaltrials.gov/ct2/history/NCT04658186&quot;

# Read the HTML code from the webpage
page &lt;- read_html(url)

table_data &lt;- page %&gt;%
  # selecting the target table first to get a single table from html_table()
  html_element(&quot;table&quot;) %&gt;% 
  html_table(fill = TRUE)

# select all table rows, and cycle through those with map_chr(), 
# map_chr returns character vecotor of the same length as 
# input list (number of &lt;tr&gt; elements)
recr_stat &lt;- page %&gt;% html_elements(&quot;tbody tr&quot;) %&gt;% 
  map_chr (\(tr) html_element(tr, &quot;span.recruitmentStatus&quot;) %&gt;% html_attr(&quot;title&quot;))

# bind to table:
bind_cols(table_data, `Recruitment Status` = recr_stat) %&gt;% 
  relocate(`Recruitment Status`, .before = Changes)
#&gt; # A tibble: 58 &#215; 6
#&gt;    Version A     B     `Submitted Date` `Recruitment Status`             Changes
#&gt;      &lt;int&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;chr&gt;            &lt;chr&gt;                            &lt;chr&gt;  
#&gt;  1       1 NA    NA    December 1, 2020 &lt;NA&gt;                             None (…
#&gt;  2       2 NA    NA    January 12, 2021 Not yet recruiting --&gt; Recruiti… Recrui…
#&gt;  3       3 NA    NA    January 29, 2021 &lt;NA&gt;                             Contac…
#&gt;  4       4 NA    NA    February 4, 2021 &lt;NA&gt;                             Study …
#&gt;  5       5 NA    NA    March 4, 2021    &lt;NA&gt;                             Study …
#&gt;  6       6 NA    NA    March 18, 2021   &lt;NA&gt;                             Contac…
#&gt;  7       7 NA    NA    April 15, 2021   &lt;NA&gt;                             Study …
#&gt;  8       8 NA    NA    May 14, 2021     &lt;NA&gt;                             Study …
#&gt;  9       9 NA    NA    May 27, 2021     &lt;NA&gt;                             Contac…
#&gt; 10      10 NA    NA    June 10, 2021    &lt;NA&gt;                             Study …
#&gt; # ℹ 48 more rows

For a more robust approach, we can skip html_table() and extract all required details from every element (here: tr) ourselves. This also works for tableless designs where tabular data is presented through lists or divs, for example.

results &lt;- page %&gt;% html_elements(&quot;tbody tr&quot;) %&gt;% 
  map(\(tr) list(
    version  = html_element(tr, &quot;td[headers=&#39;VersionNumber&#39;]&quot;) %&gt;% html_text(),
    date     = html_element(tr, &quot;td[headers=&#39;VersionDate&#39;]&quot;) %&gt;% html_text(),
    recrstat = html_element(tr, &quot;td[headers=&#39;Changes&#39;] span.recruitmentStatus&quot;) %&gt;% html_attr(&quot;title&quot;),
    changes  = html_element(tr, &quot;td[headers=&#39;Changes&#39;]&quot;) %&gt;% html_text()
    )) %&gt;% 
  bind_rows()

results %&gt;% 
  mutate(version = as.integer(version),
         date = lubridate::mdy(date))
#&gt; # A tibble: 58 &#215; 4
#&gt;    version date       recrstat                          changes                 
#&gt;      &lt;int&gt; &lt;date&gt;     &lt;chr&gt;                             &lt;chr&gt;                   
#&gt;  1       1 2020-12-01 &lt;NA&gt;                              None (earliest Version …
#&gt;  2       2 2021-01-12 Not yet recruiting --&gt; Recruiting Recruitment Status, Stu…
#&gt;  3       3 2021-01-29 &lt;NA&gt;                              Contacts/Locations and …
#&gt;  4       4 2021-02-04 &lt;NA&gt;                              Study Status and Contac…
#&gt;  5       5 2021-03-04 &lt;NA&gt;                              Study Status and Contac…
#&gt;  6       6 2021-03-18 &lt;NA&gt;                              Contacts/Locations and …
#&gt;  7       7 2021-04-15 &lt;NA&gt;                              Study Status and Contac…
#&gt;  8       8 2021-05-14 &lt;NA&gt;                              Study Status and Contac…
#&gt;  9       9 2021-05-27 &lt;NA&gt;                              Contacts/Locations and …
#&gt; 10      10 2021-10-20 &lt;NA&gt;                              Study Status and Contac…
#&gt; # ℹ 48 more rows

<sup>Created on 2023-06-15 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月15日 04:20:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76477258.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定