Webscarping with rvest – 获取表格和跨度文本

huangapple go评论99阅读模式
英文:

Webscarping with rvest - Get table and span text

问题

I looking to get the table at this link (https://clinicaltrials.gov/ct2/history/NCT04658186 ) along with the hover text on some rows Webscarping with rvest – 获取表格和跨度文本.

The result i want is to create a data frame , so that the hover text is a column on same row as its on webpage.
Tried the code below where i can get the table and span text separately, unable to figure out how to merge this togeather.

  1. library(dplyr)
  2. library(rvest)
  3. # Set the URL of the webpage containing the table
  4. url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"
  5. # Read the HTML code from the webpage
  6. page <- read_html(url)
  7. # Use html_table() to extract the table data
  8. table_data <- page %>%
  9. html_table(fill = TRUE) %>%
  10. .[[1]] # Select the first table on the page
  11. # Use html_nodes() and html_text() to extract the text from span elements within the table
  12. span_text <- page %>%
  13. html_nodes("span") %>%
  14. html_attr("title") %>% data.frame()

Thanks for any help in advance.

英文:

I looking to get the table at this link (https://clinicaltrials.gov/ct2/history/NCT04658186 ) along with the hover text on some rows
Webscarping with rvest – 获取表格和跨度文本.

The result i want is to create a data frame , so that the hover text is a column on same row as its on webpage.
Tried the code below where i can get the table and span text separately, unable to figure out how to merge this togeather.

  1. library(dplyr)
  2. library(rvest)
  3. # Set the URL of the webpage containing the table
  4. url &lt;- &quot;https://clinicaltrials.gov/ct2/history/NCT04658186&quot;
  5. # Read the HTML code from the webpage
  6. page &lt;- read_html(url)
  7. # Use html_table() to extract the table data
  8. table_data &lt;- page %&gt;%
  9. html_table(fill = TRUE) %&gt;%
  10. .[[1]] # Select the first table on the page
  11. # Use html_nodes() and html_text() to extract the text from span elements within the table
  12. span_text &lt;- page %&gt;% html_nodes(&quot;span&quot;) %&gt;%
  13. html_attr(&quot;title&quot;) %&gt;% data.frame()

Thanks for any help in advance.

答案1

得分: 0

The code you provided appears to be written in R, and it involves loading libraries and performing operations on a webpage. Here is the translated code:

  1. library(tidyverse)
  2. library(rvest)
  3. page <- "https://clinicaltrials.gov/ct2/history/NCT04658186" %>%
  4. read_html()
  5. page %>%
  6. html_table() %>%
  7. pluck(1) %>%
  8. mutate(status = page %>%
  9. html_elements(".w3-bordered.releases") %>%
  10. pluck(1) %>%
  11. html_elements("tbody tr") %>%
  12. map_chr(., ~ .x %>%
  13. html_element(".recruitmentStatus") %>%
  14. html_attr("title")))
  15. # A tibble: 58 × 6
  16. Version A B `Submitted Date` Changes status
  17. <int> <lgl> <lgl> <chr> <chr> <chr>
  18. 1 1 NA NA December 1, 2020 None (earliest Version on record) NA
  19. 2 2 NA NA January 12, 2021 Recruitment Status, Study Status and Contacts/Locations Not yet recruiting --> Recruiting
  20. 3 3 NA NA January 29, 2021 Contacts/Locations and Study Status NA
  21. 4 4 NA NA February 4, 2021 Study Status and Contacts/Locations NA
  22. 5 5 NA NA March 4, 2021 Study Status and Contacts/Locations NA
  23. 6 6 NA NA March 18, 2021 Contacts/Locations and Study Status NA
  24. 7 7 NA NA April 15, 2021 Study Status and Contacts/Locations NA
  25. 8 8 NA NA May 14, 2021 Study Status and Contacts/Locations NA
  26. 9 9 NA NA May 27, 2021 Contacts/Locations and Study Status NA
  27. 10 10 NA NA June 10, 2021 Study Status and Contacts/Locations NA
  28. # ℹ 48 more rows
  29. # ℹ Use `print(n = ...)` to see more rows

This is a translation of the R code you provided without any additional content.

英文:
  1. library(tidyverse)
  2. library(rvest)
  3. page &lt;- &quot;https://clinicaltrials.gov/ct2/history/NCT04658186&quot; %&gt;%
  4. read_html()
  5. page %&gt;%
  6. html_table() %&gt;%
  7. pluck(1) %&gt;%
  8. mutate(status = page %&gt;%
  9. html_elements(&quot;.w3-bordered.releases&quot;) %&gt;%
  10. pluck(1) %&gt;%
  11. html_elements(&quot;tbody tr&quot;) %&gt;%
  12. map_chr(.,
  13. ~ .x %&gt;%
  14. html_element(&quot;.recruitmentStatus&quot;) %&gt;%
  15. html_attr(&quot;title&quot;)))
  16. # A tibble: 58 &#215; 6
  17. Version A B `Submitted Date` Changes status
  18. &lt;int&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  19. 1 1 NA NA December 1, 2020 None (earliest Version on record) NA
  20. 2 2 NA NA January 12, 2021 Recruitment Status, Study Status and Contacts/Locations Not yet recruiting --&gt; Recruiting
  21. 3 3 NA NA January 29, 2021 Contacts/Locations and Study Status NA
  22. 4 4 NA NA February 4, 2021 Study Status and Contacts/Locations NA
  23. 5 5 NA NA March 4, 2021 Study Status and Contacts/Locations NA
  24. 6 6 NA NA March 18, 2021 Contacts/Locations and Study Status NA
  25. 7 7 NA NA April 15, 2021 Study Status and Contacts/Locations NA
  26. 8 8 NA NA May 14, 2021 Study Status and Contacts/Locations NA
  27. 9 9 NA NA May 27, 2021 Contacts/Locations and Study Status NA
  28. 10 10 NA NA June 10, 2021 Study Status and Contacts/Locations NA
  29. # ℹ 48 more rows
  30. # ℹ Use `print(n = ...)` to see more rows

答案2

得分: 0

在这种情况下,我们可以遍历元素列表(即表格行),并从每个项目中提取特定部分。使用这种方法,我们将得到一个正确对齐的列表或向量,可以绑定到先前提取的表格中:

  1. library(dplyr)
  2. library(rvest)
  3. library(purrr)
  4. # 设置包含表格的网页的URL
  5. url <- "https://clinicaltrials.gov/ct2/history/NCT04658186"
  6. # 从网页读取HTML代码
  7. page <- read_html(url)
  8. table_data <- page %>%
  9. # 首先选择目标表格以从html_table()中获取单个表格
  10. html_element("table") %>%
  11. html_table(fill = TRUE)
  12. # 选择所有表格行,并使用map_chr()遍历这些行,
  13. # map_chr返回与输入列表相同长度的字符向量(<tr>元素的数量)
  14. recr_stat <- page %>% html_elements("tbody tr") %>%
  15. map_chr (\(tr) html_element(tr, "span.recruitmentStatus") %>%
  16. html_attr("title"))
  17. # 绑定到表格中:
  18. bind_cols(table_data, `Recruitment Status` = recr_stat) %>%
  19. relocate(`Recruitment Status`, .before = Changes)
  20. #> # A tibble: 58 × 6
  21. #> Version A B `Submitted Date` `Recruitment Status` Changes
  22. #> <int> <lgl> <lgl> <chr> <chr> <chr>
  23. #> 1 1 NA NA December 1, 2020 <NA> None (…
  24. #> 2 2 NA NA January 12, 2021 Not yet recruiting --> Recruiti… Recrui…
  25. #> 3 3 NA NA January 29, 2021 <NA> Contac…
  26. #> 4 4 NA NA February 4, 2021 <NA> Study …
  27. #> 5 5 NA NA March 4, 2021 <NA> Study …
  28. #> 6 6 NA NA March 18, 2021 <NA> Contac…
  29. #> 7 7 NA NA April 15, 2021 <NA> Study …
  30. #> 8 8 NA NA May 14, 2021 <NA> Study …
  31. #> 9 9 NA NA May 27, 2021 <NA> Contac…
  32. #> 10 10 NA NA June 10, 2021 <NA> Study …
  33. #> # ℹ 48 more rows

对于更稳健的方法,我们可以跳过html_table(),并从每个元素(这里是tr)中提取所有所需的细节。这也适用于没有表格的设计,其中表格数据通过列表或divs来呈现,例如:

  1. results <- page %>% html_elements("tbody tr") %>%
  2. map(\(tr) list(
  3. version = html_element(tr, "td[headers='VersionNumber']") %>%
  4. html_text(),
  5. date = html_element(tr, "td[headers='VersionDate']") %>%
  6. html_text(),
  7. recrstat = html_element(tr, "td[headers='Changes'] span.recruitmentStatus") %>%
  8. html_attr("title"),
  9. changes = html_element(tr, "td[headers='Changes']") %>%
  10. html_text()
  11. )) %>%
  12. bind_rows()
  13. results %>%
  14. mutate(version = as.integer(version),
  15. date = lubridate::mdy(date))
  16. #> # A tibble: 58 × 4
  17. #> version date recrstat changes
  18. #> <int> <date> <chr> <chr>
  19. #> 1 1 2020-12-01 <NA> None (earliest Version …
  20. #> 2 2 2021-01-12 Not yet recruiting --> Recruiting Recruitment Status, Stu…
  21. #> 3 3 2021-01-29 <NA> Contacts/Locations and …
  22. #> 4 4 2021-02-04 <NA> Study Status and Contac…
  23. #> 5 5 2021-03-04 <NA> Study Status and Contac…
  24. #> 6 6 2021-03-18 <NA> Contacts/Locations and …
  25. #> 7 7 2021-04-15 <NA> Study Status and Contac…
  26. #> 8 8 2021-05-14 <NA> Study Status and Contac…
  27. #> 9 9 2021-05-27 <NA> Contacts/Locations and …
  28. #> 10 10 2021-10-20 <NA> Study Status and Contac…
  29. #> # ℹ 48 more rows

<sup>创建于2023年6月15日,使用reprex v2.0.2</sup>

英文:

In such case, we can cycle through a list of elements (i.e. table rows) and extract certain bits from each item. With this approach, we'll end up with a correctly aligned list or vector that can be bound to previously extracted table:

  1. library(dplyr)
  2. library(rvest)
  3. library(purrr)
  4. # Set the URL of the webpage containing the table
  5. url &lt;- &quot;https://clinicaltrials.gov/ct2/history/NCT04658186&quot;
  6. # Read the HTML code from the webpage
  7. page &lt;- read_html(url)
  8. table_data &lt;- page %&gt;%
  9. # selecting the target table first to get a single table from html_table()
  10. html_element(&quot;table&quot;) %&gt;%
  11. html_table(fill = TRUE)
  12. # select all table rows, and cycle through those with map_chr(),
  13. # map_chr returns character vecotor of the same length as
  14. # input list (number of &lt;tr&gt; elements)
  15. recr_stat &lt;- page %&gt;% html_elements(&quot;tbody tr&quot;) %&gt;%
  16. map_chr (\(tr) html_element(tr, &quot;span.recruitmentStatus&quot;) %&gt;% html_attr(&quot;title&quot;))
  17. # bind to table:
  18. bind_cols(table_data, `Recruitment Status` = recr_stat) %&gt;%
  19. relocate(`Recruitment Status`, .before = Changes)
  20. #&gt; # A tibble: 58 &#215; 6
  21. #&gt; Version A B `Submitted Date` `Recruitment Status` Changes
  22. #&gt; &lt;int&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
  23. #&gt; 1 1 NA NA December 1, 2020 &lt;NA&gt; None (…
  24. #&gt; 2 2 NA NA January 12, 2021 Not yet recruiting --&gt; Recruiti… Recrui…
  25. #&gt; 3 3 NA NA January 29, 2021 &lt;NA&gt; Contac…
  26. #&gt; 4 4 NA NA February 4, 2021 &lt;NA&gt; Study …
  27. #&gt; 5 5 NA NA March 4, 2021 &lt;NA&gt; Study …
  28. #&gt; 6 6 NA NA March 18, 2021 &lt;NA&gt; Contac…
  29. #&gt; 7 7 NA NA April 15, 2021 &lt;NA&gt; Study …
  30. #&gt; 8 8 NA NA May 14, 2021 &lt;NA&gt; Study …
  31. #&gt; 9 9 NA NA May 27, 2021 &lt;NA&gt; Contac…
  32. #&gt; 10 10 NA NA June 10, 2021 &lt;NA&gt; Study …
  33. #&gt; # ℹ 48 more rows

For a more robust approach, we can skip html_table() and extract all required details from every element (here: tr) ourselves. This also works for tableless designs where tabular data is presented through lists or divs, for example.

  1. results &lt;- page %&gt;% html_elements(&quot;tbody tr&quot;) %&gt;%
  2. map(\(tr) list(
  3. version = html_element(tr, &quot;td[headers=&#39;VersionNumber&#39;]&quot;) %&gt;% html_text(),
  4. date = html_element(tr, &quot;td[headers=&#39;VersionDate&#39;]&quot;) %&gt;% html_text(),
  5. recrstat = html_element(tr, &quot;td[headers=&#39;Changes&#39;] span.recruitmentStatus&quot;) %&gt;% html_attr(&quot;title&quot;),
  6. changes = html_element(tr, &quot;td[headers=&#39;Changes&#39;]&quot;) %&gt;% html_text()
  7. )) %&gt;%
  8. bind_rows()
  9. results %&gt;%
  10. mutate(version = as.integer(version),
  11. date = lubridate::mdy(date))
  12. #&gt; # A tibble: 58 &#215; 4
  13. #&gt; version date recrstat changes
  14. #&gt; &lt;int&gt; &lt;date&gt; &lt;chr&gt; &lt;chr&gt;
  15. #&gt; 1 1 2020-12-01 &lt;NA&gt; None (earliest Version …
  16. #&gt; 2 2 2021-01-12 Not yet recruiting --&gt; Recruiting Recruitment Status, Stu…
  17. #&gt; 3 3 2021-01-29 &lt;NA&gt; Contacts/Locations and …
  18. #&gt; 4 4 2021-02-04 &lt;NA&gt; Study Status and Contac…
  19. #&gt; 5 5 2021-03-04 &lt;NA&gt; Study Status and Contac…
  20. #&gt; 6 6 2021-03-18 &lt;NA&gt; Contacts/Locations and …
  21. #&gt; 7 7 2021-04-15 &lt;NA&gt; Study Status and Contac…
  22. #&gt; 8 8 2021-05-14 &lt;NA&gt; Study Status and Contac…
  23. #&gt; 9 9 2021-05-27 &lt;NA&gt; Contacts/Locations and …
  24. #&gt; 10 10 2021-10-20 &lt;NA&gt; Study Status and Contac…
  25. #&gt; # ℹ 48 more rows

<sup>Created on 2023-06-15 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年6月15日 04:20:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76477258.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定