用Selenium网页爬取页面上的所有项目。

huangapple go评论100阅读模式
英文:

web scrape all items in a page with selenium

问题

我正在尝试获取网页中的所有项目,例如,我想获取第一个

"Hillebrand Boudewynsz. van der Aa (1661 - 1717)"

然后获取页面中的其他49个项目。

我的代码如下,我尝试使用Selenium通过xpath或CSS获取项目,但我不确定正确的路径是什么。

欢迎使用两种选项。

这是代码中所需的句子:

  1. #Finding element
  2. object<-remDr$findElement(using="xpath","/html/body/div[2]/div/ul/li[1]/a")
  3. #---------------------------------------------------------------------

和网站:

https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse

  1. rm(list=ls())
  2. library(tidyverse)
  3. library(robotstxt)
  4. library(rvest)
  5. library(RSelenium)
  6. library(tidyverse)
  7. library(netstat)
  8. library(wdman)
  9. selenium()
  10. # see path
  11. selenium_object<-selenium(retcommand = T,check = F)
  12. #binman::list_versions("chromedriver")
  13. #start the server
  14. remote_driver<-rsDriver(
  15. browser = "chrome",
  16. chromever = "113.0.5672.63",
  17. verbose = F,
  18. port = free_port()
  19. )
  20. # create a client object
  21. remDr<-remote_driver$client
  22. #open a browser
  23. remDr$open()
  24. # maximaize window size
  25. remDr$maxWindowSize()
  26. #navigate website
  27. remDr$navigate("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse")
  28. #Finding element
  29. object<-remDr$findElement(using="xpath","/html/body/div[2]/div/ul/li[1]/a")
  30. #---------------------------------------------------------------------

希望这对你有所帮助。

英文:

I am trying to bring all items in the webpage-for instance I would like to bring the first

"Hillebrand Boudewynsz. van der Aa (1661 - 1717)"
and then all the other 49 in the page
MY code is below, I am trying to use selenium and bring the items through xpath or CSS
but I am not sure for the right path
both options will be welcome
this is the required sentence from the code
#Finding element
object<-remDr$findElement(using="xpath","/html/body/div[2]/div/ul/li[1]/a")
#---------------------------------------------------------------------
and the website
https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse

  1. rm(list=ls())
  2. library(tidyverse)
  3. #install.packages(&quot;robotstxt&quot;)
  4. library( robotstxt)
  5. #install.packages(&quot;RSelenium&quot;)
  6. library(rvest)
  7. library(RSelenium)
  8. library(tidyverse)
  9. #install.packages(&quot;netstat&quot;)
  10. library(netstat)
  11. library(wdman)
  12. selenium()
  13. # see path
  14. selenium_object&lt;-selenium(retcommand = T,check = F)
  15. #binman::list_versions(&quot;chromedriver&quot;)
  16. #start the server
  17. remote_driver&lt;-rsDriver(
  18. browser = &quot;chrome&quot;,
  19. chromever = &quot;113.0.5672.63&quot;,
  20. verbose = F,
  21. port = free_port()
  22. )
  23. # create a client object
  24. remDr&lt;-remote_driver$client
  25. #open a browser
  26. remDr$open()
  27. # maximaize window size
  28. remDr$maxWindowSize()
  29. #navigate website
  30. remDr$navigate(&quot;https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&quot;)
  31. #Finding element
  32. object&lt;-remDr$findElement(using=&quot;xpath&quot;,&quot;/html/body/div[2]/div/ul/li[1]/a&quot;)
  33. #---------------------------------------------------------------------

答案1

得分: 0

这应该适用于您!

  1. import csv
  2. from selenium import webdriver
  3. from selenium.webdriver.chrome.options import Options
  4. from selenium.webdriver.common.by import By
  5. from selenium.webdriver.support.ui import WebDriverWait
  6. from selenium.webdriver.support import expected_conditions as EC
  7. from bs4 import BeautifulSoup
  8. url = "https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse"
  9. all_data = []
  10. # Selenium 配置
  11. chrome_options = Options()
  12. chrome_options.add_argument("--headless") # 以无头模式运行
  13. driver = webdriver.Chrome(options=chrome_options)
  14. wait = WebDriverWait(driver, 10)
  15. try:
  16. # 访问页面
  17. driver.get(url)
  18. # 等待产品加载
  19. wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#setwidth > ul > li > a")))
  20. # 获取页面的 HTML 内容
  21. html = driver.page_source
  22. finally:
  23. driver.quit() # 即使出现异常也要关闭浏览器
  24. # 使用 BeautifulSoup 提取必要的数据
  25. soup = BeautifulSoup(html, 'html.parser')
  26. products = soup.select('#setwidth > ul > li > a')
  27. for title in products:
  28. title_text = title.get_text(strip=True) if title else ""
  29. all_data.append([title_text])
  30. # 将数据写入 CSV 文件
  31. with open("vondel.csv", "w", newline="", encoding="utf-8") as csv_file:
  32. writer = csv.writer(csv_file)
  33. writer.writerow(['Title'])
  34. writer.writerows(all_data)

希望这对您有所帮助!

英文:

That should work for you!

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-html -->

  1. import csv
  2. from selenium import webdriver
  3. from selenium.webdriver.chrome.options import Options
  4. from selenium.webdriver.common.by import By
  5. from selenium.webdriver.support.ui import WebDriverWait
  6. from selenium.webdriver.support import expected_conditions as EC
  7. from bs4 import BeautifulSoup
  8. url = &quot;https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&quot;
  9. all_data = []
  10. # Selenium Configuration
  11. chrome_options = Options()
  12. chrome_options.add_argument(&quot;--headless&quot;) # Running in headless mode
  13. driver = webdriver.Chrome(options=chrome_options)
  14. wait = WebDriverWait(driver, 10)
  15. try:
  16. # Access to page
  17. driver.get(url)
  18. # Wait for products to load
  19. wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, &quot;#setwidth &gt; ul &gt; li &gt; a&quot;)))
  20. # Get the HTML content of the page
  21. html = driver.page_source
  22. finally:
  23. driver.quit() # Close the browser even if an exception occurs
  24. # Extracting the necessary data using BeautifulSoup
  25. soup = BeautifulSoup(html, &#39;html.parser&#39;)
  26. products = soup.select(&#39;#setwidth &gt; ul &gt; li &gt; a&#39;)
  27. for title in products:
  28. title_text = title.get_text(strip=True) if title else &quot;&quot;
  29. all_data.append([title_text])
  30. # Write data to a CSV file
  31. with open(&quot;vondel.csv&quot;, &quot;w&quot;, newline=&quot;&quot;, encoding=&quot;utf-8&quot;) as csv_file:
  32. writer = csv.writer(csv_file)
  33. writer.writerow([&#39;Title&#39;])
  34. writer.writerows(all_data)

<!-- end snippet -->

答案2

得分: 0

RSelenium / Selenium应该是在网站抓取时的最后选择。这个网站可以在R中使用rvest轻松抓取。以下是如何从其26个页面中抓取所有链接,并保存在一个tibble /数据框中的方法。

  1. library(tidyverse)
  2. library(rvest)
  3. scraper <- function(page) {
  4. cat("Scraping page", page, "\n")
  5. str_c("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&field=surname&strtchar=A&page=", page) %>%
  6. read_html() %>%
  7. html_elements("ul") %>%
  8. pluck(2) %>%
  9. html_elements("a") %>%
  10. map_dfr(., ~ tibble(
  11. name = .x %>%
  12. html_text2(),
  13. link = .x %>%
  14. html_attr("href") %>%
  15. str_replace("..", "https://www.vondel.humanities.uva.nl/ecartico")
  16. )) %>%
  17. mutate(page = page)
  18. }
  19. df <- map_dfr(1:26, scraper)

一个 tibble: 1,269 × 3

name link page

1 Hillebrand Boudewynsz. van der Aa (1661 - 1717) https://www.vondel.humanities.uva.nl/ecartico/persons/414 1
2 Boudewijn Pietersz van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/10566 1
3 Pieter Boudewijnsz. van der Aa (1659 - 1733) https://www.vondel.humanities.uva.nl/ecartico/persons/10567 1
4 Boudewyn van der Aa (1672 - ca. 1714) https://www.vondel.humanities.uva.nl/ecartico/persons/10568 1
5 Machtelt van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/27132 1
6 Claas van der Aa I (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33780 1
7 Claas van der Aa II (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33781 1
8 Willem van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33782 1
9 Johanna van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/59894 1
10 Hans von Aachen (1552 - 1615) https://www.vondel.humanities.uva.nl/ecartico/persons/9203 1

ℹ 1,259 more rows

ℹ Use print(n = ...) to see more rows

  1. <details>
  2. <summary>英文:</summary>
  3. RSelenium / Selenium should be your last resort when scraping websites. This site can easily be scraped with `rvest` in R. This is how you can scrape **all** links from its 26 pages, saved in a tibble / data frame.
  4. library(tidyverse)
  5. library(rvest)
  6. scraper &lt;- function(page) {
  7. cat(&quot;Scraping page&quot;, page, &quot;\n&quot;)
  8. str_c(&quot;https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&amp;field=surname&amp;strtchar=A&amp;page=&quot;, page) %&gt;%
  9. read_html() %&gt;%
  10. html_elements(&quot;ul&quot;) %&gt;%
  11. pluck(2) %&gt;%
  12. html_elements(&quot;a&quot;) %&gt;%
  13. map_dfr(., ~ tibble(
  14. name = .x %&gt;%
  15. html_text2(),
  16. link = .x %&gt;%
  17. html_attr(&quot;href&quot;) %&gt;%
  18. str_replace(&quot;..&quot;, &quot;https://www.vondel.humanities.uva.nl/ecartico&quot;)
  19. )) %&gt;%
  20. mutate(page = page)
  21. }
  22. df &lt;- map_dfr(1:26, scraper)
  23. # A tibble: 1,269 &#215; 3
  24. name link page
  25. &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
  26. 1 Hillebrand Boudewynsz. van der Aa (1661 - 1717) https://www.vondel.humanities.uva.nl/ecartico/persons/414 1
  27. 2 Boudewijn Pietersz van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/10566 1
  28. 3 Pieter Boudewijnsz. van der Aa (1659 - 1733) https://www.vondel.humanities.uva.nl/ecartico/persons/10567 1
  29. 4 Boudewyn van der Aa (1672 - ca. 1714) https://www.vondel.humanities.uva.nl/ecartico/persons/10568 1
  30. 5 Machtelt van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/27132 1
  31. 6 Claas van der Aa I (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33780 1
  32. 7 Claas van der Aa II (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33781 1
  33. 8 Willem van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33782 1
  34. 9 Johanna van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/59894 1
  35. 10 Hans von Aachen (1552 - 1615) https://www.vondel.humanities.uva.nl/ecartico/persons/9203 1
  36. # ℹ 1,259 more rows
  37. # ℹ Use `print(n = ...)` to see more rows
  38. </details>

huangapple
  • 本文由 发表于 2023年6月5日 10:11:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76403162.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定