英文:
web scrape all items in a page with selenium
问题
我正在尝试获取网页中的所有项目,例如,我想获取第一个
"Hillebrand Boudewynsz. van der Aa (1661 - 1717)"
然后获取页面中的其他49个项目。
我的代码如下,我尝试使用Selenium通过xpath或CSS获取项目,但我不确定正确的路径是什么。
欢迎使用两种选项。
这是代码中所需的句子:
#Finding element
object<-remDr$findElement(using="xpath","/html/body/div[2]/div/ul/li[1]/a")
#---------------------------------------------------------------------
和网站:
https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse
rm(list=ls())
library(tidyverse)
library(robotstxt)
library(rvest)
library(RSelenium)
library(tidyverse)
library(netstat)
library(wdman)
selenium()
# see path
selenium_object<-selenium(retcommand = T,check = F)
#binman::list_versions("chromedriver")
#start the server
remote_driver<-rsDriver(
browser = "chrome",
chromever = "113.0.5672.63",
verbose = F,
port = free_port()
)
# create a client object
remDr<-remote_driver$client
#open a browser
remDr$open()
# maximaize window size
remDr$maxWindowSize()
#navigate website
remDr$navigate("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse")
#Finding element
object<-remDr$findElement(using="xpath","/html/body/div[2]/div/ul/li[1]/a")
#---------------------------------------------------------------------
希望这对你有所帮助。
英文:
I am trying to bring all items in the webpage-for instance I would like to bring the first
"Hillebrand Boudewynsz. van der Aa (1661 - 1717)"
and then all the other 49 in the page
MY code is below, I am trying to use selenium and bring the items through xpath or CSS
but I am not sure for the right path
both options will be welcome
this is the required sentence from the code
#Finding element
object<-remDr$findElement(using="xpath","/html/body/div[2]/div/ul/li[1]/a")
#---------------------------------------------------------------------
and the website
https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse
rm(list=ls())
library(tidyverse)
#install.packages("robotstxt")
library( robotstxt)
#install.packages("RSelenium")
library(rvest)
library(RSelenium)
library(tidyverse)
#install.packages("netstat")
library(netstat)
library(wdman)
selenium()
# see path
selenium_object<-selenium(retcommand = T,check = F)
#binman::list_versions("chromedriver")
#start the server
remote_driver<-rsDriver(
browser = "chrome",
chromever = "113.0.5672.63",
verbose = F,
port = free_port()
)
# create a client object
remDr<-remote_driver$client
#open a browser
remDr$open()
# maximaize window size
remDr$maxWindowSize()
#navigate website
remDr$navigate("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse")
#Finding element
object<-remDr$findElement(using="xpath","/html/body/div[2]/div/ul/li[1]/a")
#---------------------------------------------------------------------
答案1
得分: 0
这应该适用于您!
import csv
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
url = "https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse"
all_data = []
# Selenium 配置
chrome_options = Options()
chrome_options.add_argument("--headless") # 以无头模式运行
driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 10)
try:
# 访问页面
driver.get(url)
# 等待产品加载
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#setwidth > ul > li > a")))
# 获取页面的 HTML 内容
html = driver.page_source
finally:
driver.quit() # 即使出现异常也要关闭浏览器
# 使用 BeautifulSoup 提取必要的数据
soup = BeautifulSoup(html, 'html.parser')
products = soup.select('#setwidth > ul > li > a')
for title in products:
title_text = title.get_text(strip=True) if title else ""
all_data.append([title_text])
# 将数据写入 CSV 文件
with open("vondel.csv", "w", newline="", encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['Title'])
writer.writerows(all_data)
希望这对您有所帮助!
英文:
That should work for you!
<!-- begin snippet: js hide: false console: true babel: false -->
<!-- language: lang-html -->
import csv
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
url = "https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse"
all_data = []
# Selenium Configuration
chrome_options = Options()
chrome_options.add_argument("--headless") # Running in headless mode
driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 10)
try:
# Access to page
driver.get(url)
# Wait for products to load
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#setwidth > ul > li > a")))
# Get the HTML content of the page
html = driver.page_source
finally:
driver.quit() # Close the browser even if an exception occurs
# Extracting the necessary data using BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
products = soup.select('#setwidth > ul > li > a')
for title in products:
title_text = title.get_text(strip=True) if title else ""
all_data.append([title_text])
# Write data to a CSV file
with open("vondel.csv", "w", newline="", encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['Title'])
writer.writerows(all_data)
<!-- end snippet -->
答案2
得分: 0
RSelenium / Selenium应该是在网站抓取时的最后选择。这个网站可以在R中使用rvest
轻松抓取。以下是如何从其26个页面中抓取所有链接,并保存在一个tibble /数据框中的方法。
library(tidyverse)
library(rvest)
scraper <- function(page) {
cat("Scraping page", page, "\n")
str_c("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&field=surname&strtchar=A&page=", page) %>%
read_html() %>%
html_elements("ul") %>%
pluck(2) %>%
html_elements("a") %>%
map_dfr(., ~ tibble(
name = .x %>%
html_text2(),
link = .x %>%
html_attr("href") %>%
str_replace("..", "https://www.vondel.humanities.uva.nl/ecartico")
)) %>%
mutate(page = page)
}
df <- map_dfr(1:26, scraper)
一个 tibble: 1,269 × 3
name link page
1 Hillebrand Boudewynsz. van der Aa (1661 - 1717) https://www.vondel.humanities.uva.nl/ecartico/persons/414 1
2 Boudewijn Pietersz van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/10566 1
3 Pieter Boudewijnsz. van der Aa (1659 - 1733) https://www.vondel.humanities.uva.nl/ecartico/persons/10567 1
4 Boudewyn van der Aa (1672 - ca. 1714) https://www.vondel.humanities.uva.nl/ecartico/persons/10568 1
5 Machtelt van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/27132 1
6 Claas van der Aa I (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33780 1
7 Claas van der Aa II (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33781 1
8 Willem van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33782 1
9 Johanna van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/59894 1
10 Hans von Aachen (1552 - 1615) https://www.vondel.humanities.uva.nl/ecartico/persons/9203 1
ℹ 1,259 more rows
ℹ Use print(n = ...)
to see more rows
<details>
<summary>英文:</summary>
RSelenium / Selenium should be your last resort when scraping websites. This site can easily be scraped with `rvest` in R. This is how you can scrape **all** links from its 26 pages, saved in a tibble / data frame.
library(tidyverse)
library(rvest)
scraper <- function(page) {
cat("Scraping page", page, "\n")
str_c("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&field=surname&strtchar=A&page=", page) %>%
read_html() %>%
html_elements("ul") %>%
pluck(2) %>%
html_elements("a") %>%
map_dfr(., ~ tibble(
name = .x %>%
html_text2(),
link = .x %>%
html_attr("href") %>%
str_replace("..", "https://www.vondel.humanities.uva.nl/ecartico")
)) %>%
mutate(page = page)
}
df <- map_dfr(1:26, scraper)
# A tibble: 1,269 × 3
name link page
<chr> <chr> <int>
1 Hillebrand Boudewynsz. van der Aa (1661 - 1717) https://www.vondel.humanities.uva.nl/ecartico/persons/414 1
2 Boudewijn Pietersz van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/10566 1
3 Pieter Boudewijnsz. van der Aa (1659 - 1733) https://www.vondel.humanities.uva.nl/ecartico/persons/10567 1
4 Boudewyn van der Aa (1672 - ca. 1714) https://www.vondel.humanities.uva.nl/ecartico/persons/10568 1
5 Machtelt van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/27132 1
6 Claas van der Aa I (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33780 1
7 Claas van der Aa II (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33781 1
8 Willem van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/33782 1
9 Johanna van der Aa (? - ?) https://www.vondel.humanities.uva.nl/ecartico/persons/59894 1
10 Hans von Aachen (1552 - 1615) https://www.vondel.humanities.uva.nl/ecartico/persons/9203 1
# ℹ 1,259 more rows
# ℹ Use `print(n = ...)` to see more rows
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论