2023年6月5日 10:11:30go评论153阅读模式

英文:

web scrape all items in a page with selenium

问题

我正在尝试获取网页中的所有项目，例如，我想获取第一个

"Hillebrand Boudewynsz. van der Aa (1661 - 1717)"

然后获取页面中的其他49个项目。

我的代码如下，我尝试使用Selenium通过xpath或CSS获取项目，但我不确定正确的路径是什么。

欢迎使用两种选项。

这是代码中所需的句子：

#Finding element
object<-remDr$findElement(using="xpath","/html/body/div[2]/div/ul/li[1]/a")
#---------------------------------------------------------------------

和网站：

https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse

rm(list=ls())
library(tidyverse)
library(robotstxt)
library(rvest)
library(RSelenium)
library(tidyverse)
library(netstat)
library(wdman)
selenium()

# see path
selenium_object<-selenium(retcommand = T,check = F)

#binman::list_versions("chromedriver")

#start the server
remote_driver<-rsDriver(
  
  browser = "chrome",
  
  chromever = "113.0.5672.63",
  verbose = F,
  port = free_port()
)

# create a client object
remDr<-remote_driver$client

#open a browser
remDr$open()

# maximaize window size
remDr$maxWindowSize()

#navigate website
remDr$navigate("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse")

#Finding element
object<-remDr$findElement(using="xpath","/html/body/div[2]/div/ul/li[1]/a")
#---------------------------------------------------------------------

希望这对你有所帮助。

英文:

I am trying to bring all items in the webpage-for instance I would like to bring the first

"Hillebrand Boudewynsz. van der Aa (1661 - 1717)"
and then all the other 49 in the page
MY code is below, I am trying to use selenium and bring the items through xpath or CSS
but I am not sure for the right path
both options will be welcome
this is the required sentence from the code
#Finding element
object<-remDr$findElement(using="xpath","/html/body/div[2]/div/ul/li[1]/a")
#---------------------------------------------------------------------
and the website
https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse

rm(list=ls())
library(tidyverse)
#install.packages(&quot;robotstxt&quot;)
library( robotstxt)
#install.packages(&quot;RSelenium&quot;)
library(rvest)
library(RSelenium)
library(tidyverse)
#install.packages(&quot;netstat&quot;)
library(netstat)
library(wdman)
selenium()

# see path
selenium_object&lt;-selenium(retcommand = T,check = F)

#binman::list_versions(&quot;chromedriver&quot;)

#start the server
remote_driver&lt;-rsDriver(
  
  browser = &quot;chrome&quot;,
  
  chromever = &quot;113.0.5672.63&quot;,
  verbose = F,
  port = free_port()
)

# create a client object
remDr&lt;-remote_driver$client

#open a browser
remDr$open()

# maximaize window size
remDr$maxWindowSize()

#navigate website
remDr$navigate(&quot;https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&quot;)

#Finding element
object&lt;-remDr$findElement(using=&quot;xpath&quot;,&quot;/html/body/div[2]/div/ul/li[1]/a&quot;)
#---------------------------------------------------------------------

答案1

得分: 0

这应该适用于您！

import csv
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

url = "https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse"
all_data = []

# Selenium 配置
chrome_options = Options()
chrome_options.add_argument("--headless")  # 以无头模式运行
driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 10)

try:
    # 访问页面
    driver.get(url)

    # 等待产品加载
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#setwidth > ul > li > a")))

    # 获取页面的 HTML 内容
    html = driver.page_source
finally:
    driver.quit()  # 即使出现异常也要关闭浏览器

# 使用 BeautifulSoup 提取必要的数据
soup = BeautifulSoup(html, 'html.parser')
products = soup.select('#setwidth > ul > li > a')

for title in products:
    title_text = title.get_text(strip=True) if title else ""
    all_data.append([title_text])

# 将数据写入 CSV 文件
with open("vondel.csv", "w", newline="", encoding="utf-8") as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['Title'])
    writer.writerows(all_data)

希望这对您有所帮助！

英文:

That should work for you!

import csv
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

url = &quot;https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&quot;
all_data = []

# Selenium Configuration
chrome_options = Options()
chrome_options.add_argument(&quot;--headless&quot;)  # Running in headless mode
driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 10)

try:
    # Access to page
    driver.get(url)

    # Wait for products to load
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, &quot;#setwidth &gt; ul &gt; li &gt; a&quot;)))

    # Get the HTML content of the page
    html = driver.page_source
finally:
    driver.quit()  # Close the browser even if an exception occurs

# Extracting the necessary data using BeautifulSoup
soup = BeautifulSoup(html, &#39;html.parser&#39;)
products = soup.select(&#39;#setwidth &gt; ul &gt; li &gt; a&#39;)

for title in products:
    title_text = title.get_text(strip=True) if title else &quot;&quot;
    all_data.append([title_text])

# Write data to a CSV file
with open(&quot;vondel.csv&quot;, &quot;w&quot;, newline=&quot;&quot;, encoding=&quot;utf-8&quot;) as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([&#39;Title&#39;])
    writer.writerows(all_data)

答案2

得分: 0

RSelenium / Selenium应该是在网站抓取时的最后选择。这个网站可以在R中使用rvest轻松抓取。以下是如何从其26个页面中抓取所有链接，并保存在一个tibble /数据框中的方法。

library(tidyverse)
library(rvest)

scraper <- function(page) {
  cat("Scraping page", page, "\n")
  str_c("https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&field=surname&strtchar=A&page=", page) %>%  
    read_html() %>% 
    html_elements("ul") %>% 
    pluck(2) %>% 
    html_elements("a") %>% 
    map_dfr(., ~ tibble(
      name = .x %>%
        html_text2(), 
      link = .x %>%
        html_attr("href") %>%
        str_replace("..", "https://www.vondel.humanities.uva.nl/ecartico")
    )) %>%
    mutate(page = page)
}

df <- map_dfr(1:26, scraper)

一个 tibble: 1,269 × 3

ℹ 1,259 more rows

ℹ Use `print(n = ...)` to see more rows




<details>
<summary>英文:</summary>

RSelenium / Selenium should be your last resort when scraping websites. This site can easily be scraped with `rvest` in R. This is how you can scrape **all** links from its 26 pages, saved in a tibble / data frame. 

    library(tidyverse)
    library(rvest)
    
    scraper &lt;- function(page) {
      cat(&quot;Scraping page&quot;, page, &quot;\n&quot;)
      str_c(&quot;https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse&amp;field=surname&amp;strtchar=A&amp;page=&quot;, page) %&gt;%  
        read_html() %&gt;% 
        html_elements(&quot;ul&quot;) %&gt;% 
        pluck(2) %&gt;% 
        html_elements(&quot;a&quot;) %&gt;% 
        map_dfr(., ~ tibble(
          name = .x %&gt;% 
            html_text2(), 
          link = .x %&gt;% 
            html_attr(&quot;href&quot;) %&gt;% 
            str_replace(&quot;..&quot;, &quot;https://www.vondel.humanities.uva.nl/ecartico&quot;)
        )) %&gt;% 
        mutate(page = page)
    }
    
    df &lt;- map_dfr(1:26, scraper)

    # A tibble: 1,269 &#215; 3
       name                                            link                                                         page
       &lt;chr&gt;                                           &lt;chr&gt;                                                       &lt;int&gt;
     1 Hillebrand Boudewynsz. van der Aa (1661 - 1717) https://www.vondel.humanities.uva.nl/ecartico/persons/414       1
     2 Boudewijn Pietersz van der Aa (? - ?)           https://www.vondel.humanities.uva.nl/ecartico/persons/10566     1
     3 Pieter Boudewijnsz. van der Aa (1659 - 1733)    https://www.vondel.humanities.uva.nl/ecartico/persons/10567     1
     4 Boudewyn van der Aa (1672 - ca. 1714)           https://www.vondel.humanities.uva.nl/ecartico/persons/10568     1
     5 Machtelt van der Aa (? - ?)                     https://www.vondel.humanities.uva.nl/ecartico/persons/27132     1
     6 Claas van der Aa I (? - ?)                      https://www.vondel.humanities.uva.nl/ecartico/persons/33780     1
     7 Claas van der Aa II (? - ?)                     https://www.vondel.humanities.uva.nl/ecartico/persons/33781     1
     8 Willem van der Aa (? - ?)                       https://www.vondel.humanities.uva.nl/ecartico/persons/33782     1
     9 Johanna van der Aa (? - ?)                      https://www.vondel.humanities.uva.nl/ecartico/persons/59894     1
    10 Hans von Aachen (1552 - 1615)                   https://www.vondel.humanities.uva.nl/ecartico/persons/9203      1
    # ℹ 1,259 more rows
    # ℹ Use `print(n = ...)` to see more rows

</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用Selenium网页爬取页面上的所有项目。

问题

答案1

答案2

一个 tibble: 1,269 × 3

ℹ 1,259 more rows

ℹ Use `print(n = ...)` to see more rows

如何使用`conditionPanel`在Shiny中更新显示不同的`sliderInput`？

测量JMeter中的网页加载时间

在DataBricks中，将一个R数据框转换为Spark数据框是否有大小限制？

Create plot with relative time point in R

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

答案1

答案2

一个 tibble: 1,269 × 3

ℹ 1,259 more rows

ℹ Use print(n = ...) to see more rows

发表评论

ℹ Use `print(n = ...)` to see more rows