英文:
Web-scraping with headless-chrome (Rust), clicking doesn't seem to work
问题
I'm relatively new to Rust and completely new to web (scraping).
我对Rust相对不太了解,对于网络(爬虫)完全陌生。
I tried to implement a web scraper as a pet project to get more comfortable with rust and with the web stack.
我尝试实现一个网络爬虫作为个人项目,以更熟悉Rust和Web技术栈。
I use headless-chrome to go on a website and scrape a website of links, which I will investigate later.
我使用headless-chrome来访问网站并爬取链接,稍后会进行调查。
So, I open a tab, navigate to the website, then scrape the URLs, and finally want to click on the next button. Even though I find the next button (with a CSS selector) and I use click()
, nothing happens.
所以,我打开一个标签,导航到网站,然后爬取URL,最后想要点击下一页按钮。尽管我找到了下一页按钮(使用CSS选择器),并使用了click()
,但什么都没有发生。
In the next iteration, I scrape the same list again (clearly didn't move to the next page).
在下一次迭代中,我再次爬取相同的列表(明显没有转到下一页)。
use headless_chrome::Tab;
use std::error::Error;
use std::sync::Arc;
use std::{thread, time};
pub fn scrape(tab: Arc<Tab>) {
let url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC&sf=TIMESTAMP";
if let Err(_) = tab.navigate_to(url) {
println!("Failed to navigate to {}", url);
return;
}
if let Err(e) = tab.wait_until_navigated() {
println!("Failed to wait for navigation: {}", e);
return;
}
if let Ok(gdpr_accept_button) = tab.wait_for_element(".sc-gsDKAQ.fILFKg") {
if let Err(e) = gdpr_accept_button.click() {
println!("Failed to click GDPR accept button: {}", e);
return;
}
} else {
println!("No GDPR popup to acknowledge found.");
}
let mut links = Vec::<String>::new();
loop {
let mut skipped: usize = 0;
let new_urls_count: usize;
match parse_list(&tab) {
Ok(urls) => {
new_urls_count = urls.len();
for url in urls {
if !links.contains(&url) {
links.push(url);
} else {
skipped += 1;
}
}
}
Err(_) => {
println!("No more houses found: stopping");
break;
}
}
if skipped == new_urls_count {
println!("Only previously loaded houses found: stopping");
break;
}
if let Ok(button) = tab.wait_for_element("[class=\"arrowButton-20ae5\"]") {
if let Err(e) = button.click() {
println!("Failed to click next page button: {}", e);
break;
} else {
println!("Clicked next page button");
}
} else {
println!("No next page button found: stopping");
break;
}
if let Err(e) = tab.wait_until_navigated() {
println!("Failed to load next page: {}", e);
break;
}
}
println!("Found {} houses:", links.len());
for link in links {
println!("\t{}", link);
}
}
fn parse_list(tab: &Arc<Tab>) -> Result<Vec<String>, Box<dyn Error>> {
let elements = tab.find_elements("div[class*=\"EstateItem\"] > a")?;
let mut links = Vec::<String>::new();
for element in elements {
if let Some(url) = element
.call_js_fn(
"function() { return this.getAttribute(\"href\"); }",
vec![],
true,
)?
.value
{
links.push(url.to_string());
}
}
Ok(links)
}
When I call this code in main, I get the following output:
当我在主函数中调用此代码时,我得到以下输出:
No GDPR popup to acknowledge found.
Clicked next page button
Only previously loaded houses found: stopping
Found 20 houses:
...
My problem is that I don't understand clicking the next button doesn't do anything. As I am new to Rust and web applications if it's a problem with me using the crate (headless-chrome) or my understanding of web scraping.
我的问题是我不明白为什么点击下一页按钮没有任何反应。由于我对Rust和Web应用程序都很陌生,所以我不确定是我在使用crate(headless-chrome)还是我的Web爬取理解上存在问题。
英文:
I'm relatively new to Rust and completely new to web (scraping).
I tried to implement a web scraper as a pet project to get more comfortable with rust and with the web stack.
I use headless-chrome to go on a website and scrape a website of links, which I will investigate later.
So, I open a tab, navigate to the website, then scrape the URLs, and finally want to click on the next button. Even though I find the next button (with a CSS selector) and I use click()
, nothing happens.
In the next iteration, I scrape the same list again (clearly didn't move to the next page).
use headless_chrome::Tab;
use std::error::Error;
use std::sync::Arc;
use std::{thread, time};
pub fn scrape(tab: Arc<Tab>) {
let url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC&sf=TIMESTAMP";
if let Err(_) = tab.navigate_to(url) {
println!("Failed to navigate to {}", url);
return;
}
if let Err(e) = tab.wait_until_navigated() {
println!("Failed to wait for navigation: {}", e);
return;
}
if let Ok(gdpr_accept_button) = tab.wait_for_element(".sc-gsDKAQ.fILFKg") {
if let Err(e) = gdpr_accept_button.click() {
println!("Failed to click GDPR accept button: {}", e);
return;
}
} else {
println!("No GDPR popup to acknowledge found.");
}
let mut links = Vec::<String>::new();
loop {
let mut skipped: usize = 0;
let new_urls_count: usize;
match parse_list(&tab) {
Ok(urls) => {
new_urls_count = urls.len();
for url in urls {
if !links.contains(&url) {
links. Push(url);
} else {
skipped += 1;
}
}
}
Err(_) => {
println!("No more houses found: stopping");
break;
}
}
if skipped == new_urls_count {
println!("Only previously loaded houses found: stopping");
break;
}
if let Ok(button) = tab.wait_for_element("[class=\"arrowButton-20ae5\"]") {
if let Err(e) = button.click() {
println!("Failed to click next page button: {}", e);
break;
} else {
println!("Clicked next page button");
}
} else {
println!("No next page button found: stopping");
break;
}
if let Err(e) = tab.wait_until_navigated() {
println!("Failed to load next page: {}", e);
break;
}
}
println!("Found {} houses:", links.len());
for link in links {
println!("\t{}", link);
}
}
fn parse_list(tab: &Arc<Tab>) -> Result<Vec<String>, Box<dyn Error>> {
let elements = tab.find_elements("div[class*=\"EstateItem\"] > a")?; //".EstateItem-1c115"
let mut links = Vec::<String>::new();
for element in elements {
if let Some(url) = element
.call_js_fn(
&"function() {{ return this.getAttribute(\"href\"); }}",
vec![],
true,
)?
.value
{
links. Push(url.to_string());
}
}
Ok(links)
}
When I call this code in main, I get the following output:
No GDPR popup to acknowledge found.
Clicked next page button
Only previously loaded houses found: stopping
Found 20 houses:
...
My problem is that I don't understand clicking the next button doesn't do anything. As I am new to Rust and web applications if it's a problem with me using the crate (headless-chrome) or my understanding of web scraping.
答案1
得分: 2
Here is the translated code section:
tl;dr: 将点击下一页按钮的代码替换为以下内容:
```rust
if let Ok(button) = tab.wait_for_element(r#"*[class^="Pagination"] button:last-child"#) {
// 解释:左右箭头按钮都具有相同的类。原始选择器无法正常工作。
if let Err(e) = button.click() {
println!("点击下一页按钮失败: {}", e);
break;
} else {
println!("已点击下一页按钮");
}
} else {
println!("未找到下一页按钮: 停止");
break;
}
// 解释:Rust执行速度太快,因此需要等待页面加载
std::thread::sleep(std::time::Duration::from_secs(5)); // 等待5秒
if let Err(e) = tab.wait_until_navigated() {
println!("加载下一页失败: {}", e);
break;
}
- 原始代码会在第一页上点击右箭头按钮,然后在此之后点击左箭头按钮,因为CSS也会匹配左箭头按钮;并且由于在DOM树中位于第一位,左箭头按钮将被返回。
- 原始代码执行速度太快。Chrome需要等待一段时间才能加载。如果发现性能无法接受,可以检查此处的事件并等待浏览器发出事件<https://docs.rs/headless_chrome/latest/headless_chrome/protocol/cdp/Accessibility/events/struct.LoadCompleteEvent.html>。
最后建议,以上所有工作都是不必要的:很明显,URL模式如下:https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC&sf={PAGINATION}
。您可以通过基本地抓取分页元素来找到此站点中的所有页面;您可能也可以放弃Chrome,执行基本的HTTP请求并解析返回的HTML。为此,请查看<https://docs.rs/scraper/latest/scraper/>和<https://docs.rs/reqwest/latest/reqwest/>。如果性能对这个爬虫至关重要,reqwest
也可以与tokio
一起使用,以异步/并发方式抓取网页。
更新:
以下是我上面建议的Rust和Python实现。解析HTML/XML并评估XPath的Rust库似乎非常罕见且相对不可靠,不过。
use reqwest::Client;
use std::error::Error;
use std::sync::Arc;
use sxd_xpath::{Context, Factory, Value};
async fn get_page_count(client: &reqwest::Client, url: &str) -> Result<i32, Box<dyn Error>> {
let res = client.get(url).send().await?;
let body = res.text().await?;
let pages_count = body
.split("\"pagesCount\":")
.nth(1)
.unwrap()
.split(",")
.next()
.unwrap()
.trim()
.parse::<i32>()?;
Ok(pages_count)
}
async fn scrape_one(client: &Client, url: &str) -> Result<Vec<String>, Box<dyn Error>> {
let res = client.get(url).send().await?;
let body = res.text().await?;
let package = sxd_html::parse_html(&body);
let doc = package.as_document();
let factory = Factory::new();
let ctx = Context::new();
let houses_selector = factory
.build("//*[contains(@class, 'EstateItem')]")?
.unwrap();
let houses = houses_selector.evaluate(&ctx, doc.root())?;
if let Value::Nodeset(houses) = houses {
let mut data = Vec::new();
for house in houses {
let title_selector = factory.build(".//h2/text()")?.unwrap();
let title = title_selector.evaluate(&ctx, house)?.string();
let a_selector = factory.build(".//a/@href")?.unwrap();
let href = a_selector.evaluate(&ctx, house)?.string();
data.push(format!("{} - {}", title, href));
}
return Ok(data);
}
Err("未找到数据".into())
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC";
let client = reqwest::Client::builder()
.user_agent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0",
)
.build()?;
let client = Arc::new(client);
let page_count = get_page_count(&client, url).await?;
let mut tasks = Vec::new();
for i in 1..=page_count {
let url = format!("{}&sf={}", url, i);
let client = client.clone();
tasks.push(tokio::spawn(async move {
scrape_one(&client, &url).await.unwrap()
}));
}
let results = futures::future::join_all(tasks).await;
for result in results {
println!("{:?}", result?);
}
Ok(())
}
async def page_count(url):
req = await session.get(url)
return int(re.search(f"pagesCount":\s*(\d+)'
<details>
<summary>英文:</summary>
tl;dr: replace the code in the click next page button as this:
```rust
if let Ok(button) = tab.wait_for_element(r#"*[class^="Pagination"] button:last-child"#) {
// Expl: both left and right arrow buttons have the same class. The original selector doesn't work, thusly.
if let Err(e) = button.click() {
println!("Failed to click next page button: {}", e);
break;
} else {
println!("Clicked next page button");
}
} else {
println!("No next page button found: stopping");
break;
}
// Expl: rust is too fast, so we need to wait for the page to load
std::thread::sleep(std::time::Duration::from_secs(5)); // Wait for 5 seconds
if let Err(e) = tab.wait_until_navigated() {
println!("Failed to load next page: {}", e);
break;
}
- The original code would click right button on the first page, then click left button here after because the CSS would match the left button as well; and by virtue being first in the DOM tree, the left button would be returned.
- The original code is just too fast. The chrome need to wait a bit to load. Should you find this performance to be not tolerable, check the event here and wait for the browser to emit the event <https://docs.rs/headless_chrome/latest/headless_chrome/protocol/cdp/Accessibility/events/struct.LoadCompleteEvent.html>.
As a final suggestion, all the work above is unnecessary: it is obvious that the URL pattern looks like this: https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC&sf=TIMESTAMP&sp={PAGINATION}
. And you can find all the pages in this site by basically scrape the pagination elements; you might as well just ditch the chrome and perform and basic HTTP requests and parse the HTML returned. For this purpose, check <https://docs.rs/scraper/latest/scraper/> and <https://docs.rs/reqwest/latest/reqwest/> out. If performance is mission critical for this spider, reqwest
can also be used with tokio
to scrape the web page in asynchronous/concurrent manner.
UPDATE:
Below are rust/py implementation of my above suggestion. The rust library to parse HTML/XML and evaluate XPath seems to be very rare and relatively not reliable, however.
use reqwest::Client;
use std::error::Error;
use std::sync::Arc;
use sxd_xpath::{Context, Factory, Value};
async fn get_page_count(client: &reqwest::Client, url: &str) -> Result<i32, Box<dyn Error>> {
let res = client.get(url).send().await?;
let body = res.text().await?;
let pages_count = body
.split("\"pagesCount\":")
.nth(1)
.unwrap()
.split(",")
.next()
.unwrap()
.trim()
.parse::<i32>()?;
Ok(pages_count)
}
async fn scrape_one(client: &Client, url: &str) -> Result<Vec<String>, Box<dyn Error>> {
let res = client.get(url).send().await?;
let body = res.text().await?;
let package = sxd_html::parse_html(&body);
let doc = package.as_document();
let factory = Factory::new();
let ctx = Context::new();
let houses_selector = factory
.build("//*[contains(@class, 'EstateItem')]")?
.unwrap();
let houses = houses_selector.evaluate(&ctx, doc.root())?;
if let Value::Nodeset(houses) = houses {
let mut data = Vec::new();
for house in houses {
let title_selector = factory.build(".//h2/text()")?.unwrap();
let title = title_selector.evaluate(&ctx, house)?.string();
let a_selector = factory.build(".//a/@href")?.unwrap();
let href = a_selector.evaluate(&ctx, house)?.string();
data.push(format!("{} - {}", title, href));
}
return Ok(data);
}
Err("No data found".into())
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC";
let client = reqwest::Client::builder()
.user_agent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0",
)
.build()?;
let client = Arc::new(client);
let page_count = get_page_count(&client, url).await?;
let mut tasks = Vec::new();
for i in 1..=page_count {
let url = format!("{}&sf={}", url, i);
let client = client.clone();
tasks.push(tokio::spawn(async move {
scrape_one(&client, &url).await.unwrap()
}));
}
let results = futures::future::join_all(tasks).await;
for result in results {
println!("{:?}", result?);
}
Ok(())
}
async def page_count(url):
req = await session.get(url)
return int(re.search(f'"pagesCount":\s*(\d+)', await req.text()).group(1))
async def scrape_one(url):
req = await session.get(url)
tree = etree.HTML(await req.text())
houses = tree.xpath("//*[contains(@class, 'EstateItem')]")
data = [
dict(title=house.xpath(".//h2/text()")[0], href=house.xpath(".//a/@href")[0])
for house in houses
]
return data
url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC"
result = await asyncio.gather(
*[
scrape_one(url + f"&sf={i}")
for i in range(1, await page_count(url + "&sf=1") + 1)
]
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论