使用无头Chrome进行网页抓取(Rust),点击似乎无法工作。

huangapple go评论98阅读模式
英文:

Web-scraping with headless-chrome (Rust), clicking doesn't seem to work

问题

I'm relatively new to Rust and completely new to web (scraping).
我对Rust相对不太了解,对于网络(爬虫)完全陌生。

I tried to implement a web scraper as a pet project to get more comfortable with rust and with the web stack.
我尝试实现一个网络爬虫作为个人项目,以更熟悉Rust和Web技术栈。

I use headless-chrome to go on a website and scrape a website of links, which I will investigate later.
我使用headless-chrome来访问网站并爬取链接,稍后会进行调查。

So, I open a tab, navigate to the website, then scrape the URLs, and finally want to click on the next button. Even though I find the next button (with a CSS selector) and I use click(), nothing happens.
所以,我打开一个标签,导航到网站,然后爬取URL,最后想要点击下一页按钮。尽管我找到了下一页按钮(使用CSS选择器),并使用了click(),但什么都没有发生。

In the next iteration, I scrape the same list again (clearly didn't move to the next page).
在下一次迭代中,我再次爬取相同的列表(明显没有转到下一页)。

  1. use headless_chrome::Tab;
  2. use std::error::Error;
  3. use std::sync::Arc;
  4. use std::{thread, time};
  5. pub fn scrape(tab: Arc<Tab>) {
  6. let url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC&sf=TIMESTAMP";
  7. if let Err(_) = tab.navigate_to(url) {
  8. println!("Failed to navigate to {}", url);
  9. return;
  10. }
  11. if let Err(e) = tab.wait_until_navigated() {
  12. println!("Failed to wait for navigation: {}", e);
  13. return;
  14. }
  15. if let Ok(gdpr_accept_button) = tab.wait_for_element(".sc-gsDKAQ.fILFKg") {
  16. if let Err(e) = gdpr_accept_button.click() {
  17. println!("Failed to click GDPR accept button: {}", e);
  18. return;
  19. }
  20. } else {
  21. println!("No GDPR popup to acknowledge found.");
  22. }
  23. let mut links = Vec::<String>::new();
  24. loop {
  25. let mut skipped: usize = 0;
  26. let new_urls_count: usize;
  27. match parse_list(&tab) {
  28. Ok(urls) => {
  29. new_urls_count = urls.len();
  30. for url in urls {
  31. if !links.contains(&url) {
  32. links.push(url);
  33. } else {
  34. skipped += 1;
  35. }
  36. }
  37. }
  38. Err(_) => {
  39. println!("No more houses found: stopping");
  40. break;
  41. }
  42. }
  43. if skipped == new_urls_count {
  44. println!("Only previously loaded houses found: stopping");
  45. break;
  46. }
  47. if let Ok(button) = tab.wait_for_element("[class=\"arrowButton-20ae5\"]") {
  48. if let Err(e) = button.click() {
  49. println!("Failed to click next page button: {}", e);
  50. break;
  51. } else {
  52. println!("Clicked next page button");
  53. }
  54. } else {
  55. println!("No next page button found: stopping");
  56. break;
  57. }
  58. if let Err(e) = tab.wait_until_navigated() {
  59. println!("Failed to load next page: {}", e);
  60. break;
  61. }
  62. }
  63. println!("Found {} houses:", links.len());
  64. for link in links {
  65. println!("\t{}", link);
  66. }
  67. }
  68. fn parse_list(tab: &Arc<Tab>) -> Result<Vec<String>, Box<dyn Error>> {
  69. let elements = tab.find_elements("div[class*=\"EstateItem\"] > a")?;
  70. let mut links = Vec::<String>::new();
  71. for element in elements {
  72. if let Some(url) = element
  73. .call_js_fn(
  74. "function() { return this.getAttribute(\"href\"); }",
  75. vec![],
  76. true,
  77. )?
  78. .value
  79. {
  80. links.push(url.to_string());
  81. }
  82. }
  83. Ok(links)
  84. }

When I call this code in main, I get the following output:
当我在主函数中调用此代码时,我得到以下输出:

  1. No GDPR popup to acknowledge found.
  2. Clicked next page button
  3. Only previously loaded houses found: stopping
  4. Found 20 houses:
  5. ...

My problem is that I don't understand clicking the next button doesn't do anything. As I am new to Rust and web applications if it's a problem with me using the crate (headless-chrome) or my understanding of web scraping.
我的问题是我不明白为什么点击下一页按钮没有任何反应。由于我对Rust和Web应用程序都很陌生,所以我不确定是我在使用crate(headless-chrome)还是我的Web爬取理解上存在问题。

英文:

I'm relatively new to Rust and completely new to web (scraping).
I tried to implement a web scraper as a pet project to get more comfortable with rust and with the web stack.

I use headless-chrome to go on a website and scrape a website of links, which I will investigate later.
So, I open a tab, navigate to the website, then scrape the URLs, and finally want to click on the next button. Even though I find the next button (with a CSS selector) and I use click(), nothing happens.
In the next iteration, I scrape the same list again (clearly didn't move to the next page).

  1. use headless_chrome::Tab;
  2. use std::error::Error;
  3. use std::sync::Arc;
  4. use std::{thread, time};
  5. pub fn scrape(tab: Arc&lt;Tab&gt;) {
  6. let url = &quot;https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&amp;d=true&amp;lids=513958&amp;lids=513960&amp;lids=513966&amp;pami=750&amp;pma=500000&amp;pmi=10000&amp;sd=DESC&amp;sf=TIMESTAMP&quot;;
  7. if let Err(_) = tab.navigate_to(url) {
  8. println!(&quot;Failed to navigate to {}&quot;, url);
  9. return;
  10. }
  11. if let Err(e) = tab.wait_until_navigated() {
  12. println!(&quot;Failed to wait for navigation: {}&quot;, e);
  13. return;
  14. }
  15. if let Ok(gdpr_accept_button) = tab.wait_for_element(&quot;.sc-gsDKAQ.fILFKg&quot;) {
  16. if let Err(e) = gdpr_accept_button.click() {
  17. println!(&quot;Failed to click GDPR accept button: {}&quot;, e);
  18. return;
  19. }
  20. } else {
  21. println!(&quot;No GDPR popup to acknowledge found.&quot;);
  22. }
  23. let mut links = Vec::&lt;String&gt;::new();
  24. loop {
  25. let mut skipped: usize = 0;
  26. let new_urls_count: usize;
  27. match parse_list(&amp;tab) {
  28. Ok(urls) =&gt; {
  29. new_urls_count = urls.len();
  30. for url in urls {
  31. if !links.contains(&amp;url) {
  32. links. Push(url);
  33. } else {
  34. skipped += 1;
  35. }
  36. }
  37. }
  38. Err(_) =&gt; {
  39. println!(&quot;No more houses found: stopping&quot;);
  40. break;
  41. }
  42. }
  43. if skipped == new_urls_count {
  44. println!(&quot;Only previously loaded houses found: stopping&quot;);
  45. break;
  46. }
  47. if let Ok(button) = tab.wait_for_element(&quot;[class=\&quot;arrowButton-20ae5\&quot;]&quot;) {
  48. if let Err(e) = button.click() {
  49. println!(&quot;Failed to click next page button: {}&quot;, e);
  50. break;
  51. } else {
  52. println!(&quot;Clicked next page button&quot;);
  53. }
  54. } else {
  55. println!(&quot;No next page button found: stopping&quot;);
  56. break;
  57. }
  58. if let Err(e) = tab.wait_until_navigated() {
  59. println!(&quot;Failed to load next page: {}&quot;, e);
  60. break;
  61. }
  62. }
  63. println!(&quot;Found {} houses:&quot;, links.len());
  64. for link in links {
  65. println!(&quot;\t{}&quot;, link);
  66. }
  67. }
  68. fn parse_list(tab: &amp;Arc&lt;Tab&gt;) -&gt; Result&lt;Vec&lt;String&gt;, Box&lt;dyn Error&gt;&gt; {
  69. let elements = tab.find_elements(&quot;div[class*=\&quot;EstateItem\&quot;] &gt; a&quot;)?; //&quot;.EstateItem-1c115&quot;
  70. let mut links = Vec::&lt;String&gt;::new();
  71. for element in elements {
  72. if let Some(url) = element
  73. .call_js_fn(
  74. &amp;&quot;function() {{ return this.getAttribute(\&quot;href\&quot;); }}&quot;,
  75. vec![],
  76. true,
  77. )?
  78. .value
  79. {
  80. links. Push(url.to_string());
  81. }
  82. }
  83. Ok(links)
  84. }

When I call this code in main, I get the following output:

  1. No GDPR popup to acknowledge found.
  2. Clicked next page button
  3. Only previously loaded houses found: stopping
  4. Found 20 houses:
  5. ...

My problem is that I don't understand clicking the next button doesn't do anything. As I am new to Rust and web applications if it's a problem with me using the crate (headless-chrome) or my understanding of web scraping.

答案1

得分: 2

Here is the translated code section:

  1. tl;dr: 将点击下一页按钮的代码替换为以下内容:
  2. ```rust
  3. if let Ok(button) = tab.wait_for_element(r#&quot;*[class^=&quot;Pagination&quot;] button:last-child&quot;#) {
  4. // 解释:左右箭头按钮都具有相同的类。原始选择器无法正常工作。
  5. if let Err(e) = button.click() {
  6. println!(&quot;点击下一页按钮失败: {}&quot;, e);
  7. break;
  8. } else {
  9. println!(&quot;已点击下一页按钮&quot;);
  10. }
  11. } else {
  12. println!(&quot;未找到下一页按钮: 停止&quot;);
  13. break;
  14. }
  15. // 解释:Rust执行速度太快,因此需要等待页面加载
  16. std::thread::sleep(std::time::Duration::from_secs(5)); // 等待5秒
  17. if let Err(e) = tab.wait_until_navigated() {
  18. println!(&quot;加载下一页失败: {}&quot;, e);
  19. break;
  20. }
  1. 原始代码会在第一页上点击右箭头按钮,然后在此之后点击左箭头按钮,因为CSS也会匹配左箭头按钮;并且由于在DOM树中位于第一位,左箭头按钮将被返回。
  2. 原始代码执行速度太快。Chrome需要等待一段时间才能加载。如果发现性能无法接受,可以检查此处的事件并等待浏览器发出事件<https://docs.rs/headless_chrome/latest/headless_chrome/protocol/cdp/Accessibility/events/struct.LoadCompleteEvent.html>。

最后建议,以上所有工作都是不必要的:很明显,URL模式如下:https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&amp;d=true&amp;lids=513958&amp;lids=513960&amp;lids=513966&amp;pami=750&amp;pma=500000&amp;pmi=10000&amp;sd=DESC&amp;sf={PAGINATION}。您可以通过基本地抓取分页元素来找到此站点中的所有页面;您可能也可以放弃Chrome,执行基本的HTTP请求并解析返回的HTML。为此,请查看<https://docs.rs/scraper/latest/scraper/>和<https://docs.rs/reqwest/latest/reqwest/>。如果性能对这个爬虫至关重要,reqwest也可以与tokio一起使用,以异步/并发方式抓取网页。

更新:

以下是我上面建议的Rust和Python实现。解析HTML/XML并评估XPath的Rust库似乎非常罕见且相对不可靠,不过。

  1. use reqwest::Client;
  2. use std::error::Error;
  3. use std::sync::Arc;
  4. use sxd_xpath::{Context, Factory, Value};
  5. async fn get_page_count(client: &amp;reqwest::Client, url: &amp;str) -&gt; Result&lt;i32, Box&lt;dyn Error&gt;&gt; {
  6. let res = client.get(url).send().await?;
  7. let body = res.text().await?;
  8. let pages_count = body
  9. .split(&quot;\&quot;pagesCount\&quot;:&quot;)
  10. .nth(1)
  11. .unwrap()
  12. .split(&quot;,&quot;)
  13. .next()
  14. .unwrap()
  15. .trim()
  16. .parse::&lt;i32&gt;()?;
  17. Ok(pages_count)
  18. }
  19. async fn scrape_one(client: &amp;Client, url: &amp;str) -&gt; Result&lt;Vec&lt;String&gt;, Box&lt;dyn Error&gt;&gt; {
  20. let res = client.get(url).send().await?;
  21. let body = res.text().await?;
  22. let package = sxd_html::parse_html(&amp;body);
  23. let doc = package.as_document();
  24. let factory = Factory::new();
  25. let ctx = Context::new();
  26. let houses_selector = factory
  27. .build(&quot;//*[contains(@class, &#39;EstateItem&#39;)]&quot;)?
  28. .unwrap();
  29. let houses = houses_selector.evaluate(&amp;ctx, doc.root())?;
  30. if let Value::Nodeset(houses) = houses {
  31. let mut data = Vec::new();
  32. for house in houses {
  33. let title_selector = factory.build(&quot;.//h2/text()&quot;)?.unwrap();
  34. let title = title_selector.evaluate(&amp;ctx, house)?.string();
  35. let a_selector = factory.build(&quot;.//a/@href&quot;)?.unwrap();
  36. let href = a_selector.evaluate(&amp;ctx, house)?.string();
  37. data.push(format!(&quot;{} - {}&quot;, title, href));
  38. }
  39. return Ok(data);
  40. }
  41. Err(&quot;未找到数据&quot;.into())
  42. }
  43. #[tokio::main]
  44. async fn main() -&gt; Result&lt;(), Box&lt;dyn Error&gt;&gt; {
  45. let url = &quot;https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&amp;d=true&amp;lids=513958&amp;lids=513960&amp;lids=513966&amp;pami=750&amp;pma=500000&amp;pmi=10000&amp;sd=DESC&quot;;
  46. let client = reqwest::Client::builder()
  47. .user_agent(
  48. &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0&quot;,
  49. )
  50. .build()?;
  51. let client = Arc::new(client);
  52. let page_count = get_page_count(&amp;client, url).await?;
  53. let mut tasks = Vec::new();
  54. for i in 1..=page_count {
  55. let url = format!(&quot;{}&amp;sf={}&quot;, url, i);
  56. let client = client.clone();
  57. tasks.push(tokio::spawn(async move {
  58. scrape_one(&amp;client, &amp;url).await.unwrap()
  59. }));
  60. }
  61. let results = futures::future::join_all(tasks).await;
  62. for result in results {
  63. println!(&quot;{:?}&quot;, result?);
  64. }
  65. Ok(())
  66. }
  1. async def page_count(url):
  2. req = await session.get(url)
  3. return int(re.search(f&quot;pagesCount&quot;:\s*(\d+)&#39
  4. <details>
  5. <summary>英文:</summary>
  6. tl;dr: replace the code in the click next page button as this:
  7. ```rust
  8. if let Ok(button) = tab.wait_for_element(r#&quot;*[class^=&quot;Pagination&quot;] button:last-child&quot;#) {
  9. // Expl: both left and right arrow buttons have the same class. The original selector doesn&#39;t work, thusly.
  10. if let Err(e) = button.click() {
  11. println!(&quot;Failed to click next page button: {}&quot;, e);
  12. break;
  13. } else {
  14. println!(&quot;Clicked next page button&quot;);
  15. }
  16. } else {
  17. println!(&quot;No next page button found: stopping&quot;);
  18. break;
  19. }
  20. // Expl: rust is too fast, so we need to wait for the page to load
  21. std::thread::sleep(std::time::Duration::from_secs(5)); // Wait for 5 seconds
  22. if let Err(e) = tab.wait_until_navigated() {
  23. println!(&quot;Failed to load next page: {}&quot;, e);
  24. break;
  25. }
  1. The original code would click right button on the first page, then click left button here after because the CSS would match the left button as well; and by virtue being first in the DOM tree, the left button would be returned.
  2. The original code is just too fast. The chrome need to wait a bit to load. Should you find this performance to be not tolerable, check the event here and wait for the browser to emit the event <https://docs.rs/headless_chrome/latest/headless_chrome/protocol/cdp/Accessibility/events/struct.LoadCompleteEvent.html>.

As a final suggestion, all the work above is unnecessary: it is obvious that the URL pattern looks like this: https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&amp;d=true&amp;lids=513958&amp;lids=513960&amp;lids=513966&amp;pami=750&amp;pma=500000&amp;pmi=10000&amp;sd=DESC&amp;sf=TIMESTAMP&amp;sp={PAGINATION}. And you can find all the pages in this site by basically scrape the pagination elements; you might as well just ditch the chrome and perform and basic HTTP requests and parse the HTML returned. For this purpose, check <https://docs.rs/scraper/latest/scraper/> and <https://docs.rs/reqwest/latest/reqwest/> out. If performance is mission critical for this spider, reqwest can also be used with tokio to scrape the web page in asynchronous/concurrent manner.

UPDATE:

Below are rust/py implementation of my above suggestion. The rust library to parse HTML/XML and evaluate XPath seems to be very rare and relatively not reliable, however.

  1. use reqwest::Client;
  2. use std::error::Error;
  3. use std::sync::Arc;
  4. use sxd_xpath::{Context, Factory, Value};
  5. async fn get_page_count(client: &amp;reqwest::Client, url: &amp;str) -&gt; Result&lt;i32, Box&lt;dyn Error&gt;&gt; {
  6. let res = client.get(url).send().await?;
  7. let body = res.text().await?;
  8. let pages_count = body
  9. .split(&quot;\&quot;pagesCount\&quot;:&quot;)
  10. .nth(1)
  11. .unwrap()
  12. .split(&quot;,&quot;)
  13. .next()
  14. .unwrap()
  15. .trim()
  16. .parse::&lt;i32&gt;()?;
  17. Ok(pages_count)
  18. }
  19. async fn scrape_one(client: &amp;Client, url: &amp;str) -&gt; Result&lt;Vec&lt;String&gt;, Box&lt;dyn Error&gt;&gt; {
  20. let res = client.get(url).send().await?;
  21. let body = res.text().await?;
  22. let package = sxd_html::parse_html(&amp;body);
  23. let doc = package.as_document();
  24. let factory = Factory::new();
  25. let ctx = Context::new();
  26. let houses_selector = factory
  27. .build(&quot;//*[contains(@class, &#39;EstateItem&#39;)]&quot;)?
  28. .unwrap();
  29. let houses = houses_selector.evaluate(&amp;ctx, doc.root())?;
  30. if let Value::Nodeset(houses) = houses {
  31. let mut data = Vec::new();
  32. for house in houses {
  33. let title_selector = factory.build(&quot;.//h2/text()&quot;)?.unwrap();
  34. let title = title_selector.evaluate(&amp;ctx, house)?.string();
  35. let a_selector = factory.build(&quot;.//a/@href&quot;)?.unwrap();
  36. let href = a_selector.evaluate(&amp;ctx, house)?.string();
  37. data.push(format!(&quot;{} - {}&quot;, title, href));
  38. }
  39. return Ok(data);
  40. }
  41. Err(&quot;No data found&quot;.into())
  42. }
  43. #[tokio::main]
  44. async fn main() -&gt; Result&lt;(), Box&lt;dyn Error&gt;&gt; {
  45. let url = &quot;https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&amp;d=true&amp;lids=513958&amp;lids=513960&amp;lids=513966&amp;pami=750&amp;pma=500000&amp;pmi=10000&amp;sd=DESC&quot;;
  46. let client = reqwest::Client::builder()
  47. .user_agent(
  48. &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0&quot;,
  49. )
  50. .build()?;
  51. let client = Arc::new(client);
  52. let page_count = get_page_count(&amp;client, url).await?;
  53. let mut tasks = Vec::new();
  54. for i in 1..=page_count {
  55. let url = format!(&quot;{}&amp;sf={}&quot;, url, i);
  56. let client = client.clone();
  57. tasks.push(tokio::spawn(async move {
  58. scrape_one(&amp;client, &amp;url).await.unwrap()
  59. }));
  60. }
  61. let results = futures::future::join_all(tasks).await;
  62. for result in results {
  63. println!(&quot;{:?}&quot;, result?);
  64. }
  65. Ok(())
  66. }
  1. async def page_count(url):
  2. req = await session.get(url)
  3. return int(re.search(f&#39;&quot;pagesCount&quot;:\s*(\d+)&#39;, await req.text()).group(1))
  4. async def scrape_one(url):
  5. req = await session.get(url)
  6. tree = etree.HTML(await req.text())
  7. houses = tree.xpath(&quot;//*[contains(@class, &#39;EstateItem&#39;)]&quot;)
  8. data = [
  9. dict(title=house.xpath(&quot;.//h2/text()&quot;)[0], href=house.xpath(&quot;.//a/@href&quot;)[0])
  10. for house in houses
  11. ]
  12. return data
  13. url = &quot;https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&amp;d=true&amp;lids=513958&amp;lids=513960&amp;lids=513966&amp;pami=750&amp;pma=500000&amp;pmi=10000&amp;sd=DESC&quot;
  14. result = await asyncio.gather(
  15. *[
  16. scrape_one(url + f&quot;&amp;sf={i}&quot;)
  17. for i in range(1, await page_count(url + &quot;&amp;sf=1&quot;) + 1)
  18. ]
  19. )

huangapple
  • 本文由 发表于 2023年6月12日 01:31:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76451684.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定