如何使用Ruby通过Selenium从复杂的网页中提取信息

huangapple go评论68阅读模式
英文:

How to extract information from a complex web page using Selenium via Ruby

问题

作为一个实验,我想测试如何使用Ruby、Selenium和Web驱动程序访问一个复杂的网站。对于这个测试,我想我可以打开flights.google.com并查看如何在页面上查找一些内容,快速浏览了一下视觉上看起来简单但实际上是一个复杂生成的视图的动态性质。

似乎足够简单。例如,该网站是flights.google.com,输入两个目的地点,比如SFO到LAX,URL将是https://www.google.com/travel/flights/search?tfs=CBwQAhogEgoyMDIzLTA2LTE0KABqBwgBEgNTRk9yBwgBEgNMQVgaIBIKMjAyMy0wNi0xOCgAagcIARIDTEFYcgcIARIDU0ZPQAFIAXABggELCP___________wGYAQE

现在,当页面生成时,您会看到一些漂亮的航班列表,并且如果我想找到一个结果,它不是一个可读的命名集合。

<div class="yR1fYc" jsaction="click:O1htCb;gP4E0b:O1htCb;DIjhEc:YmNhJf" jsname="BXUrOb">
<div class="mxvQLc ceis6c uj4xv uVdL1c A8qKrc" jsname="HSrbLb">…</div>
</div>

通常情况下,使用Webdriver和Selenium,我会这样做:

require 'selenium-webdriver'
require 'capybara'

driver = Selenium::WebDriver.for :chrome
driver.get 'https://www.google.com/flights/'

然后使用一些命名元素的查找元素方法:

flights = driver.find_elements_by_class_name('flight')

在这种情况下,它的名称更加隐晦,所以不确定如何解决这个问题,如果我假设类的名称可能是动态生成的。

有任何建议或方法吗?

英文:

As an experiment I wanted to test how to use Ruby, Selenium and the web-driver to access a complex web site. For this test I thought I could take flights.google.com and see how to find something on the page having taken a quick look at the dynamic nature of what visually looks simple but is a complex generated view.

Seems easy enough. So for example the site is flights.google.com and entering two destination points say SFO to LAX the URL would be https://www.google.com/travel/flights/search?tfs=CBwQAhogEgoyMDIzLTA2LTE0KABqBwgBEgNTRk9yBwgBEgNMQVgaIBIKMjAyMy0wNi0xOCgAagcIARIDTEFYcgcIARIDU0ZPQAFIAXABggELCP___________wGYAQE

Now when the page is generated you get some nice displayed lists of the flights and if I want to find a result it's not a readable named set of items.

&lt;div class=&quot;yR1fYc&quot; jsaction=&quot;click:O1htCb;gP4E0b:O1htCb;DIjhEc:YmNhJf&quot; jsname=&quot;BXUrOb&quot;&gt;
&lt;div class=&quot;mxvQLc ceis6c uj4xv uVdL1c A8qKrc&quot; jsname=&quot;HSrbLb&quot;&gt;…&lt;/div&gt;
&lt;/div&gt;

Typically with Webdriver and Selenium I would use

require &#39;selenium-webdriver&#39;
require &#39;capybara&#39;

driver = Selenium::WebDriver.for :chrome
driver.get &#39;https://www.google.com/flights/&#39;

and then use the find element approach with some named element

flights = driver.find_elements_by_class_name(&#39;flight&#39;)

In this case its a more cryptic name so not sure how to tackle the problem if I'm assuming the names of classes may be dynamically generated.

Any suggestions or approaches ?

答案1

得分: 2

简要概述

如果混淆是故意的且不断演进的,您可能无法永久解决问题。然而,对于那些没有故意违反标准的网站,有一些最佳实践可供遵循。您可能只需要重新构思解决方案,以使其在面对动态内容时更加稳定,并找出页面上不会变化的部分。

分析和建议

如果您正在处理动态生成的内容,您可能需要以下一项或多项:

  1. 能够执行JavaScript的驱动程序,但并非所有驱动程序都可以。这对于依赖JavaScript的网站很重要。
    • 您的Chrome驱动程序可以执行JavaScript,但可能不会以其他JavaScript驱动程序执行的方式呈现;尝试一些其他驱动程序可能会有所帮助。
    • 如果Google故意使用Chrome的功能来混淆页面,使用不同的JavaScript驱动程序或引擎可能会有所帮助,但效果可能因情况而异。
  2. 一些基于父子关系、n<sup>th</sup>元素或容器的搜索方法,以查找您所需的内容,如果无法依赖于给定的类名或ID名称。
  3. 愿意将您的方法更改为全页面正则表达式或固定字符串搜索,如果您知道有可靠的前缀、后缀或其他基于字符串的逻辑来识别您想要的文本或HTML元素的附近数据。
  4. 考虑在可能的情况下针对仅包含HTML的站点进行测试。
    • 您可以使用felinks或类似工具测试不使用JavaScript渲染的内容。
    • 您可以搜索是否存在屏幕阅读器或启用辅助功能的页面,以查找您要查找的内容。
    • 您可以查看它们是否支持Web可访问性倡议Accessible Rich Internet Applications WAI-ARIA或类似的优雅降级界面,以便您可以正确解析它们。
  5. 考虑使用API访问。
    • 有时,当存在RESTful或GraphQL选项时,使用网络抓取可能不是解决数据检索问题的正确方法。
    • 例如,Google曾经提供Google Flight Search (GFS),但似乎已经停止提供。我没有找到替代方案,但我只花了大约30秒的时间来寻找。

例如,如果您可以依赖于类似"First Name"这样的内容出现在给定的div容器中,您可以使用XPath表达式来查找父容器,然后从那里执行更结构化或宽容的搜索。

Capybara非常强大,但有时您需要自己进行解析。如果发生这种情况,请查看Nokogiri提供了什么功能,并查看是否可以执行您想要的操作,即使您需要执行多次提取/搜索步骤。

并非所有问题都有一行解决方案。设计用于破坏标准工具(如mechanize或基于CSS的解析器)的动态网站通常有自己的原因,尽管这经常会破坏与标准的兼容性。假设它们没有故意破坏ALT文本或其他辅助功能属性之类的辅助功能特性,您可以考虑利用这些或类似的辅助功能特性,因为商业网站通常必须符合ADA或第508节标准,无论它们如何滥用DOM。

英文:

TL;DR

You may not be able to permanently solve your problem if the obfuscation is deliberate and continuously evolving. However, there are some best practices you can follow with web sites that aren't deliberately breaking standards. You may just need to reframe your solution to be less brittle in the face of dynamic content and find out what doesn't change on the page.

Analysis and Recommendations

If you're dealing with dynamically-generated content, you need one or more of the following:

  1. A driver that can render JavaScript, which not all of them do, for JavaScript-dependent sites.
    • Your Chrome driver does, but may not render the same way it would on other JavaScript drivers; it may be worth trying some others.
    • If Google is deliberately using features of Chrome to obfuscate pages, using a different JavaScript driver or engine may help. YMMV with this one.
  2. Some sort of parent-child, n<sup>th</sup> element, or container-based search to find what you want if you can't rely on a given class or ID name.
  3. Be willing to change your approach to a full-page regexp or fixed-string search if you know that there's a reliable prefix, suffix, or other string-based logic to identify text near the data or HTML elements you want.
  4. Consider testing against the HTML-only version of sites when available.
    • You could test with felinks or similar to see what renders without JavaScript.
    • You can search for their screen-reader or accessibility-enabled pages for the content you're trying to find.
    • You can see if they support Web Accessibility Initiative Accessible Rich Internet Applications WAI-ARIA or similar gracefully degrading interfaces that you can parse properly.
  5. Consider API access instead.
    • Sometimes web-scraping is the wrong way to solve a data retrieval problem when there are RESTful or GraphQL options available.
    • For example, Google used to offer Google Flight Search (GFS) but seems to have discontinued it. I didn't find an alternative, but I only spent about 30 seconds trying to find one.

For example, if you can rely on something like "First Name" being somewhere in a given div container, you could use an XPath expression to find the parent container and then do a more structured or permissive search from there.

Capybara is pretty powerful, but sometimes you have to do your own parsing. If that happens, take a look at what Nokogiri has to offer, and see if you can do what you want even if you have to perform multiple extraction/search steps to get it done.

Not everything has a one-liner as a solution. Dynamic sites that are designed to break standard tools like mechanize or CSS-based parsers are generally doing that for a reason, I suppose, even though it often breaks compatibility with standards. Assuming they aren't also deliberately breaking accessibility features like ALT text or other accessibility attributes, you might think about leveraging those or similar accessibility features since commercial sites are often required to meet ADA or Section 508 standards, no matter how much they abuse the DOM.

答案2

得分: 1

你仍然可以依赖结构,因为结构不是随机的;像这样的CSS选择器应该能够为您提供结果的列表元素,而无需依赖类名:

body > c-wiz:nth-of-type(2) > div:nth-of-type(1) >
div:nth-of-type(2) > c-wiz:nth-of-type(1) > div:nth-of-type(1) >
c-wiz:nth-of-type(1) > div:nth-of-type(2) > div:nth-of-type(2) >
div:nth-of-type(3) > ul:nth-of-type(1) > li
英文:

You can still rely on the structure, since the structure is not random; a CSS selector like this should be able to give you the list elements of the results without relying on class names:

body &gt; c-wiz:nth-of-type(2) &gt; div:nth-of-type(1) &gt; div:nth-of-type(2) &gt;
c-wiz:nth-of-type(1) &gt; div:nth-of-type(1) &gt;
c-wiz:nth-of-type(1) &gt; div:nth-of-type(2) &gt; div:nth-of-type(2) &gt; div:nth-of-type(3) &gt;
ul:nth-of-type(1) &gt; li

huangapple
  • 本文由 发表于 2023年5月30日 10:53:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76361338.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定