HTML/XML: 理解”滚动条”的工作方式

huangapple go评论88阅读模式
英文:

HTML/XML: Understanding How "Scroll Bars" Work

问题

我正在使用R编程语言,并尝试学习如何使用Selenium与网页进行交互。

例如,使用Google地图 - 我正在尝试查找特定区域周围所有比萨店的名称、地址和经度/纬度。据我了解,这需要输入您感兴趣的位置,点击"附近"按钮,输入您要查找的内容(例如"比萨"),滚动到底部以确保加载了所有比萨店 - 然后复制所有比萨店的名称、地址和经度/纬度。

我一直在自学如何在R中使用Selenium,并已能够自己解决部分问题。到目前为止,我已经完成了以下工作:

第一部分:搜索地址(例如自由女神像,纽约,美国)并返回经度/纬度:

library(RSelenium)
library(wdman)
library(netstat)

selenium()
selenium_object <- selenium(retcommand = T, check = F)

remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())

remDr <- remote_driver$client
remDr$navigate("https://www.google.com/maps")

search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("Statue of Liberty", key = "enter"))

Sys.sleep(5)

url <- remDr$getCurrentUrl()[[1]]

long_lat <- gsub(".*@(-?[0-9.]+),(-?[0-9.]+),.*", "\,\", url)
long_lat <- unlist(strsplit(long_lat, ","))

long_lat

第二部分:搜索特定位置周围的所有比萨店:

library(RSelenium)
library(wdman)
library(netstat)

selenium()
selenium_object <- selenium(retcommand = T, check = F)

remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())

remDr <- remote_driver$client

remDr$navigate("https://www.google.com/maps")

Sys.sleep(5)

search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))

Sys.sleep(5)

search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))

Sys.sleep(5)

但是,从这里开始,我不知道如何继续。我不知道如何滚动页面到底部以查看所有可用的结果,并且不知道如何开始提取名称。

通过一些研究(即检查HTML代码),我得出了以下观察结果:

  • 餐厅位置的名称可以在以下标签中找到:<a class="hfpxzc" aria-label=
  • 餐厅位置的地址可以在以下标签中找到:<div class="W4Efsd">

最终,我将寻找以下结果:

            name                            address longitude latitude
    1 pizza land 123 fake st, city, state, zip code    45.212  -75.123

请问有人可以向我展示如何继续吗?

注意:由于更多人可能会通过Python使用Selenium - 我很愿意学习如何在Python中解决这个问题,然后尝试将答案转换成R代码。

参考资料:

更新:在地址方面取得了进一步的进展

remDr$navigate("https://www.google.com/maps")

Sys.sleep(5)

search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))

Sys.sleep(5)

search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))

Sys.sleep(5)

address_elements <- remDr$findElements(using = 'css selector', '.W4Efsd')
addresses <- lapply(address_elements, function(x) x$getElementText()[[1]])

result <- data.frame(name = unlist(names), address = unlist(addresses))
英文:

I am working with the R programming language and trying to learn about how to use Selenium to interact with webpages.

For example, using Google Maps - I am trying to find the name, address and longitude/latitude of all Pizza shops around a certain area. As I understand, this would involve entering the location you are interested in, clicking the "nearby" button, entering what you are looking for (e.g. "pizza"), scrolling all the way to the bottom to make sure all pizza shops are loaded - and then copying the names, address and longitude/latitudes of all pizza locations.

I have been self-teaching myself how to use Selenium in R and have been able to solve parts of this problem myself. Here is what I have done so far:

Part 1: Searching for an address (e.g. Statue of Liberty, New York, USA) and returning a longitude/latitude :

library(RSelenium)
library(wdman)
library(netstat)

selenium()
seleium_object &lt;- selenium(retcommand = T, check = F)


remote_driver &lt;- rsDriver(browser = &quot;chrome&quot;, chromever = &quot;114.0.5735.90&quot;, verbose = F, port = free_port())

remDr&lt;- remote_driver$client
remDr$navigate(&quot;https://www.google.com/maps&quot;)

search_box &lt;- remDr$findElement(using = &#39;css selector&#39;, &quot;#searchboxinput&quot;)
search_box$sendKeysToElement(list(&quot;Statue of Liberty&quot;, key = &quot;enter&quot;))

Sys.sleep(5)

url &lt;- remDr$getCurrentUrl()[[1]]

long_lat &lt;- gsub(&quot;.*@(-?[0-9.]+),(-?[0-9.]+),.*&quot;, &quot;\,\&quot;, url)
long_lat &lt;- unlist(strsplit(long_lat, &quot;,&quot;))

&gt; long_lat
[1] &quot;40.7269409&quot;  &quot;-74.0906116&quot;

Part 2: Searching for all Pizza shops around a certain location:

library(RSelenium)
library(wdman)
library(netstat)

selenium()
seleium_object &lt;- selenium(retcommand = T, check = F)

remote_driver &lt;- rsDriver(browser = &quot;chrome&quot;, chromever = &quot;114.0.5735.90&quot;, verbose = F, port = free_port())

remDr&lt;- remote_driver$client


remDr$navigate(&quot;https://www.google.com/maps&quot;)


Sys.sleep(5)

search_box &lt;- remDr$findElement(using = &#39;css selector&#39;, &quot;#searchboxinput&quot;)
search_box$sendKeysToElement(list(&quot;40.7256456,-74.0909442&quot;, key = &quot;enter&quot;))

Sys.sleep(5)


search_box &lt;- remDr$findElement(using = &#39;css selector&#39;, &quot;#searchboxinput&quot;)
search_box$clearElement()
search_box$sendKeysToElement(list(&quot;pizza&quot;, key = &quot;enter&quot;))


Sys.sleep(5)

But from here, I do not know how to proceed. I do not know how to scroll the page all the way to the bottom to view all such results that are available - and I do not know how to start extracting the names.

Doing some research (i.e. inspecting the HTML code), I made the following observations:

  • The name of a restaurant location can be found in the following tags: &lt;a class=&quot;hfpxzc&quot; aria-label=

  • The address of a restaurant location be found in the following tags: &lt;div class=&quot;W4Efsd&quot;&gt;

In the end, I would be looking for a result like this:

        name                            address longitude latitude
1 pizza land 123 fake st, city, state, zip code    45.212  -75.123

Can someone please show me how to proceed?

Note: Seeing as more people likely use Selenium through Python - I am more than happy to learn how to solve this problem in Python and then try to convert the answer into R code.r

Thanks!

References:

UPDATE: Some further progress with addresses

remDr$navigate(&quot;https://www.google.com/maps&quot;)

Sys.sleep(5)

search_box &lt;- remDr$findElement(using = &#39;css selector&#39;, &quot;#searchboxinput&quot;)
search_box$sendKeysToElement(list(&quot;40.7256456,-74.0909442&quot;, key = &quot;enter&quot;))

Sys.sleep(5)

search_box &lt;- remDr$findElement(using = &#39;css selector&#39;, &quot;#searchboxinput&quot;)
search_box$clearElement()
search_box$sendKeysToElement(list(&quot;pizza&quot;, key = &quot;enter&quot;))

Sys.sleep(5)

address_elements &lt;- remDr$findElements(using = &#39;css selector&#39;, &#39;.W4Efsd&#39;)
addresses &lt;- lapply(address_elements, function(x) x$getElementText()[[1]])

result &lt;- data.frame(name = unlist(names), address = unlist(addresses))

答案1

得分: 7

以下是翻译好的部分:

"The page is lazy loaded" 这一页是懒加载的,也就是说,当你滚动时,数据会进行分页加载。

"So, what you need to do, is to keep finding the last HTML tag of the data which will therefore load more content." 所以,你需要做的是不断找到数据的最后一个HTML标签,这样就会加载更多内容。

"Finding how more data is loaded" 找出更多数据是如何加载的

"You need to find out how the data is loaded. Here's what I did:" 你需要找出数据是如何加载的。这是我所做的:

"First, disable internet access for your browser in the Network calls (F12 -> Network -> Offline)" 首先,在网络调用中禁用浏览器的互联网访问(F12 -> Network -> Offline)

"Then, scroll to the last loaded element, you will see a loading indicator (since there is load internet, it will just hang)" 然后,滚动到最后加载的元素,你会看到一个加载指示器(因为没有互联网访问,它将会停止)

"Now, here comes the important part, find out under what HTML tag this loading indicator is:" 现在,重要的部分来了,找出这个加载指示器在哪个HTML标签下:

"As you can see that element is under the div.qjESne CSS selector." 正如你所看到的,这个元素在div.qjESne CSS选择器下。

"Working with Selenium" 使用Selenium

"You can call the javascript code scrollIntoView() function which will scroll a particular element into view within the browser's viewport." 你可以调用JavaScript代码中的scrollIntoView()函数,它将在浏览器的视口中将特定元素滚动到可视区域内。

"Finding out when to break" 找出何时停止

"To find out when to stop scrolling in order to load more data, we need to find out what element appears when there's no data." 为了找出何时停止滚动以加载更多数据,我们需要找出没有数据时出现了什么元素。

"If you scroll until there are no more results, you will see:" 如果你滚动直到没有更多结果,你会看到:

"which is an element under the CSS selector span.HlvSq." 这是一个在CSS选择器span.HlvSq下的元素。

"Code examples" 代码示例

"Scrolling the page" 滚动页面

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

URL = "https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu"

driver = webdriver.Chrome()

driver.get(URL)

# Waits 10 seconds for the elements to load before scrolling
wait = WebDriverWait(driver, 10)
elements = wait.until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)

while True:
    new_elements = wait.until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
    )

    # Pick the last element in the list - this is the one we want to scroll to
    last_element = elements[-1]
    # Scroll to the last element
    driver.execute_script("arguments[0].scrollIntoView(true);", last_element)

    # Update the elements list
    elements = new_elements

    # Check if there are any new elements loaded - the "You've reached the end of the list." message
    if driver.find_elements(By.CSS_SELECTOR, "span.HlvSq"):
        print("No more elements")
        break

"Getting the data" 获取数据

"If you inspect the page, you will see that the data is under the cards under the CSS selector of div.lI9IFe." 如果你检查页面,你会发现数据在CSS选择器div.lI9IFe下的卡片中。

"What you need to do, is wait until the scrolling has finished, and then you get all the data under the CSS selector of div.lI9IFe" 你需要做的是等待滚动完成,然后获取CSS选择器div.lI9IFe下的所有数据。

"Code example" 代码示例

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

URL = "https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu"

driver = webdriver.Chrome()
driver.get(URL)

# Waits 10 seconds for the elements to load before scrolling
wait = WebDriverWait(driver, 10)
elements = wait.until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)
titles = []
links = []
addresses = []

while True:
    new_elements = wait.until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
    )

    # Pick the last element in the list - this is the one we want to scroll to
    last_element = elements[-1]
    # Scroll to the last element
    driver.execute_script("arguments[0].scrollIntoView(true);", last_element)

    # Update the elements list

    elements = new_elements
    # time.sleep(0.1)

    # Check if there are any new elements loaded - the "You've reached the end of the list." message
    if driver.find_elements(By.CSS_SELECTOR, "span.HlvSq"):
        # now we can parse the data since all the elements loaded
        for data in driver.find_elements(By.CSS_SELECTOR, "div.lI9IFe"):
            title = data.find_element(
                By.CSS_SELECTOR, "div.qBF1Pd.fontHeadlineSmall"
            ).text
            restaurant = data.find_element(
                By.CSS_SELECTOR, ".W4Efsd > span:nth-of-type(2

<details>
<summary>英文:</summary>

*I see that you updated your question to include a Python answer, so here&#39;s how it&#39;s done in Python. you can use the same method for R.*



The page is *lazy loaded* which means, as you scroll the data is paginated and loaded.


So, what you need to do, is to keep finding the *last* HTML tag of the data which will therefore load more content.


### Finding how more data is loaded

You need to find out how the data is loaded. Here&#39;s what I did:

First, disable internet access for your browser in the Network calls (F12 -&gt; Network -&gt; Offline)

[![enter image description here][1]][1]


Then, scroll to the last loaded element, you will see a loading indicator (since there is load internet, it will just hang)

[![enter image description here][2]][2]


Now, here comes the important part, find out under what HTML tag this loading indicator is:

[![enter image description here][3]][3]

As you can see that element is under the `div.qjESne` CSS selector.



### Working with Selenium


You can call the javascript code [`scrollIntoView()`][4] function which will scroll a particular element into view within the browser&#39;s viewport.

##### Finding out when to break


To find out when to stop scrolling in order to load more data, we need to find out what element appears when theres no data.

If you scroll until there are no more results, you will see:


[![enter image description here][5]][5]

which is an element under the CSS selector `span.HlvSq`.




### Code examples


##### Scrolling the page

```py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


URL = &quot;https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu&quot;

driver = webdriver.Chrome()


driver.get(URL)

# Waits 10 seconds for the elements to load before scrolling
wait = WebDriverWait(driver, 10)
elements = wait.until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, &quot;div.qjESne&quot;))
)

while True:
    new_elements = wait.until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, &quot;div.qjESne&quot;))
    )

    # Pick the last element in the list - this is the one we want to scroll to
    last_element = elements[-1]
    # Scroll to the last element
    driver.execute_script(&quot;arguments[0].scrollIntoView(true);&quot;, last_element)

    # Update the elements list
    elements = new_elements

    # Check if there are any new elements loaded - the &quot;You&#39;ve reached the end of the list.&quot; message
    if driver.find_elements(By.CSS_SELECTOR, &quot;span.HlvSq&quot;):
        print(&quot;No more elements&quot;)
        break
Getting the data

If you inspect the page, you will see that the data is under the cards under the CSS selector of div.lI9IFe.

What you need to do, is wait until the scrolling has finished, and then you get all the data under the CSS selector of div.lI9IFe

Code example

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

URL = &quot;https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu&quot;

driver = webdriver.Chrome()
driver.get(URL)

# Waits 10 seconds for the elements to load before scrolling
wait = WebDriverWait(driver, 10)
elements = wait.until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, &quot;div.qjESne&quot;))
)
titles = []
links = []
addresses = []

while True:
    new_elements = wait.until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, &quot;div.qjESne&quot;))
    )

    # Pick the last element in the list - this is the one we want to scroll to
    last_element = elements[-1]
    # Scroll to the last element
    driver.execute_script(&quot;arguments[0].scrollIntoView(true);&quot;, last_element)

    # Update the elements list

    elements = new_elements
    # time.sleep(0.1)

    # Check if there are any new elements loaded - the &quot;You&#39;ve reached the end of the list.&quot; message
    if driver.find_elements(By.CSS_SELECTOR, &quot;span.HlvSq&quot;):
        # now we can parse the data since all the elements loaded
        for data in driver.find_elements(By.CSS_SELECTOR, &quot;div.lI9IFe&quot;):
            title = data.find_element(
                By.CSS_SELECTOR, &quot;div.qBF1Pd.fontHeadlineSmall&quot;
            ).text
            restaurant = data.find_element(
                By.CSS_SELECTOR, &quot;.W4Efsd &gt; span:nth-of-type(2)&quot;
            ).text

            titles.append(title)
            addresses.append(restaurant)

        # This converts the list of titles and links into a dataframe
        df = pd.DataFrame(list(zip(titles, addresses)), columns=[&quot;title&quot;, &quot;addresses&quot;])
        print(df)
        break

Prints:

                            title               addresses
0                   Domino&#39;s Pizza  &#183; 741 Communipaw Ave A
1        Tommy&#39;s Family Restaurant       &#183; 349 Central Ave
2     VIP RESTAURANT LLC BARSHAY&#39;S           &#183; 175 Sip Ave
3    The Hutton Restaurant and Bar         &#183; 225 Hutton St
4                        Barge Inn            &#183; 324 3rd St
..                             ...                     ...
116            Bettie&#39;s Restaurant     &#183; 579 West Side Ave
117               Mahboob-E-El Ahi     &#183; 580 Montgomery St
118                Samosa Paradise        &#183; 804 Newark Ave
119                     TACO DRIVE        &#183; 195 Newark Ave
120                Two Boots Pizza        &#183; 133 Newark Ave
[121 rows x 2 columns]

答案2

得分: 4

这已经是一个很好的开始。我可以列举一些我继续进行的事情,但请注意,我主要使用Python。

要在DOM树中定位元素,我建议使用xpath。它具有易于理解的语法,学习起来相当容易。

https://devhints.io/xpath

在这里,您可以找到定位元素的所有可能性的概述,并通过"Whitebeam.org"提供的链接测试台来进行训练。还有助于理解如何提取名称。
它会看起来像这样:

返回给定xpath表达式的对象

restaurant_adr <- remDr$findElement(using = 'xpath', "//*/*[@class='W4Efsd']")

在这个对象中,我们需要引用所需的属性,可能是.text()
我不确定R中的语法如何

restaurant_adr.text()

要进行滚动,可以使用https://www.selenium.dev/documentation/webdriver/actions_api/wheel/,但没有关于R的文档

或者,您可以使用JavaScript进行滚动。

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

https://cran.r-project.org/web/packages/js/vignettes/intro.html

有用的资源:

https://statsandr.com/blog/web-scraping-in-r/

https://betterdatascience.com/r-web-scraping/

https://scrapfly.io/blog/web-scraping-with-r/#http-clients-crul

英文:

That is already a good start. I can name a few things I did to proceed, but note I mainly worked with python.

For locating elements within the DOM tree I suggest using xpath. It has a humanreadable syntax and is quite easy to learn.

https://devhints.io/xpath

Here you can find an overview of all possibilities to locate elements and a linked testbench by "Whitebeam.org" to train.
Also helps understanding how to extract names.
It will look something like this:

Returns an object for the given xpath expression

restaurant_adr &lt;- remDr$findElement(using = &#39;xpath&#39;, &quot;//*/*[@class=&quot;W4Efsd&quot;]&quot;)

In this object we need to reference the desired attribute, probably .text()
I am not sure about the syntax in R

restaurant_adr.text()

To scroll there is https://www.selenium.dev/documentation/webdriver/actions_api/wheel/ but it has no documentation for R

Or you could use javascript for scrolling.

driver.execute_script(&quot;window.scrollTo(0, document.body.scrollHeight);&quot;)

https://cran.r-project.org/web/packages/js/vignettes/intro.html

Helpful resources:

https://statsandr.com/blog/web-scraping-in-r/

https://betterdatascience.com/r-web-scraping/

https://scrapfly.io/blog/web-scraping-with-r/#http-clients-crul

huangapple
  • 本文由 发表于 2023年7月17日 11:39:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76701351.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定