英文:
HTML/XML: Understanding How "Scroll Bars" Work
问题
我正在使用R编程语言,并尝试学习如何使用Selenium与网页进行交互。
例如,使用Google地图 - 我正在尝试查找特定区域周围所有比萨店的名称、地址和经度/纬度。据我了解,这需要输入您感兴趣的位置,点击"附近"按钮,输入您要查找的内容(例如"比萨"),滚动到底部以确保加载了所有比萨店 - 然后复制所有比萨店的名称、地址和经度/纬度。
我一直在自学如何在R中使用Selenium,并已能够自己解决部分问题。到目前为止,我已经完成了以下工作:
第一部分:搜索地址(例如自由女神像,纽约,美国)并返回经度/纬度:
library(RSelenium)
library(wdman)
library(netstat)
selenium()
selenium_object <- selenium(retcommand = T, check = F)
remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())
remDr <- remote_driver$client
remDr$navigate("https://www.google.com/maps")
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("Statue of Liberty", key = "enter"))
Sys.sleep(5)
url <- remDr$getCurrentUrl()[[1]]
long_lat <- gsub(".*@(-?[0-9.]+),(-?[0-9.]+),.*", "\,\", url)
long_lat <- unlist(strsplit(long_lat, ","))
long_lat
第二部分:搜索特定位置周围的所有比萨店:
library(RSelenium)
library(wdman)
library(netstat)
selenium()
selenium_object <- selenium(retcommand = T, check = F)
remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())
remDr <- remote_driver$client
remDr$navigate("https://www.google.com/maps")
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))
Sys.sleep(5)
但是,从这里开始,我不知道如何继续。我不知道如何滚动页面到底部以查看所有可用的结果,并且不知道如何开始提取名称。
通过一些研究(即检查HTML代码),我得出了以下观察结果:
- 餐厅位置的名称可以在以下标签中找到:
<a class="hfpxzc" aria-label=
- 餐厅位置的地址可以在以下标签中找到:
<div class="W4Efsd">
最终,我将寻找以下结果:
name address longitude latitude
1 pizza land 123 fake st, city, state, zip code 45.212 -75.123
请问有人可以向我展示如何继续吗?
注意:由于更多人可能会通过Python使用Selenium - 我很愿意学习如何在Python中解决这个问题,然后尝试将答案转换成R代码。
参考资料:
- https://medium.com/python-point/python-crawling-restaurant-data-ab395d121247
- https://www.youtube.com/watch?v=GnpJujF9dBw
- https://www.youtube.com/watch?v=U1BrIPmhx10
更新:在地址方面取得了进一步的进展
remDr$navigate("https://www.google.com/maps")
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))
Sys.sleep(5)
address_elements <- remDr$findElements(using = 'css selector', '.W4Efsd')
addresses <- lapply(address_elements, function(x) x$getElementText()[[1]])
result <- data.frame(name = unlist(names), address = unlist(addresses))
英文:
I am working with the R programming language and trying to learn about how to use Selenium to interact with webpages.
For example, using Google Maps - I am trying to find the name, address and longitude/latitude of all Pizza shops around a certain area. As I understand, this would involve entering the location you are interested in, clicking the "nearby" button, entering what you are looking for (e.g. "pizza"), scrolling all the way to the bottom to make sure all pizza shops are loaded - and then copying the names, address and longitude/latitudes of all pizza locations.
I have been self-teaching myself how to use Selenium in R and have been able to solve parts of this problem myself. Here is what I have done so far:
Part 1: Searching for an address (e.g. Statue of Liberty, New York, USA) and returning a longitude/latitude :
library(RSelenium)
library(wdman)
library(netstat)
selenium()
seleium_object <- selenium(retcommand = T, check = F)
remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())
remDr<- remote_driver$client
remDr$navigate("https://www.google.com/maps")
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("Statue of Liberty", key = "enter"))
Sys.sleep(5)
url <- remDr$getCurrentUrl()[[1]]
long_lat <- gsub(".*@(-?[0-9.]+),(-?[0-9.]+),.*", "\,\", url)
long_lat <- unlist(strsplit(long_lat, ","))
> long_lat
[1] "40.7269409" "-74.0906116"
Part 2: Searching for all Pizza shops around a certain location:
library(RSelenium)
library(wdman)
library(netstat)
selenium()
seleium_object <- selenium(retcommand = T, check = F)
remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())
remDr<- remote_driver$client
remDr$navigate("https://www.google.com/maps")
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))
Sys.sleep(5)
But from here, I do not know how to proceed. I do not know how to scroll the page all the way to the bottom to view all such results that are available - and I do not know how to start extracting the names.
Doing some research (i.e. inspecting the HTML code), I made the following observations:
-
The name of a restaurant location can be found in the following tags:
<a class="hfpxzc" aria-label=
-
The address of a restaurant location be found in the following tags:
<div class="W4Efsd">
In the end, I would be looking for a result like this:
name address longitude latitude
1 pizza land 123 fake st, city, state, zip code 45.212 -75.123
Can someone please show me how to proceed?
Note: Seeing as more people likely use Selenium through Python - I am more than happy to learn how to solve this problem in Python and then try to convert the answer into R code.r
Thanks!
References:
- https://medium.com/python-point/python-crawling-restaurant-data-ab395d121247
- https://www.youtube.com/watch?v=GnpJujF9dBw
- https://www.youtube.com/watch?v=U1BrIPmhx10
UPDATE: Some further progress with addresses
remDr$navigate("https://www.google.com/maps")
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))
Sys.sleep(5)
search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))
Sys.sleep(5)
address_elements <- remDr$findElements(using = 'css selector', '.W4Efsd')
addresses <- lapply(address_elements, function(x) x$getElementText()[[1]])
result <- data.frame(name = unlist(names), address = unlist(addresses))
答案1
得分: 7
以下是翻译好的部分:
"The page is lazy loaded" 这一页是懒加载的,也就是说,当你滚动时,数据会进行分页加载。
"So, what you need to do, is to keep finding the last HTML tag of the data which will therefore load more content." 所以,你需要做的是不断找到数据的最后一个HTML标签,这样就会加载更多内容。
"Finding how more data is loaded" 找出更多数据是如何加载的
"You need to find out how the data is loaded. Here's what I did:" 你需要找出数据是如何加载的。这是我所做的:
"First, disable internet access for your browser in the Network calls (F12 -> Network -> Offline)" 首先,在网络调用中禁用浏览器的互联网访问(F12 -> Network -> Offline)
"Then, scroll to the last loaded element, you will see a loading indicator (since there is load internet, it will just hang)" 然后,滚动到最后加载的元素,你会看到一个加载指示器(因为没有互联网访问,它将会停止)
"Now, here comes the important part, find out under what HTML tag this loading indicator is:" 现在,重要的部分来了,找出这个加载指示器在哪个HTML标签下:
"As you can see that element is under the div.qjESne
CSS selector." 正如你所看到的,这个元素在div.qjESne
CSS选择器下。
"Working with Selenium" 使用Selenium
"You can call the javascript code scrollIntoView()
function which will scroll a particular element into view within the browser's viewport." 你可以调用JavaScript代码中的scrollIntoView()
函数,它将在浏览器的视口中将特定元素滚动到可视区域内。
"Finding out when to break" 找出何时停止
"To find out when to stop scrolling in order to load more data, we need to find out what element appears when there's no data." 为了找出何时停止滚动以加载更多数据,我们需要找出没有数据时出现了什么元素。
"If you scroll until there are no more results, you will see:" 如果你滚动直到没有更多结果,你会看到:
"which is an element under the CSS selector span.HlvSq
." 这是一个在CSS选择器span.HlvSq
下的元素。
"Code examples" 代码示例
"Scrolling the page" 滚动页面
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
URL = "https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu"
driver = webdriver.Chrome()
driver.get(URL)
# Waits 10 seconds for the elements to load before scrolling
wait = WebDriverWait(driver, 10)
elements = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)
while True:
new_elements = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)
# Pick the last element in the list - this is the one we want to scroll to
last_element = elements[-1]
# Scroll to the last element
driver.execute_script("arguments[0].scrollIntoView(true);", last_element)
# Update the elements list
elements = new_elements
# Check if there are any new elements loaded - the "You've reached the end of the list." message
if driver.find_elements(By.CSS_SELECTOR, "span.HlvSq"):
print("No more elements")
break
"Getting the data" 获取数据
"If you inspect the page, you will see that the data is under the cards under the CSS selector of div.lI9IFe
." 如果你检查页面,你会发现数据在CSS选择器div.lI9IFe
下的卡片中。
"What you need to do, is wait until the scrolling has finished, and then you get all the data under the CSS selector of div.lI9IFe
" 你需要做的是等待滚动完成,然后获取CSS选择器div.lI9IFe
下的所有数据。
"Code example" 代码示例
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
URL = "https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu"
driver = webdriver.Chrome()
driver.get(URL)
# Waits 10 seconds for the elements to load before scrolling
wait = WebDriverWait(driver, 10)
elements = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)
titles = []
links = []
addresses = []
while True:
new_elements = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)
# Pick the last element in the list - this is the one we want to scroll to
last_element = elements[-1]
# Scroll to the last element
driver.execute_script("arguments[0].scrollIntoView(true);", last_element)
# Update the elements list
elements = new_elements
# time.sleep(0.1)
# Check if there are any new elements loaded - the "You've reached the end of the list." message
if driver.find_elements(By.CSS_SELECTOR, "span.HlvSq"):
# now we can parse the data since all the elements loaded
for data in driver.find_elements(By.CSS_SELECTOR, "div.lI9IFe"):
title = data.find_element(
By.CSS_SELECTOR, "div.qBF1Pd.fontHeadlineSmall"
).text
restaurant = data.find_element(
By.CSS_SELECTOR, ".W4Efsd > span:nth-of-type(2
<details>
<summary>英文:</summary>
*I see that you updated your question to include a Python answer, so here's how it's done in Python. you can use the same method for R.*
The page is *lazy loaded* which means, as you scroll the data is paginated and loaded.
So, what you need to do, is to keep finding the *last* HTML tag of the data which will therefore load more content.
### Finding how more data is loaded
You need to find out how the data is loaded. Here's what I did:
First, disable internet access for your browser in the Network calls (F12 -> Network -> Offline)
[![enter image description here][1]][1]
Then, scroll to the last loaded element, you will see a loading indicator (since there is load internet, it will just hang)
[![enter image description here][2]][2]
Now, here comes the important part, find out under what HTML tag this loading indicator is:
[![enter image description here][3]][3]
As you can see that element is under the `div.qjESne` CSS selector.
### Working with Selenium
You can call the javascript code [`scrollIntoView()`][4] function which will scroll a particular element into view within the browser's viewport.
##### Finding out when to break
To find out when to stop scrolling in order to load more data, we need to find out what element appears when theres no data.
If you scroll until there are no more results, you will see:
[![enter image description here][5]][5]
which is an element under the CSS selector `span.HlvSq`.
### Code examples
##### Scrolling the page
```py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
URL = "https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu"
driver = webdriver.Chrome()
driver.get(URL)
# Waits 10 seconds for the elements to load before scrolling
wait = WebDriverWait(driver, 10)
elements = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)
while True:
new_elements = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)
# Pick the last element in the list - this is the one we want to scroll to
last_element = elements[-1]
# Scroll to the last element
driver.execute_script("arguments[0].scrollIntoView(true);", last_element)
# Update the elements list
elements = new_elements
# Check if there are any new elements loaded - the "You've reached the end of the list." message
if driver.find_elements(By.CSS_SELECTOR, "span.HlvSq"):
print("No more elements")
break
Getting the data
If you inspect the page, you will see that the data is under the cards under the CSS selector of div.lI9IFe
.
What you need to do, is wait until the scrolling has finished, and then you get all the data under the CSS selector of div.lI9IFe
Code example
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
URL = "https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu"
driver = webdriver.Chrome()
driver.get(URL)
# Waits 10 seconds for the elements to load before scrolling
wait = WebDriverWait(driver, 10)
elements = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)
titles = []
links = []
addresses = []
while True:
new_elements = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)
# Pick the last element in the list - this is the one we want to scroll to
last_element = elements[-1]
# Scroll to the last element
driver.execute_script("arguments[0].scrollIntoView(true);", last_element)
# Update the elements list
elements = new_elements
# time.sleep(0.1)
# Check if there are any new elements loaded - the "You've reached the end of the list." message
if driver.find_elements(By.CSS_SELECTOR, "span.HlvSq"):
# now we can parse the data since all the elements loaded
for data in driver.find_elements(By.CSS_SELECTOR, "div.lI9IFe"):
title = data.find_element(
By.CSS_SELECTOR, "div.qBF1Pd.fontHeadlineSmall"
).text
restaurant = data.find_element(
By.CSS_SELECTOR, ".W4Efsd > span:nth-of-type(2)"
).text
titles.append(title)
addresses.append(restaurant)
# This converts the list of titles and links into a dataframe
df = pd.DataFrame(list(zip(titles, addresses)), columns=["title", "addresses"])
print(df)
break
Prints:
title addresses
0 Domino's Pizza · 741 Communipaw Ave A
1 Tommy's Family Restaurant · 349 Central Ave
2 VIP RESTAURANT LLC BARSHAY'S · 175 Sip Ave
3 The Hutton Restaurant and Bar · 225 Hutton St
4 Barge Inn · 324 3rd St
.. ... ...
116 Bettie's Restaurant · 579 West Side Ave
117 Mahboob-E-El Ahi · 580 Montgomery St
118 Samosa Paradise · 804 Newark Ave
119 TACO DRIVE · 195 Newark Ave
120 Two Boots Pizza · 133 Newark Ave
[121 rows x 2 columns]
答案2
得分: 4
这已经是一个很好的开始。我可以列举一些我继续进行的事情,但请注意,我主要使用Python。
要在DOM树中定位元素,我建议使用xpath。它具有易于理解的语法,学习起来相当容易。
在这里,您可以找到定位元素的所有可能性的概述,并通过"Whitebeam.org"提供的链接测试台来进行训练。还有助于理解如何提取名称。
它会看起来像这样:
返回给定xpath表达式的对象
restaurant_adr <- remDr$findElement(using = 'xpath', "//*/*[@class='W4Efsd']")
在这个对象中,我们需要引用所需的属性,可能是.text()
我不确定R中的语法如何
restaurant_adr.text()
要进行滚动,可以使用https://www.selenium.dev/documentation/webdriver/actions_api/wheel/,但没有关于R
的文档
或者,您可以使用JavaScript进行滚动。
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
https://cran.r-project.org/web/packages/js/vignettes/intro.html
有用的资源:
https://statsandr.com/blog/web-scraping-in-r/
https://betterdatascience.com/r-web-scraping/
https://scrapfly.io/blog/web-scraping-with-r/#http-clients-crul
英文:
That is already a good start. I can name a few things I did to proceed, but note I mainly worked with python.
For locating elements within the DOM tree I suggest using xpath. It has a humanreadable syntax and is quite easy to learn.
Here you can find an overview of all possibilities to locate elements and a linked testbench by "Whitebeam.org" to train.
Also helps understanding how to extract names.
It will look something like this:
Returns an object for the given xpath expression
restaurant_adr <- remDr$findElement(using = 'xpath', "//*/*[@class="W4Efsd"]")
In this object we need to reference the desired attribute, probably .text()
I am not sure about the syntax in R
restaurant_adr.text()
To scroll there is https://www.selenium.dev/documentation/webdriver/actions_api/wheel/ but it has no documentation for R
Or you could use javascript for scrolling.
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
https://cran.r-project.org/web/packages/js/vignettes/intro.html
Helpful resources:
https://statsandr.com/blog/web-scraping-in-r/
https://betterdatascience.com/r-web-scraping/
https://scrapfly.io/blog/web-scraping-with-r/#http-clients-crul
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论