英文:
Error message <selenium.common.exceptions.InvalidSelectorException> when extract information from website using selenium webdriver
问题
这个网站https://findmasa.com/city/los-angeles/ 包含许多壁画。我想使用Python从单击地址按钮时弹出的子页面中提取信息,例如https://findmasa.com/view/map#b1cc410b。我想要获取的信息包括壁画ID、艺术家、地址、城市、纬度、经度和链接。
当我运行下面的代码时,它可以正常工作,获取了前四个子页面的信息,但在第五个子链接https://findmasa.com/view/map#1456a64a 处停止,并给出了错误消息selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified (Session info: chrome=114.0.5735.199)
。是否有人可以帮助我识别问题并提供解决方案?谢谢。
from requests_html import HTMLSession
import warnings
import csv
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
warnings.filterwarnings("ignore", category=DeprecationWarning) # 忽略Deprecation警告消息
s = HTMLSession()
# 定义一个函数来获取不同类别的壁画链接
def get_mural_links(page):
url = f'https://findmasa.com/city/los-angeles/{page}'
links = []
r = s.get(url)
artworks = r.html.find('ul.list-works-cards div.top p')
for item in artworks:
links.append(item.find('a', first=True).attrs['href'])
return links
# 定义一个函数来从一系列链接中提取感兴趣的信息
def parse_mural(url):
# 获取壁画ID
spl = '#'
id = url.partition(spl)[2]
# 创建一个Chrome驱动实例
driver = Chrome()
driver.get(url)
# 等待li元素在页面上出现
li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li#{id}')))
data_lat = li_element.get_attribute('data-lat')
data_lng = li_element.get_attribute('data-lng')
city = li_element.find_elements(By.TAG_NAME, 'p')[2].text
link = url
try:
artist = li_element.find_element(By.TAG_NAME, 'a').text
except:
artist = 'No Data'
try:
address = li_element.find_elements(By.TAG_NAME, 'p')[1].text
except:
address = 'No Data'
info = {
'ID': id,
'ARTIST': artist,
'LOCATION': address,
'CITY': city,
'LATITUDE': data_lat,
'LONGITUDE': data_lng,
'LINK': link,
}
return info
# 定义一个函数来将结果保存到CSV文件中
def save_csv(results):
keys = results[0].keys()
with open('LAmural_MASA.csv', 'w', newline='') as f:
dict_writer = csv.DictWriter(f, keys)
dict_writer.writeheader()
dict_writer.writerows(results)
# 定义导出结果的主要函数
def main():
results = []
for x in range(1, 3):
urls = get_mural_links(x)
for url in range(len(urls)):
results.append(parse_mural(urls))
save_csv(results)
if __name__ == '__main__':
main()
英文:
This website https://findmasa.com/city/los-angeles/ contains many murals. I want to use python and extract information from the subpages that pop up when clicking the address button, such as https://findmasa.com/view/map#b1cc410b. The information I want to get includes mural id, artist, address, city, latitude, longitude, and link.
When I run the code below, it worked for the first four subpages but stopped at the fifth at this sublink https://findmasa.com/view/map#1456a64a and gave me an error message selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified (Session info: chrome=114.0.5735.199)
. Can anyone help me identify the problem and provide a solution? Thank you.
from requests_html import HTMLSession
import warnings
import csv
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
warnings.filterwarnings("ignore", category=DeprecationWarning) ## ignore the Deprecation warning message
s = HTMLSession()
## define a function to get mural links from different categories
def get_mural_links(page):
url = f'https://findmasa.com/city/los-angeles/{page}'
links = []
r = s.get(url)
artworks = r.html.find('ul.list-works-cards div.top p')
for item in artworks:
links.append(item.find('a', first=True).attrs['href'])
return links
## define a function to get interested info from a list of links
def parse_mural(url):
## get mural id
spl = '#'
id = url.partition(spl)[2]
## create a Chrome driver instance
driver = Chrome()
driver.get(url)
# wait for the li element to be present on the page
li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li#{id}')))
data_lat = li_element.get_attribute('data-lat')
data_lng = li_element.get_attribute('data-lng')
city = li_element.find_elements(By.TAG_NAME, 'p')[2].text
link = url
try:
artist = li_element.find_element(By.TAG_NAME, 'a').text
except:
artist = 'No Data'
try:
address = li_element.find_elements(By.TAG_NAME, 'p')[1].text
except:
address = 'No Data'
info = {
'ID': id,
'ARTIST': artist,
'LOCATION': address,
'CITY': city,
'LATITUDE': data_lat,
'LONGITUDE': data_lng,
'LINK': link,
}
return info
## define a function to save the results to a csv file
def save_csv(results):
keys = results[0].keys()
with open('LAmural_MASA.csv', 'w', newline='') as f: ## newline='' helps remove the blank rows in b/t each mural
dict_writer = csv.DictWriter(f, keys)
dict_writer.writeheader()
dict_writer.writerows(results)
## define the main function for this file to export results
def main():
results = []
for x in range(1, 3):
urls = get_mural_links(x)
for url in range(len(urls)):
results.append(parse_mural(urls))
save_csv(results)
## won't run/import to other files
if __name__ == '__main__':
main()
答案1
得分: 1
如我在这里回答的那样,
要解决你在某些网址或者更确切地说是某些id编号上遇到的InvalidSelectorException
问题,使用记法li[id="id_value"]
代替li#id_value
。
使用以下代码:
li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li[id="{id}"]')))
而不是:
li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li#{id}')))
英文:
As I've answered here,
To fix the InvalidSelectorException
that you're getting for some url or better to say for some id number, use the notation li[id="id_value"]
instead of li#id_value
.
Use this:
li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li[id="{id}"]')))
Instead of:
li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li#{id}')))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论