2023年7月20日 10:18:38go评论111阅读模式

英文:

Error message <selenium.common.exceptions.InvalidSelectorException> when extract information from website using selenium webdriver

问题

这个网站https://findmasa.com/city/los-angeles/ 包含许多壁画。我想使用Python从单击地址按钮时弹出的子页面中提取信息，例如https://findmasa.com/view/map#b1cc410b。我想要获取的信息包括壁画ID、艺术家、地址、城市、纬度、经度和链接。

当我运行下面的代码时，它可以正常工作，获取了前四个子页面的信息，但在第五个子链接https://findmasa.com/view/map#1456a64a 处停止，并给出了错误消息selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified (Session info: chrome=114.0.5735.199)。是否有人可以帮助我识别问题并提供解决方案？谢谢。

from requests_html import HTMLSession
import warnings
import csv
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
warnings.filterwarnings("ignore", category=DeprecationWarning) # 忽略Deprecation警告消息
s = HTMLSession()
# 定义一个函数来获取不同类别的壁画链接
def get_mural_links(page):
    url = f'https://findmasa.com/city/los-angeles/{page}'
    links = []
    r = s.get(url)
    artworks = r.html.find('ul.list-works-cards div.top p')
    for item in artworks:
        links.append(item.find('a', first=True).attrs['href'])
    return links
# 定义一个函数来从一系列链接中提取感兴趣的信息
def parse_mural(url):
    # 获取壁画ID
    spl = '#'
    id = url.partition(spl)[2]
    # 创建一个Chrome驱动实例
    driver = Chrome()
    driver.get(url)
    # 等待li元素在页面上出现
    li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li#{id}')))
    data_lat = li_element.get_attribute('data-lat')
    data_lng = li_element.get_attribute('data-lng')
    city = li_element.find_elements(By.TAG_NAME, 'p')[2].text
    link = url
    try:
        artist = li_element.find_element(By.TAG_NAME, 'a').text
    except:
        artist = 'No Data'
    try:
        address = li_element.find_elements(By.TAG_NAME, 'p')[1].text
    except:
        address = 'No Data'
    info = {
        'ID': id,
        'ARTIST': artist,
        'LOCATION': address,
        'CITY': city,
        'LATITUDE': data_lat,
        'LONGITUDE': data_lng,
        'LINK': link,
    }
    return info
# 定义一个函数来将结果保存到CSV文件中
def save_csv(results):
    keys = results[0].keys()
    with open('LAmural_MASA.csv', 'w', newline='') as f: 
        dict_writer = csv.DictWriter(f, keys)
        dict_writer.writeheader()
        dict_writer.writerows(results)
# 定义导出结果的主要函数
def main():
    results = []
    for x in range(1, 3):
        urls = get_mural_links(x)
        for url in range(len(urls)):
            results.append(parse_mural(urls
))
            save_csv(results)
if __name__ == '__main__':
    main()

英文:

This website https://findmasa.com/city/los-angeles/ contains many murals. I want to use python and extract information from the subpages that pop up when clicking the address button, such as https://findmasa.com/view/map#b1cc410b. The information I want to get includes mural id, artist, address, city, latitude, longitude, and link.

When I run the code below, it worked for the first four subpages but stopped at the fifth at this sublink https://findmasa.com/view/map#1456a64a and gave me an error message selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified (Session info: chrome=114.0.5735.199). Can anyone help me identify the problem and provide a solution? Thank you.

from requests_html import HTMLSession
import warnings
import csv
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
warnings.filterwarnings(&quot;ignore&quot;, category=DeprecationWarning) ## ignore the Deprecation warning message
s = HTMLSession()
## define a function to get mural links from different categories
def get_mural_links(page):
url = f&#39;https://findmasa.com/city/los-angeles/{page}&#39;
links = []
r = s.get(url)
artworks = r.html.find(&#39;ul.list-works-cards div.top p&#39;)
for item in artworks:
links.append(item.find(&#39;a&#39;, first=True).attrs[&#39;href&#39;])
return links
## define a function to get interested info from a list of links
def parse_mural(url):
## get mural id
spl = &#39;#&#39;
id = url.partition(spl)[2]
## create a Chrome driver instance
driver = Chrome()
driver.get(url)
# wait for the li element to be present on the page
li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li#{id}&#39;)))
data_lat = li_element.get_attribute(&#39;data-lat&#39;)
data_lng = li_element.get_attribute(&#39;data-lng&#39;)
city = li_element.find_elements(By.TAG_NAME, &#39;p&#39;)[2].text
link = url
try:
artist = li_element.find_element(By.TAG_NAME, &#39;a&#39;).text
except:
artist = &#39;No Data&#39;
try:
address = li_element.find_elements(By.TAG_NAME, &#39;p&#39;)[1].text
except:
address = &#39;No Data&#39;
info = {
&#39;ID&#39;: id,
&#39;ARTIST&#39;: artist,
&#39;LOCATION&#39;: address,
&#39;CITY&#39;: city,
&#39;LATITUDE&#39;: data_lat,
&#39;LONGITUDE&#39;: data_lng,
&#39;LINK&#39;: link,
}
return info
## define a function to save the results to a csv file
def save_csv(results):
keys = results[0].keys()
with open(&#39;LAmural_MASA.csv&#39;, &#39;w&#39;, newline=&#39;&#39;) as f: ## newline=&#39;&#39; helps remove the blank rows in b/t each mural
dict_writer = csv.DictWriter(f, keys)
dict_writer.writeheader()
dict_writer.writerows(results)
## define the main function for this file to export results
def main():
results = []
for x in range(1, 3):
urls = get_mural_links(x)
for url in range(len(urls)):
results.append(parse_mural(urls
))
save_csv(results)
## won&#39;t run/import to other files
if __name__ == &#39;__main__&#39;:
main()

答案1

得分: 1

如我在这里回答的那样，

要解决你在某些网址或者更确切地说是某些id编号上遇到的InvalidSelectorException问题，使用记法li[id="id_value"]代替li#id_value。

使用以下代码：

li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li[id=&quot;{id}&quot;]&#39;)))

而不是：

li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li#{id}&#39;)))

英文:

As I've answered here,

To fix the InvalidSelectorException that you're getting for some url or better to say for some id number, use the notation li[id="id_value"] instead of li#id_value.

Use this:

li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li[id=&quot;{id}&quot;]&#39;)))

Instead of:

li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li#{id}&#39;)))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Error message <selenium.common.exceptions.InvalidSelectorException> when extract information from website using selenium webdriver

问题

答案1

使用Python Selenium导出Facebook帖子 – 无法按帖子分开

如何在Heroku上使用Python使Zxing正常工作？

Generic vs Specific MyPy types of functions

解析JSON时间错误

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。