Error message <selenium.common.exceptions.InvalidSelectorException> when extract information from website using selenium webdriver

huangapple go评论82阅读模式
英文:

Error message <selenium.common.exceptions.InvalidSelectorException> when extract information from website using selenium webdriver

问题

这个网站https://findmasa.com/city/los-angeles/ 包含许多壁画。我想使用Python从单击地址按钮时弹出的子页面中提取信息,例如https://findmasa.com/view/map#b1cc410b。我想要获取的信息包括壁画ID、艺术家、地址、城市、纬度、经度和链接。

当我运行下面的代码时,它可以正常工作,获取了前四个子页面的信息,但在第五个子链接https://findmasa.com/view/map#1456a64a 处停止,并给出了错误消息selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified (Session info: chrome=114.0.5735.199)。是否有人可以帮助我识别问题并提供解决方案?谢谢。

from requests_html import HTMLSession
import warnings
import csv

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC

warnings.filterwarnings("ignore", category=DeprecationWarning) # 忽略Deprecation警告消息

s = HTMLSession()

# 定义一个函数来获取不同类别的壁画链接
def get_mural_links(page):
    url = f'https://findmasa.com/city/los-angeles/{page}'
    links = []
    r = s.get(url)
    artworks = r.html.find('ul.list-works-cards div.top p')
    for item in artworks:
        links.append(item.find('a', first=True).attrs['href'])
    return links

# 定义一个函数来从一系列链接中提取感兴趣的信息
def parse_mural(url):
    # 获取壁画ID
    spl = '#'
    id = url.partition(spl)[2]

    # 创建一个Chrome驱动实例
    driver = Chrome()
    driver.get(url)

    # 等待li元素在页面上出现
    li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li#{id}')))

    data_lat = li_element.get_attribute('data-lat')
    data_lng = li_element.get_attribute('data-lng')
    city = li_element.find_elements(By.TAG_NAME, 'p')[2].text
    link = url

    try:
        artist = li_element.find_element(By.TAG_NAME, 'a').text
    except:
        artist = 'No Data'

    try:
        address = li_element.find_elements(By.TAG_NAME, 'p')[1].text
    except:
        address = 'No Data'

    info = {
        'ID': id,
        'ARTIST': artist,
        'LOCATION': address,
        'CITY': city,
        'LATITUDE': data_lat,
        'LONGITUDE': data_lng,
        'LINK': link,
    }
    return info

# 定义一个函数来将结果保存到CSV文件中
def save_csv(results):
    keys = results[0].keys()

    with open('LAmural_MASA.csv', 'w', newline='') as f: 
        dict_writer = csv.DictWriter(f, keys)
        dict_writer.writeheader()
        dict_writer.writerows(results)

# 定义导出结果的主要函数
def main():
    results = []
    for x in range(1, 3):
        urls = get_mural_links(x)
        for url in range(len(urls)):
            results.append(parse_mural(urls
))
save_csv(results) if __name__ == '__main__': main()
英文:

This website https://findmasa.com/city/los-angeles/ contains many murals. I want to use python and extract information from the subpages that pop up when clicking the address button, such as https://findmasa.com/view/map#b1cc410b. The information I want to get includes mural id, artist, address, city, latitude, longitude, and link.

When I run the code below, it worked for the first four subpages but stopped at the fifth at this sublink https://findmasa.com/view/map#1456a64a and gave me an error message selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified (Session info: chrome=114.0.5735.199). Can anyone help me identify the problem and provide a solution? Thank you.

from requests_html import HTMLSession
import warnings
import csv
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
warnings.filterwarnings(&quot;ignore&quot;, category=DeprecationWarning) ## ignore the Deprecation warning message
s = HTMLSession()
## define a function to get mural links from different categories
def get_mural_links(page):
url = f&#39;https://findmasa.com/city/los-angeles/{page}&#39;
links = []
r = s.get(url)
artworks = r.html.find(&#39;ul.list-works-cards div.top p&#39;)
for item in artworks:
links.append(item.find(&#39;a&#39;, first=True).attrs[&#39;href&#39;])
return links
## define a function to get interested info from a list of links
def parse_mural(url):
## get mural id
spl = &#39;#&#39;
id = url.partition(spl)[2]
## create a Chrome driver instance
driver = Chrome()
driver.get(url)
# wait for the li element to be present on the page
li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li#{id}&#39;)))
data_lat = li_element.get_attribute(&#39;data-lat&#39;)
data_lng = li_element.get_attribute(&#39;data-lng&#39;)
city = li_element.find_elements(By.TAG_NAME, &#39;p&#39;)[2].text
link = url
try:
artist = li_element.find_element(By.TAG_NAME, &#39;a&#39;).text
except:
artist = &#39;No Data&#39;
try:
address = li_element.find_elements(By.TAG_NAME, &#39;p&#39;)[1].text
except:
address = &#39;No Data&#39;
info = {
&#39;ID&#39;: id,
&#39;ARTIST&#39;: artist,
&#39;LOCATION&#39;: address,
&#39;CITY&#39;: city,
&#39;LATITUDE&#39;: data_lat,
&#39;LONGITUDE&#39;: data_lng,
&#39;LINK&#39;: link,
}
return info
## define a function to save the results to a csv file
def save_csv(results):
keys = results[0].keys()
with open(&#39;LAmural_MASA.csv&#39;, &#39;w&#39;, newline=&#39;&#39;) as f: ## newline=&#39;&#39; helps remove the blank rows in b/t each mural
dict_writer = csv.DictWriter(f, keys)
dict_writer.writeheader()
dict_writer.writerows(results)
## define the main function for this file to export results
def main():
results = []
for x in range(1, 3):
urls = get_mural_links(x)
for url in range(len(urls)):
results.append(parse_mural(urls
)) save_csv(results) ## won&#39;t run/import to other files if __name__ == &#39;__main__&#39;: main()

答案1

得分: 1

如我在这里回答的那样,

要解决你在某些网址或者更确切地说是某些id编号上遇到的InvalidSelectorException问题,使用记法li[id=&quot;id_value&quot;]代替li#id_value

使用以下代码:

li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li[id=&quot;{id}&quot;]&#39;)))

而不是:

li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li#{id}&#39;)))
英文:

As I've answered here,

To fix the InvalidSelectorException that you're getting for some url or better to say for some id number, use the notation li[id=&quot;id_value&quot;] instead of li#id_value.

Use this:

li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li[id=&quot;{id}&quot;]&#39;)))

Instead of:

li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li#{id}&#39;)))

huangapple
  • 本文由 发表于 2023年7月20日 10:18:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76726273.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定