Error message <selenium.common.exceptions.InvalidSelectorException> when extract information from website using selenium webdriver

huangapple go评论111阅读模式
英文:

Error message <selenium.common.exceptions.InvalidSelectorException> when extract information from website using selenium webdriver

问题

这个网站https://findmasa.com/city/los-angeles/ 包含许多壁画。我想使用Python从单击地址按钮时弹出的子页面中提取信息,例如https://findmasa.com/view/map#b1cc410b。我想要获取的信息包括壁画ID、艺术家、地址、城市、纬度、经度和链接。

当我运行下面的代码时,它可以正常工作,获取了前四个子页面的信息,但在第五个子链接https://findmasa.com/view/map#1456a64a 处停止,并给出了错误消息selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified (Session info: chrome=114.0.5735.199)。是否有人可以帮助我识别问题并提供解决方案?谢谢。

  1. from requests_html import HTMLSession
  2. import warnings
  3. import csv
  4. from selenium.webdriver import Chrome
  5. from selenium.webdriver.common.by import By
  6. from selenium.webdriver.support.wait import WebDriverWait
  7. import selenium.webdriver.support.expected_conditions as EC
  8. warnings.filterwarnings("ignore", category=DeprecationWarning) # 忽略Deprecation警告消息
  9. s = HTMLSession()
  10. # 定义一个函数来获取不同类别的壁画链接
  11. def get_mural_links(page):
  12. url = f'https://findmasa.com/city/los-angeles/{page}'
  13. links = []
  14. r = s.get(url)
  15. artworks = r.html.find('ul.list-works-cards div.top p')
  16. for item in artworks:
  17. links.append(item.find('a', first=True).attrs['href'])
  18. return links
  19. # 定义一个函数来从一系列链接中提取感兴趣的信息
  20. def parse_mural(url):
  21. # 获取壁画ID
  22. spl = '#'
  23. id = url.partition(spl)[2]
  24. # 创建一个Chrome驱动实例
  25. driver = Chrome()
  26. driver.get(url)
  27. # 等待li元素在页面上出现
  28. li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'li#{id}')))
  29. data_lat = li_element.get_attribute('data-lat')
  30. data_lng = li_element.get_attribute('data-lng')
  31. city = li_element.find_elements(By.TAG_NAME, 'p')[2].text
  32. link = url
  33. try:
  34. artist = li_element.find_element(By.TAG_NAME, 'a').text
  35. except:
  36. artist = 'No Data'
  37. try:
  38. address = li_element.find_elements(By.TAG_NAME, 'p')[1].text
  39. except:
  40. address = 'No Data'
  41. info = {
  42. 'ID': id,
  43. 'ARTIST': artist,
  44. 'LOCATION': address,
  45. 'CITY': city,
  46. 'LATITUDE': data_lat,
  47. 'LONGITUDE': data_lng,
  48. 'LINK': link,
  49. }
  50. return info
  51. # 定义一个函数来将结果保存到CSV文件中
  52. def save_csv(results):
  53. keys = results[0].keys()
  54. with open('LAmural_MASA.csv', 'w', newline='') as f:
  55. dict_writer = csv.DictWriter(f, keys)
  56. dict_writer.writeheader()
  57. dict_writer.writerows(results)
  58. # 定义导出结果的主要函数
  59. def main():
  60. results = []
  61. for x in range(1, 3):
  62. urls = get_mural_links(x)
  63. for url in range(len(urls)):
  64. results.append(parse_mural(urls
    ))
  65. save_csv(results)
  66. if __name__ == '__main__':
  67. main()
英文:

This website https://findmasa.com/city/los-angeles/ contains many murals. I want to use python and extract information from the subpages that pop up when clicking the address button, such as https://findmasa.com/view/map#b1cc410b. The information I want to get includes mural id, artist, address, city, latitude, longitude, and link.

When I run the code below, it worked for the first four subpages but stopped at the fifth at this sublink https://findmasa.com/view/map#1456a64a and gave me an error message selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified (Session info: chrome=114.0.5735.199). Can anyone help me identify the problem and provide a solution? Thank you.

  1. from requests_html import HTMLSession
  2. import warnings
  3. import csv
  4. from selenium.webdriver import Chrome
  5. from selenium.webdriver.common.by import By
  6. from selenium.webdriver.support.wait import WebDriverWait
  7. import selenium.webdriver.support.expected_conditions as EC
  8. warnings.filterwarnings(&quot;ignore&quot;, category=DeprecationWarning) ## ignore the Deprecation warning message
  9. s = HTMLSession()
  10. ## define a function to get mural links from different categories
  11. def get_mural_links(page):
  12. url = f&#39;https://findmasa.com/city/los-angeles/{page}&#39;
  13. links = []
  14. r = s.get(url)
  15. artworks = r.html.find(&#39;ul.list-works-cards div.top p&#39;)
  16. for item in artworks:
  17. links.append(item.find(&#39;a&#39;, first=True).attrs[&#39;href&#39;])
  18. return links
  19. ## define a function to get interested info from a list of links
  20. def parse_mural(url):
  21. ## get mural id
  22. spl = &#39;#&#39;
  23. id = url.partition(spl)[2]
  24. ## create a Chrome driver instance
  25. driver = Chrome()
  26. driver.get(url)
  27. # wait for the li element to be present on the page
  28. li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li#{id}&#39;)))
  29. data_lat = li_element.get_attribute(&#39;data-lat&#39;)
  30. data_lng = li_element.get_attribute(&#39;data-lng&#39;)
  31. city = li_element.find_elements(By.TAG_NAME, &#39;p&#39;)[2].text
  32. link = url
  33. try:
  34. artist = li_element.find_element(By.TAG_NAME, &#39;a&#39;).text
  35. except:
  36. artist = &#39;No Data&#39;
  37. try:
  38. address = li_element.find_elements(By.TAG_NAME, &#39;p&#39;)[1].text
  39. except:
  40. address = &#39;No Data&#39;
  41. info = {
  42. &#39;ID&#39;: id,
  43. &#39;ARTIST&#39;: artist,
  44. &#39;LOCATION&#39;: address,
  45. &#39;CITY&#39;: city,
  46. &#39;LATITUDE&#39;: data_lat,
  47. &#39;LONGITUDE&#39;: data_lng,
  48. &#39;LINK&#39;: link,
  49. }
  50. return info
  51. ## define a function to save the results to a csv file
  52. def save_csv(results):
  53. keys = results[0].keys()
  54. with open(&#39;LAmural_MASA.csv&#39;, &#39;w&#39;, newline=&#39;&#39;) as f: ## newline=&#39;&#39; helps remove the blank rows in b/t each mural
  55. dict_writer = csv.DictWriter(f, keys)
  56. dict_writer.writeheader()
  57. dict_writer.writerows(results)
  58. ## define the main function for this file to export results
  59. def main():
  60. results = []
  61. for x in range(1, 3):
  62. urls = get_mural_links(x)
  63. for url in range(len(urls)):
  64. results.append(parse_mural(urls
    ))
  65. save_csv(results)
  66. ## won&#39;t run/import to other files
  67. if __name__ == &#39;__main__&#39;:
  68. main()

答案1

得分: 1

如我在这里回答的那样,

要解决你在某些网址或者更确切地说是某些id编号上遇到的InvalidSelectorException问题,使用记法li[id=&quot;id_value&quot;]代替li#id_value

使用以下代码:

  1. li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li[id=&quot;{id}&quot;]&#39;)))

而不是:

  1. li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li#{id}&#39;)))
英文:

As I've answered here,

To fix the InvalidSelectorException that you're getting for some url or better to say for some id number, use the notation li[id=&quot;id_value&quot;] instead of li#id_value.

Use this:

  1. li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li[id=&quot;{id}&quot;]&#39;)))

Instead of:

  1. li_element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.CSS_SELECTOR, f&#39;li#{id}&#39;)))

huangapple
  • 本文由 发表于 2023年7月20日 10:18:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76726273.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定