为什么我使用Selenium进行Letterboxd网页爬取时,没有任何内容被打印出来?

huangapple go评论155阅读模式
英文:

Why won't anything be printed when i use selenium to webscrape letterboxd?

问题

  1. from selenium import webdriver
  2. from selenium.webdriver.common.by import By
  3. def search_letterboxd_by_genre(genre):
  4. url = f"https://letterboxd.com/films/genre/{genre}/"
  5. # 设置 Selenium webdriver
  6. options = webdriver.ChromeOptions()
  7. options.add_argument("--headless") # 以无界面模式运行浏览器
  8. driver = webdriver.Chrome(options=options)
  9. driver.get(url)
  10. # 等待页面加载
  11. driver.implicitly_wait(5)
  12. # 查找所有 class 名为 "film-poster" 的电影元素
  13. movie_elements = driver.find_elements(By.CSS_SELECTOR, "frame")
  14. if movie_elements:
  15. for movie_element in movie_elements:
  16. try:
  17. movie_title = movie_element.find_element(By.TAG_NAME, "img").get_attribute("alt")
  18. print(movie_title)
  19. except:
  20. print("提取电影标题时出错。")
  21. else:
  22. print("未找到该类型的电影。")
  23. # 关闭浏览器
  24. driver.quit()

这是代码(我已经调用了该函数并添加了一个输入语句,但没有复制它)。

起初我使用的是Beautiful Soup,但在某个地方看到Selenium可以克服JavaScript,所以电影标题会打印出来,但仍然不起作用。

英文:
  1. from selenium import webdriver
  2. from selenium.webdriver.common.by import By
  3. def search_letterboxd_by_genre(genre):
  4. url = f"https://letterboxd.com/films/genre/{genre}/"
  5. # Set up the Selenium webdriver
  6. options = webdriver.ChromeOptions()
  7. options.add_argument("--headless") # Run browser in headless mode (without GUI)
  8. driver = webdriver.Chrome(options=options)
  9. driver.get(url)
  10. # Wait for the page to load
  11. driver.implicitly_wait(5)
  12. # Find all movie elements with the class name "film-poster"
  13. movie_elements = driver.find_elements(By.CSS_SELECTOR, "frame")
  14. if movie_elements:
  15. for movie_element in movie_elements:
  16. try:
  17. movie_title = movie_element.find_element(By.TAG_NAME, "img").get_attribute("alt")
  18. print(movie_title)
  19. except:
  20. print("Error extracting movie title.")
  21. else:
  22. print("No movies found for the given genre.")
  23. # Close the browser
  24. driver.quit()

This is the code (I have called the function and added an input statement, but just haven't copied it)

At first I was using beautiful soup but read somewhere that selenium is able to overcome the javascript so the movie titles will print but it is still not working.

答案1

得分: 0

**根本原因:**在URL启动后,会出现一个<kbd>Consent</kbd>弹出窗口(如下所示)。您需要首先摆脱该弹出窗口,然后再尝试进行抓取。

为什么我使用Selenium进行Letterboxd网页爬取时,没有任何内容被打印出来?

请参考下面的工作代码:

  1. import time
  2. from selenium.webdriver.common.by import By
  3. from selenium import webdriver
  4. url = f"https://letterboxd.com/films/genre/action/"
  5. # 设置Selenium webdriver
  6. options = webdriver.ChromeOptions()
  7. options.add_argument("--headless") # 以无界面模式运行浏览器
  8. driver = webdriver.Chrome(options=options)
  9. driver.get(url)
  10. driver.maximize_window()
  11. # 等待页面加载
  12. driver.implicitly_wait(10)
  13. driver.find_element(By.XPATH, "//p[text()='Consent']").click()
  14. time.sleep(10)
  15. movie_elements = driver.find_elements(By.XPATH, "//div[@id='films-browser-list-container']//li//a//span[1]")
  16. for movie in movie_elements:
  17. print(movie.get_attribute("innerText"))
  18. # 关闭浏览器
  19. driver.quit()

控制台输出:

  1. Everything Everywhere All at Once (2022)
  2. Spider-Man: Into the Spider-Verse (2018)
  3. Inception (2010)
  4. The Dark Knight (2008)
  5. Spider-Man: No Way Home (2021)
  6. Avengers: Infinity War (2018)
  7. Spider-Man: Across the Spider-Verse (2023)
  8. Avengers: Endgame (2019)
  9. Baby Driver (2017)
  10. Black Panther (2018)
  11. Guardians of the Galaxy (2014)
  12. Kill Bill: Vol. 1 (2003)
  13. The Matrix (1999)
  14. Scott Pilgrim vs. the World (2010)
  15. Spider-Man: Homecoming (2017)
  16. Avatar: The Way of Water (2022)
  17. Thor: Ragnarok (2017)
  18. The Lord of the Rings: The Fellowship of the Ring (2001)
  19. Doctor Strange in the Multiverse of Madness (2022)
  20. Star Wars (1977)
  21. Top Gun: Maverick (2022)
  22. Deadpool (2016)
  23. Spider-Man: Far from Home (2019)
  24. Mad Max: Fury Road (2015)
  25. Dunkirk (2017)
  26. The Avengers (2012)
  27. Avatar (2009)
  28. Guardians of the Galaxy Vol. 3 (2023)
  29. Puss in Boots: The Last Wish (2022)
  30. Guardians of the Galaxy Vol. 2 (2017)
  31. The Empire Strikes Back (1980)
  32. Tenet (2020)
  33. Bullet Train (2022)
  34. Doctor Strange (2016)
  35. Iron Man (2008)
  36. Captain America: Civil War (2016)
  37. Spider-Man (2002)
  38. Star Wars: The Force Awakens (2015)
  39. The Incredibles (2004)
  40. The Lord of the Rings: The Return of the King (2003)
  41. Captain America: The Winter Soldier (2014)
  42. The Dark Knight Rises (2012)
  43. John Wick (2014)
  44. Thor: Love and Thunder (2022)
  45. The Lord of the Rings: The Two Towers (2002)
  46. Spider-Man 2 (2004)
  47. Batman Begins (2005)
  48. Avengers: Age of Ultron (2015)
  49. Star Wars: The Last Jedi (2017)
  50. Black Widow (2021)
  51. The Suicide Squad (2021)
  52. Shang-Chi and the Legend of the Ten Rings (2021)
  53. Captain Marvel (2019)
  54. Return of the Jedi (1983)
  55. Rogue One: A Star Wars Story (2016)
  56. Ant-Man (2015)
  57. Black Panther: Wakanda Forever (2022)
  58. Logan (2017)
  59. Star Wars: Episode III Revenge of the Sith (2005)
  60. Captain America: The First Avenger (2011)
  61. Oldboy (2003)
  62. The Northman (2022)
  63. Léon: The Professional (1994)
  64. Star Wars: The Rise of Skywalker (2019)
  65. The Amazing Spider-Man (2012)
  66. Kill Bill: Vol. 2 (2004)
  67. Raiders of the Lost Ark (1981)
  68. Star Wars: Episode I The Phantom Menace (1999)
  69. Deadpool 2 (2018)
  70. Iron Man 3 (2013)
  71. Eternals (2021)
  72. Scarface (1983)
  73. Process finished with exit code 0
英文:

Root cause: After the URL is launched, a <kbd>Consent</kbd> pop-up appears(see below). You need to get rid of that pop-up first. After that try to scrape.

为什么我使用Selenium进行Letterboxd网页爬取时,没有任何内容被打印出来?

Refer the working code below:

  1. import time
  2. from selenium.webdriver.common.by import By
  3. from selenium import webdriver
  4. url = f&quot;https://letterboxd.com/films/genre/action/&quot;
  5. # Set up the Selenium webdriver
  6. options = webdriver.ChromeOptions()
  7. options.add_argument(&quot;--headless&quot;) # Run browser in headless mode (without GUI)
  8. driver = webdriver.Chrome(options=options)
  9. driver.get(url)
  10. driver.maximize_window()
  11. # Wait for the page to load
  12. driver.implicitly_wait(10)
  13. driver.find_element(By.XPATH, &quot;//p[text()=&#39;Consent&#39;]&quot;).click()
  14. time.sleep(10)
  15. movie_elements = driver.find_elements(By.XPATH, &quot;//div[@id=&#39;films-browser-list-container&#39;]//li//a//span[1]&quot;)
  16. for movie in movie_elements:
  17. print(movie.get_attribute(&quot;innerText&quot;))
  18. # Close the browser
  19. driver.quit()

Console output:

  1. Everything Everywhere All at Once (2022)
  2. Spider-Man: Into the Spider-Verse (2018)
  3. Inception (2010)
  4. The Dark Knight (2008)
  5. Spider-Man: No Way Home (2021)
  6. Avengers: Infinity War (2018)
  7. Spider-Man: Across the Spider-Verse (2023)
  8. Avengers: Endgame (2019)
  9. Baby Driver (2017)
  10. Black Panther (2018)
  11. Guardians of the Galaxy (2014)
  12. Kill Bill: Vol. 1 (2003)
  13. The Matrix (1999)
  14. Scott Pilgrim vs. the World (2010)
  15. Spider-Man: Homecoming (2017)
  16. Avatar: The Way of Water (2022)
  17. Thor: Ragnarok (2017)
  18. The Lord of the Rings: The Fellowship of the Ring (2001)
  19. Doctor Strange in the Multiverse of Madness (2022)
  20. Star Wars (1977)
  21. Top Gun: Maverick (2022)
  22. Deadpool (2016)
  23. Spider-Man: Far from Home (2019)
  24. Mad Max: Fury Road (2015)
  25. Dunkirk (2017)
  26. The Avengers (2012)
  27. Avatar (2009)
  28. Guardians of the Galaxy Vol. 3 (2023)
  29. Puss in Boots: The Last Wish (2022)
  30. Guardians of the Galaxy Vol. 2 (2017)
  31. The Empire Strikes Back (1980)
  32. Tenet (2020)
  33. Bullet Train (2022)
  34. Doctor Strange (2016)
  35. Iron Man (2008)
  36. Captain America: Civil War (2016)
  37. Spider-Man (2002)
  38. Star Wars: The Force Awakens (2015)
  39. The Incredibles (2004)
  40. The Lord of the Rings: The Return of the King (2003)
  41. Captain America: The Winter Soldier (2014)
  42. The Dark Knight Rises (2012)
  43. John Wick (2014)
  44. Thor: Love and Thunder (2022)
  45. The Lord of the Rings: The Two Towers (2002)
  46. Spider-Man 2 (2004)
  47. Batman Begins (2005)
  48. Avengers: Age of Ultron (2015)
  49. Star Wars: The Last Jedi (2017)
  50. Black Widow (2021)
  51. The Suicide Squad (2021)
  52. Shang-Chi and the Legend of the Ten Rings (2021)
  53. Captain Marvel (2019)
  54. Return of the Jedi (1983)
  55. Rogue One: A Star Wars Story (2016)
  56. Ant-Man (2015)
  57. Black Panther: Wakanda Forever (2022)
  58. Logan (2017)
  59. Star Wars: Episode III –&#160;Revenge of the Sith (2005)
  60. Captain America: The First Avenger (2011)
  61. Oldboy (2003)
  62. The Northman (2022)
  63. L&#233;on: The Professional (1994)
  64. Star Wars: The Rise of Skywalker (2019)
  65. The Amazing Spider-Man (2012)
  66. Kill Bill: Vol. 2 (2004)
  67. Raiders of the Lost Ark (1981)
  68. Star Wars: Episode I The Phantom Menace (1999)
  69. Deadpool 2 (2018)
  70. Iron Man 3 (2013)
  71. Eternals (2021)
  72. Scarface (1983)
  73. Process finished with exit code 0

huangapple
  • 本文由 发表于 2023年8月8日 20:47:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76859726.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定