英文:
Why won't anything be printed when i use selenium to webscrape letterboxd?
问题
from selenium import webdriver
from selenium.webdriver.common.by import By
def search_letterboxd_by_genre(genre):
url = f"https://letterboxd.com/films/genre/{genre}/"
# 设置 Selenium webdriver
options = webdriver.ChromeOptions()
options.add_argument("--headless") # 以无界面模式运行浏览器
driver = webdriver.Chrome(options=options)
driver.get(url)
# 等待页面加载
driver.implicitly_wait(5)
# 查找所有 class 名为 "film-poster" 的电影元素
movie_elements = driver.find_elements(By.CSS_SELECTOR, "frame")
if movie_elements:
for movie_element in movie_elements:
try:
movie_title = movie_element.find_element(By.TAG_NAME, "img").get_attribute("alt")
print(movie_title)
except:
print("提取电影标题时出错。")
else:
print("未找到该类型的电影。")
# 关闭浏览器
driver.quit()
这是代码(我已经调用了该函数并添加了一个输入语句,但没有复制它)。
起初我使用的是Beautiful Soup,但在某个地方看到Selenium可以克服JavaScript,所以电影标题会打印出来,但仍然不起作用。
英文:
from selenium import webdriver
from selenium.webdriver.common.by import By
def search_letterboxd_by_genre(genre):
url = f"https://letterboxd.com/films/genre/{genre}/"
# Set up the Selenium webdriver
options = webdriver.ChromeOptions()
options.add_argument("--headless") # Run browser in headless mode (without GUI)
driver = webdriver.Chrome(options=options)
driver.get(url)
# Wait for the page to load
driver.implicitly_wait(5)
# Find all movie elements with the class name "film-poster"
movie_elements = driver.find_elements(By.CSS_SELECTOR, "frame")
if movie_elements:
for movie_element in movie_elements:
try:
movie_title = movie_element.find_element(By.TAG_NAME, "img").get_attribute("alt")
print(movie_title)
except:
print("Error extracting movie title.")
else:
print("No movies found for the given genre.")
# Close the browser
driver.quit()
This is the code (I have called the function and added an input statement, but just haven't copied it)
At first I was using beautiful soup but read somewhere that selenium is able to overcome the javascript so the movie titles will print but it is still not working.
答案1
得分: 0
**根本原因:**在URL启动后,会出现一个<kbd>Consent</kbd>弹出窗口(如下所示)。您需要首先摆脱该弹出窗口,然后再尝试进行抓取。
请参考下面的工作代码:
import time
from selenium.webdriver.common.by import By
from selenium import webdriver
url = f"https://letterboxd.com/films/genre/action/"
# 设置Selenium webdriver
options = webdriver.ChromeOptions()
options.add_argument("--headless") # 以无界面模式运行浏览器
driver = webdriver.Chrome(options=options)
driver.get(url)
driver.maximize_window()
# 等待页面加载
driver.implicitly_wait(10)
driver.find_element(By.XPATH, "//p[text()='Consent']").click()
time.sleep(10)
movie_elements = driver.find_elements(By.XPATH, "//div[@id='films-browser-list-container']//li//a//span[1]")
for movie in movie_elements:
print(movie.get_attribute("innerText"))
# 关闭浏览器
driver.quit()
控制台输出:
Everything Everywhere All at Once (2022)
Spider-Man: Into the Spider-Verse (2018)
Inception (2010)
The Dark Knight (2008)
Spider-Man: No Way Home (2021)
Avengers: Infinity War (2018)
Spider-Man: Across the Spider-Verse (2023)
Avengers: Endgame (2019)
Baby Driver (2017)
Black Panther (2018)
Guardians of the Galaxy (2014)
Kill Bill: Vol. 1 (2003)
The Matrix (1999)
Scott Pilgrim vs. the World (2010)
Spider-Man: Homecoming (2017)
Avatar: The Way of Water (2022)
Thor: Ragnarok (2017)
The Lord of the Rings: The Fellowship of the Ring (2001)
Doctor Strange in the Multiverse of Madness (2022)
Star Wars (1977)
Top Gun: Maverick (2022)
Deadpool (2016)
Spider-Man: Far from Home (2019)
Mad Max: Fury Road (2015)
Dunkirk (2017)
The Avengers (2012)
Avatar (2009)
Guardians of the Galaxy Vol. 3 (2023)
Puss in Boots: The Last Wish (2022)
Guardians of the Galaxy Vol. 2 (2017)
The Empire Strikes Back (1980)
Tenet (2020)
Bullet Train (2022)
Doctor Strange (2016)
Iron Man (2008)
Captain America: Civil War (2016)
Spider-Man (2002)
Star Wars: The Force Awakens (2015)
The Incredibles (2004)
The Lord of the Rings: The Return of the King (2003)
Captain America: The Winter Soldier (2014)
The Dark Knight Rises (2012)
John Wick (2014)
Thor: Love and Thunder (2022)
The Lord of the Rings: The Two Towers (2002)
Spider-Man 2 (2004)
Batman Begins (2005)
Avengers: Age of Ultron (2015)
Star Wars: The Last Jedi (2017)
Black Widow (2021)
The Suicide Squad (2021)
Shang-Chi and the Legend of the Ten Rings (2021)
Captain Marvel (2019)
Return of the Jedi (1983)
Rogue One: A Star Wars Story (2016)
Ant-Man (2015)
Black Panther: Wakanda Forever (2022)
Logan (2017)
Star Wars: Episode III – Revenge of the Sith (2005)
Captain America: The First Avenger (2011)
Oldboy (2003)
The Northman (2022)
Léon: The Professional (1994)
Star Wars: The Rise of Skywalker (2019)
The Amazing Spider-Man (2012)
Kill Bill: Vol. 2 (2004)
Raiders of the Lost Ark (1981)
Star Wars: Episode I – The Phantom Menace (1999)
Deadpool 2 (2018)
Iron Man 3 (2013)
Eternals (2021)
Scarface (1983)
Process finished with exit code 0
英文:
Root cause: After the URL is launched, a <kbd>Consent</kbd> pop-up appears(see below). You need to get rid of that pop-up first. After that try to scrape.
Refer the working code below:
import time
from selenium.webdriver.common.by import By
from selenium import webdriver
url = f"https://letterboxd.com/films/genre/action/"
# Set up the Selenium webdriver
options = webdriver.ChromeOptions()
options.add_argument("--headless") # Run browser in headless mode (without GUI)
driver = webdriver.Chrome(options=options)
driver.get(url)
driver.maximize_window()
# Wait for the page to load
driver.implicitly_wait(10)
driver.find_element(By.XPATH, "//p[text()='Consent']").click()
time.sleep(10)
movie_elements = driver.find_elements(By.XPATH, "//div[@id='films-browser-list-container']//li//a//span[1]")
for movie in movie_elements:
print(movie.get_attribute("innerText"))
# Close the browser
driver.quit()
Console output:
Everything Everywhere All at Once (2022)
Spider-Man: Into the Spider-Verse (2018)
Inception (2010)
The Dark Knight (2008)
Spider-Man: No Way Home (2021)
Avengers: Infinity War (2018)
Spider-Man: Across the Spider-Verse (2023)
Avengers: Endgame (2019)
Baby Driver (2017)
Black Panther (2018)
Guardians of the Galaxy (2014)
Kill Bill: Vol. 1 (2003)
The Matrix (1999)
Scott Pilgrim vs. the World (2010)
Spider-Man: Homecoming (2017)
Avatar: The Way of Water (2022)
Thor: Ragnarok (2017)
The Lord of the Rings: The Fellowship of the Ring (2001)
Doctor Strange in the Multiverse of Madness (2022)
Star Wars (1977)
Top Gun: Maverick (2022)
Deadpool (2016)
Spider-Man: Far from Home (2019)
Mad Max: Fury Road (2015)
Dunkirk (2017)
The Avengers (2012)
Avatar (2009)
Guardians of the Galaxy Vol. 3 (2023)
Puss in Boots: The Last Wish (2022)
Guardians of the Galaxy Vol. 2 (2017)
The Empire Strikes Back (1980)
Tenet (2020)
Bullet Train (2022)
Doctor Strange (2016)
Iron Man (2008)
Captain America: Civil War (2016)
Spider-Man (2002)
Star Wars: The Force Awakens (2015)
The Incredibles (2004)
The Lord of the Rings: The Return of the King (2003)
Captain America: The Winter Soldier (2014)
The Dark Knight Rises (2012)
John Wick (2014)
Thor: Love and Thunder (2022)
The Lord of the Rings: The Two Towers (2002)
Spider-Man 2 (2004)
Batman Begins (2005)
Avengers: Age of Ultron (2015)
Star Wars: The Last Jedi (2017)
Black Widow (2021)
The Suicide Squad (2021)
Shang-Chi and the Legend of the Ten Rings (2021)
Captain Marvel (2019)
Return of the Jedi (1983)
Rogue One: A Star Wars Story (2016)
Ant-Man (2015)
Black Panther: Wakanda Forever (2022)
Logan (2017)
Star Wars: Episode III – Revenge of the Sith (2005)
Captain America: The First Avenger (2011)
Oldboy (2003)
The Northman (2022)
Léon: The Professional (1994)
Star Wars: The Rise of Skywalker (2019)
The Amazing Spider-Man (2012)
Kill Bill: Vol. 2 (2004)
Raiders of the Lost Ark (1981)
Star Wars: Episode I – The Phantom Menace (1999)
Deadpool 2 (2018)
Iron Man 3 (2013)
Eternals (2021)
Scarface (1983)
Process finished with exit code 0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论