为什么我使用Selenium进行Letterboxd网页爬取时,没有任何内容被打印出来?

huangapple go评论114阅读模式
英文:

Why won't anything be printed when i use selenium to webscrape letterboxd?

问题

from selenium import webdriver
from selenium.webdriver.common.by import By

def search_letterboxd_by_genre(genre):
    url = f"https://letterboxd.com/films/genre/{genre}/"

    # 设置 Selenium webdriver
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")  # 以无界面模式运行浏览器
    driver = webdriver.Chrome(options=options)

    driver.get(url)

    # 等待页面加载
    driver.implicitly_wait(5)

    # 查找所有 class 名为 "film-poster" 的电影元素
    movie_elements = driver.find_elements(By.CSS_SELECTOR, "frame")

    if movie_elements:
        for movie_element in movie_elements:
            try:
                movie_title = movie_element.find_element(By.TAG_NAME, "img").get_attribute("alt")
                print(movie_title)
            except:
                print("提取电影标题时出错。")
    else:
        print("未找到该类型的电影。")

    # 关闭浏览器
    driver.quit()

这是代码(我已经调用了该函数并添加了一个输入语句,但没有复制它)。

起初我使用的是Beautiful Soup,但在某个地方看到Selenium可以克服JavaScript,所以电影标题会打印出来,但仍然不起作用。

英文:
from selenium import webdriver
from selenium.webdriver.common.by import By

def search_letterboxd_by_genre(genre):
    url = f"https://letterboxd.com/films/genre/{genre}/"

    # Set up the Selenium webdriver
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")  # Run browser in headless mode (without GUI)
    driver = webdriver.Chrome(options=options)

    driver.get(url)

    # Wait for the page to load
    driver.implicitly_wait(5)

    # Find all movie elements with the class name "film-poster"
    movie_elements = driver.find_elements(By.CSS_SELECTOR, "frame")

    if movie_elements:
        for movie_element in movie_elements:
            try:
                movie_title = movie_element.find_element(By.TAG_NAME, "img").get_attribute("alt")
                print(movie_title)
            except:
                print("Error extracting movie title.")
    else:
        print("No movies found for the given genre.")

    # Close the browser
    driver.quit()

This is the code (I have called the function and added an input statement, but just haven't copied it)

At first I was using beautiful soup but read somewhere that selenium is able to overcome the javascript so the movie titles will print but it is still not working.

答案1

得分: 0

**根本原因:**在URL启动后,会出现一个<kbd>Consent</kbd>弹出窗口(如下所示)。您需要首先摆脱该弹出窗口,然后再尝试进行抓取。

为什么我使用Selenium进行Letterboxd网页爬取时,没有任何内容被打印出来?

请参考下面的工作代码:

import time

from selenium.webdriver.common.by import By
from selenium import webdriver

url = f"https://letterboxd.com/films/genre/action/"

# 设置Selenium webdriver
options = webdriver.ChromeOptions()
options.add_argument("--headless")  # 以无界面模式运行浏览器
driver = webdriver.Chrome(options=options)

driver.get(url)
driver.maximize_window()
# 等待页面加载
driver.implicitly_wait(10)
driver.find_element(By.XPATH, "//p[text()='Consent']").click()
time.sleep(10)
movie_elements = driver.find_elements(By.XPATH, "//div[@id='films-browser-list-container']//li//a//span[1]")

for movie in movie_elements:
    print(movie.get_attribute("innerText"))

# 关闭浏览器
driver.quit()

控制台输出:

Everything Everywhere All at Once (2022)
Spider-Man: Into the Spider-Verse (2018)
Inception (2010)
The Dark Knight (2008)
Spider-Man: No Way Home (2021)
Avengers: Infinity War (2018)
Spider-Man: Across the Spider-Verse (2023)
Avengers: Endgame (2019)
Baby Driver (2017)
Black Panther (2018)
Guardians of the Galaxy (2014)
Kill Bill: Vol. 1 (2003)
The Matrix (1999)
Scott Pilgrim vs. the World (2010)
Spider-Man: Homecoming (2017)
Avatar: The Way of Water (2022)
Thor: Ragnarok (2017)
The Lord of the Rings: The Fellowship of the Ring (2001)
Doctor Strange in the Multiverse of Madness (2022)
Star Wars (1977)
Top Gun: Maverick (2022)
Deadpool (2016)
Spider-Man: Far from Home (2019)
Mad Max: Fury Road (2015)
Dunkirk (2017)
The Avengers (2012)
Avatar (2009)
Guardians of the Galaxy Vol. 3 (2023)
Puss in Boots: The Last Wish (2022)
Guardians of the Galaxy Vol. 2 (2017)
The Empire Strikes Back (1980)
Tenet (2020)
Bullet Train (2022)
Doctor Strange (2016)
Iron Man (2008)
Captain America: Civil War (2016)
Spider-Man (2002)
Star Wars: The Force Awakens (2015)
The Incredibles (2004)
The Lord of the Rings: The Return of the King (2003)
Captain America: The Winter Soldier (2014)
The Dark Knight Rises (2012)
John Wick (2014)
Thor: Love and Thunder (2022)
The Lord of the Rings: The Two Towers (2002)
Spider-Man 2 (2004)
Batman Begins (2005)
Avengers: Age of Ultron (2015)
Star Wars: The Last Jedi (2017)
Black Widow (2021)
The Suicide Squad (2021)
Shang-Chi and the Legend of the Ten Rings (2021)
Captain Marvel (2019)
Return of the Jedi (1983)
Rogue One: A Star Wars Story (2016)
Ant-Man (2015)
Black Panther: Wakanda Forever (2022)
Logan (2017)
Star Wars: Episode III – Revenge of the Sith (2005)
Captain America: The First Avenger (2011)
Oldboy (2003)
The Northman (2022)
Léon: The Professional (1994)
Star Wars: The Rise of Skywalker (2019)
The Amazing Spider-Man (2012)
Kill Bill: Vol. 2 (2004)
Raiders of the Lost Ark (1981)
Star Wars: Episode I – The Phantom Menace (1999)
Deadpool 2 (2018)
Iron Man 3 (2013)
Eternals (2021)
Scarface (1983)

Process finished with exit code 0

英文:

Root cause: After the URL is launched, a <kbd>Consent</kbd> pop-up appears(see below). You need to get rid of that pop-up first. After that try to scrape.

为什么我使用Selenium进行Letterboxd网页爬取时,没有任何内容被打印出来?

Refer the working code below:

import time

from selenium.webdriver.common.by import By
from selenium import webdriver

url = f&quot;https://letterboxd.com/films/genre/action/&quot;

# Set up the Selenium webdriver
options = webdriver.ChromeOptions()
options.add_argument(&quot;--headless&quot;)  # Run browser in headless mode (without GUI)
driver = webdriver.Chrome(options=options)

driver.get(url)
driver.maximize_window()
# Wait for the page to load
driver.implicitly_wait(10)
driver.find_element(By.XPATH, &quot;//p[text()=&#39;Consent&#39;]&quot;).click()
time.sleep(10)
movie_elements = driver.find_elements(By.XPATH, &quot;//div[@id=&#39;films-browser-list-container&#39;]//li//a//span[1]&quot;)

for movie in movie_elements:
    print(movie.get_attribute(&quot;innerText&quot;))

# Close the browser
driver.quit()

Console output:

Everything Everywhere All at Once (2022)
Spider-Man: Into the Spider-Verse (2018)
Inception (2010)
The Dark Knight (2008)
Spider-Man: No Way Home (2021)
Avengers: Infinity War (2018)
Spider-Man: Across the Spider-Verse (2023)
Avengers: Endgame (2019)
Baby Driver (2017)
Black Panther (2018)
Guardians of the Galaxy (2014)
Kill Bill: Vol. 1 (2003)
The Matrix (1999)
Scott Pilgrim vs. the World (2010)
Spider-Man: Homecoming (2017)
Avatar: The Way of Water (2022)
Thor: Ragnarok (2017)
The Lord of the Rings: The Fellowship of the Ring (2001)
Doctor Strange in the Multiverse of Madness (2022)
Star Wars (1977)
Top Gun: Maverick (2022)
Deadpool (2016)
Spider-Man: Far from Home (2019)
Mad Max: Fury Road (2015)
Dunkirk (2017)
The Avengers (2012)
Avatar (2009)
Guardians of the Galaxy Vol. 3 (2023)
Puss in Boots: The Last Wish (2022)
Guardians of the Galaxy Vol. 2 (2017)
The Empire Strikes Back (1980)
Tenet (2020)
Bullet Train (2022)
Doctor Strange (2016)
Iron Man (2008)
Captain America: Civil War (2016)
Spider-Man (2002)
Star Wars: The Force Awakens (2015)
The Incredibles (2004)
The Lord of the Rings: The Return of the King (2003)
Captain America: The Winter Soldier (2014)
The Dark Knight Rises (2012)
John Wick (2014)
Thor: Love and Thunder (2022)
The Lord of the Rings: The Two Towers (2002)
Spider-Man 2 (2004)
Batman Begins (2005)
Avengers: Age of Ultron (2015)
Star Wars: The Last Jedi (2017)
Black Widow (2021)
The Suicide Squad (2021)
Shang-Chi and the Legend of the Ten Rings (2021)
Captain Marvel (2019)
Return of the Jedi (1983)
Rogue One: A Star Wars Story (2016)
Ant-Man (2015)
Black Panther: Wakanda Forever (2022)
Logan (2017)
Star Wars: Episode III –&#160;Revenge of the Sith (2005)
Captain America: The First Avenger (2011)
Oldboy (2003)
The Northman (2022)
L&#233;on: The Professional (1994)
Star Wars: The Rise of Skywalker (2019)
The Amazing Spider-Man (2012)
Kill Bill: Vol. 2 (2004)
Raiders of the Lost Ark (1981)
Star Wars: Episode I – The Phantom Menace (1999)
Deadpool 2 (2018)
Iron Man 3 (2013)
Eternals (2021)
Scarface (1983)

Process finished with exit code 0

huangapple
  • 本文由 发表于 2023年8月8日 20:47:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76859726.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定