使用Selenium和XPath在Python中进行网页抓取时筛选掉模糊图片

huangapple go评论120阅读模式
英文:

Filtering Out Blurry Images During Web Scraping with Selenium and XPath in Python

问题

我目前正在进行一个使用Python的网络抓取项目,我试图使用Selenium和XPath从Reddit上抓取常规图像。我已经成功地避免了抓取广告、视频、长图像等内容,但是在尝试排除模糊图像时遇到了问题。

我尝试了各种XPath来定位和排除这些模糊图像,但都没有成功。我的当前解决方案没有返回任何错误,但在这个特定情况下没有按预期运行。不幸的是,我正在抓取不想要的模糊图像,我希望能够避免。

是否有人对优化我的XPath或者可能采用其他方法来解决这个问题有建议或策略?我会感激任何指导。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import os

import time

from urllib.parse import urljoin
from selenium.webdriver.common.by import By
import urllib.request

options = Options()
options.add_experimental_option("detach", True)
options.add_argument("--disable-notifications")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

reddit = ['https://www.reddit.com/r/Funnypics/']

time.sleep(10)  # 允许网页打开的时间
scroll_pause_time = 2  # 暂停时间取决于电脑速度
screen_height = driver.execute_script("return window.screen.height;")  # 获取网页的屏幕高度
media = set()

def redditscraper(subreddit):
    i = 1
    start_time = time.time()
    driver.get(subreddit)
    while True:
        if time.time() - start_time >= 60:
            break

        # 每次滚动一个屏幕高度
        driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))
        i += 1
        time.sleep(scroll_pause_time)

        # 每次滚动后更新滚动高度,因为页面滚动后滚动高度可能会发生变化
        scroll_height = driver.execute_script("return document.body.scrollHeight;")

        # 使用XPath查找图像元素
        images = driver.find_elements(By.XPATH, "//div[contains(@class, '_3Oa0THmZ3f5iZXAQ0hBJ0k')] //img[not(ancestor::div[contains(@class, '_1NSbknF8ucHV2abfCZw2Z1') and not(ancestor::div[@data-testid='shreddit-player-wrapper']) and .//a[@data-adclicklocation='media'] and not(contains(@class, '_2_tDEnGMLxpM6uOa2kaDB3 ImageBox-image _1XWObl-3b9tPy64oaG6fax _3oBPn1sFwq76ZAxXgwRhhn')) ])]")
        [media.add(img.get_attribute('src')) for img in images]

        # 当需要滚动的高度大于总滚动高度时,退出循环
        if (screen_height) * i > scroll_height:
            break;

redditscraper(reddit[0])

# 设置GIF图像将被保存的目录
save_dir = "images"

# 如果目录不存在,则创建目录
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

# 遍历媒体集合中的元素并下载每个GIF图像
for i, url in enumerate(media):
    filename = os.path.join(save_dir, f"media{i}.gif")
    urllib.request.urlretrieve(url, filename)

print(f"存储的动态GIF图像和视频总数为 {len(media)}")

这是你的Python代码的翻译部分。

英文:

I am currently working on a web scraping project in Python where I'm trying to scrape regular images from Reddit using Selenium and XPath. I have been successful in avoiding scraping of ads, videos, long images, and the like, but I've run into a problem when trying to exclude blurry images.

I have tried various XPaths to target and exclude these blurry images but have not been successful. My current solution isn't returning any errors, it's just not functioning as intended in this specific scenario. Unfortunately, I'm scraping unwanted blurry images which I would like to avoid.

Does anyone have suggestions or strategies for refining my XPath, or perhaps another approach to solve this problem? I'd appreciate any guidance on this.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import os
#from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin
from selenium.webdriver.common.by import By
import urllib.request
options = Options()
options.add_experimental_option("detach", True)
options.add_argument("--disable-notifications")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
reddit = ['https://www.reddit.com/r/Funnypics/']
time.sleep(10)  # Allow 10 seconds for the web page to open
scroll_pause_time = 2 # Pause time depending on laptop speed
screen_height = driver.execute_script("return window.screen.height;")   #get the screen height of the web
media = set()
def redditscraper(subreddit):
i = 1
start_time = time.time()
driver.get(subreddit)
while True:
if time.time() - start_time >= 60:
break
# scroll one screen height each time
driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  
i += 1
#pages_scrolled += 1
time.sleep(scroll_pause_time)
# update scroll height each time after scrolled, as the scroll height can change after we scrolled the page
scroll_height = driver.execute_script("return document.body.scrollHeight;")  
images = driver.find_elements(By.XPATH, "//div[contains(@class, '_3Oa0THmZ3f5iZXAQ0hBJ0k ')] //img[not(ancestor::div[contains(@class, '_1NSbknF8ucHV2abfCZw2Z1 ') and not(ancestor::div[@data-testid='shreddit-player-wrapper']) and .//a[@data-adclicklocation='media'] and not(contains(@class, '_2_tDEnGMLxpM6uOa2kaDB3 ImageBox-image _1XWObl-3b9tPy64oaG6fax _3oBPn1sFwq76ZAxXgwRhhn'))  ])]")
[media.add(img.get_attribute('src')) for img in images]
# Break the loop when the height we need to scroll to is larger than the total scroll height
if (screen_height) * i > scroll_height:
break;
redditscraper(reddit[0])
# Set the directory where the GIF images will be saved
save_dir = "images"
# Create the directory if it does not exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Loop over the elements in the media set and download each GIF image
for i, url in enumerate(media):
filename = os.path.join(save_dir, f"media{i}.gif")
urllib.request.urlretrieve(url, filename)
print(f"Total Number of Animated GIFs and Videos Stored is {len(media)}")

答案1

得分: 1

以下是翻译好的部分:

"The images which are blurred contains 'blur' in their src attribute"

  • "被模糊的图像在其src属性中包含'blur'"

"you can add below condition to your xpath"

  • "您可以将以下条件添加到您的XPath"

"Complete Xpath"

  • "完整的XPath"

"//div[(contains(@class, '_3Oa0THmZ3f5iZXAQ0hBJ0k'))] //img[not(contains(@src, 'blur')) and not(ancestor::div[contains(@class, '_1NSbknF8ucHV2abfCZw2Z1 ') and not(ancestor::div[@data-testid='shreddit-player-wrapper']) and .//a[@data-adclicklocation='media'] and not(contains(@class, '_2_tDEnGMLxpM6uOa2kaDB3 ImageBox-image _1XWObl-3b9tPy64oaG6fax _3oBPn1sFwq76ZAxXgwRhhn'))])]"

请注意,XPath 表达式中的特殊字符和引号已经被正确处理。

英文:

The images which are blurred contains "blur" in their src attribute
you can add below condition to your xpath

not(contains(@src, 'blur')

Complete Xpath

//div[(contains(@class, '_3Oa0THmZ3f5iZXAQ0hBJ0k'))] //img[not(contains(@src, 'blur')) and not(ancestor::div[contains(@class, '_1NSbknF8ucHV2abfCZw2Z1 ') and not(ancestor::div[@data-testid='shreddit-player-wrapper']) and .//a[@data-adclicklocation='media'] and not(contains(@class, '_2_tDEnGMLxpM6uOa2kaDB3 ImageBox-image _1XWObl-3b9tPy64oaG6fax _3oBPn1sFwq76ZAxXgwRhhn'))])]

huangapple
  • 本文由 发表于 2023年5月29日 11:15:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76354484.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定