2023年5月29日 11:15:10go评论206阅读模式

英文:

Filtering Out Blurry Images During Web Scraping with Selenium and XPath in Python

问题

我目前正在进行一个使用Python的网络抓取项目，我试图使用Selenium和XPath从Reddit上抓取常规图像。我已经成功地避免了抓取广告、视频、长图像等内容，但是在尝试排除模糊图像时遇到了问题。

我尝试了各种XPath来定位和排除这些模糊图像，但都没有成功。我的当前解决方案没有返回任何错误，但在这个特定情况下没有按预期运行。不幸的是，我正在抓取不想要的模糊图像，我希望能够避免。

是否有人对优化我的XPath或者可能采用其他方法来解决这个问题有建议或策略？我会感激任何指导。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import os

import time

from urllib.parse import urljoin
from selenium.webdriver.common.by import By
import urllib.request

options = Options()
options.add_experimental_option("detach", True)
options.add_argument("--disable-notifications")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

reddit = ['https://www.reddit.com/r/Funnypics/']

time.sleep(10)  # 允许网页打开的时间
scroll_pause_time = 2  # 暂停时间取决于电脑速度
screen_height = driver.execute_script("return window.screen.height;")  # 获取网页的屏幕高度
media = set()

def redditscraper(subreddit):
    i = 1
    start_time = time.time()
    driver.get(subreddit)
    while True:
        if time.time() - start_time >= 60:
            break

        # 每次滚动一个屏幕高度
        driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))
        i += 1
        time.sleep(scroll_pause_time)

        # 每次滚动后更新滚动高度，因为页面滚动后滚动高度可能会发生变化
        scroll_height = driver.execute_script("return document.body.scrollHeight;")

        # 使用XPath查找图像元素
        images = driver.find_elements(By.XPATH, "//div[contains(@class, '_3Oa0THmZ3f5iZXAQ0hBJ0k')] //img[not(ancestor::div[contains(@class, '_1NSbknF8ucHV2abfCZw2Z1') and not(ancestor::div[@data-testid='shreddit-player-wrapper']) and .//a[@data-adclicklocation='media'] and not(contains(@class, '_2_tDEnGMLxpM6uOa2kaDB3 ImageBox-image _1XWObl-3b9tPy64oaG6fax _3oBPn1sFwq76ZAxXgwRhhn')) ])]")
        [media.add(img.get_attribute('src')) for img in images]

        # 当需要滚动的高度大于总滚动高度时，退出循环
        if (screen_height) * i > scroll_height:
            break;

redditscraper(reddit[0])

# 设置GIF图像将被保存的目录
save_dir = "images"

# 如果目录不存在，则创建目录
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

# 遍历媒体集合中的元素并下载每个GIF图像
for i, url in enumerate(media):
    filename = os.path.join(save_dir, f"media{i}.gif")
    urllib.request.urlretrieve(url, filename)

print(f"存储的动态GIF图像和视频总数为 {len(media)}")

这是你的Python代码的翻译部分。

英文:

I am currently working on a web scraping project in Python where I'm trying to scrape regular images from Reddit using Selenium and XPath. I have been successful in avoiding scraping of ads, videos, long images, and the like, but I've run into a problem when trying to exclude blurry images.

I have tried various XPaths to target and exclude these blurry images but have not been successful. My current solution isn't returning any errors, it's just not functioning as intended in this specific scenario. Unfortunately, I'm scraping unwanted blurry images which I would like to avoid.

Does anyone have suggestions or strategies for refining my XPath, or perhaps another approach to solve this problem? I'd appreciate any guidance on this.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import os
#from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin
from selenium.webdriver.common.by import By
import urllib.request
options = Options()
options.add_experimental_option(&quot;detach&quot;, True)
options.add_argument(&quot;--disable-notifications&quot;)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
reddit = [&#39;https://www.reddit.com/r/Funnypics/&#39;]
time.sleep(10)  # Allow 10 seconds for the web page to open
scroll_pause_time = 2 # Pause time depending on laptop speed
screen_height = driver.execute_script(&quot;return window.screen.height;&quot;)   #get the screen height of the web
media = set()
def redditscraper(subreddit):
i = 1
start_time = time.time()
driver.get(subreddit)
while True:
if time.time() - start_time &gt;= 60:
break
# scroll one screen height each time
driver.execute_script(&quot;window.scrollTo(0, {screen_height}*{i});&quot;.format(screen_height=screen_height, i=i))  
i += 1
#pages_scrolled += 1
time.sleep(scroll_pause_time)
# update scroll height each time after scrolled, as the scroll height can change after we scrolled the page
scroll_height = driver.execute_script(&quot;return document.body.scrollHeight;&quot;)  
images = driver.find_elements(By.XPATH, &quot;//div[contains(@class, &#39;_3Oa0THmZ3f5iZXAQ0hBJ0k &#39;)] //img[not(ancestor::div[contains(@class, &#39;_1NSbknF8ucHV2abfCZw2Z1 &#39;) and not(ancestor::div[@data-testid=&#39;shreddit-player-wrapper&#39;]) and .//a[@data-adclicklocation=&#39;media&#39;] and not(contains(@class, &#39;_2_tDEnGMLxpM6uOa2kaDB3 ImageBox-image _1XWObl-3b9tPy64oaG6fax _3oBPn1sFwq76ZAxXgwRhhn&#39;))  ])]&quot;)
[media.add(img.get_attribute(&#39;src&#39;)) for img in images]
# Break the loop when the height we need to scroll to is larger than the total scroll height
if (screen_height) * i &gt; scroll_height:
break;
redditscraper(reddit[0])
# Set the directory where the GIF images will be saved
save_dir = &quot;images&quot;
# Create the directory if it does not exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Loop over the elements in the media set and download each GIF image
for i, url in enumerate(media):
filename = os.path.join(save_dir, f&quot;media{i}.gif&quot;)
urllib.request.urlretrieve(url, filename)
print(f&quot;Total Number of Animated GIFs and Videos Stored is {len(media)}&quot;)

答案1

得分: 1

以下是翻译好的部分：

"The images which are blurred contains 'blur' in their src attribute"

"被模糊的图像在其src属性中包含'blur'"

"you can add below condition to your xpath"

"您可以将以下条件添加到您的XPath"

"Complete Xpath"

"完整的XPath"

"//div[(contains(@class, '_3Oa0THmZ3f5iZXAQ0hBJ0k'))] //img[not(contains(@src, 'blur')) and not(ancestor::div[contains(@class, '_1NSbknF8ucHV2abfCZw2Z1 ') and not(ancestor::div[@data-testid='shreddit-player-wrapper']) and .//a[@data-adclicklocation='media'] and not(contains(@class, '_2_tDEnGMLxpM6uOa2kaDB3 ImageBox-image _1XWObl-3b9tPy64oaG6fax _3oBPn1sFwq76ZAxXgwRhhn'))])]"

请注意，XPath 表达式中的特殊字符和引号已经被正确处理。

英文:

The images which are blurred contains "blur" in their src attribute
you can add below condition to your xpath

not(contains(@src, &#39;blur&#39;)

Complete Xpath

//div[(contains(@class, &#39;_3Oa0THmZ3f5iZXAQ0hBJ0k&#39;))] //img[not(contains(@src, &#39;blur&#39;)) and not(ancestor::div[contains(@class, &#39;_1NSbknF8ucHV2abfCZw2Z1 &#39;) and not(ancestor::div[@data-testid=&#39;shreddit-player-wrapper&#39;]) and .//a[@data-adclicklocation=&#39;media&#39;] and not(contains(@class, &#39;_2_tDEnGMLxpM6uOa2kaDB3 ImageBox-image _1XWObl-3b9tPy64oaG6fax _3oBPn1sFwq76ZAxXgwRhhn&#39;))])]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Selenium和XPath在Python中进行网页抓取时筛选掉模糊图片

问题

答案1

对象标题未传递到多个页面的下拉菜单中。

Discord.py 在 on_message 客户端事件上识别，但 IF 语句不响应。

读取来自多个文件的三维数据

使用 Pandas 数据框的日期列创建额外行。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论