问题

I am using icrawler in a project.
icrawler 是一个从谷歌搜索获取图像的好工具，但如何获取图像的URL，
我注意到我想要的URL显示在终端中：
终端的输出

2023年5月6日20:25:22,389 - 信息 - 解析器 - 解析结果页面https://www.google.com/search?q=programing+gif&amp;ijn=0&amp;start=0&amp;tbs=&amp;tbm=isch
2023年5月6日20:25:23,579 - 信息 - 下载器 - 图像#1 https://cdn.videoplasty.com/animation/chill-coding-programming-lo-fi-animation-stock-animation-21874-1024x576.jpg

英文:

I am using icrawler in a project.
icrawler is a good tool for getting image from google search but how to get the image's url,
I noticed that the url I want is displayed in terminal :
the output from the terminal

2023-05-06 20:25:22,389 - INFO - parser - parsing result page https://www.google.com/search?q=programing+gif&amp;ijn=0&amp;start=0&amp;tbs=&amp;tbm=isch
2023-05-06 20:25:23,579 - INFO - downloader - image #1  https://cdn.videoplasty.com/animation/chill-coding-programming-lo-fi-animation-stock-animation-21874-1024x576.jpg

答案1

得分: 1

以下是翻译好的内容：

根据Swifty的评论，以下是捕获和筛选icrawler用于打印消息的日志对象的方法。

爬虫的各个组件有不同的日志对象。下载消息由your_crawler.downloader.logger记录。您可以添加一个函数来处理日志消息，使用some_logger.addFilter(function_name)。请注意，默认情况下，此过滤器不会打印消息，因此如果您仍然希望在终端中看到消息，则应从函数中打印它。

以下函数使用正则表达式搜索字符串image #n后跟URL。然后，将URL存储在打印的列表中。

在示例中，我使用了GreedyImageCrawler，因为我无法使Google或Bing正常工作。GreedyImageCrawler也无法正常工作（在下载的最大图像数量时不会退出），但它提供了用于URL提取的工作概念验证。

import re
import sys
from icrawler.builtin import GreedyImageCrawler

all_urls = []

def checkCrawlURL(log_input):
    # 重新打印捕获的日志消息
    print("INFO - downloader -", log_input.getMessage())
    # 提取URL
    res = re.search("image #\d+\t(.*)", log_input.getMessage())
    if res:
       # 将提取的URL添加到列表中
       all_urls.append(res.group(1))
       print(all_urls)

greedy_crawler = GreedyImageCrawler(storage={'root_dir': 'image_dir'})
# 通过函数传递下载记录消息
greedy_crawler.downloader.logger.addFilter(checkCrawlURL)
greedy_crawler.crawl(domains='http://www.bbc.com/news', max_num=10,
                     min_size=None, max_size=None)

结果：

2023-05-06 19:18:23,031 - INFO - icrawler.crawler - 开始爬取...
2023-05-06 19:18:23,032 - INFO - icrawler.crawler - 启动1个提取器线程...
2023-05-06 19:18:23,032 - INFO - icrawler.crawler - 启动1个解析器线程...
2023-05-06 19:18:23,033 - INFO - icrawler.crawler - 启动1个下载器线程...
2023-05-06 19:18:23,187 - INFO - parser - 解析结果页面 http://www.bbc.com/news
INFO - downloader - image #1	https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg
['https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg']
INFO - downloader - image #2	https://ichef.bbci.co.uk/news/320/cpsprodpb/11A78/production/_129621327_f5b5bd4421299cab6e7587311796a63f8808b7db-1.jpg
['https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg', 'https://ichef.bbci.co.uk/news/320/cpsprodpb/11A78/production/_129621327_f5b5bd4421299cab6e7587311796a63f8808b7db-1.jpg']
[...]

英文:

Following Swifty's comment the following is an approach to capture and filter the logging object used by icrawler to print the messages.

There are different logging objects for the various components of the crawler. The download messages are logged by your_crawler.downloader.logger. You can add a function to process the log messages using some_logger.addFilter(function_name). Note that this filter doesn't print the message by default, so if you still want to see the message in the terminal then you should print it from the function.

The function below uses a regular expression to search for the string image #n followed by the url. Then the url is stored in a list which is printed.

In the example I have used a GreedyImageCrawler because I couldn't get Google or Bing to work. The GreedyImageCrawler also didn't work properly (it doesn't exit when the maximum number of images is downloaded) but it does provide a working proof of concept for the url extraction.

import re
import sys
from icrawler.builtin import GreedyImageCrawler

all_urls = []

def checkCrawlURL(log_input):
    # re-print captured log message
    print(&quot;INFO - downloader -&quot;, log_input.getMessage())
    # extract url
    res = re.search(r&quot;image #\d+\t(.*)&quot;, log_input.getMessage())
    if res:
       # add extracted url to list
       all_urls.append(res.group(1))
       print(all_urls)

greedy_crawler = GreedyImageCrawler(storage={&#39;root_dir&#39;: &#39;image_dir&#39;})
# pass download logger messages through function
greedy_crawler.downloader.logger.addFilter(checkCrawlURL)
greedy_crawler.crawl(domains=&#39;http://www.bbc.com/news&#39;, max_num=10,
                     min_size=None, max_size=None)

Result:

2023-05-06 19:18:23,031 - INFO - icrawler.crawler - start crawling...
2023-05-06 19:18:23,032 - INFO - icrawler.crawler - starting 1 feeder threads...
2023-05-06 19:18:23,032 - INFO - icrawler.crawler - starting 1 parser threads...
2023-05-06 19:18:23,033 - INFO - icrawler.crawler - starting 1 downloader threads...
2023-05-06 19:18:23,187 - INFO - parser - parsing result page http://www.bbc.com/news
INFO - downloader - image #1	https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg
[&#39;https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg&#39;]
INFO - downloader - image #2	https://ichef.bbci.co.uk/news/320/cpsprodpb/11A78/production/_129621327_f5b5bd4421299cab6e7587311796a63f8808b7db-1.jpg
[&#39;https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg&#39;, &#39;https://ichef.bbci.co.uk/news/320/cpsprodpb/11A78/production/_129621327_f5b5bd4421299cab6e7587311796a63f8808b7db-1.jpg&#39;]
[...]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

获取icrawler中的图像源。

问题

答案1

Pyglet在我的Python Chip-8模拟器中不注册按键按下，即使没有使用keyboard模块。

有可能加速这个 pandas 数据提取吗？

Pandas read_json 脚本曾经正常运行，现在出现错误。

如何在Python中使用for循环运行SQL查询。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论