获取icrawler中的图像源。

huangapple go评论63阅读模式
英文:

how to get image source from icrawler

问题

I am using icrawler in a project.
icrawler 是一个从谷歌搜索获取图像的好工具,但如何获取图像的URL,
我注意到我想要的URL显示在终端中:
终端的输出

2023年5月6日20:25:22,389 - 信息 - 解析器 - 解析结果页面https://www.google.com/search?q=programing+gif&ijn=0&start=0&tbs=&tbm=isch
2023年5月6日20:25:23,579 - 信息 - 下载器 - 图像#1 https://cdn.videoplasty.com/animation/chill-coding-programming-lo-fi-animation-stock-animation-21874-1024x576.jpg
英文:

I am using icrawler in a project.
icrawler is a good tool for getting image from google search but how to get the image's url,
I noticed that the url I want is displayed in terminal :
the output from the terminal

2023-05-06 20:25:22,389 - INFO - parser - parsing result page https://www.google.com/search?q=programing+gif&ijn=0&start=0&tbs=&tbm=isch
2023-05-06 20:25:23,579 - INFO - downloader - image #1  https://cdn.videoplasty.com/animation/chill-coding-programming-lo-fi-animation-stock-animation-21874-1024x576.jpg

答案1

得分: 1

以下是翻译好的内容:

根据Swifty的评论,以下是捕获和筛选icrawler用于打印消息的日志对象的方法。

爬虫的各个组件有不同的日志对象。下载消息由your_crawler.downloader.logger记录。您可以添加一个函数来处理日志消息,使用some_logger.addFilter(function_name)。请注意,默认情况下,此过滤器不会打印消息,因此如果您仍然希望在终端中看到消息,则应从函数中打印它。

以下函数使用正则表达式搜索字符串image #n后跟URL。然后,将URL存储在打印的列表中。

在示例中,我使用了GreedyImageCrawler,因为我无法使Google或Bing正常工作。GreedyImageCrawler也无法正常工作(在下载的最大图像数量时不会退出),但它提供了用于URL提取的工作概念验证。

import re
import sys
from icrawler.builtin import GreedyImageCrawler

all_urls = []

def checkCrawlURL(log_input):
    # 重新打印捕获的日志消息
    print("INFO - downloader -", log_input.getMessage())
    # 提取URL
    res = re.search("image #\d+\t(.*)", log_input.getMessage())
    if res:
       # 将提取的URL添加到列表中
       all_urls.append(res.group(1))
       print(all_urls)

greedy_crawler = GreedyImageCrawler(storage={'root_dir': 'image_dir'})
# 通过函数传递下载记录消息
greedy_crawler.downloader.logger.addFilter(checkCrawlURL)
greedy_crawler.crawl(domains='http://www.bbc.com/news', max_num=10,
                     min_size=None, max_size=None)

结果:

2023-05-06 19:18:23,031 - INFO - icrawler.crawler - 开始爬取...
2023-05-06 19:18:23,032 - INFO - icrawler.crawler - 启动1个提取器线程...
2023-05-06 19:18:23,032 - INFO - icrawler.crawler - 启动1个解析器线程...
2023-05-06 19:18:23,033 - INFO - icrawler.crawler - 启动1个下载器线程...
2023-05-06 19:18:23,187 - INFO - parser - 解析结果页面 http://www.bbc.com/news
INFO - downloader - image #1	https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg
['https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg']
INFO - downloader - image #2	https://ichef.bbci.co.uk/news/320/cpsprodpb/11A78/production/_129621327_f5b5bd4421299cab6e7587311796a63f8808b7db-1.jpg
['https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg', 'https://ichef.bbci.co.uk/news/320/cpsprodpb/11A78/production/_129621327_f5b5bd4421299cab6e7587311796a63f8808b7db-1.jpg']
[...]
英文:

Following Swifty's comment the following is an approach to capture and filter the logging object used by icrawler to print the messages.

There are different logging objects for the various components of the crawler. The download messages are logged by your_crawler.downloader.logger. You can add a function to process the log messages using some_logger.addFilter(function_name). Note that this filter doesn't print the message by default, so if you still want to see the message in the terminal then you should print it from the function.

The function below uses a regular expression to search for the string image #n followed by the url. Then the url is stored in a list which is printed.

In the example I have used a GreedyImageCrawler because I couldn't get Google or Bing to work. The GreedyImageCrawler also didn't work properly (it doesn't exit when the maximum number of images is downloaded) but it does provide a working proof of concept for the url extraction.

import re
import sys
from icrawler.builtin import GreedyImageCrawler

all_urls = []

def checkCrawlURL(log_input):
    # re-print captured log message
    print("INFO - downloader -", log_input.getMessage())
    # extract url
    res = re.search(r"image #\d+\t(.*)", log_input.getMessage())
    if res:
       # add extracted url to list
       all_urls.append(res.group(1))
       print(all_urls)

greedy_crawler = GreedyImageCrawler(storage={'root_dir': 'image_dir'})
# pass download logger messages through function
greedy_crawler.downloader.logger.addFilter(checkCrawlURL)
greedy_crawler.crawl(domains='http://www.bbc.com/news', max_num=10,
                     min_size=None, max_size=None)

Result:

2023-05-06 19:18:23,031 - INFO - icrawler.crawler - start crawling...
2023-05-06 19:18:23,032 - INFO - icrawler.crawler - starting 1 feeder threads...
2023-05-06 19:18:23,032 - INFO - icrawler.crawler - starting 1 parser threads...
2023-05-06 19:18:23,033 - INFO - icrawler.crawler - starting 1 downloader threads...
2023-05-06 19:18:23,187 - INFO - parser - parsing result page http://www.bbc.com/news
INFO - downloader - image #1	https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg
['https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg']
INFO - downloader - image #2	https://ichef.bbci.co.uk/news/320/cpsprodpb/11A78/production/_129621327_f5b5bd4421299cab6e7587311796a63f8808b7db-1.jpg
['https://ichef.bbci.co.uk/news/320/cpsprodpb/CCDA/production/_129624425_gettyimages-1487975069.jpg', 'https://ichef.bbci.co.uk/news/320/cpsprodpb/11A78/production/_129621327_f5b5bd4421299cab6e7587311796a63f8808b7db-1.jpg']
[...]

huangapple
  • 本文由 发表于 2023年5月6日 23:12:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76189631.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定