Scrapy未将输出保存为jsonline。

huangapple go评论77阅读模式
英文:

Scrapy not saving output to jsonline

问题

我编写了一个爬虫来提取网站上找到的URL,并将它们保存到一个jsonline文件中:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    name = 'medscape_crawler'
    allowed_domains = ['medscape.com']
    start_urls = ['https://www.medscape.com/']
    custom_settings = {
        'ROBOTSTXT_OBEY': False,
        'DOWNLOAD_DELAY': 2,
        'FEEDS': {'medscape_links.jsonl': {'format': 'jsonlines',}},
        'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
        'JOBDIR': 'crawl_state',
    }

    def parse(self, response):
        yield {'url': response.url}  # 保存此页面的URL

        for href in response.css('a::attr(href)').getall():
            if href.startswith('http://') or href.startswith('https://'):
                yield response.follow(href, self.parse)

process = CrawlerProcess()
process.crawl(MySpider)
process.start()

这个爬虫成功地收集了链接,但没有将任何内容填充到jsonline文件中:

2023-05-28 21:20:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://portugues.medscape.com> (referer: https://www.medscape.com/)
2023-05-28 21:20:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://portugues.medscape.com>
{'url': 'https://portugues.medscape.com'}

jsonline文件仍然是空的。添加'FEED_EXPORT_BATCH_ITEM_COUNT': 10,并不会改变之前的保存情况。

任何帮助将不胜感激。

谢谢!

英文:

I put together a crawler to extract URLs found on a website and save them to a jsonline file:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    name = &#39;medscape_crawler&#39;
    allowed_domains = [&#39;medscape.com&#39;]
    start_urls = [&#39;https://www.medscape.com/&#39;]
    custom_settings = {
        &#39;ROBOTSTXT_OBEY&#39;: False,
        &#39;DOWNLOAD_DELAY&#39;: 2,
        &#39;FEEDS&#39;: {&#39;medscape_links.jsonl&#39;: {&#39;format&#39;: &#39;jsonlines&#39;,}},
        &#39;FEED_EXPORT_BATCH_ITEM_COUNT&#39;: 10,
        &#39;JOBDIR&#39;: &#39;crawl_state&#39;,
    }

    def parse(self, response):
        yield {&#39;url&#39;: response.url}  # Save this page&#39;s URL

        for href in response.css(&#39;a::attr(href)&#39;).getall():
            if href.startswith(&#39;http://&#39;) or href.startswith(&#39;https://&#39;):
                yield response.follow(href, self.parse)

process = CrawlerProcess()
process.crawl(MySpider)
process.start()

The crawler successfully collects the links but does not populate the jsonline file with any outputs:

2023-05-28 21:20:31 [scrapy.core.engine] DEBUG: Crawled (200) &lt;GET https://portugues.medscape.com&gt; (referer: https://www.medscape.com/)
2023-05-28 21:20:31 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 https://portugues.medscape.com&gt;
{&#39;url&#39;: &#39;https://portugues.medscape.com&#39;}

The jsonline file remains empty. Adding &#39;FEED_EXPORT_BATCH_ITEM_COUNT&#39;: 10, does not make earlier saves.

Any help would be greatly appreciated.

Thank you!

答案1

得分: 2

它确实有效,您可能需要清理crawl_state目录内的active.json文件。
如果您希望保存到不同的文件中,请使用FEED_URI_PARAMS

custom_settings = {
    'ROBOTSTXT_OBEY': False,
    'DOWNLOAD_DELAY': 2,
    'FEEDS': {'json_files/batch-%(batch_id)d.jsonl': {'format': 'jsonlines'}},
    'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
    'JOBDIR': 'crawl_state',
}

如果您暂停作业,那么您可能希望将overwrite设置为False(如果您不保存在不同的文件中,尽管我没有测试过)。

英文:

It does work, you may want to clean your active.json file inside crawl_state directory.
<br>
If you want to save in different files use FEED_URI_PARAMS.

custom_settings = {
    &#39;ROBOTSTXT_OBEY&#39;: False,
    &#39;DOWNLOAD_DELAY&#39;: 2,
    &#39;FEEDS&#39;: {&#39;json_files/batch-%(batch_id)d.jsonl&#39;: {&#39;format&#39;: &#39;jsonlines&#39;}},
    &#39;FEED_EXPORT_BATCH_ITEM_COUNT&#39;: 10,
    &#39;JOBDIR&#39;: &#39;crawl_state&#39;,
}

If you pause your job then you may want to set overwrite to be False (if you're not saving in different files, I haven't tested it though).

答案2

得分: 1

尽管您已经定义了 "custom_settings",但要使此自定义设置生效,您需要在项目中使用 CrawlerProcess 时传递一个包含您的自定义设置的 Settings 对象。

...
process = CrawlerProcess(MySpider.custom_settings)
process.crawl(MySpider)
process.start()
英文:

Although you have defined "custom_settings", to make this custom setting take effect, you need to pass a Settings object containing your custom settings when using CrawlerProcess in the project.

...
process = CrawlerProcess(MySpider.custom_settings)
process.crawl(MySpider)
process.start()

huangapple
  • 本文由 发表于 2023年5月29日 10:56:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76354431.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定