2023年5月29日 10:56:46go评论113阅读模式

英文:

Scrapy not saving output to jsonline

问题

我编写了一个爬虫来提取网站上找到的URL，并将它们保存到一个jsonline文件中：

import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
    name = 'medscape_crawler'
    allowed_domains = ['medscape.com']
    start_urls = ['https://www.medscape.com/']
    custom_settings = {
        'ROBOTSTXT_OBEY': False,
        'DOWNLOAD_DELAY': 2,
        'FEEDS': {'medscape_links.jsonl': {'format': 'jsonlines',}},
        'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
        'JOBDIR': 'crawl_state',
    }
    def parse(self, response):
        yield {'url': response.url}  # 保存此页面的URL
        for href in response.css('a::attr(href)').getall():
            if href.startswith('http://') or href.startswith('https://'):
                yield response.follow(href, self.parse)
process = CrawlerProcess()
process.crawl(MySpider)
process.start()

这个爬虫成功地收集了链接，但没有将任何内容填充到jsonline文件中：

2023-05-28 21:20:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://portugues.medscape.com> (referer: https://www.medscape.com/)
2023-05-28 21:20:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://portugues.medscape.com>
{'url': 'https://portugues.medscape.com'}

jsonline文件仍然是空的。添加'FEED_EXPORT_BATCH_ITEM_COUNT': 10,并不会改变之前的保存情况。

任何帮助将不胜感激。

谢谢！

英文:

I put together a crawler to extract URLs found on a website and save them to a jsonline file:

import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
    name = &#39;medscape_crawler&#39;
    allowed_domains = [&#39;medscape.com&#39;]
    start_urls = [&#39;https://www.medscape.com/&#39;]
    custom_settings = {
        &#39;ROBOTSTXT_OBEY&#39;: False,
        &#39;DOWNLOAD_DELAY&#39;: 2,
        &#39;FEEDS&#39;: {&#39;medscape_links.jsonl&#39;: {&#39;format&#39;: &#39;jsonlines&#39;,}},
        &#39;FEED_EXPORT_BATCH_ITEM_COUNT&#39;: 10,
        &#39;JOBDIR&#39;: &#39;crawl_state&#39;,
    }
    def parse(self, response):
        yield {&#39;url&#39;: response.url}  # Save this page&#39;s URL
        for href in response.css(&#39;a::attr(href)&#39;).getall():
            if href.startswith(&#39;http://&#39;) or href.startswith(&#39;https://&#39;):
                yield response.follow(href, self.parse)
process = CrawlerProcess()
process.crawl(MySpider)
process.start()

The crawler successfully collects the links but does not populate the jsonline file with any outputs:

2023-05-28 21:20:31 [scrapy.core.engine] DEBUG: Crawled (200) &lt;GET https://portugues.medscape.com&gt; (referer: https://www.medscape.com/)
2023-05-28 21:20:31 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 https://portugues.medscape.com&gt;
{&#39;url&#39;: &#39;https://portugues.medscape.com&#39;}

The jsonline file remains empty. Adding 'FEED_EXPORT_BATCH_ITEM_COUNT': 10, does not make earlier saves.

Any help would be greatly appreciated.

Thank you!

答案1

得分: 2

它确实有效，您可能需要清理crawl_state目录内的active.json文件。
如果您希望保存到不同的文件中，请使用FEED_URI_PARAMS。

custom_settings = {
    'ROBOTSTXT_OBEY': False,
    'DOWNLOAD_DELAY': 2,
    'FEEDS': {'json_files/batch-%(batch_id)d.jsonl': {'format': 'jsonlines'}},
    'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
    'JOBDIR': 'crawl_state',
}

如果您暂停作业，那么您可能希望将overwrite设置为False（如果您不保存在不同的文件中，尽管我没有测试过）。

英文:

It does work, you may want to clean your active.json file inside crawl_state directory.
<br>
If you want to save in different files use FEED_URI_PARAMS.

custom_settings = {
    &#39;ROBOTSTXT_OBEY&#39;: False,
    &#39;DOWNLOAD_DELAY&#39;: 2,
    &#39;FEEDS&#39;: {&#39;json_files/batch-%(batch_id)d.jsonl&#39;: {&#39;format&#39;: &#39;jsonlines&#39;}},
    &#39;FEED_EXPORT_BATCH_ITEM_COUNT&#39;: 10,
    &#39;JOBDIR&#39;: &#39;crawl_state&#39;,
}

If you pause your job then you may want to set overwrite to be False (if you're not saving in different files, I haven't tested it though).

答案2

得分: 1

尽管您已经定义了 "custom_settings"，但要使此自定义设置生效，您需要在项目中使用 CrawlerProcess 时传递一个包含您的自定义设置的 Settings 对象。

...
process = CrawlerProcess(MySpider.custom_settings)
process.crawl(MySpider)
process.start()

英文:

Although you have defined "custom_settings", to make this custom setting take effect, you need to pass a Settings object containing your custom settings when using CrawlerProcess in the project.

...
process = CrawlerProcess(MySpider.custom_settings)
process.crawl(MySpider)
process.start()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Scrapy未将输出保存为jsonline。

问题

答案1

答案2

How to fix 'int' object has no attribute 'astype' error when sending WhatsApp messages to large number of contacts using Python and pandas?

email dataframe as table in mail body using python

如何将返回值进行整理并保存为单独的JSON/CSV文件？

如何在seaborn中设置图例以自动调整并采用标准格式

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。