英文:
Scrapy not saving output to jsonline
问题
我编写了一个爬虫来提取网站上找到的URL,并将它们保存到一个jsonline文件中:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'medscape_crawler'
allowed_domains = ['medscape.com']
start_urls = ['https://www.medscape.com/']
custom_settings = {
'ROBOTSTXT_OBEY': False,
'DOWNLOAD_DELAY': 2,
'FEEDS': {'medscape_links.jsonl': {'format': 'jsonlines',}},
'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
'JOBDIR': 'crawl_state',
}
def parse(self, response):
yield {'url': response.url} # 保存此页面的URL
for href in response.css('a::attr(href)').getall():
if href.startswith('http://') or href.startswith('https://'):
yield response.follow(href, self.parse)
process = CrawlerProcess()
process.crawl(MySpider)
process.start()
这个爬虫成功地收集了链接,但没有将任何内容填充到jsonline文件中:
2023-05-28 21:20:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://portugues.medscape.com> (referer: https://www.medscape.com/)
2023-05-28 21:20:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://portugues.medscape.com>
{'url': 'https://portugues.medscape.com'}
jsonline文件仍然是空的。添加'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
并不会改变之前的保存情况。
任何帮助将不胜感激。
谢谢!
英文:
I put together a crawler to extract URLs found on a website and save them to a jsonline file:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'medscape_crawler'
allowed_domains = ['medscape.com']
start_urls = ['https://www.medscape.com/']
custom_settings = {
'ROBOTSTXT_OBEY': False,
'DOWNLOAD_DELAY': 2,
'FEEDS': {'medscape_links.jsonl': {'format': 'jsonlines',}},
'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
'JOBDIR': 'crawl_state',
}
def parse(self, response):
yield {'url': response.url} # Save this page's URL
for href in response.css('a::attr(href)').getall():
if href.startswith('http://') or href.startswith('https://'):
yield response.follow(href, self.parse)
process = CrawlerProcess()
process.crawl(MySpider)
process.start()
The crawler successfully collects the links but does not populate the jsonline file with any outputs:
2023-05-28 21:20:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://portugues.medscape.com> (referer: https://www.medscape.com/)
2023-05-28 21:20:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://portugues.medscape.com>
{'url': 'https://portugues.medscape.com'}
The jsonline file remains empty. Adding 'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
does not make earlier saves.
Any help would be greatly appreciated.
Thank you!
答案1
得分: 2
它确实有效,您可能需要清理crawl_state
目录内的active.json
文件。
如果您希望保存到不同的文件中,请使用FEED_URI_PARAMS。
custom_settings = {
'ROBOTSTXT_OBEY': False,
'DOWNLOAD_DELAY': 2,
'FEEDS': {'json_files/batch-%(batch_id)d.jsonl': {'format': 'jsonlines'}},
'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
'JOBDIR': 'crawl_state',
}
如果您暂停作业,那么您可能希望将overwrite
设置为False
(如果您不保存在不同的文件中,尽管我没有测试过)。
英文:
It does work, you may want to clean your active.json
file inside crawl_state
directory.
<br>
If you want to save in different files use FEED_URI_PARAMS.
custom_settings = {
'ROBOTSTXT_OBEY': False,
'DOWNLOAD_DELAY': 2,
'FEEDS': {'json_files/batch-%(batch_id)d.jsonl': {'format': 'jsonlines'}},
'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
'JOBDIR': 'crawl_state',
}
If you pause your job then you may want to set overwrite
to be False
(if you're not saving in different files, I haven't tested it though).
答案2
得分: 1
尽管您已经定义了 "custom_settings",但要使此自定义设置生效,您需要在项目中使用 CrawlerProcess 时传递一个包含您的自定义设置的 Settings 对象。
...
process = CrawlerProcess(MySpider.custom_settings)
process.crawl(MySpider)
process.start()
英文:
Although you have defined "custom_settings", to make this custom setting take effect, you need to pass a Settings object containing your custom settings when using CrawlerProcess in the project.
...
process = CrawlerProcess(MySpider.custom_settings)
process.crawl(MySpider)
process.start()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论