英文:
crawlee - How to add the same URL back to the requestQueue
问题
如何将我当前正在处理请求的相同URL加入队列?
我有这段代码,并希望再次爬取相同的URL(可能带有延迟),我添加了环境变量,根据此答案1,缓存的结果将被删除。
import { RequestQueue, CheerioCrawler, Configuration } from "crawlee";
const config = Configuration.getGlobalConfig();
config.set('persistStorage', false);
config.set('purgeOnStart', false);
const requestQueue = await RequestQueue.open();
await requestQueue.addRequest({ url: "https://www.google.com/" });
const crawler = new CheerioCrawler({
requestQueue,
async requestHandler({ $, request }) {
console.log("使用爬取的数据执行某些操作...");
await crawler.addRequests([{ url: "https://www.google.com/" }]);
}
})
await crawler.run();
英文:
How do i enqueue the same URL that i am currently handling the request for?
I have this code and want to scrape the same URL again (possibly with a delay), i added enviroment variables that cached results will be deleted, according to this answer.
import { RequestQueue, CheerioCrawler, Configuration } from "crawlee";
const config = Configuration.getGlobalConfig();
config.set('persistStorage', false);
config.set('purgeOnStart', false);
const requestQueue = await RequestQueue.open();
await requestQueue.addRequest({ url: "https://www.google.com/" });
const crawler = new CheerioCrawler({
requestQueue,
async requestHandler({ $, request }) {
console.log("Do something with scraped data...");
await crawler.addRequests([{url: "https://www.google.com/"}]);
}
})
await crawler.run();
答案1
得分: 0
我找到了一个解决方案:
向请求字典添加一个唯一键,例如在我们排队新请求之前每次递增的计数器,可以解决这个问题。
{url: "https://www.google.com/", uniqueKey: counter.toString()}
英文:
I found a solution:
Adding a unique key to the Request Dictionary, for example an counter that is incremented every time before we queue a new request, solves this problem.
{url: "https://www.google.com/", uniqueKey: counter.toString()}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论