问题

如何将我当前正在处理请求的相同URL加入队列？
我有这段代码，并希望再次爬取相同的URL（可能带有延迟），我添加了环境变量，根据此答案1，缓存的结果将被删除。

import { RequestQueue, CheerioCrawler, Configuration } from "crawlee";

const config = Configuration.getGlobalConfig();
config.set('persistStorage', false);
config.set('purgeOnStart', false);

const requestQueue = await RequestQueue.open();
await requestQueue.addRequest({ url: "https://www.google.com/" });

const crawler = new CheerioCrawler({
    requestQueue,
    async requestHandler({ $, request }) {
        console.log("使用爬取的数据执行某些操作...");
        await crawler.addRequests([{ url: "https://www.google.com/" }]);
    }
})

await crawler.run();

英文:

How do i enqueue the same URL that i am currently handling the request for?
I have this code and want to scrape the same URL again (possibly with a delay), i added enviroment variables that cached results will be deleted, according to this answer.

import { RequestQueue, CheerioCrawler, Configuration } from &quot;crawlee&quot;;

const config = Configuration.getGlobalConfig();
config.set(&#39;persistStorage&#39;, false);
config.set(&#39;purgeOnStart&#39;, false);

const requestQueue = await RequestQueue.open();
await requestQueue.addRequest({ url: &quot;https://www.google.com/&quot; });

const crawler = new CheerioCrawler({
    requestQueue,
    async requestHandler({ $, request }) {
        console.log(&quot;Do something with scraped data...&quot;);
        await crawler.addRequests([{url: &quot;https://www.google.com/&quot;}]);
    }
})

await crawler.run();

答案1

得分: 0

我找到了一个解决方案：
向请求字典添加一个唯一键，例如在我们排队新请求之前每次递增的计数器，可以解决这个问题。

{url: "https://www.google.com/", uniqueKey: counter.toString()}

英文:

I found a solution:
Adding a unique key to the Request Dictionary, for example an counter that is incremented every time before we queue a new request, solves this problem.

{url: &quot;https://www.google.com/&quot;, uniqueKey: counter.toString()}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

crawlee – 如何将相同的URL添加回请求队列

问题

答案1

如何在标题搜索输入为空时再次可视化初始数组？

RemoveEventListener click vaadin

Typescript中的filter给我一个不满足条件的值数组。

如何使用HTML按钮满足JavaScript门锁需求？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论