2023年5月23日 01:35:32go评论65阅读模式

英文:

Puppeteer only runs three times on Heroku

问题

I'm here to provide a translation of the code and logs you provided:

我正在开发一个网站，该网站使用puppeteer从另一个网站上爬取数据。当我在本地机器上运行npm服务器时，它可以正常爬取数据，但是当我部署到Heroku时，它只运行了前三个我正在寻找的文件，然后停止了。

我基本上是在尝试从我的学校网站上爬取有关课程的数据，所以我在一个循环中运行了这一行：

`let data = await crawler.scrapeData(classesTaken[i].code)`

这会运行下面的这个函数。出于隐私原因，我已将实际网站URL替换为自己的。

```javascript
    const browser = await puppeteer.launch({
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox'
      ]
    })
    const page = await browser.newPage()
    
    await page.goto("网站URL")
    await page.type('#crit-keyword', code)
    await page.click('#search-button')

    await page.waitForSelector(".result__headline")

    await page.click(".result__headline")

    await page.waitForSelector("div.text:nth-child(2)")

    let data = await page.evaluate(() => {
        let classTitle = document.querySelector("div.text:nth-child(2)").textContent
            .toLowerCase().split(' ')
            .map((s) => s.charAt(0).toUpperCase() + s.substring(1)).join(' ').replace('Ii', "II")
        let classDesc =  document.querySelector(".section--description > div:nth-child(2)").textContent.replace('Lec/lab/rec.', '').trim()

        return {
            title: classTitle,
            desc: classDesc
        }
    })

    console.log(`== Finished grabbing ${code}`)

    return data

在我的本地服务器上，这运行得非常好。然而，当我推送到Heroku网站时，它只运行前三个课程代码。我有一种感觉，这可能是由于我的dyno内存耗尽导致的，但我不知道如何让它等待可用内存。

以下是部署日志：

2023-05-22T17:29:18.421015+00:00 app[web.1]: == Finished grabbing CS 475
2023-05-22T17:29:19.098698+00:00 app[web.1]: == Finished grabbing CS 331
2023-05-22T17:29:19.783377+00:00 app[web.1]: == Finished grabbing CS 370

2023-05-22T17:29:49.992190+00:00 app[web.1]: /app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317

2023-05-22T17:29:49.992208+00:00 app[web.1]:     const timeoutError = new Errors_js_1.TimeoutError(`waiting for ${taskName} failed: timeout ${timeout}ms exceeded`);

2023-05-22T17:29:49.992209+00:00 app[web.1]:                          ^

2023-05-22T17:29:49.992209+00:00 app[web.1]: 

2023-05-22T17:29:49.992210+00:00 app[web.1]: TimeoutError: waiting for target failed: timeout 30000ms exceeded

2023-05-22T17:29:49.992211+00:00 app[web.1]:     at waitWithTimeout (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317:26)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at Browser.waitForTarget (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:405:56)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at ChromeLauncher.launch (/app/node_modules/puppeteer/lib/cjs/puppeteer/node/ChromeLauncher.js:100:31)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async Object.scrapeData (/app/crawler.js:9:21)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async getClassData (file:///app/server.mjs:40:16)

2023-05-22T17:29:49.992234+00:00 app[web.1]:

我曾经看到有人建议尝试使用以下命令清除构建缓存：

$ heroku plugins:install heroku-builds
$ heroku builds:cache:purge --app 你的应用名称

我尝试过这个，但没有任何作用。我还按照puppeteer GitHub上的故障排除说明进行了操作。

我认为问题可能与我的dyno内存有关，这是基于这个相关帖子的。如果是这样的话，我想弄清楚如何等待可用内存后再使用。

编辑：我现在也在无头模式下运行浏览器，但结果仍然是相同的错误。

英文:

I'm working on a website which uses puppeteer to scrape data from another website. When I run the npm server on my local machine, it scrapes the data just fine, however when I deploy it to Heroku, it only runs the first three files I'm looking for and then stops.

I'm essentially trying to scrape data about classes from my school website, so I run this line in a for loop,

let data = await crawler.scrapeData(classesTaken[i].code)

This runs this function down below. I have replaced the actual website URL for my own privacy.

    const browser = await puppeteer.launch({
      args: [
        &#39;--no-sandbox&#39;,
        &#39;--disable-setuid-sandbox&#39;
      ]
    })
    const page = await browser.newPage()
    
    await page.goto(&quot;website url&quot;)
    await page.type(&#39;#crit-keyword&#39;, code)
    await page.click(&#39;#search-button&#39;)

    await page.waitForSelector(&quot;.result__headline&quot;)

    await page.click(&quot;.result__headline&quot;)

    await page.waitForSelector(&quot;div.text:nth-child(2)&quot;)

    let data = await page.evaluate(() =&gt; {
        let classTitle = document.querySelector(&quot;div.text:nth-child(2)&quot;).textContent
            .toLowerCase().split(&#39; &#39;)
            .map((s) =&gt; s.charAt(0).toUpperCase() + s.substring(1)).join(&#39; &#39;).replace(&#39;Ii&#39;, &quot;II&quot;)
        let classDesc =  document.querySelector(&quot;.section--description &gt; div:nth-child(2)&quot;).textContent.replace(&#39;Lec/lab/rec.&#39;, &#39;&#39;).trim()

        return {
            title: classTitle,
            desc: classDesc
        }
    })

    console.log(`== Finished grabbing ${code}`)

    return data

This runs perfectly fine on my own local server. However, when I push to my Heroku website, it only runs the first three class codes. I have a feeling this could be due to my dyno running out of memory, but I don't know how to make it wait for there to be available memory.

Here are the deploy logs

2023-05-22T17:29:18.421015+00:00 app[web.1]: == Finished grabbing CS 475
2023-05-22T17:29:19.098698+00:00 app[web.1]: == Finished grabbing CS 331
2023-05-22T17:29:19.783377+00:00 app[web.1]: == Finished grabbing CS 370

2023-05-22T17:29:49.992190+00:00 app[web.1]: /app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317

2023-05-22T17:29:49.992208+00:00 app[web.1]:     const timeoutError = new Errors_js_1.TimeoutError(`waiting for ${taskName} failed: timeout ${timeout}ms exceeded`);

2023-05-22T17:29:49.992209+00:00 app[web.1]:                          ^

2023-05-22T17:29:49.992209+00:00 app[web.1]: 

2023-05-22T17:29:49.992210+00:00 app[web.1]: TimeoutError: waiting for target failed: timeout 30000ms exceeded

2023-05-22T17:29:49.992211+00:00 app[web.1]:     at waitWithTimeout (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317:26)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at Browser.waitForTarget (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:405:56)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at ChromeLauncher.launch (/app/node_modules/puppeteer/lib/cjs/puppeteer/node/ChromeLauncher.js:100:31)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async Object.scrapeData (/app/crawler.js:9:21)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async getClassData (file:///app/server.mjs:40:16)

2023-05-22T17:29:49.992234+00:00 app[web.1]:

I read somewhere to try clearing the build cache using these commands

$ heroku plugins:install heroku-builds
$ heroku builds:cache:purge --app your-app-name

I have tried that and it didn't do anything. I also followed the troubleshooting notes for Heroku on the puppeteer GitHub.

The reason I believe it might be something to do with my dyno memory is due to this related post. If this is the case, I would like to figure out how to wait until there is available memory to use.

EDIT: I am now running the browser in headless mode as well, this results in the exact same error.

答案1

得分: 2

I discovered the issue was that I was leaking memory by opening the browser and then never closing it. By adding the line await browser.close() right before the return statement of the scrapeData() function, the memory leaks stopped and the server was able to parse all of the class codes correctly.

英文:

Upon further logging, I discovered the issue was that I was leaking memory by opening the browser and then never closing it. By adding the line await browser.close() right before the return statement of the scrapeData() function, the memory leaks stopped and the server was able to parse all of the class codes correctly.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Puppeteer 在 Heroku 上仅运行三次

问题

答案1

如何在使用JS的倒计时器中，结合用户输入，使用本地存储？

HTML的createLinearGradient不正常工作。

在分配了useState值之后，但得到了未定义的值？

如何在JavaScript中使用具有Promise字符串标签

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论