Puppeteer 在 Heroku 上仅运行三次

huangapple go评论65阅读模式
英文:

Puppeteer only runs three times on Heroku

问题

I'm here to provide a translation of the code and logs you provided:

我正在开发一个网站该网站使用puppeteer从另一个网站上爬取数据当我在本地机器上运行npm服务器时它可以正常爬取数据但是当我部署到Heroku时它只运行了前三个我正在寻找的文件然后停止了

我基本上是在尝试从我的学校网站上爬取有关课程的数据所以我在一个循环中运行了这一行

`let data = await crawler.scrapeData(classesTaken[i].code)`

这会运行下面的这个函数出于隐私原因我已将实际网站URL替换为自己的

```javascript
    const browser = await puppeteer.launch({
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox'
      ]
    })
    const page = await browser.newPage()
    
    await page.goto("网站URL")
    await page.type('#crit-keyword', code)
    await page.click('#search-button')

    await page.waitForSelector(".result__headline")

    await page.click(".result__headline")

    await page.waitForSelector("div.text:nth-child(2)")

    let data = await page.evaluate(() => {
        let classTitle = document.querySelector("div.text:nth-child(2)").textContent
            .toLowerCase().split(' ')
            .map((s) => s.charAt(0).toUpperCase() + s.substring(1)).join(' ').replace('Ii', "II")
        let classDesc =  document.querySelector(".section--description > div:nth-child(2)").textContent.replace('Lec/lab/rec.', '').trim()

        return {
            title: classTitle,
            desc: classDesc
        }
    })

    console.log(`== Finished grabbing ${code}`)

    return data

在我的本地服务器上,这运行得非常好。然而,当我推送到Heroku网站时,它只运行前三个课程代码。我有一种感觉,这可能是由于我的dyno内存耗尽导致的,但我不知道如何让它等待可用内存。

以下是部署日志:

2023-05-22T17:29:18.421015+00:00 app[web.1]: == Finished grabbing CS 475
2023-05-22T17:29:19.098698+00:00 app[web.1]: == Finished grabbing CS 331
2023-05-22T17:29:19.783377+00:00 app[web.1]: == Finished grabbing CS 370

2023-05-22T17:29:49.992190+00:00 app[web.1]: /app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317

2023-05-22T17:29:49.992208+00:00 app[web.1]:     const timeoutError = new Errors_js_1.TimeoutError(`waiting for ${taskName} failed: timeout ${timeout}ms exceeded`);

2023-05-22T17:29:49.992209+00:00 app[web.1]:                          ^

2023-05-22T17:29:49.992209+00:00 app[web.1]: 

2023-05-22T17:29:49.992210+00:00 app[web.1]: TimeoutError: waiting for target failed: timeout 30000ms exceeded

2023-05-22T17:29:49.992211+00:00 app[web.1]:     at waitWithTimeout (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317:26)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at Browser.waitForTarget (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:405:56)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at ChromeLauncher.launch (/app/node_modules/puppeteer/lib/cjs/puppeteer/node/ChromeLauncher.js:100:31)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async Object.scrapeData (/app/crawler.js:9:21)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async getClassData (file:///app/server.mjs:40:16)

2023-05-22T17:29:49.992234+00:00 app[web.1]: 

我曾经看到有人建议尝试使用以下命令清除构建缓存:

$ heroku plugins:install heroku-builds
$ heroku builds:cache:purge --app 你的应用名称

我尝试过这个,但没有任何作用。我还按照puppeteer GitHub上的故障排除说明进行了操作。

我认为问题可能与我的dyno内存有关,这是基于这个相关帖子的。如果是这样的话,我想弄清楚如何等待可用内存后再使用。

编辑:我现在也在无头模式下运行浏览器,但结果仍然是相同的错误。

英文:

I'm working on a website which uses puppeteer to scrape data from another website. When I run the npm server on my local machine, it scrapes the data just fine, however when I deploy it to Heroku, it only runs the first three files I'm looking for and then stops.

I'm essentially trying to scrape data about classes from my school website, so I run this line in a for loop,

let data = await crawler.scrapeData(classesTaken[i].code)

This runs this function down below. I have replaced the actual website URL for my own privacy.

    const browser = await puppeteer.launch({
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox'
      ]
    })
    const page = await browser.newPage()
    
    await page.goto("website url")
    await page.type('#crit-keyword', code)
    await page.click('#search-button')

    await page.waitForSelector(".result__headline")

    await page.click(".result__headline")

    await page.waitForSelector("div.text:nth-child(2)")

    let data = await page.evaluate(() => {
        let classTitle = document.querySelector("div.text:nth-child(2)").textContent
            .toLowerCase().split(' ')
            .map((s) => s.charAt(0).toUpperCase() + s.substring(1)).join(' ').replace('Ii', "II")
        let classDesc =  document.querySelector(".section--description > div:nth-child(2)").textContent.replace('Lec/lab/rec.', '').trim()

        return {
            title: classTitle,
            desc: classDesc
        }
    })

    console.log(`== Finished grabbing ${code}`)

    return data

This runs perfectly fine on my own local server. However, when I push to my Heroku website, it only runs the first three class codes. I have a feeling this could be due to my dyno running out of memory, but I don't know how to make it wait for there to be available memory.

Here are the deploy logs

2023-05-22T17:29:18.421015+00:00 app[web.1]: == Finished grabbing CS 475
2023-05-22T17:29:19.098698+00:00 app[web.1]: == Finished grabbing CS 331
2023-05-22T17:29:19.783377+00:00 app[web.1]: == Finished grabbing CS 370

2023-05-22T17:29:49.992190+00:00 app[web.1]: /app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317

2023-05-22T17:29:49.992208+00:00 app[web.1]:     const timeoutError = new Errors_js_1.TimeoutError(`waiting for ${taskName} failed: timeout ${timeout}ms exceeded`);

2023-05-22T17:29:49.992209+00:00 app[web.1]:                          ^

2023-05-22T17:29:49.992209+00:00 app[web.1]: 

2023-05-22T17:29:49.992210+00:00 app[web.1]: TimeoutError: waiting for target failed: timeout 30000ms exceeded

2023-05-22T17:29:49.992211+00:00 app[web.1]:     at waitWithTimeout (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317:26)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at Browser.waitForTarget (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:405:56)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at ChromeLauncher.launch (/app/node_modules/puppeteer/lib/cjs/puppeteer/node/ChromeLauncher.js:100:31)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async Object.scrapeData (/app/crawler.js:9:21)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async getClassData (file:///app/server.mjs:40:16)

2023-05-22T17:29:49.992234+00:00 app[web.1]: 

I read somewhere to try clearing the build cache using these commands

$ heroku plugins:install heroku-builds
$ heroku builds:cache:purge --app your-app-name

I have tried that and it didn't do anything. I also followed the troubleshooting notes for Heroku on the puppeteer GitHub.

The reason I believe it might be something to do with my dyno memory is due to this related post. If this is the case, I would like to figure out how to wait until there is available memory to use.

EDIT: I am now running the browser in headless mode as well, this results in the exact same error.

答案1

得分: 2

I discovered the issue was that I was leaking memory by opening the browser and then never closing it. By adding the line await browser.close() right before the return statement of the scrapeData() function, the memory leaks stopped and the server was able to parse all of the class codes correctly.

英文:

Upon further logging, I discovered the issue was that I was leaking memory by opening the browser and then never closing it. By adding the line await browser.close() right before the return statement of the scrapeData() function, the memory leaks stopped and the server was able to parse all of the class codes correctly.

huangapple
  • 本文由 发表于 2023年5月23日 01:35:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76308675.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定