如何防止Puppeteer爬取我的网站内容

huangapple go评论86阅读模式
英文:

How to prevent puppeteer from crawling my website content

问题

我知道Puppeteer是一个简单而强大的工具,可以轻松获取网站数据。

据我所知,如果使用无头模式,与普通浏览器不同的属性会很多。

但是,如果我使用以下方法连接到一个已打开的浏览器,我无法检测到它吗?

首先:修改桌面上的Google浏览器快捷方式属性并打开浏览器
C:\Users\13632\AppData\Local\Google\Chrome\Application\chrome.exe --remote-debugging-port=9222

const axios = require('axios')
const puppeteer = require('puppeteer')
async function main() {

    const response = await axios.get(`http://127.0.0.1:9222/json/version`);
    const webSocketDebuggerUrl = response.data.webSocketDebuggerUrl;

    browser = await puppeteer.connect({
        browserWSEndpoint: webSocketDebuggerUrl,
        ignoreDefaultArgs: ["--enable-automation"],
        slowMo: 100,
        defaultViewport: { width: 1280, height: 600 },
    });

    
    let target = await browser.waitForTarget(t => t.url().includes("your url"))
    const page = await target.page();

    

}
main()

以上方法是连接到一个已打开的浏览器,它是一个普通的Google浏览器。似乎无法检测它是否是一个自动化工具?还有其他方法可以判断对方是人还是机器吗?

英文:

I know that puppeteer is a simple and great tool, which can easily get the website data

As far as I know, if it is headless mode, there will be many properties different from normal browsers

But if I use the following method to link an open browser with the puppeteer , I can't detect it?

First :Modify Desktop Google Browser Shortcut Properties and open brwoser
C:\Users\13632\AppData\Local\Google\Chrome\Application\chrome.exe --remote-debugging-port=9222

const axios = require('axios')
const puppeteer = require('puppeteer')
async function main() {

    const response = await axios.get(`http://127.0.0.1:9222/json/version`);
    const webSocketDebuggerUrl = response.data.webSocketDebuggerUrl;

    browser = await puppeteer.connect({
        browserWSEndpoint: webSocketDebuggerUrl,
        ignoreDefaultArgs: ["--enable-automation"],
        slowMo: 100,
        defaultViewport: { width: 1280, height: 600 },
    });

    
    let target = await browser.waitForTarget(t => t.url().includes("you url"))
    const page = await target.page();

    

}
main()

The above method is to link to an opened browser, which is a normal Google browser. It seems that it is impossible to detect whether it is an automated tool? Is there any other way for me to judge whether the other party is a human or a machine

答案1

得分: 1

浏览器分析和自动化检测(以及应对它)是一个完整的子领域。一些驱动程序(如chromedriver;我没有使用过puppeteer)设置标志以指示自动化使用,但这些很容易被击败。(例如,可以参考undetected chromedriver这个包,它试图避免被检测到。)

然后是用户分析(机器人往往以可预测的方式点击),在浏览器中运行JS以尝试检测环境,列入黑名单的IP地址(大多数机器人都在代理后面),等等。

问问自己:你害怕什么?然后采取相应的防御措施。你放在互联网上的任何东西都可以被抓取,但你可以让它变得难以执行破坏性的操作,比如预订所有的音乐会门票,然后以500%的溢价转售。像这样的具体挑战有具体的答案;但并没有绝对可靠的方法来检测自动化浏览器,这样做只是在浪费精力。

英文:

Browser profiling and automation detection (and beating it) is an entire subfield of its own. Some drivers (chromedriver; I've not used puppeteer) set flags to indicate automated use, but these are easily defeated. (See for instance undetected chromedriver for a package which tries not to be detectable.)

Then there's user profiling (bots tend to click in predictable ways), running JS in the browser to try to detect the environment, blacklisting ips (most bots are behind proxies), and so on.

Ask yourself: what are you afraid of? And then defend against that. Anything you put on the Internet can and will be crawled, but you can make it hard to do disruptive things like booking all the concert tickets and the reselling them with a 500% markup. Specific challenges like that have specific answers; but there is no foolproof way to detect automated browsers, and doing so is a waste of effort.

huangapple
  • 本文由 发表于 2023年1月9日 00:33:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75049507.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定