干净的方式从Manifest V3 Chrome扩展中抓取网页。

huangapple go评论71阅读模式
英文:

Clean way to scrape web pages from Manifest V3 chrome extension

问题

我的Chrome扩展程序可以抓取各种网页。我还没有找到完全可行的方法。我尝试过以下方法,接近但不完全可行:

1)从背景脚本中,我可以使用fetch获取网页,然后通过htmlparser2解析它(虽然不能获取文档,但对于简单的提取来说可以)。这对于静态网站来说可以,但对于使用JavaScript渲染内容的网站不起作用。

2)我可以创建一个包含扩展提供的HTML的选项卡,并在选项卡中加载我试图抓取的目标网页的iframe(在使用declarativeNetRequest删除X-Frame-Options和相关标头后)。不幸的是,我遇到了同源策略问题,这意味着我无法访问iframe的内容 - 具体来说,iframe.contentDocument最终为null。我尝试使用chrome.scripting.executeScript将脚本注入iframe,以为我可以发送消息并让其响应,但是我没有权限在chrome-extension://标签上注入脚本,尽管这是我的标签!(这似乎很愚蠢,但可能是有意设计的。)

我知道我可以为每个要抓取的URL创建一个新标签页;然而,为了做到这一点,我需要一个宽松的contentScripts策略(我有数十个URL),而且我真的不想在用户的常规浏览标签页中注入contentScript(尽管如果找不到其他解决方案,我会这样做)。此外,标签页的显示和隐藏,或者标签页上的图标/标题的更改会对用户体验产生负面影响。

Firefox有隐藏标签页的功能,这很好,但它们在Chrome中不受支持。

是否有更清晰的方法?

英文:

My chrome extension scrapes a variety of web pages. I haven't found an approach that fully works yet. What I've tried, that is close:

  1. From the background script, I can fetch, and then run the html through htmlparser2 to parse it (I can't get a document, but for simple extraction this is OK). This is fine for static sites, but doesn't work for sites that render content with javascript.

  2. I can create a tab with extension-supplied html, and in the tab load the targets that I'm attempting to scrape in an iframe (after using declarativeNetRequest to remove X-Frame-Options and related headers). Unfortunately, I then run into same-origin policy, which means that I can't access the content of the iframe - specifically, iframe.contentDocument ends up as null. I tried injecting a script into the iframe using chrome.scripting.executeScript, thinking I could post a message and get it to respond, but I don't have permission to inject scripts on chrome-extension:// tabs, even though it's my own tab! (This seems dumb, but maybe by design.)

I know I could create a new tab per url I want to scrape; however, in order to do that, I'd need a lax contentScripts policy (I have dozens of urls), and I really don't want to be injecting a contentScript into the user's regular browsing tabs (although I will if I find no other solution). Also, the distraction of tabs showing and hiding, or the favicon / title on the tab changing, is pretty poor UX.

Firefox has hidden tabs, which would be nice, but they're not supported in Chrome.

Is there a cleaner approach?

答案1

得分: 3

  1. 使用 chrome.offscreen API 创建一个访问 DOM 的隐藏文档
  2. 添加一个规则来 去除 X-Frame-Options
  3. 对于每个站点:
    1. 注册一个内容脚本,使用 chrome.scripting.registerContentScripts 在站点的 URL 上运行,使用 allFrames: truepersistAcrossSessions: false
    2. 在离屏文档中创建一个指向该站点的 iframe
    3. 在内容脚本中处理其 DOM
    4. 通过 messaging 将结果发送回
    5. 在离屏文档中移除 iframe
    6. 注销内容脚本

为了使内容脚本仅在您的 iframe 中运行:

  1. 在 URL 中添加一个虚假的随机 ID,并在注册内容脚本时使用它

    let u = new URL(url);
    u.searchParams.set(Math.random(), '')
    url = u.href;
    

    理论上,某些站点可能会拒绝未知参数,但这不太可能发生。

  2. 将整个内容脚本包装在条件中:

    if (location.ancestorOrigins.contains(chrome.runtime.getURL('').slice(0, -1)) {
       .....
    }
    
英文:
  1. Use chrome.offscreen API to create a hidden document with access to DOM
  2. Add a rule to strip X-Frame-Options
  3. For each site:
    1. register a content script that runs in the url of the site using chrome.scripting.registerContentScripts with allFrames: true and persistAcrossSessions: false
    2. in the offscreen document create an iframe inside pointing to the site
    3. process its DOM inside your content script
    4. send the results back via messaging
    5. in the offscreen document remove the iframe
    6. unregister the content script

To make the content script run only inside your iframe:

  1. Add a dummy random id to the URL and use it when registering the content script

    let u = new URL(url);
    u.searchParams.set(Math.random(), '')
    url = u.href;
    

    Theoretically an unknown parameter may be rejected by some site but it's unlikely.

  2. Wrap the entire content script in a condition:

    if (location.ancestorOrigins.contains(chrome.runtime.getURL('').slice(0, -1)) {
       .....
    }
    

huangapple
  • 本文由 发表于 2023年5月17日 10:51:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76268275.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定