英文:
Clean way to scrape web pages from Manifest V3 chrome extension
问题
我的Chrome扩展程序可以抓取各种网页。我还没有找到完全可行的方法。我尝试过以下方法,接近但不完全可行:
1)从背景脚本中,我可以使用fetch
获取网页,然后通过htmlparser2解析它(虽然不能获取文档,但对于简单的提取来说可以)。这对于静态网站来说可以,但对于使用JavaScript渲染内容的网站不起作用。
2)我可以创建一个包含扩展提供的HTML的选项卡,并在选项卡中加载我试图抓取的目标网页的iframe(在使用declarativeNetRequest
删除X-Frame-Options
和相关标头后)。不幸的是,我遇到了同源策略问题,这意味着我无法访问iframe的内容 - 具体来说,iframe.contentDocument
最终为null。我尝试使用chrome.scripting.executeScript
将脚本注入iframe,以为我可以发送消息并让其响应,但是我没有权限在chrome-extension://标签上注入脚本,尽管这是我的标签!(这似乎很愚蠢,但可能是有意设计的。)
我知道我可以为每个要抓取的URL创建一个新标签页;然而,为了做到这一点,我需要一个宽松的contentScripts策略(我有数十个URL),而且我真的不想在用户的常规浏览标签页中注入contentScript(尽管如果找不到其他解决方案,我会这样做)。此外,标签页的显示和隐藏,或者标签页上的图标/标题的更改会对用户体验产生负面影响。
Firefox有隐藏标签页的功能,这很好,但它们在Chrome中不受支持。
是否有更清晰的方法?
英文:
My chrome extension scrapes a variety of web pages. I haven't found an approach that fully works yet. What I've tried, that is close:
-
From the background script, I can
fetch
, and then run the html through htmlparser2 to parse it (I can't get a document, but for simple extraction this is OK). This is fine for static sites, but doesn't work for sites that render content with javascript. -
I can create a tab with extension-supplied html, and in the tab load the targets that I'm attempting to scrape in an iframe (after using
declarativeNetRequest
to removeX-Frame-Options
and related headers). Unfortunately, I then run into same-origin policy, which means that I can't access the content of the iframe - specifically,iframe.contentDocument
ends up as null. I tried injecting a script into the iframe usingchrome.scripting.executeScript
, thinking I could post a message and get it to respond, but I don't have permission to inject scripts on chrome-extension:// tabs, even though it's my own tab! (This seems dumb, but maybe by design.)
I know I could create a new tab per url I want to scrape; however, in order to do that, I'd need a lax contentScripts policy (I have dozens of urls), and I really don't want to be injecting a contentScript into the user's regular browsing tabs (although I will if I find no other solution). Also, the distraction of tabs showing and hiding, or the favicon / title on the tab changing, is pretty poor UX.
Firefox has hidden tabs, which would be nice, but they're not supported in Chrome.
Is there a cleaner approach?
答案1
得分: 3
- 使用 chrome.offscreen API 创建一个访问 DOM 的隐藏文档
- 添加一个规则来 去除 X-Frame-Options
- 对于每个站点:
- 注册一个内容脚本,使用 chrome.scripting.registerContentScripts 在站点的 URL 上运行,使用
allFrames: true
和persistAcrossSessions: false
- 在离屏文档中创建一个指向该站点的 iframe
- 在内容脚本中处理其 DOM
- 通过 messaging 将结果发送回
- 在离屏文档中移除 iframe
- 注销内容脚本
- 注册一个内容脚本,使用 chrome.scripting.registerContentScripts 在站点的 URL 上运行,使用
为了使内容脚本仅在您的 iframe 中运行:
-
在 URL 中添加一个虚假的随机 ID,并在注册内容脚本时使用它
let u = new URL(url); u.searchParams.set(Math.random(), '') url = u.href;
理论上,某些站点可能会拒绝未知参数,但这不太可能发生。
-
将整个内容脚本包装在条件中:
if (location.ancestorOrigins.contains(chrome.runtime.getURL('').slice(0, -1)) { ..... }
英文:
- Use chrome.offscreen API to create a hidden document with access to DOM
- Add a rule to strip X-Frame-Options
- For each site:
- register a content script that runs in the url of the site using chrome.scripting.registerContentScripts with
allFrames: true
andpersistAcrossSessions: false
- in the offscreen document create an iframe inside pointing to the site
- process its DOM inside your content script
- send the results back via messaging
- in the offscreen document remove the iframe
- unregister the content script
- register a content script that runs in the url of the site using chrome.scripting.registerContentScripts with
To make the content script run only inside your iframe:
-
Add a dummy random id to the URL and use it when registering the content script
let u = new URL(url); u.searchParams.set(Math.random(), '') url = u.href;
Theoretically an unknown parameter may be rejected by some site but it's unlikely.
-
Wrap the entire content script in a condition:
if (location.ancestorOrigins.contains(chrome.runtime.getURL('').slice(0, -1)) { ..... }
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论