问题

I am trying to save the static content of HTML. However, I see dynamic content like the script is what got captured. Is there a way to capture the raw content?

Please find the sample code here

import {chromium} from 'playwright'; // Web scraper Library import * as fs from 'fs';

(async function () {
    const chromeBrowser = await chromium.launch({ headless: true }); // Chromium launch and options
    const context = await chromeBrowser.newContext({ ignoreHTTPSErrors: true,
        userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    });
    const page = await context.newPage();
    await page.goto("https://emposedesigns.wixsite.com/empose/games", { waitUntil: 'networkidle', timeout: 60000 });
    let content = await page.content();
    fs.writeFileSync('test.html', content);
    console.log("done");
})();

英文:

I am trying to save the static content of HTML. However, I see dynamic content like the script is what got captured. Is there a way to capture the raw content ?

Please find the sample code here

import {chromium}  from &#39;playwright&#39;; // Web scraper Library import * as fs from &#39;fs&#39;;

(async function () {
    const chromeBrowser = await chromium.launch({ headless: true }); // Chromium launch and options
    const context = await chromeBrowser.newContext({ ignoreHTTPSErrors: true ,
        userAgent: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36&#39;,
      });
    const page = await context.newPage();
    await page.goto(&quot;https://emposedesigns.wixsite.com/empose/games&quot;, { waitUntil: &#39;networkidle&#39;, timeout: 60000 });
    let content = await page.content();
    fs.writeFileSync(&#39;test.html&#39;, content);
    console.log(&quot;done&quot;)
})();

答案1

得分: 1

在网页抓取时，确定目标后，重要的是考虑如何以普通访问者的方式实现目标。虽然存在一些捷径（通常用于网页抓取，但不用于测试），但总体而言，Playwright 设计用于逐一复制用户的操作。

这里的目标是获取隐私政策的文本。如果我们以用户的身份导航到页面，就不会看到隐私政策。可能政策已经静态地存在于 HTML 中。我们可以通过查看页面源代码来检查，但在这种情况下它并不存在。

点击一个包含文本“隐私政策”的链接后，政策将显示出来。在浏览器呈现单击触发的更改后，有一个包含政策的 iframe。

以下是在 Playwright 中复制此操作的一种方式：

const fs = require("node:fs/promises");
const playwright = require("playwright"); // ^1.30.1

const url = "<Your URL>";

let browser;
(async () => {
  browser = await playwright.chromium.launch();
  const page = await browser.newPage();
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.getByText("Privacy Policy").click();
  const text = await page.frameLocator("iframe")
    .locator('[data-custom-class="body"]')
    .textContent(); // 或 .innerHTML()
  console.log(text.trim());
  await fs.writeFile("policy.txt", text.trim());
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

现在，如果目标是尽快获取隐私政策，而不关心为测试目的复制用户操作，可以直接导航到 iframe 的 src URL。假设该 URL 是稳定的，这是获得结果的最简单方法：不需要点击或使用 iframe。

英文:

When web scraping, after determining what your goal is, it's important to think about how you'd achieve the goal as a normal visitor to the site. Although some shortcuts exist (and are usually taken for web scraping, but not for testing), for the most part, Playwright is designed to replicate the user's actions 1:1.

The goal here is to get the text of the privacy policy. If we navigate to the page as a user, no such privacy policy is visible. It's possible that the policy is in the HTML statically. We can check that by viewing page source, but in this case it's not present.

The policy is shown after clicking a link that has the text "Privacy Policy". After the browser renders the change triggered by the click, there's an iframe that contains the policy.

Here's one way to replicate this in Playwright:

const fs = require(&quot;node:fs/promises&quot;);
const playwright = require(&quot;playwright&quot;); // ^1.30.1

const url = &quot;&lt;Your URL&gt;&quot;;

let browser;
(async () =&gt; {
  browser = await playwright.chromium.launch();
  const page = await browser.newPage();
  await page.goto(url, {waitUntil: &quot;domcontentloaded&quot;});
  await page.getByText(&quot;Privacy Policy&quot;).click();
  const text = await page.frameLocator(&quot;iframe&quot;)
    .locator(&#39;[data-custom-class=&quot;body&quot;]&#39;)
    .textContent(); // or .innerHTML()
  console.log(text.trim());
  await fs.writeFile(&quot;policy.txt&quot;, text.trim());
})()
  .catch(err =&gt; console.error(err))
  .finally(() =&gt; browser?.close());

Now, if the goal is to get the privacy policy as quickly as possible, and you don't care about replicating user actions for testing purposes, you could navigate directly to the iframe's src URL. Assuming that URL is stable, this is the easiest way to get to the result: no clicking or iframes required.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

保存HTML的静态内容，但它似乎是带有脚本的动态内容

问题

答案1

Nest 无法解析 ClientsService 的依赖项。

如何在GraphiQL游乐场中启用标题部分？

你可以查看 Firebase Functions JS 中的 HTTP 请求体数据如何处理。

如何使用D3.js的nest()和rollup()来控制返回顺序？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论