保存HTML的静态内容,但它似乎是带有脚本的动态内容

huangapple go评论85阅读模式
英文:

Save static content of HTML but it appears to be dynamic content with scripts

问题

I am trying to save the static content of HTML. However, I see dynamic content like the script is what got captured. Is there a way to capture the raw content?

Please find the sample code here

import {chromium} from 'playwright'; // Web scraper Library import * as fs from 'fs';

(async function () {
    const chromeBrowser = await chromium.launch({ headless: true }); // Chromium launch and options
    const context = await chromeBrowser.newContext({ ignoreHTTPSErrors: true,
        userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    });
    const page = await context.newPage();
    await page.goto("https://emposedesigns.wixsite.com/empose/games", { waitUntil: 'networkidle', timeout: 60000 });
    let content = await page.content();
    fs.writeFileSync('test.html', content);
    console.log("done");
})();
英文:

I am trying to save the static content of HTML. However, I see dynamic content like the script is what got captured. Is there a way to capture the raw content ?

Please find the sample code here

import {chromium}  from 'playwright'; // Web scraper Library import * as fs from 'fs';

(async function () {
    const chromeBrowser = await chromium.launch({ headless: true }); // Chromium launch and options
    const context = await chromeBrowser.newContext({ ignoreHTTPSErrors: true ,
        userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
      });
    const page = await context.newPage();
    await page.goto("https://emposedesigns.wixsite.com/empose/games", { waitUntil: 'networkidle', timeout: 60000 });
    let content = await page.content();
    fs.writeFileSync('test.html', content);
    console.log("done")
})();

答案1

得分: 1

在网页抓取时,确定目标后,重要的是考虑如何以普通访问者的方式实现目标。虽然存在一些捷径(通常用于网页抓取,但不用于测试),但总体而言,Playwright 设计用于逐一复制用户的操作。

这里的目标是获取隐私政策的文本。如果我们以用户的身份导航到页面,就不会看到隐私政策。可能政策已经静态地存在于 HTML 中。我们可以通过查看页面源代码来检查,但在这种情况下它并不存在。

点击一个包含文本“隐私政策”的链接后,政策将显示出来。在浏览器呈现单击触发的更改后,有一个包含政策的 iframe。

以下是在 Playwright 中复制此操作的一种方式:

const fs = require("node:fs/promises");
const playwright = require("playwright"); // ^1.30.1

const url = "<Your URL>";

let browser;
(async () => {
  browser = await playwright.chromium.launch();
  const page = await browser.newPage();
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.getByText("Privacy Policy").click();
  const text = await page.frameLocator("iframe")
    .locator('[data-custom-class="body"]')
    .textContent(); // 或 .innerHTML()
  console.log(text.trim());
  await fs.writeFile("policy.txt", text.trim());
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

现在,如果目标是尽快获取隐私政策,而不关心为测试目的复制用户操作,可以直接导航到 iframe 的 src URL。假设该 URL 是稳定的,这是获得结果的最简单方法:不需要点击或使用 iframe。

英文:

When web scraping, after determining what your goal is, it's important to think about how you'd achieve the goal as a normal visitor to the site. Although some shortcuts exist (and are usually taken for web scraping, but not for testing), for the most part, Playwright is designed to replicate the user's actions 1:1.

The goal here is to get the text of the privacy policy. If we navigate to the page as a user, no such privacy policy is visible. It's possible that the policy is in the HTML statically. We can check that by viewing page source, but in this case it's not present.

The policy is shown after clicking a link that has the text "Privacy Policy". After the browser renders the change triggered by the click, there's an iframe that contains the policy.

Here's one way to replicate this in Playwright:

const fs = require(&quot;node:fs/promises&quot;);
const playwright = require(&quot;playwright&quot;); // ^1.30.1

const url = &quot;&lt;Your URL&gt;&quot;;

let browser;
(async () =&gt; {
  browser = await playwright.chromium.launch();
  const page = await browser.newPage();
  await page.goto(url, {waitUntil: &quot;domcontentloaded&quot;});
  await page.getByText(&quot;Privacy Policy&quot;).click();
  const text = await page.frameLocator(&quot;iframe&quot;)
    .locator(&#39;[data-custom-class=&quot;body&quot;]&#39;)
    .textContent(); // or .innerHTML()
  console.log(text.trim());
  await fs.writeFile(&quot;policy.txt&quot;, text.trim());
})()
  .catch(err =&gt; console.error(err))
  .finally(() =&gt; browser?.close());

Now, if the goal is to get the privacy policy as quickly as possible, and you don't care about replicating user actions for testing purposes, you could navigate directly to the iframe's src URL. Assuming that URL is stable, this is the easiest way to get to the result: no clicking or iframes required.

huangapple
  • 本文由 发表于 2023年3月31日 16:12:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75896265.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定