如何查找任何级别的链接子元素

huangapple go评论64阅读模式
英文:

How to find links children of any level

问题

以下是您要翻译的HTML片段:

<ul class="ptf">
    <li class="pti">
        <div data-testid="pagetree-item-expander" class="pe" role="button" tabindex="0" aria-expanded="false"></div>
       <a href="/jsw/docs/start" data-testid="atlas_link">kl</a>
       <ul class="ptf" style="display:none">
            <li class="pti">
                <a href="/jsw/docs/what/" data-testid="atlas_link">ij</a>
            </li>
            <li class="pti">
                <a href="/jsw/docs/where/" data-testid="atlas_link">gh</a>
            </li>
            <li class="pti">
                <a href="/jsw/docs/common/" data-testid="atlas_link">ef</a>
            </li>
        </ul>
     <li class="pti">
         <div data-testid="pagetree-item-expander" class="pe" role="button" tabindex="0" aria-expanded="false"></div>
       <a href="/jsw/docs/ge/" data-testid="atlas_link">cd</a>
       <ul class="ptf" style="display:none">
            <li class="pti">
                <a href="/jsw/docs/wha/" data-testid="atlas_link">ab</a>
            </li>
      </li>
</ul>

另外,这是您提供的JavaScript代码片段:

const links = await page.$x("//div[@data-testid='pagetree-item-expander']/following-sibling::a");
  
for (let i = 0; i < links.length; i++) {
    
    const textContent = await links[i].getProperty("href");
    const srcText = await textContent.jsonValue();
    console.log(srcText);
}

请注意,我已经进行了一些修正以使JavaScript代码匹配HTML结构。希望这对您有所帮助。

英文:

I have the following piece of html inside a page I loaded using puppeteer and I'm trying to get all the child links (not just direct children, child as at any level).

&lt;ul class=&quot;ptf&quot;&gt;
    &lt;li class=&quot;pti&quot;&gt;
        &lt;div data-testid=&quot;pagetree-item-expander&quot; class=&quot;pe&quot; role=&quot;button&quot; tabindex=&quot;0&quot; aria-expanded=&quot;false&quot;&gt;&lt;/div&gt;
       &lt;a href=&quot;/jsw/docs/start” data-testid=&quot;atlas_link&quot;&gt;kl&lt;/a&gt;
       &lt;ul class=&quot;ptf&quot; style=&quot;display:none&quot;&gt;
            &lt;li class=&quot;pti&quot;&gt;
                &lt;a href=&quot;/jsw/docs/what/&quot; data-testid=&quot;atlas_link&quot;&gt;ij&lt;/a&gt;
            &lt;/li&gt;
            &lt;li class=&quot;pti&quot;&gt;
                &lt;a href=&quot;/jsw/docs/where/&quot; data-testid=&quot;atlas_link&quot;&gt;gh&lt;/a&gt;
            &lt;/li&gt;
            &lt;li class=&quot;pti&quot;&gt;
                &lt;a href=&quot;/jsw/docs/common/&quot; data-testid=&quot;atlas_link&quot;&gt;ef&lt;/a&gt;
            &lt;/li&gt;

        &lt;/ul&gt;
     &lt;li class=&quot;pti&quot;&gt;
         &lt;div data-testid=&quot;pagetree-item-expander&quot; class=&quot;pe&quot; role=&quot;button&quot; tabindex=&quot;0&quot; aria-expanded=&quot;false&quot;&gt;&lt;/div&gt;
       &lt;a href=&quot;/jsw/docs/ge/&quot; data-testid=&quot;atlas_link&quot;&gt;cd&lt;/a&gt;
       &lt;ul class=&quot;ptf&quot; style=&quot;display:none&quot;&gt;
            &lt;li class=&quot;pti&quot;&gt;
                &lt;a href=&quot;/jsw/docs/wha/&quot; data-testid=&quot;atlas_link&quot;&gt;ab&lt;/a&gt;
            &lt;/li&gt;
      &lt;/li&gt;
&lt;/ul&gt;

I tried the following but it's not listing down any children. What am I doing wrong?

const links = await page.$x(&quot;//*[@id=\&quot;root\&quot;]/div[2]/div/li[5]/ul//a&quot;);
  
  for (let i = 0; i &lt; links.length; i++) {
    
    const textContent = await links[i].getProperty(&quot;href&quot;);
    const srcText = await textContent.jsonValue();
    console.log(srcText);
  }

Context: I'm looking to get URLs of all child links within this link:
如何查找任何级别的链接子元素

Expected outcome: A flat array with the following first 10 URLs:

[“https://support.atlassian.com/jira-software-cloud/docs/get-started-with-advanced-roadmaps/“, 
“https://support.atlassian.com/jira-software-cloud/docs/what-is-advanced-roadmaps/“, 
“https://support.atlassian.com/jira-software-cloud/docs/where-do-i-find-advanced-roadmaps/“, 
“https://support.atlassian.com/jira-software-cloud/docs/common-jira-software-configurations-for-advanced-roadmaps/“, “https://support.atlassian.com/jira-software-cloud/docs/view-a-sample-advanced-roadmaps-plan/“, 
“https://support.atlassian.com/jira-software-cloud/docs/create-a-new-plan-in-advanced-roadmaps/“, 
“https://support.atlassian.com/jira-software-cloud/docs/how-do-i-navigate-advanced-roadmaps/“, 
“https://support.atlassian.com/jira-software-cloud/docs/change-your-advanced-roadmaps-plan-settings/“, “https://support.atlassian.com/jira-software-cloud/docs/how-do-i-read-my-advanced-roadmaps-plan/“, “https://support.atlassian.com/jira-software-cloud/docs/what-do-the-symbols-in-advanced-roadmaps-mean/“]

答案1

得分: 1

以下是您要翻译的代码部分:

fetch("&lt;Your URL&gt;")
  .then(res =&gt; {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.text();
  })
  .then(html =&gt; {
    const pageTree = JSON.parse(
      html.match(/^ *pageTree: (.*);*$/m)[1]
    );
    console.log(JSON.stringify(pageTree, null, 2));
    const hrefs = pageTree
      .find(({title}) =&gt;
        title.toLowerCase().includes("advanced roadmaps")
      )
      .childList[0].childList.map(({slug}) =&gt; slug);
    console.log(hrefs);
  })
  .catch(err =&gt; console.error(err));
const puppeteer = require("puppeteer"); // ^20.2.0

const url = "&lt;Your URL&gt;";

let browser;
(async () =&gt; {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setRequestInterception(true);
  page.on("request", req =&gt; {
    req.url().replace(/\/$/, "") === url.replace(/\/$/, "")
      ? req.continue()
      : req.abort();
  });
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const hrefs = await page.evaluate(() =&gt;
    window.__APP_INITIAL_STATE__.pageTree
      .at(-1)
      .childList[0].childList.map(({slug}) =&gt; slug)
  );
  console.log(hrefs);
})()
  .catch(err =&gt; console.error(err))
  .finally(() =&gt; browser?.close());

请注意,由于文本中包含HTML实体编码(例如 &lt;&gt;),这些字符不会被翻译。

英文:

This appears to be an XY problem. The data is in the page source as a JSON string, so you can get it without any dependencies or imports by using Node 18's native fetch:

fetch(&quot;&lt;Your URL&gt;&quot;)
  .then(res =&gt; {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.text();
  })
  .then(html =&gt; {
    const pageTree = JSON.parse(
      html.match(/^ *pageTree: (.*);*$/m)[1]
    );
    console.log(JSON.stringify(pageTree, null, 2));
    const hrefs = pageTree
      .find(({title}) =&gt;
        title.toLowerCase().includes(&quot;advanced roadmaps&quot;)
      )
      .childList[0].childList.map(({slug}) =&gt; slug);
    console.log(hrefs);
  })
  .catch(err =&gt; console.error(err));

Output:

&lt;giant JSON structure with the entire nav tree&gt;
[
  &#39;/jira-software-cloud/docs/what-is-advanced-roadmaps/&#39;,
  &#39;/jira-software-cloud/docs/where-do-i-find-advanced-roadmaps/&#39;,
  &#39;/jira-software-cloud/docs/common-jira-software-configurations-for-advanced-roadmaps/&#39;,
  &#39;/jira-software-cloud/docs/view-a-sample-advanced-roadmaps-plan/&#39;,
  &#39;/jira-software-cloud/docs/create-a-new-plan-in-advanced-roadmaps/&#39;,
  &#39;/jira-software-cloud/docs/how-do-i-navigate-advanced-roadmaps/&#39;,
  &#39;/jira-software-cloud/docs/change-your-advanced-roadmaps-plan-settings/&#39;,
  &#39;/jira-software-cloud/docs/how-do-i-read-my-advanced-roadmaps-plan/&#39;,
  &#39;/jira-software-cloud/docs/what-do-the-symbols-in-advanced-roadmaps-mean/&#39;,
  &#39;/jira-software-cloud/docs/what-keyboard-shortcuts-are-available-in-advanced-roadmaps/&#39;,
  &#39;/jira-software-cloud/docs/add-teams-and-releases-to-your-advanced-roadmaps-plan/&#39;,
  &#39;/jira-software-cloud/docs/build-out-your-plan-in-advanced-roadmaps/&#39;,
  &#39;/jira-software-cloud/docs/planning-tools-in-advanced-roadmaps/&#39;,
  &#39;/jira-software-cloud/docs/create-different-views-of-your-advanced-roadmaps-plan/&#39;,
  &#39;/jira-software-cloud/docs/how-ted-uses-advanced-roadmaps-scenarios-and-capacity/&#39;,
  &#39;/jira-software-cloud/docs/how-veronica-uses-advanced-roadmaps-cross-project-planning/&#39;
]

This runs in a fraction of the time Puppeteer would take, 0.879s on my decade-old laptop. Although it's possible the JSON format could change at any time, it's just as likely that the DOM could as well.

See this answer for a detailed walkthrough of how to find your data like this. It's written in Python but all of the concepts apply to Node.

If your requests are being blocked (and you added a user agent header), or for some reason you really want/need to use Puppeteer, the data in question is attached to the window, so you can use:

const puppeteer = require(&quot;puppeteer&quot;); // ^20.2.0

const url = &quot;&lt;Your URL&gt;&quot;;

let browser;
(async () =&gt; {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setRequestInterception(true);
  page.on(&quot;request&quot;, req =&gt; {
    req.url().replace(/\/$/, &quot;&quot;) === url.replace(/\/$/, &quot;&quot;)
      ? req.continue()
      : req.abort();
  });
  await page.goto(url, {waitUntil: &quot;domcontentloaded&quot;});
  const hrefs = await page.evaluate(() =&gt;
    window.__APP_INITIAL_STATE__.pageTree
      .at(-1)
      .childList[0].childList.map(({slug}) =&gt; slug)
  );
  console.log(hrefs);
})()
  .catch(err =&gt; console.error(err))
  .finally(() =&gt; browser?.close());

This took 3-4x as long to run as the fetch version for me.

huangapple
  • 本文由 发表于 2023年5月15日 08:41:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76250251.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定