英文:
How to find links children of any level
问题
以下是您要翻译的HTML片段:
<ul class="ptf">
<li class="pti">
<div data-testid="pagetree-item-expander" class="pe" role="button" tabindex="0" aria-expanded="false"></div>
<a href="/jsw/docs/start" data-testid="atlas_link">kl</a>
<ul class="ptf" style="display:none">
<li class="pti">
<a href="/jsw/docs/what/" data-testid="atlas_link">ij</a>
</li>
<li class="pti">
<a href="/jsw/docs/where/" data-testid="atlas_link">gh</a>
</li>
<li class="pti">
<a href="/jsw/docs/common/" data-testid="atlas_link">ef</a>
</li>
</ul>
<li class="pti">
<div data-testid="pagetree-item-expander" class="pe" role="button" tabindex="0" aria-expanded="false"></div>
<a href="/jsw/docs/ge/" data-testid="atlas_link">cd</a>
<ul class="ptf" style="display:none">
<li class="pti">
<a href="/jsw/docs/wha/" data-testid="atlas_link">ab</a>
</li>
</li>
</ul>
另外,这是您提供的JavaScript代码片段:
const links = await page.$x("//div[@data-testid='pagetree-item-expander']/following-sibling::a");
for (let i = 0; i < links.length; i++) {
const textContent = await links[i].getProperty("href");
const srcText = await textContent.jsonValue();
console.log(srcText);
}
请注意,我已经进行了一些修正以使JavaScript代码匹配HTML结构。希望这对您有所帮助。
英文:
I have the following piece of html inside a page I loaded using puppeteer and I'm trying to get all the child links (not just direct children, child a
s at any level).
<ul class="ptf">
<li class="pti">
<div data-testid="pagetree-item-expander" class="pe" role="button" tabindex="0" aria-expanded="false"></div>
<a href="/jsw/docs/start” data-testid="atlas_link">kl</a>
<ul class="ptf" style="display:none">
<li class="pti">
<a href="/jsw/docs/what/" data-testid="atlas_link">ij</a>
</li>
<li class="pti">
<a href="/jsw/docs/where/" data-testid="atlas_link">gh</a>
</li>
<li class="pti">
<a href="/jsw/docs/common/" data-testid="atlas_link">ef</a>
</li>
</ul>
<li class="pti">
<div data-testid="pagetree-item-expander" class="pe" role="button" tabindex="0" aria-expanded="false"></div>
<a href="/jsw/docs/ge/" data-testid="atlas_link">cd</a>
<ul class="ptf" style="display:none">
<li class="pti">
<a href="/jsw/docs/wha/" data-testid="atlas_link">ab</a>
</li>
</li>
</ul>
I tried the following but it's not listing down any children. What am I doing wrong?
const links = await page.$x("//*[@id=\"root\"]/div[2]/div/li[5]/ul//a");
for (let i = 0; i < links.length; i++) {
const textContent = await links[i].getProperty("href");
const srcText = await textContent.jsonValue();
console.log(srcText);
}
Context: I'm looking to get URLs of all child links within this link:
Expected outcome: A flat array with the following first 10 URLs:
[“https://support.atlassian.com/jira-software-cloud/docs/get-started-with-advanced-roadmaps/“,
“https://support.atlassian.com/jira-software-cloud/docs/what-is-advanced-roadmaps/“,
“https://support.atlassian.com/jira-software-cloud/docs/where-do-i-find-advanced-roadmaps/“,
“https://support.atlassian.com/jira-software-cloud/docs/common-jira-software-configurations-for-advanced-roadmaps/“, “https://support.atlassian.com/jira-software-cloud/docs/view-a-sample-advanced-roadmaps-plan/“,
“https://support.atlassian.com/jira-software-cloud/docs/create-a-new-plan-in-advanced-roadmaps/“,
“https://support.atlassian.com/jira-software-cloud/docs/how-do-i-navigate-advanced-roadmaps/“,
“https://support.atlassian.com/jira-software-cloud/docs/change-your-advanced-roadmaps-plan-settings/“, “https://support.atlassian.com/jira-software-cloud/docs/how-do-i-read-my-advanced-roadmaps-plan/“, “https://support.atlassian.com/jira-software-cloud/docs/what-do-the-symbols-in-advanced-roadmaps-mean/“]
答案1
得分: 1
以下是您要翻译的代码部分:
fetch("<Your URL>")
.then(res => {
if (!res.ok) {
throw Error(res.statusText);
}
return res.text();
})
.then(html => {
const pageTree = JSON.parse(
html.match(/^ *pageTree: (.*);*$/m)[1]
);
console.log(JSON.stringify(pageTree, null, 2));
const hrefs = pageTree
.find(({title}) =>
title.toLowerCase().includes("advanced roadmaps")
)
.childList[0].childList.map(({slug}) => slug);
console.log(hrefs);
})
.catch(err => console.error(err));
const puppeteer = require("puppeteer"); // ^20.2.0
const url = "<Your URL>";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setRequestInterception(true);
page.on("request", req => {
req.url().replace(/\/$/, "") === url.replace(/\/$/, "")
? req.continue()
: req.abort();
});
await page.goto(url, {waitUntil: "domcontentloaded"});
const hrefs = await page.evaluate(() =>
window.__APP_INITIAL_STATE__.pageTree
.at(-1)
.childList[0].childList.map(({slug}) => slug)
);
console.log(hrefs);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
请注意,由于文本中包含HTML实体编码(例如 <
和 >
),这些字符不会被翻译。
英文:
This appears to be an XY problem. The data is in the page source as a JSON string, so you can get it without any dependencies or imports by using Node 18's native fetch
:
fetch("<Your URL>")
.then(res => {
if (!res.ok) {
throw Error(res.statusText);
}
return res.text();
})
.then(html => {
const pageTree = JSON.parse(
html.match(/^ *pageTree: (.*);*$/m)[1]
);
console.log(JSON.stringify(pageTree, null, 2));
const hrefs = pageTree
.find(({title}) =>
title.toLowerCase().includes("advanced roadmaps")
)
.childList[0].childList.map(({slug}) => slug);
console.log(hrefs);
})
.catch(err => console.error(err));
Output:
<giant JSON structure with the entire nav tree>
[
'/jira-software-cloud/docs/what-is-advanced-roadmaps/',
'/jira-software-cloud/docs/where-do-i-find-advanced-roadmaps/',
'/jira-software-cloud/docs/common-jira-software-configurations-for-advanced-roadmaps/',
'/jira-software-cloud/docs/view-a-sample-advanced-roadmaps-plan/',
'/jira-software-cloud/docs/create-a-new-plan-in-advanced-roadmaps/',
'/jira-software-cloud/docs/how-do-i-navigate-advanced-roadmaps/',
'/jira-software-cloud/docs/change-your-advanced-roadmaps-plan-settings/',
'/jira-software-cloud/docs/how-do-i-read-my-advanced-roadmaps-plan/',
'/jira-software-cloud/docs/what-do-the-symbols-in-advanced-roadmaps-mean/',
'/jira-software-cloud/docs/what-keyboard-shortcuts-are-available-in-advanced-roadmaps/',
'/jira-software-cloud/docs/add-teams-and-releases-to-your-advanced-roadmaps-plan/',
'/jira-software-cloud/docs/build-out-your-plan-in-advanced-roadmaps/',
'/jira-software-cloud/docs/planning-tools-in-advanced-roadmaps/',
'/jira-software-cloud/docs/create-different-views-of-your-advanced-roadmaps-plan/',
'/jira-software-cloud/docs/how-ted-uses-advanced-roadmaps-scenarios-and-capacity/',
'/jira-software-cloud/docs/how-veronica-uses-advanced-roadmaps-cross-project-planning/'
]
This runs in a fraction of the time Puppeteer would take, 0.879s on my decade-old laptop. Although it's possible the JSON format could change at any time, it's just as likely that the DOM could as well.
See this answer for a detailed walkthrough of how to find your data like this. It's written in Python but all of the concepts apply to Node.
If your requests are being blocked (and you added a user agent header), or for some reason you really want/need to use Puppeteer, the data in question is attached to the window, so you can use:
const puppeteer = require("puppeteer"); // ^20.2.0
const url = "<Your URL>";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setRequestInterception(true);
page.on("request", req => {
req.url().replace(/\/$/, "") === url.replace(/\/$/, "")
? req.continue()
: req.abort();
});
await page.goto(url, {waitUntil: "domcontentloaded"});
const hrefs = await page.evaluate(() =>
window.__APP_INITIAL_STATE__.pageTree
.at(-1)
.childList[0].childList.map(({slug}) => slug)
);
console.log(hrefs);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
This took 3-4x as long to run as the fetch
version for me.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论