英文:
Get src from img tag with puppeteer
问题
return {
src: e.querySelector('.fc-item__media-wrapper .responsive-img').getAttribute('src'),
image: text('.fc-item__media-wrapper .responsive-img'),
};
英文:
I want to get the link text from the src attribute within an img tag. This is part of the html with the img and src tags:
<img alt="" class="responsive-img" src="https://i.guim.co.uk/img/media/6167380a1330877b8265353f2756b127c2226824/0_81_4256_2554/master/4256.jpg?width=300&amp;quality=85&amp;auto=format&amp;fit=max&amp;s=787aa5ddd44a8a66a06120452e228503">
I will give the HTML and the script I use because that is what I tried:
This is the HTML:
<div class="fc-item__container">
<div class="fc-item__media-wrapper">
<div class="fc-item__image-container u-responsive-ratio">
<picture><!--[if IE 9]><video style="display: none;"><![endif]-->
<source
media="(min-width: 980px) and (-webkit-min-device-pixel-ratio: 1.25), (min-width: 980px) and (min-resolution: 120dpi)"
srcset="https://i.guim.co.uk/img/media/6167380a1330877b8265353f2756b127c2226824/0_81_4256_2554/master/4256.jpg?width=140&amp;quality=45&amp;auto=format&amp;fit=max&amp;dpr=2&amp;s=35c7a9a7cc4e5ebd8fcdfcb67177a8f4 280w">
<source media="(min-width: 980px)"
srcset="https://i.guim.co.uk/img/media/6167380a1330877b8265353f2756b127c2226824/0_81_4256_2554/master/4256.jpg?width=140&amp;quality=85&amp;auto=format&amp;fit=max&amp;s=f68f029ce1b60ed96581f28a29062e3b 140w">
<source
media="(min-width: 740px) and (-webkit-min-device-pixel-ratio: 1.25), (min-width: 740px) and (min-resolution: 120dpi)"
srcset="https://i.guim.co.uk/img/media/6167380a1330877b8265353f2756b127c2226824/0_81_4256_2554/master/4256.jpg?width=140&amp;quality=45&amp;auto=format&amp;fit=max&amp;dpr=2&amp;s=35c7a9a7cc4e5ebd8fcdfcb67177a8f4 280w">
<source media="(min-width: 740px)"
srcset="https://i.guim.co.uk/img/media/6167380a1330877b8265353f2756b127c2226824/0_81_4256_2554/master/4256.jpg?width=140&amp;quality=85&amp;auto=format&amp;fit=max&amp;s=f68f029ce1b60ed96581f28a29062e3b 140w">
<!--[if IE 9]></video><![endif]-->
<img alt="" class="responsive-img"
src="https://i.guim.co.uk/img/media/6167380a1330877b8265353f2756b127c2226824/0_81_4256_2554/master/4256.jpg?width=300&amp;quality=85&amp;auto=format&amp;fit=max&amp;s=787aa5ddd44a8a66a06120452e228503">
</picture>
</div>
</div>
</div>
This is the Puppeteer script:
const fs = require("node:fs/promises");
const puppeteer = require("puppeteer"); // ^19.4.1
const url = "https://www.theguardian.com/international";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setJavaScriptEnabled(false);
await page.setRequestInterception(true);
page.on("request", req => {
if (req.url() !== url) {
req.abort();
}
else {
req.continue();
}
});
await page.goto(url, { waitUntil: "domcontentloaded" });
const img_src = await page.$$eval(".fc-item__container", els =>
els.map(e => {
const text = s => e.querySelector(s)?.textContent.trim();
return {
src: e.querySelector(".fc-item__media-wrapper .responsive-img src"),
image: text(".fc-item__media-wrapper .responsive-img"),
};
})
);
console.log(img_src);
await fs.writeFile("img_src.json", JSON.stringify(img_src, null, 2));
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
The script runs but all I get are empty strings, like this:
[
{
"src": null,
"image": ""
},
{
"src": null,
"image": ""
}
}
]
As you can see I tried 2 variations but both doesn't give any result.
return {
src: e.querySelector(".fc-item__media-wrapper .responsive-img src"),
image: text(".fc-item__media-wrapper .responsive-img"),
};
Any help is much appriciated.
答案1
得分: 1
首先,关于阻止请求和禁用 JS,你做得很棒!这显著加速了脚本的执行,意味着我们可以纯粹依赖于 view-source:
,从而简化了事情。
问题在于:
e.querySelector(".fc-item__media-wrapper .responsive-img src"),
这句话的意思是“返回具有 class="responsive-img"
的元素内的 <src>
标记,该元素位于具有 class="fc-item__media-wrapper"
的元素内”。你可能是想要:
e.querySelector(".fc-item__media-wrapper .responsive-img")
?.getAttribute("src")
至于“text”,我不确定它指的是什么,因为在 .fc-item__media-wrapper
类内部似乎没有文本。
如果你正在寻找“kicker text” 或“headline”,这是一种方法:
const fs = require("node:fs/promises");
const puppeteer = require("puppeteer"); // ^19.6.3
const url = "<Your URL>";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setJavaScriptEnabled(false);
await page.setRequestInterception(true);
page.on("request", req => {
if (req.url() !== url) {
req.abort();
}
else {
req.continue();
}
});
await page.goto(url, {waitUntil: "domcontentloaded"});
const data = await page.$$eval(".fc-item__container", els =>
els.map(e => {
const $ = s => e.querySelector(s);
const text = s => $(s)?.textContent.trim();
return {
src: $(".fc-item__media-wrapper .responsive-img")
?.getAttribute("src"),
kicker: text(".fc-item__kicker"),
headline: text(".fc-item__headline"),
};
})
);
console.log(data);
await fs.writeFile("img_src.json", JSON.stringify(data, null, 2));
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
顺便提一下,一旦你已经阻止了所有请求并禁用了 JS,通常可以使用 fetch(或者如果你尚未使用 Node 18,则使用 axios)与 Cheerio。这将简化事情并进一步提速:
const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "<Your URL>";
fetch(url)
.then(res => {
if (!res.ok) {
throw Error(res.statusText);
}
return res.text();
})
.then(html => {
const $ = cheerio.load(html);
const data = [...$(".fc-item__container")].map(e => ({
src: $(e).find(".fc-item__media-wrapper .responsive-img").attr("src"),
kicker: $(e).find(".fc-item__kicker").text().trim(),
headline: $(e).find(".fc-item__headline").text().trim(),
}));
console.log(data);
})
.catch(err => console.error(err));
另请参阅提问者的相关问题。
英文:
First of all, great job on blocking requests and disabling JS! This speeds up the script considerably and means we can rely purely on the view-source:
which simplifies matters.
A problem is:
e.querySelector(".fc-item__media-wrapper .responsive-img src"),
This says "return the <src>
tag within an element with class="responsive-img"
within an element with class="fc-item__media-wrapper"
". You probably mean:
e.querySelector(".fc-item__media-wrapper .responsive-img")
?.getAttribute("src")
As for the "text", I'm not sure what that refers to, since there's no text anywhere inside of the .fc-item__media-wrapper
class.
If you're looking for the kicker text or headline, here's one approach:
const fs = require("node:fs/promises");
const puppeteer = require("puppeteer"); // ^19.6.3
const url = "<Your URL>";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setJavaScriptEnabled(false);
await page.setRequestInterception(true);
page.on("request", req => {
if (req.url() !== url) {
req.abort();
}
else {
req.continue();
}
});
await page.goto(url, {waitUntil: "domcontentloaded"});
const data = await page.$$eval(".fc-item__container", els =>
els.map(e => {
const $ = s => e.querySelector(s);
const text = s => $(s)?.textContent.trim();
return {
src: $(".fc-item__media-wrapper .responsive-img")
?.getAttribute("src"),
kicker: text(".fc-item__kicker"),
headline: text(".fc-item__headline"),
};
})
);
console.log(data);
await fs.writeFile("img_src.json", JSON.stringify(data, null, 2));
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
By the way, once you've gotten to the point where you're blocking all requests and have disabled JS, you can often just use fetch (or axios, if you're not on Node 18 yet) with Cheerio. This simplifies matters and further speeds things up:
const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "<Your URL>";
fetch(url)
.then(res => {
if (!res.ok) {
throw Error(res.statusText);
}
return res.text();
})
.then(html => {
const $ = cheerio.load(html);
const data = [...$(".fc-item__container")].map(e => ({
src: $(e).find(".fc-item__media-wrapper .responsive-img").attr("src"),
kicker: $(e).find(".fc-item__kicker").text().trim(),
headline: $(e).find(".fc-item__headline").text().trim(),
}));
console.log(data);
})
.catch(err => console.error(err));
See also this related question from OP.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论