英文:
Why Cheerio XML parsing with Crawlee doesn't return text() for *some* keys?
问题
以下是你要翻译的内容:
const crawler = new CheerioCrawler({
async requestHandler({ request, response, body, contentType, $ }) {
$("item").each(function (i, ref) {
const el = $(ref);
const title = el.find("title").text();
const link = el.find('link').text();
const published_on = el.find('pubdate').text();
const published_by = $("source").text();
const snippet = el.find("description").text();
console.log("TITLE: ", title);
console.log("LINK: ", link); // DOESN'T WORK
console.log("PUBLISHED_ON: ", published_on);
console.log("PUBLISHED_BY: ", published_by); // DOESN'T WORK
console.log("SNIPPET: ", snippet )
console.log("AUTHOR: ", author);
});
},
});
希望这对你有所帮助。如果你需要更多帮助,请随时告诉我。
英文:
Considering an XML
file like this one (Google New RSS feed) and item
like this:
<item>
<title>Test Like a Dragon Ishin...</title>
<link>https://news.google.com/rss/articles/CBMie2....</link>
<guid isPermaLink="false">CBMie2h0dHB...</guid>
<pubDate>Fri, 17 Feb 2023 15:00:03 GMT</pubDate>
<description>Test Like a Dragon Ishin...</description>
<source url="https://www.jeuxvideo.com">jeuxvideo.com</source>
</item>
I try to learn Cheerio
(via Crawlee
) and wrote the following (quite-working) function:
const crawler = new CheerioCrawler({
async requestHandler({ request, response, body, contentType, $ }) {
$("item").each(function (i, ref) {
const el = $(ref);
const title = el.find("title").text();
const link = el.find('link').text();
const published_on = el.find('pubdate').text();
const published_by = $("source").text();
const snippet = el.find("description").text();
console.log("TITLE: ", title);
console.log("LINK: ", link); // DOESN'T WORK
console.log("PUBLISHED_ON: ", published_on);
console.log("PUBLISHED_BY: ", published_by); // DOESN'T WORK
console.log("SNIPPET: ", snippet )
console.log("AUTHOR: ", author);
});
},
});
It might be obvious (except for me), but I do not understand why I can't retrieve link
and published_by
content whereas it's working for the other ones.
Any clue?
Thanks a lot.
答案1
得分: 0
Passing {xml: true}
as a Cheerio option works:
const cheerio = require("cheerio"); // ^1.0.0-rc.12
const html = `
<item>
<title>Test Like a Dragon Ishin...</title>
<link>https://news.google.com/rss/articles/CBMie2....</link>
<guid isPermaLink="false">CBMie2h0dHB...</guid>
<pubDate>Fri, 17 Feb 2023 15:00:03 GMT</pubDate>
<description>Test Like a Dragon Ishin...</description>
<source url="https://www.jeuxvideo.com">jeuxvideo.com</source>
</item>
`;
const $ = cheerio.load(html, {xml: true});
const items = [...$("item")].map(e => {
const text = s => $(e).find(s).text().trim();
return {
title: text("title"),
link: text("link"),
publishedOn: text("pubDate"),
publishedBy: text("source"),
snippet: text("description"),
guid: text("guid"),
url: $(e).find("source").attr("url"),
isPermaLink: $(e).find("guid").attr("isPermaLink"),
};
});
console.log(items);
I'm not familiar with Crawlee, and scanning the docs doesn't show an obvious way to add this option to $
. Presumably, you can use $ = cheerio.load(body, {xml: true})
inside the handler though.
Note also that I'm using pubDate
rather than pubdate
since XML is case-sensitive, while HTML isn't.
By the way, console.log($.html())
is a good way to debug this sort of problem; it shows what Cheerio parsed the text to.
英文:
Passing {xml: true}
as a Cheerio option works:
const cheerio = require("cheerio"); // ^1.0.0-rc.12
const html = `
<item>
<title>Test Like a Dragon Ishin...</title>
<link>https://news.google.com/rss/articles/CBMie2....</link>
<guid isPermaLink="false">CBMie2h0dHB...</guid>
<pubDate>Fri, 17 Feb 2023 15:00:03 GMT</pubDate>
<description>Test Like a Dragon Ishin...</description>
<source url="https://www.jeuxvideo.com">jeuxvideo.com</source>
</item>
`;
const $ = cheerio.load(html, {xml: true});
const items = [...$("item")].map(e => {
const text = s => $(e).find(s).text().trim();
return {
title: text("title"),
link: text("link"),
publishedOn: text("pubDate"),
publishedBy: text("source"),
snippet: text("description"),
guid: text("guid"),
url: $(e).find("source").attr("url"),
isPermaLink: $(e).find("guid").attr("isPermaLink"),
};
});
console.log(items);
I'm not familiar with Crawlee, and scanning the docs doesn't show an obvious way to add this option to $
. Presumably, you can use $ = cheerio.load(body, {xml: true})
inside the handler though.
Note also that I'm using pubDate
rather than pubdate
since XML is case-sensitive, while HTML isn't.
By the way, console.log($.html())
is a good way to debug this sort of problem; it shows what Cheerio parsed the text to.
答案2
得分: 0
这似乎是CheerioCrawler中的一个错误。
错误是由于在处理流式XML输入时未设置选项xmlMode: true
在DomHandler
和WritableStream
接口中导致的。
请查看我的回复Crawlee问题#1794,为什么CheerioCrawler解析不返回某些XML键的text()?
英文:
This seems to be a bug in CheerioCrawler.
The error is due to the fact that option xmlMode: true
is not set in DomHandler
& WritableStream
interface to process the streaming XML input.
See my response Crawlee Issue #1794, to Why CheerioCrawler parsing doesn't return text() for some XML keys?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论