Cheerio XML解析与Crawlee一起为*某些*键不返回text()的原因是什么?

huangapple go评论73阅读模式
英文:

Why Cheerio XML parsing with Crawlee doesn't return text() for *some* keys?

问题

以下是你要翻译的内容:

const crawler = new CheerioCrawler({
    async requestHandler({ request, response, body, contentType, $ }) {
        $("item").each(function (i, ref) {
            const el = $(ref);
            const title = el.find("title").text();
            const link = el.find('link').text();
            const published_on = el.find('pubdate').text();
            const published_by = $("source").text();
            const snippet = el.find("description").text();

            console.log("TITLE: ", title);
            console.log("LINK: ", link);                 // DOESN'T WORK
            console.log("PUBLISHED_ON: ", published_on);
            console.log("PUBLISHED_BY: ", published_by); // DOESN'T WORK
            console.log("SNIPPET: ", snippet )
            console.log("AUTHOR: ", author);
        });
    },
});

希望这对你有所帮助。如果你需要更多帮助,请随时告诉我。

英文:

Considering an XML file like this one (Google New RSS feed) and item like this:

<item>
  <title>Test Like a Dragon Ishin...</title>
  <link>https://news.google.com/rss/articles/CBMie2....</link>
  <guid isPermaLink="false">CBMie2h0dHB...</guid>
  <pubDate>Fri, 17 Feb 2023 15:00:03 GMT</pubDate>
  <description>Test Like a Dragon Ishin...</description>
  <source url="https://www.jeuxvideo.com">jeuxvideo.com</source>
</item>

I try to learn Cheerio (via Crawlee) and wrote the following (quite-working) function:

const crawler = new CheerioCrawler({
    async requestHandler({ request, response, body, contentType, $ }) {
      $("item").each(function (i, ref) {
        const el = $(ref);
        const title = el.find("title").text();
        const link = el.find('link').text();
        const published_on = el.find('pubdate').text();
        const published_by = $("source").text();
        const snippet = el.find("description").text();

        console.log("TITLE: ", title);
        console.log("LINK: ", link);                 // DOESN'T WORK
        console.log("PUBLISHED_ON: ", published_on);
        console.log("PUBLISHED_BY: ", published_by); // DOESN'T WORK
        console.log("SNIPPET: ", snippet )
        console.log("AUTHOR: ", author);
      });
    },
  });

It might be obvious (except for me), but I do not understand why I can't retrieve link and published_by content whereas it's working for the other ones.

Any clue?
Thanks a lot.

答案1

得分: 0

Passing {xml: true} as a Cheerio option works:

const cheerio = require("cheerio"); // ^1.0.0-rc.12

const html = `
<item>
  <title>Test Like a Dragon Ishin...</title>
  <link>https://news.google.com/rss/articles/CBMie2....</link>
  <guid isPermaLink="false">CBMie2h0dHB...</guid>
  <pubDate>Fri, 17 Feb 2023 15:00:03 GMT</pubDate>
  <description>Test Like a Dragon Ishin...</description>
  <source url="https://www.jeuxvideo.com">jeuxvideo.com</source>
</item>
`;
const $ = cheerio.load(html, {xml: true});
const items = [...$("item")].map(e => {
  const text = s => $(e).find(s).text().trim();
  return {
    title: text("title"),
    link: text("link"),
    publishedOn: text("pubDate"),
    publishedBy: text("source"),
    snippet: text("description"),
    guid: text("guid"),
    url: $(e).find("source").attr("url"),
    isPermaLink: $(e).find("guid").attr("isPermaLink"),
  };
});
console.log(items);

I'm not familiar with Crawlee, and scanning the docs doesn't show an obvious way to add this option to $. Presumably, you can use $ = cheerio.load(body, {xml: true}) inside the handler though.

Note also that I'm using pubDate rather than pubdate since XML is case-sensitive, while HTML isn't.

By the way, console.log($.html()) is a good way to debug this sort of problem; it shows what Cheerio parsed the text to.

英文:

Passing {xml: true} as a Cheerio option works:

const cheerio = require(&quot;cheerio&quot;); // ^1.0.0-rc.12

const html = `
&lt;item&gt;
  &lt;title&gt;Test Like a Dragon Ishin...&lt;/title&gt;
  &lt;link&gt;https://news.google.com/rss/articles/CBMie2....&lt;/link&gt;
  &lt;guid isPermaLink=&quot;false&quot;&gt;CBMie2h0dHB...&lt;/guid&gt;
  &lt;pubDate&gt;Fri, 17 Feb 2023 15:00:03 GMT&lt;/pubDate&gt;
  &lt;description&gt;Test Like a Dragon Ishin...&lt;/description&gt;
  &lt;source url=&quot;https://www.jeuxvideo.com&quot;&gt;jeuxvideo.com&lt;/source&gt;
&lt;/item&gt;
`;
const $ = cheerio.load(html, {xml: true});
const items = [...$(&quot;item&quot;)].map(e =&gt; {
  const text = s =&gt; $(e).find(s).text().trim();
  return {
    title: text(&quot;title&quot;),
    link: text(&quot;link&quot;),
    publishedOn: text(&quot;pubDate&quot;),
    publishedBy: text(&quot;source&quot;),
    snippet: text(&quot;description&quot;),
    guid: text(&quot;guid&quot;),
    url: $(e).find(&quot;source&quot;).attr(&quot;url&quot;),
    isPermaLink: $(e).find(&quot;guid&quot;).attr(&quot;isPermaLink&quot;),
  };
});
console.log(items);

I'm not familiar with Crawlee, and scanning the docs doesn't show an obvious way to add this option to $. Presumably, you can use $ = cheerio.load(body, {xml: true}) inside the handler though.

Note also that I'm using pubDate rather than pubdate since XML is case-sensitive, while HTML isn't.

By the way, console.log($.html()) is a good way to debug this sort of problem; it shows what Cheerio parsed the text to.

答案2

得分: 0

这似乎是CheerioCrawler中的一个错误。
错误是由于在处理流式XML输入时未设置选项xmlMode: trueDomHandlerWritableStream接口中导致的。

请查看我的回复Crawlee问题#1794,为什么CheerioCrawler解析不返回某些XML键的text()?

英文:

This seems to be a bug in CheerioCrawler.
The error is due to the fact that option xmlMode: true is not set in DomHandler & WritableStream interface to process the streaming XML input.

See my response Crawlee Issue #1794, to Why CheerioCrawler parsing doesn't return text() for some XML keys?

huangapple
  • 本文由 发表于 2023年2月19日 17:59:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/75499291.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定