2023年2月19日 17:59:13go评论73阅读模式

英文:

Why Cheerio XML parsing with Crawlee doesn't return text() for *some* keys?

问题

以下是你要翻译的内容：

const crawler = new CheerioCrawler({
    async requestHandler({ request, response, body, contentType, $ }) {
        $("item").each(function (i, ref) {
            const el = $(ref);
            const title = el.find("title").text();
            const link = el.find('link').text();
            const published_on = el.find('pubdate').text();
            const published_by = $("source").text();
            const snippet = el.find("description").text();

            console.log("TITLE: ", title);
            console.log("LINK: ", link);                 // DOESN'T WORK
            console.log("PUBLISHED_ON: ", published_on);
            console.log("PUBLISHED_BY: ", published_by); // DOESN'T WORK
            console.log("SNIPPET: ", snippet )
            console.log("AUTHOR: ", author);
        });
    },
});

希望这对你有所帮助。如果你需要更多帮助，请随时告诉我。

英文:

Considering an XML file like this one (Google New RSS feed) and item like this:

&lt;item&gt;
  &lt;title&gt;Test Like a Dragon Ishin...&lt;/title&gt;
  &lt;link&gt;https://news.google.com/rss/articles/CBMie2....&lt;/link&gt;
  &lt;guid isPermaLink=&quot;false&quot;&gt;CBMie2h0dHB...&lt;/guid&gt;
  &lt;pubDate&gt;Fri, 17 Feb 2023 15:00:03 GMT&lt;/pubDate&gt;
  &lt;description&gt;Test Like a Dragon Ishin...&lt;/description&gt;
  &lt;source url=&quot;https://www.jeuxvideo.com&quot;&gt;jeuxvideo.com&lt;/source&gt;
&lt;/item&gt;

I try to learn Cheerio (via Crawlee) and wrote the following (quite-working) function:

const crawler = new CheerioCrawler({
    async requestHandler({ request, response, body, contentType, $ }) {
      $(&quot;item&quot;).each(function (i, ref) {
        const el = $(ref);
        const title = el.find(&quot;title&quot;).text();
        const link = el.find(&#39;link&#39;).text();
        const published_on = el.find(&#39;pubdate&#39;).text();
        const published_by = $(&quot;source&quot;).text();
        const snippet = el.find(&quot;description&quot;).text();

        console.log(&quot;TITLE: &quot;, title);
        console.log(&quot;LINK: &quot;, link);                 // DOESN&#39;T WORK
        console.log(&quot;PUBLISHED_ON: &quot;, published_on);
        console.log(&quot;PUBLISHED_BY: &quot;, published_by); // DOESN&#39;T WORK
        console.log(&quot;SNIPPET: &quot;, snippet )
        console.log(&quot;AUTHOR: &quot;, author);
      });
    },
  });

It might be obvious (except for me), but I do not understand why I can't retrieve link and published_by content whereas it's working for the other ones.

Any clue?
Thanks a lot.

答案1

得分: 0

Passing {xml: true} as a Cheerio option works:

const cheerio = require("cheerio"); // ^1.0.0-rc.12

const html = `
<item>
  <title>Test Like a Dragon Ishin...</title>
  <link>https://news.google.com/rss/articles/CBMie2....</link>
  <guid isPermaLink="false">CBMie2h0dHB...</guid>
  <pubDate>Fri, 17 Feb 2023 15:00:03 GMT</pubDate>
  <description>Test Like a Dragon Ishin...</description>
  <source url="https://www.jeuxvideo.com">jeuxvideo.com</source>
</item>
`;
const $ = cheerio.load(html, {xml: true});
const items = [...$("item")].map(e => {
  const text = s => $(e).find(s).text().trim();
  return {
    title: text("title"),
    link: text("link"),
    publishedOn: text("pubDate"),
    publishedBy: text("source"),
    snippet: text("description"),
    guid: text("guid"),
    url: $(e).find("source").attr("url"),
    isPermaLink: $(e).find("guid").attr("isPermaLink"),
  };
});
console.log(items);

I'm not familiar with Crawlee, and scanning the docs doesn't show an obvious way to add this option to $. Presumably, you can use $ = cheerio.load(body, {xml: true}) inside the handler though.

Note also that I'm using pubDate rather than pubdate since XML is case-sensitive, while HTML isn't.

By the way, console.log($.html()) is a good way to debug this sort of problem; it shows what Cheerio parsed the text to.

英文:

Passing {xml: true} as a Cheerio option works:

const cheerio = require(&quot;cheerio&quot;); // ^1.0.0-rc.12

const html = `
&lt;item&gt;
  &lt;title&gt;Test Like a Dragon Ishin...&lt;/title&gt;
  &lt;link&gt;https://news.google.com/rss/articles/CBMie2....&lt;/link&gt;
  &lt;guid isPermaLink=&quot;false&quot;&gt;CBMie2h0dHB...&lt;/guid&gt;
  &lt;pubDate&gt;Fri, 17 Feb 2023 15:00:03 GMT&lt;/pubDate&gt;
  &lt;description&gt;Test Like a Dragon Ishin...&lt;/description&gt;
  &lt;source url=&quot;https://www.jeuxvideo.com&quot;&gt;jeuxvideo.com&lt;/source&gt;
&lt;/item&gt;
`;
const $ = cheerio.load(html, {xml: true});
const items = [...$(&quot;item&quot;)].map(e =&gt; {
  const text = s =&gt; $(e).find(s).text().trim();
  return {
    title: text(&quot;title&quot;),
    link: text(&quot;link&quot;),
    publishedOn: text(&quot;pubDate&quot;),
    publishedBy: text(&quot;source&quot;),
    snippet: text(&quot;description&quot;),
    guid: text(&quot;guid&quot;),
    url: $(e).find(&quot;source&quot;).attr(&quot;url&quot;),
    isPermaLink: $(e).find(&quot;guid&quot;).attr(&quot;isPermaLink&quot;),
  };
});
console.log(items);

I'm not familiar with Crawlee, and scanning the docs doesn't show an obvious way to add this option to $. Presumably, you can use $ = cheerio.load(body, {xml: true}) inside the handler though.

Note also that I'm using pubDate rather than pubdate since XML is case-sensitive, while HTML isn't.

By the way, console.log($.html()) is a good way to debug this sort of problem; it shows what Cheerio parsed the text to.

答案2

得分: 0

这似乎是CheerioCrawler中的一个错误。
错误是由于在处理流式XML输入时未设置选项xmlMode: true在DomHandler和WritableStream接口中导致的。

请查看我的回复Crawlee问题＃1794，为什么CheerioCrawler解析不返回某些XML键的text()？

英文:

This seems to be a bug in CheerioCrawler.
The error is due to the fact that option xmlMode: true is not set in DomHandler & WritableStream interface to process the streaming XML input.

See my response Crawlee Issue #1794, to Why CheerioCrawler parsing doesn't return text() for some XML keys?

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Cheerio XML解析与Crawlee一起为某些键不返回text()的原因是什么？

问题

答案1

答案2

我无法在Maya MEL的脚本编辑器输出中看到结果评论。

How do I take an object from one file and place it in an array element of another in my React Native app?

Python: 为一对列表的列表编写线性getitem函数

cypress从React组件中间谍函数调用

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论