2020年1月3日 23:57:56go评论74阅读模式

英文:

How deep is the visible area of HtmlAgilityPack?

问题

我需要从一个博客中获取一些帖子。一切都进行得很顺利，直到我想要获取帖子的创建日期。用于它的DOM树如下：

<div class="stories-feed__container">
  -> article
     -> div class="story__main"
       -> div class="story__footer"
         -> div class="story__user user"
           -> div class="user__info-item"
             -> time datetime="日期和时间的UTC格式"。

所以我写了这段代码：

HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://pikabu.ru/@serhiy1994");
string postDate = doc.DocumentNode.SelectSingleNode("//div[contains(@class, 'stories-feed__container')]/article[2]/div[contains(@class, 'story__main')]/div[contains(@class, 'story__footer')]/div[contains(@class, 'story__user user')]/div[contains(@class, 'user__info-item')]/time").GetAttributeValue("datetime", "NULL"); // 例如，对于页面上的第二篇文章

但它返回了NullReferenceException。

但如果你停在"div class="story__user user""级别上（例如，

string postDate = doc.DocumentNode.SelectSingleNode("//div[contains(@class, 'stories-feed__container')]/article[2]/div[contains(@class, 'story__main')]/div[contains(@class, 'story__footer')]/div[contains(@class, 'story__user user')]").InnerHtml;

它可以正常工作并返回内部HTML代码。

所以我认为HtmlAgilityPack有一种像"最大可见级别"的东西，你不能够操作更深层次的标记。

我是对的还是我在编写代码方面出了什么问题？

原始页面代码在这里：https://pastebin.com/jFC0XD9C

英文:

I need to grab some posts from a blog. All went well until I've wanted to get the post creation date. The DOM-tree for it is:

div class=&quot;stories-feed__container&quot; 
  -&gt; article 
     -&gt; div class=&quot;story__main&quot; 
       -&gt; div class=&quot;story__footer&quot; 
         -&gt; div class=&quot;story__user user&quot; 
           -&gt; div class=&quot;user__info-item&quot; 
             -&gt; time datetime=&quot;date and time in UTC format&quot;.

So I wrote the code:

    HtmlWeb web = new HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc = web.Load(&quot;https://pikabu.ru/@serhiy1994&quot;);
    string postDate = doc.DocumentNode.SelectSingleNode(&quot;//div[contains(@class, &#39;stories-feed__container&#39;)]/article[2]/div[contains(@class, &#39;story__main&#39;)]/div[contains(@class, &#39;story__footer&#39;)]/div[contains(@class, &#39;story__user user&#39;)]/div[contains(@class, &#39;user__info-item&#39;)]/time&quot;).GetAttributeValue(&quot;datetime&quot;, &quot;NULL&quot;); // e.g. for the 2nd article on the page

And it returns the NullReferenceException.
BUT if you stop at the "div class="story__user user"" level (e.g.,

    string postDate = doc.DocumentNode.SelectSingleNode(&quot;//div[contains(@class, &#39;stories-feed__container&#39;)]/article[2]/div[contains(@class, &#39;story__main&#39;)]/div[contains(@class, &#39;story__footer&#39;)]/div[contains(@class, &#39;story__user user&#39;)]&quot;).InnerHtml;

it works properly and return you the inner HTML-code.
So I think there is something like 'maximum visibility level" for HtmlAgilityPack, and you won't able to manipulate with the deeper markdown.

Am I right or I'm coding something wrong?

The original page code is here: https://pastebin.com/jFC0XD9C

答案1

得分: 2

HtmlAgility会抓取整个网站，无论你想深入多深都可以。你可以使用它来找到你要查找的项目，因为你不需要提供整个路径。

这将搜索整个站点，查找具有类名user__info-item的第一个<div>标签。如果有多个标签，你还可以将SelectSingleNode更改为SelectNodes，然后遍历它们以获取日期。

HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://pikabu.ru/@serhiy1994");
var postDate = doc.DocumentNode.SelectSingleNode("//div[@class='user__info-item']/time");
Console.WriteLine(postDate.InnerText);

你的代码有什么问题？

导致上述代码不起作用的原因是因为你漏掉了另一个<div>，即<div class="user__info user__info_left">。

如果你像这样编写代码，它就会起作用。

var nodes = doc.DocumentNode.SelectSingleNode("//div[@class='story__main']/div[@class='story__footer']/div[@class='story__user user']/div[@class='user__info user__info_left']/div[@class='user__info-item']/time");
Console.WriteLine(nodes.InnerText);

另一种方法

另一种方法是搜索父级div。一旦找到父标签，就在该标签下搜索你要查找的内容。

var nodes = doc.DocumentNode.SelectNodes("//div[@class='story__user user']");
foreach (HtmlNode node in nodes)
{
    // 使用 .// 符号在每个节点内部进行搜索
    var timeNodes = node.SelectSingleNode(".//div[@class='user__info-item']/time");
    Console.WriteLine(timeNodes.InnerText);
}

在此处测试代码

英文:

HtmlAgility will scrape the entire website, regardless of how deep you want to go. You can use this to get to the item you are looking for since you dont have to provide the entire path.

This will search the entire site and look for the first <div> tag that has the class name user__info-item. You can also change SelectSingleNode to SelectNodes if there are multiple tags then loop through them to get the dates.

    HtmlWeb web = new HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc = web.Load(&quot;https://pikabu.ru/@serhiy1994&quot;);
    var postDate = doc.DocumentNode.SelectSingleNode(&quot;//div[@class=&#39;user__info-item&#39;]/time&quot;);
    Console.WriteLine(postDate.InnerText);

Whats wrong with your code?

Reason the code above you have doesnt work is because there is another div that you are missing, '<div class="user__info user__info_left">'.

If you write your code like this, it works.

    var nodes = doc.DocumentNode.SelectSingleNode(&quot;//div[@class=&#39;story__main&#39;]/div[@class=&#39;story__footer&#39;]/div[@class=&#39;story__user user&#39;]/div[@class=&#39;user__info user__info_left&#39;]/div[@class=&#39;user__info-item&#39;]/time&quot;);
    Console.WriteLine(nodes.InnerText);

Another way

Another way to do it is by searching for a parent div. Once you find the parent tag, search under that tag to find what you are looking for.

    var nodes = doc.DocumentNode.SelectNodes(&quot;//div[@class=&#39;story__user user&#39;]&quot;);
    foreach (HtmlNode node in nodes)
    {
        // Search within each node using .// notation
        var timeNodes = node.SelectSingleNode(&quot;.//div[@class=&#39;user__info-item&#39;]/time&quot;);
        Console.WriteLine(timeNodes.InnerText);
    }

Tested Code here

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

HtmlAgilityPack的可见区域有多深？

问题

答案1

Starting psexec process fails in windows form.

自更新至 RestSharp v110 之后，仅出现 401 未经授权错误。

Generate truly random numbers in C# 在C#中生成真正的随机数

Weird behavior in linking table creation in EF Core, adding it in the DbContext but shows two versions in the migration file

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论