英文:
How deep is the visible area of HtmlAgilityPack?
问题
我需要从一个博客中获取一些帖子。一切都进行得很顺利,直到我想要获取帖子的创建日期。用于它的DOM树如下:
<div class="stories-feed__container">
-> article
-> div class="story__main"
-> div class="story__footer"
-> div class="story__user user"
-> div class="user__info-item"
-> time datetime="日期和时间的UTC格式"。
所以我写了这段代码:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://pikabu.ru/@serhiy1994");
string postDate = doc.DocumentNode.SelectSingleNode("//div[contains(@class, 'stories-feed__container')]/article[2]/div[contains(@class, 'story__main')]/div[contains(@class, 'story__footer')]/div[contains(@class, 'story__user user')]/div[contains(@class, 'user__info-item')]/time").GetAttributeValue("datetime", "NULL"); // 例如,对于页面上的第二篇文章
但它返回了NullReferenceException
。
但如果你停在"div class="story__user user""级别上(例如,
string postDate = doc.DocumentNode.SelectSingleNode("//div[contains(@class, 'stories-feed__container')]/article[2]/div[contains(@class, 'story__main')]/div[contains(@class, 'story__footer')]/div[contains(@class, 'story__user user')]").InnerHtml;
它可以正常工作并返回内部HTML代码。
所以我认为HtmlAgilityPack有一种像"最大可见级别"的东西,你不能够操作更深层次的标记。
我是对的还是我在编写代码方面出了什么问题?
原始页面代码在这里:https://pastebin.com/jFC0XD9C
英文:
I need to grab some posts from a blog. All went well until I've wanted to get the post creation date. The DOM-tree for it is:
div class="stories-feed__container"
-> article
-> div class="story__main"
-> div class="story__footer"
-> div class="story__user user"
-> div class="user__info-item"
-> time datetime="date and time in UTC format".
So I wrote the code:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://pikabu.ru/@serhiy1994");
string postDate = doc.DocumentNode.SelectSingleNode("//div[contains(@class, 'stories-feed__container')]/article[2]/div[contains(@class, 'story__main')]/div[contains(@class, 'story__footer')]/div[contains(@class, 'story__user user')]/div[contains(@class, 'user__info-item')]/time").GetAttributeValue("datetime", "NULL"); // e.g. for the 2nd article on the page
And it returns the NullReferenceException
.
BUT if you stop at the "div class="story__user user"" level (e.g.,
string postDate = doc.DocumentNode.SelectSingleNode("//div[contains(@class, 'stories-feed__container')]/article[2]/div[contains(@class, 'story__main')]/div[contains(@class, 'story__footer')]/div[contains(@class, 'story__user user')]").InnerHtml;
it works properly and return you the inner HTML-code.
So I think there is something like 'maximum visibility level" for HtmlAgilityPack, and you won't able to manipulate with the deeper markdown.
Am I right or I'm coding something wrong?
The original page code is here: https://pastebin.com/jFC0XD9C
答案1
得分: 2
HtmlAgility会抓取整个网站,无论你想深入多深都可以。你可以使用它来找到你要查找的项目,因为你不需要提供整个路径。
这将搜索整个站点,查找具有类名user__info-item
的第一个<div>
标签。如果有多个标签,你还可以将SelectSingleNode
更改为SelectNodes
,然后遍历它们以获取日期。
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://pikabu.ru/@serhiy1994");
var postDate = doc.DocumentNode.SelectSingleNode("//div[@class='user__info-item']/time");
Console.WriteLine(postDate.InnerText);
你的代码有什么问题?
导致上述代码不起作用的原因是因为你漏掉了另一个<div>
,即<div class="user__info user__info_left">
。
如果你像这样编写代码,它就会起作用。
var nodes = doc.DocumentNode.SelectSingleNode("//div[@class='story__main']/div[@class='story__footer']/div[@class='story__user user']/div[@class='user__info user__info_left']/div[@class='user__info-item']/time");
Console.WriteLine(nodes.InnerText);
另一种方法
另一种方法是搜索父级div
。一旦找到父标签,就在该标签下搜索你要查找的内容。
var nodes = doc.DocumentNode.SelectNodes("//div[@class='story__user user']");
foreach (HtmlNode node in nodes)
{
// 使用 .// 符号在每个节点内部进行搜索
var timeNodes = node.SelectSingleNode(".//div[@class='user__info-item']/time");
Console.WriteLine(timeNodes.InnerText);
}
英文:
HtmlAgility will scrape the entire website, regardless of how deep you want to go. You can use this to get to the item you are looking for since you dont have to provide the entire path.
This will search the entire site and look for the first <div>
tag that has the class name user__info-item
. You can also change SelectSingleNode
to SelectNodes
if there are multiple tags then loop through them to get the dates.
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = web.Load("https://pikabu.ru/@serhiy1994");
var postDate = doc.DocumentNode.SelectSingleNode("//div[@class='user__info-item']/time");
Console.WriteLine(postDate.InnerText);
Whats wrong with your code?
Reason the code above you have doesnt work is because there is another div that you are missing, '<div class="user__info user__info_left">
'.
If you write your code like this, it works.
var nodes = doc.DocumentNode.SelectSingleNode("//div[@class='story__main']/div[@class='story__footer']/div[@class='story__user user']/div[@class='user__info user__info_left']/div[@class='user__info-item']/time");
Console.WriteLine(nodes.InnerText);
Another way
Another way to do it is by searching for a parent div. Once you find the parent tag, search under that tag to find what you are looking for.
var nodes = doc.DocumentNode.SelectNodes("//div[@class='story__user user']");
foreach (HtmlNode node in nodes)
{
// Search within each node using .// notation
var timeNodes = node.SelectSingleNode(".//div[@class='user__info-item']/time");
Console.WriteLine(timeNodes.InnerText);
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论