英文:
xidel: wrong order of results on hacker news
问题
To scrape hacker news, I use:
xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]' https://news.ycombinator.com/newest
But the output is not in the expected order, the URL comes after the text, so it's very difficult to parse.
Does I miss something to have the good order?
I have:
There Is No Such Thing as a Microservice (youtube.com)
https://www.youtube.com/watch?v=FXCLLsCGY0s
I expect:
https://www.youtube.com/watch?v=FXCLLsCGY0s
There Is No Such Thing as a Microservice (youtube.com)
Or even better:
https://www.youtube.com/watch?v=FXCLLsCGY0s There Is No Such Thing as a Microservice (youtube.com)
英文:
To scrape hacker news, I use:
xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]' https://news.ycombinator.com/newest
But the output in not in the expected order, the URL come after the text, so it's very difficult to parse.
Does I miss something to have the good order?
I have:
There Is No Such Thing as a Microservice (youtube.com)
I expect:
https://www.youtube.com/watch?v=FXCLLsCGY0s
There Is No Such Thing as a Microservice (youtube.com)
Or even better
https://www.youtube.com/watch?v=FXCLLsCGY0s There Is No Such Thing as a Microservice (youtube.com)
答案1
得分: 2
请参阅关于为什么会发生这种情况以及为什么在这种情况下应该使用XPath 3映射运算符 !
的“在序列上而不是集合上使用 /”部分:
$ xidel -s "https://news.ycombinator.com/newest" -e '
//span[@class="titleline"]/a ! (@href,.)
'
(还请指定输入)
对于简单的字符串连接,这是不必要的:
-e '//span[@class="titleline"]/a/join((@href,.))'
-e '//span[@class="titleline"]/a/concat(@href," ",.)'
-e '//span[@class="titleline"]/a/x{"@href"} {.}'
(奖励)输出为JSON:
$ xidel -s "https://news.ycombinator.com/newest" -e '
array{
//span[@class="titleline"]/a{
"title": .,
"url": @href
}
}
'
英文:
Please see "Using / on sequences rather than on sets" on why this is happening and why you should be using the XPath 3 mapping operator !
in this case:
$ xidel -s "https://news.ycombinator.com/newest" -e '
//span[@class="titleline"]/a ! (@href,.)
'
(also please specify input first)
For a simple string-concatenation this isn't necessary:
-e '//span[@class="titleline"]/a/join((@href,.))'
-e '//span[@class="titleline"]/a/concat(@href," ",.)'
-e '//span[@class="titleline"]/a/x"{@href} {.}"'
(Bonus) Output to JSON:
$ xidel -s "https://news.ycombinator.com/newest" -e '
array{
//span[@class="titleline"]/a/{
"title":.,
"url":@href
}
}
'
答案2
得分: 1
Nodes are returned in document order, not in XPath order, so additional parsing is needed. With xmllint
and awk
:
Result:
- https://github.com/thesephist/ink Ink: Minimal, functional programming language inspired by modern JavaScript, Go
- https://controlleddigitallending.org/whitepaper/ A White Paper on Controlled Digital Lending of Library Books
- item?id=35471687 Ask HN: Connect Guitar to Tesla?
Note: Xpath in the answer does not need awk
since href comes before a/text()
in document order, as OP expected. Added for reference on how to change output order.
英文:
Nodes are returned in document order not in XPath order so additional parsing is needed. With xmllint
and awk
xmllint --html --recover --xpath '//span[@class="titleline"]/a/@href|//span[@class="titleline"]/a/text()' tmp.html 2>/dev/null|\
gawk 'BEGIN{RS="\n? href="; FS="\n"}{ print $1, $2}' | tr -d '"'
Result
https://github.com/thesephist/ink Ink: Minimal, functional programming language inspired by modern JavaScript, Go
https://controlleddigitallending.org/whitepaper/ A White Paper on Controlled Digital Lending of Library Books
item?id=35471687 Ask HN: Connect Guitar to Tesla?
Note: Xpath in the answer does not need awk
since href comes before a/text()
in document order as OP expected. Added for reference on how to change ouput order.
答案3
得分: 0
$xidel -e ''//span[@class="titleline"]/a/@href|//span[@class="titleline"]/a/text()'' https://news.ycombinator.com/newest
英文:
Found a better way:
$ xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]/a/text()' https://news.ycombinator.com/newest
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论